python数据预处理


Ⅰ.数据源

图片[1] - python数据预处理 - MaxSSL
图片[2] - python数据预处理 - MaxSSL

Ⅱ.导入库

import pandas as pdimport numpy as npfrom scipy import statsimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.preprocessing import MinMaxScaler, StandardScaler# 避免画图时无法显示中文(中文部分以正方格显示)plt.rcParams['font.sans-serif']=['SimHei']plt.rcParams['axes.unicode_minus']=False# 显示所有的列,而不是以……显示pd.set_option('display.max_columns', None)# 显示所有的行,而不是以……显示pd.set_option('display.max_rows', None)# 不自动换行显示pd.set_option('display.width', None)

Ⅲ.读取数据

# 读取文件catering_sale=pd.read_excel('./data/catering_sale.xls')

Ⅳ.数据缺失值处理

# 判断是否存在缺失值print(catering_sale.info(),'\n')# print(np.isnan(catering_sale).any(),'\n')catering_sale2=catering_sale.dropna(axis=0)print(catering_sale2.info(),'\n')

输出结果如下:

<class 'pandas.core.frame.DataFrame'>RangeIndex: 201 entries, 0 to 200Data columns (total 2 columns): # ColumnNon-Null CountDtype ----------------------------0 日期201 non-nulldatetime64[ns] 1 销量200 non-nullfloat64 dtypes: datetime64[ns](1), float64(1)memory usage: 3.3 KBNone <class 'pandas.core.frame.DataFrame'>Int64Index: 200 entries, 0 to 200Data columns (total 2 columns): # ColumnNon-Null CountDtype ----------------------------0 日期200 non-nulldatetime64[ns] 1 销量200 non-nullfloat64 dtypes: datetime64[ns](1), float64(1)memory usage: 4.7 KBNone 

观察可知,【销量】存在一个缺失值,本例将缺失值所在行进行删除处理

Ⅴ.数据特征分析

# 数据特征分析y=catering_sale2.iloc[:,1]print(catering_sale2.describe(),'\n')m=stats.mode(y)r=np.max(y)-np.min(y)cv=np.std(y)/np.mean(y)n,(smin,smax),sm,sv,ss,sk=stats.describe(y)print(f"统计量:{n},最小值:{smin},最大值:{smax},极差:{r},众数:{m},均值:{sm},方差:{sv},偏度:{ss},峰度:{sk},变异系数:{cv}\n")

输出结果如下:

销量count 200.000000mean 2755.214700std 751.029772min22.00000025%2451.97500050%2655.85000075%3026.125000max9106.440000 统计量:200,最小值:22.0,最大值:9106.44,极差:9084.44,众数:ModeResult(mode=array([2618.2]), count=array([2])),均值:2755.2146999999995,方差:564045.7182129748,偏度:3.0384935298149753,峰度:29.36759633770712,变异系数:0.2719025288924932

Ⅵ.数据异常值处理

# 画箱线图y=np.array(y)def boxplot(y,title):plt.style.use('ggplot')plt.subplots()plt.boxplot(y,patch_artist=True,showmeans=True,showfliers=True,medianprops={'color':'yellow'},flierprops={'markerfacecolor':'red'},labels=[''])plt.xlabel('销量')plt.text(1.05,3850,'上边缘')plt.text(1.05,1780,'下边缘')plt.text(1.1,3000,'上四分位数Q3')plt.text(0.8,2600,'中位数Q2')plt.text(1.1,2300,'下四分位数Q1')plt.text(1.05,6500,'异常值')plt.title(f'{title}箱线图')plt.show()# 计算上下四分位数q1=np.quantile(y,q=0.25)q3=np.quantile(y,q=0.75)# 异常值判断标准,1.5倍的四分位差 计算上下须对应的值low_quantile=q1-1.5*(q3-q1)high_quantile=q3+1.5*(q3-q1)print(f'下四分位数Q1:{q1},上四分位数Q3:{q3},下边缘:{low_quantile},上边缘:{high_quantile}\n')y2=[]for i in y:if i>high_quantile:i=high_quantiley2.append(i)elif i<low_quantile:i=low_quantiley2.append(i)else:y2.append(i)boxplot(y,title='异常值处理前')boxplot(y2,title='异常值处理后')

输出结果如下:
图片[3] - python数据预处理 - MaxSSL

下四分位数Q1:2451.975,上四分位数Q3:3026.125,下边缘:1590.7499999999998,上边缘:3887.3500000000004

观察可知,箱线图上下边缘存在异常值,本例通过四分位法对异常值进行处理,即:超出上边缘的异常值让其落在上边缘,低于下边缘的异常值让其落在下边缘

输出结果如下:
图片[4] - python数据预处理 - MaxSSL

Ⅶ.数据标准化/归一化处理

# 标准化处理catering_sale_all=pd.read_excel('./data/catering_sale_all.xls',index_col=0)# 方法一:归一化处理normalization_data1=MinMaxScaler().fit_transform(catering_sale_all)normalization_data1=[[round(j,2) for j in normalization_data1[i]] for i in range(len(normalization_data1))]normalization_data1=np.array(normalization_data1)# 方法二:标准化处理normalization_data2=StandardScaler().fit_transform(catering_sale_all)normalization_data2=[[round(j,2) for j in normalization_data2[i]] for i in range(len(normalization_data2))]normalization_data2=np.array(normalization_data2)print(f'方法一:归一化处理:\n{normalization_data1}\n方法二:标准化处理:\n{normalization_data2}')

输出结果如下:

方法一:归一化处理:[[1. 0.10.41. 0.90.83 0.94 0.45 0.64 1.] [0.57 1. 1. 0.48 0.50.58 1. 0.73 1. 0.39] [0.50.30.80.48 0.40. 0.33 0.55 0.64 0.22] [0.43 0.10.20. 0.60.50.44 0.73 1. 0.39] [0.07 0.50.90.24 0.80.58 0.89 0.55 0.91 0.43] [0.71 0.50.90.62 0.40.50.61 0.55 0.18 0.22] [0.43 0.20.90.24 0.10.33 0.50.27 0.64 0.13] [0.43 0.70.90.14 0.30.42 0.28 0.64 0.73 0.04] [0.21 0.30.40. 0.10.08 0.22 0. 0.36 0.26] [0.43 0.60.90.14 0.40.33 0.28 0.36 0.45 0.22] [0.21 0.20.40.29 0. 0.33 0.39 0. 0. 0.26] [0.14 0.40. 0.19 0.40.50.78 0.36 0.91 0.22] [0.43 0.20.70.29 0.70.42 0.72 0.36 0.55 0.48] [0.50.30.60.14 1. 0.83 0.83 0.36 0.09 0.43] [0.71 0.70.80.33 0.50.67 0.39 0.64 0.55 0.48] [0.07 0.30.80.38 0.50.33 0.50.09 0.73 0.3 ] [0.21 0.70.60.29 0.70.08 0.39 0.82 0.27 0.39] [0.43 1. 0. 0.43 0.30.50. 0. 0.82 0.17] [0. 0.50.90.48 0.90.42 0.67 0. 0.73 0.3 ] [0.36 0.20.50.81 0.10.75 0.39 0.18 0.45 0.3 ] [0.57 0.10.70.24 0.30.42 0.44 0.18 0.64 0.22] [0.57 0.10.10.57 0.30.92 0.72 0.18 0.27 0.17] [0.07 0.20.60.19 0.20.58 0.44 0.64 0.36 0.04] [0.29 0. 0.20.19 0.81. 0.28 0.64 0.55 0.] [0.36 0.30.80.52 0.40.67 0.33 0.64 0.64 0.09] [0.07 0.50.80.29 0.30.42 0.39 0.45 0.64 0.3 ] [0.21 0.20.30.38 0.30.33 0.44 1. 0.64 0.3 ] [0.36 0. 0.70.33 0.40.67 0.39 0.45 0.64 0.22] [0.36 0.10.30.29 0.20.08 0.28 0.55 0.27 0.22]]方法二:标准化处理:[[ 2.81 -0.95 -0.643.051.771.421.930.110.3 3.89] [ 0.9 2.411.390.620.230.4 2.171.151.740.59] [ 0.58 -0.210.710.62 -0.16 -1.98 -0.680.450.3-0.36] [ 0.26 -0.95 -1.32 -1.590.610.06 -0.2 1.151.740.59] [-1.330.541.05 -0.491.380.4 1.690.451.380.82] [ 1.540.541.051.28 -0.160.060.510.45 -1.5-0.36] [ 0.26 -0.581.05 -0.49 -1.32 -0.620.03 -0.580.3-0.83] [ 0.261.291.05 -0.93 -0.55 -0.28 -0.920.8 0.66 -1.3 ] [-0.69 -0.21 -0.64 -1.59 -1.32 -1.64 -1.15 -1.62 -0.78 -0.12] [ 0.260.921.05 -0.93 -0.16 -0.62 -0.92 -0.24 -0.42 -0.36] [-0.69 -0.58 -0.64 -0.27 -1.7-0.62 -0.44 -1.62 -2.22 -0.12] [-1.010.17 -2. -0.71 -0.160.061.22 -0.241.38 -0.36] [ 0.26 -0.580.37 -0.271. -0.280.98 -0.24 -0.061.06] [ 0.58 -0.210.04 -0.932.151.421.46 -0.24 -1.860.82] [ 1.541.290.71 -0.050.230.74 -0.440.8-0.061.06] [-1.33 -0.210.710.180.23 -0.620.03 -1.280.660.11] [-0.691.290.04 -0.271. -1.64 -0.441.49 -1.140.59] [ 0.262.41 -2.0.4-0.550.06 -2.1-1.621.02 -0.59] [-1.650.541.050.621.77 -0.280.74 -1.620.660.11] [-0.05 -0.58 -0.3 2.16 -1.321.08 -0.44 -0.93 -0.420.11] [ 0.9-0.950.37 -0.49 -0.55 -0.28 -0.2-0.930.3-0.36] [ 0.9-0.95 -1.661.06 -0.551.750.98 -0.93 -1.14 -0.59] [-1.33 -0.580.04 -0.71 -0.930.4-0.2 0.8-0.78 -1.3 ] [-0.37 -1.33 -1.32 -0.711.382.09 -0.920.8-0.06 -1.54] [-0.05 -0.210.710.84 -0.160.74 -0.680.8 0.3-1.07] [-1.330.540.71 -0.27 -0.55 -0.28 -0.440.110.3 0.11] [-0.69 -0.58 -0.980.18 -0.55 -0.62 -0.2 2.180.3 0.11] [-0.05 -1.330.37 -0.05 -0.160.74 -0.440.110.3-0.36] [-0.05 -0.95 -0.98 -0.27 -0.93 -1.64 -0.920.45 -1.14 -0.36]]

Ⅷ.数据相关性分析

# 相关性分析pearson=catering_sale_all.corr(method='pearson')print('相关性矩阵:\n',pearson)y2=catering_sale_all.iloc[:,1:]# 相关性矩阵散点图# sns.pairplot(y2,kind='reg',diag_kind='hist')# plt.title('相关性矩阵散点图')# plt.show()# 相关性矩阵热力图sns.heatmap(pearson,square=True,annot=True)plt.title('相关性矩阵热力图')plt.show()

输出结果如下:
图片[5] - python数据预处理 - MaxSSL

相关性矩阵:百合酱蒸凤爪翡翠蒸香茜饺 金银蒜汁蒸排骨 乐膳真味鸡 蜜汁焗餐包生炒菜心铁板酸菜豆腐 香煎韭菜饺 香煎罗卜糕原汁原味菜心百合酱蒸凤爪 1.0000000.0092060.0167990.4556380.1267000.3084960.2048980.127448 -0.0902760.428316翡翠蒸香茜饺 0.0092061.0000000.304434 -0.0122790.066560 -0.180446 -0.0269080.0623440.2702760.020462金银蒜汁蒸排骨0.0167990.3044341.0000000.0351350.122710 -0.1842900.1872720.1215430.0778080.029074乐膳真味鸡0.455638 -0.0122790.0351351.0000000.0896020.3254620.297692 -0.068866 -0.0302220.421878蜜汁焗餐包0.1267000.0665600.1227100.0896021.0000000.3610680.5287720.2195780.2005500.516849生炒菜心 0.308496 -0.180446 -0.1842900.3254620.3610681.0000000.3697870.0382330.0498980.122988铁板酸菜豆腐 0.204898 -0.0269080.1872720.2976920.5287720.3697871.0000000.0955430.1579580.567332香煎韭菜饺0.1274480.0623440.121543 -0.0688660.2195780.0382330.0955431.0000000.1783360.049689香煎罗卜糕 -0.0902760.2702760.077808 -0.0302220.2005500.0498980.1579580.1783361.0000000.088980原汁原味菜心 0.4283160.0204620.0290740.4218780.5168490.1229880.5673320.0496890.0889801.000000
© 版权声明
THE END
喜欢就支持一下吧
点赞0 分享