目录
一、导入数据
二、数据查看
可视化缺失值占比
绘制所有变量的柱形图,查看数据
查看各特征与目标变量price的相关性
三、数据处理
处理异常值
查看seller,offerType的取值
查看特征 notRepairedDamage
异常值截断
填充缺失值
删除取值无变化的特征
查看目标变量price
对price做对数log变换
四、特征构造
构造新特征:计算某品牌的销售统计量
构造新特征:使用时间
对连续型特征数据进行分桶
对数值型特征做归一化
匿名特征交叉
平均数编码
五、特征筛选
计算各列于交易价格的相关性
对类别特征进行 OneEncoder
切分特征和标签
用lightgbm筛选特征
一、导入数据
import pandas as pdimport numpy as np#coding:utf-8#导入warnings包,利用过滤器来实现忽略警告语句。import warningswarnings.filterwarnings('ignore')import matplotlib.pyplot as pltimport seaborn as snsimport missingno as msno#显示所有列pd.set_option('display.max_columns',None)# #显示所有行# pd.set_option('display.max_rows',None)Train_data = pd.read_csv("二手汽车价格预测/used_car_train_20200313.csv",sep=' ')Test_data = pd.read_csv('二手汽车价格预测/used_car_testB_20200421.csv', sep=' ')Train_data.shape,Test_data.shape#((150000, 31), (50000, 30))Train_data.tail()# Test_data.head()
二、数据查看
Train_data.info()
Data columns (total 31 columns): # Column Non-Null Count Dtype--------- -------------- ----- 0 SaleID 150000 non-nullint64 1 name 150000 non-nullint64 2 regDate150000 non-nullint64 3 model149999 non-nullfloat64 4 brand150000 non-nullint64 5 bodyType 145494 non-nullfloat64 6 fuelType 141320 non-nullfloat64 7 gearbox144019 non-nullfloat64 8 power150000 non-nullint64 9 kilometer150000 non-nullfloat64 10notRepairedDamage150000 non-nullobject11regionCode 150000 non-nullint64 12seller 150000 non-nullint64 13offerType150000 non-nullint64 14creatDate150000 non-nullint64 15price150000 non-nullint64 16v_0150000 non-nullfloat64 17v_1150000 non-nullfloat64 18v_2150000 non-nullfloat64 19v_3150000 non-nullfloat64 20v_4150000 non-nullfloat64 21v_5150000 non-nullfloat64 22v_6150000 non-nullfloat64 23v_7150000 non-nullfloat64 24v_8150000 non-nullfloat64 25v_9150000 non-nullfloat64 26v_10 150000 non-nullfloat64 27v_11 150000 non-nullfloat64 28v_12 150000 non-nullfloat64 29v_13 150000 non-nullfloat64 30v_14 150000 non-nullfloat64dtypes: float64(20), int64(10), object(1)
Train_data.duplicated().sum()#没有重复值Train_data.isnull().sum()
SaleID0name0regDate 0model 1brand 0bodyType 4506fuelType 8680gearbox5981power 0kilometer 0notRepairedDamage 0regionCode0seller0offerType 0creatDate 0price 0v_0 0v_1 0v_2 0v_3 0v_4 0v_5 0v_6 0v_7 0v_8 0v_9 0v_100v_110v_120v_130v_140dtype: int64
bodyType ,fuelType,gearbox,model,这几个特征存在缺失值。
可视化缺失值占比
# nan可视化missing = Train_data.isnull().sum()missing = missing[missing > 0]missing.sort_values(inplace=True)missing.plot.bar()
绘制所有变量的柱形图,查看数据
Train_data.hist(bins=50,figsize=(20,15))plt.cla()#清除axes
图中可以看出,seller,offerType,creatDate这几个特征值分布不均匀,分别查看
查看各特征与目标变量price的相关性
#把字符串类型的变量、以及一些无关的变量去掉,获得需要的列名numeric_columns=Train_data.select_dtypes(exclude='object').columnscolumns=[col for col in numeric_columns if col not in ['SaleID', 'name']]#根据列名提取数据train_set=Train_data[columns]#计算各列于交易价格的相关性correlation=train_set.corr()correlation['price'].sort_values(ascending = False)
price 1.000000v_120.692823v_8 0.685798v_0 0.628397regDate 0.611959gearbox 0.329075bodyType0.241303power 0.219834fuelType0.200536v_5 0.164317model 0.136983v_2 0.085322v_6 0.068970v_1 0.060914v_140.035911regionCode0.014036creatDate 0.002955seller -0.002004v_13 -0.013993brand-0.043799v_7-0.053024v_4-0.147085v_9-0.206205v_10 -0.246175v_11 -0.275320kilometer-0.440519v_3-0.730946offerTypeNaNName: price, dtype: float64
f , ax = plt.subplots(figsize = (7, 7))plt.title('Correlation of Numeric Features with Price',y=1,size=16)sns.heatmap(correlation,square = True,vmax=0.8)
三、数据处理
处理异常值
查看seller,offerType的取值
Train_data['seller'].value_counts()#将seller其中的异常值1改为0Train_data['seller'] = Train_data['seller'][Train_data['seller']==1]=0Train_data['seller'].value_counts()
01499991 1Name: seller, dtype: int64
Train_data['offerType'].value_counts()
0150000Name: offerType, dtype: int64
可以看出,seller,offerType这两个特征的取值无变化,几乎倒向同一个值,可以删除。
查看特征 notRepairedDamage
notRepairedDamage 中存在空缺值,但空缺值用“-”表示,所以数据查看发现不了空缺值,将“-”替换成NaN。
Train_data['notRepairedDamage'].value_counts()Train_data['notRepairedDamage'].replace('-',np.nan,inplace = True)
0.0111361- 243241.0 14315Name: notRepairedDamage, dtype: int64
Train_data['notRepairedDamage'].value_counts()
0.01113611.0 14315Name: notRepairedDamage, dtype: int64
异常值截断
Train_data['power'].value_counts()
0 1282975 9593150649560 63741405963...513 119931191751 1549 1Name: power, Length: 566, dtype: int64
power在题目中要求范围
power | 发动机功率:范围 [ 0, 600 ] |
进行异常值截断
#异常值截断Train_data['power'][Train_data['power']>600]=600Train_data['power'][Train_data['power']6] = 6Train_data['v_14'][Train_data['v_14']>4] = 4
填充缺失值
类别型特征用众数填充缺失值
print(Train_data.bodyType.mode())print(Train_data.fuelType.mode())print(Train_data.gearbox.mode())print(Train_data.model.mode())#用众数填补空缺值Train_data['bodyType']=Train_data['bodyType'].fillna(0)Train_data['fuelType']=Train_data['fuelType'].fillna(0)Train_data['gearbox']=Train_data['gearbox'].fillna(0)Train_data['model']=Train_data['model'].fillna(0)Train_data.isnull().sum()
删除取值无变化的特征
‘seller’,’offerType’
#删除取值没有变化的列Train_data.head()Train_data = Train_data.drop(['seller','offerType'],axis = 1)Train_data.head()
查看目标变量price
# 查看目标变量的skewness and kurtosissns.distplot(Train_data['price']);print("Skewness: %f" % Train_data['price'].skew())#偏度print("Kurtosis: %f" % Train_data['price'].kurt())#峰度# Train_data.skew(), Train_data.kurt()
Skewness: 3.346487Kurtosis: 18.995183
## 查看目标变量的具体频数## 绘制标签的统计图,查看标签分布plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')plt.show()
对price的长尾数据进行截取,做对数log变换
np.log1p ( )
数据预处理时首先可以对偏度比较大的数据用log1p函数进行转化,使其更加服从高斯分布,此步处理可能会使我们后续的分类结果得到一个好的结果.
# 目标变量进行对数变换服从正态分布Train_data['price'] = np.log1p(Train_data['price'])plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red') plt.show()sns.distplot(Train_data['price']);print("Skewness: %f" % Train_data['price'].skew())#偏度print("Kurtosis: %f" % Train_data['price'].kurt())#峰度
Skewness: -0.261727Kurtosis: -0.182127
四、特征构造
4.1、构造新特征:计算某品牌的销售统计量
# 计算某品牌的销售统计量Train_gb = Train_data.groupby("brand")all_info = {}for kind, kind_data in Train_gb:info = {}kind_data = kind_data[kind_data['price'] > 0]info['brand_amount'] = len(kind_data)info['brand_price_max'] = kind_data.price.max()info['brand_price_median'] = kind_data.price.median()info['brand_price_min'] = kind_data.price.min()info['brand_price_sum'] = kind_data.price.sum()info['brand_price_std'] = kind_data.price.std()info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)all_info[kind] = infobrand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})Train_data = Train_data.merge(brand_fe, how='left', on='brand')
4.2、构造新特征:使用时间
一般来说汽车价格与使用时间成反比
# 使用时间:Train_data['creatDate'] - Train_data['regDate']#一般来说汽车价格与使用时间成反比# 数据里有时间出错的格式,errors='coerce',遇到不能转换的数据赋值为nanTrain_data['used_time'] = (pd.to_datetime(Train_data['creatDate'], format='%Y%m%d', errors='coerce') - pd.to_datetime(Train_data['regDate'], format='%Y%m%d', errors='coerce')).dt.daysTrain_data['used_time'].isnull().sum()Train_data['used_time'].mean()#4432.082407160321#用平均数或众数填充缺失值Train_data['used_time'].fillna(4432,inplace = True)Train_data['used_time'].isnull().sum()
4.3、对连续型特征数据进行分桶
#对连续型数据进行分桶#对power进行分桶bin = [i*10 for i in range(31)]#分成30个桶Train_data['power_bin'] = pd.cut(Train_data['power'], bin, labels=False)Train_data[['power_bin', 'power']].head()
kilometer已经分桶了
plt.hist(Train_data['kilometer'])
# 删除不需要的数据Train_data = Train_data.drop(['name','SaleID', 'regionCode'], axis=1)Train_data.head()
- 目前的数据其实已经可以给树模型使用了,所以我们导出一下
Train_data.to_csv(‘data_for_tree.csv’, index=0)
4.5、对数值型特征做归一化
# 我们可以再构造一份特征给 LR NN 之类的模型用# 之所以分开构造是因为,不同模型对数据集的要求不同# 我们看下数据分布:Train_data['power'].plot.hist()
# 我们对其取 log,在做归一化from sklearn import preprocessingmin_max_scaler = preprocessing.MinMaxScaler()Train_data['power'] = np.log1p(Train_data['power'] + 1) Train_data['power'] = Train_data['power'] = max_min(Train_data['power'])Train_data['power'].plot.hist()
# kilometer做过分桶处理了,所以我们可以直接做归一化Train_data['kilometer'] =max_min(Train_data['kilometer'])Train_data['kilometer'].plot.hist()
# 对之前构造的以下特征进行归一化# 'brand_amount', 'brand_price_average', 'brand_price_max',# 'brand_price_median', 'brand_price_min', 'brand_price_std',# 'brand_price_sum'# 这里不再一一举例分析了,直接做变换,def max_min(x):return (x - np.min(x)) / (np.max(x) - np.min(x))# Train_data['brand_amount'] = max_min(Train_data['brand_amount'])Train_data['brand_price_average'] =max_min(Train_data['brand_price_average'] )Train_data['brand_price_max'] =max_min(Train_data['brand_price_max'])Train_data['brand_price_median'] =max_min(Train_data['brand_price_max'])Train_data['brand_price_min'] =max_min(Train_data['brand_price_min'])Train_data['brand_price_std'] =max_min(Train_data['brand_price_std'])Train_data['brand_price_sum'] =max_min(Train_data['brand_price_sum'] )Train_data.head()
4.6、匿名特征交叉
#匿名特征交叉num_cols = [0,2,3,6,8,10,12,14]for index, value in enumerate(num_cols):for j in num_cols[index+1:]:Train_data['new'+str(value)+'*'+str(j)]=Train_data['v_'+str(value)]*Train_data['v_'+str(j)]Train_data['new'+str(value)+'+'+str(j)]=Train_data['v_'+str(value)]+Train_data['v_'+str(j)]Train_data['new'+str(value)+'-'+str(j)]=Train_data['v_'+str(value)]-Train_data['v_'+str(j)]num_cols1 = [3,5,1,11]for index, value in enumerate(num_cols1):for j in num_cols1[index+1:]:Train_data['new'+str(value)+'-'+str(j)]=Train_data['v_'+str(value)]-Train_data['v_'+str(j)] for i in range(15):Train_data['new'+str(i)+'*year']=Train_data['v_'+str(i)] * Train_data['used_time']
# 这份数据可以给 LR 用Train_data.to_csv('Train_data_for_lr.csv', index=0)Train_data.head()
五、特征筛选
5.1、查看各列于交易价格的相关性
correlation=Train_data.corr()x=correlation['price'].sort_values(ascending = False)y = np.abs(x)>=0.01
5.2、对类别特征进行 OneEncoder
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'power_bin'])print(data.shape)data.columns
(200000, 364)
Index(['SaleID', 'name', 'regDate', 'power', 'kilometer', 'regionCode', 'creatDate', 'price', 'v_0', 'v_1', ... 'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0', 'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0', 'power_bin_28.0', 'power_bin_29.0'],dtype='object', length=364)
5.3、切分特征和标签
#切分特征和标签train_set=Train_data.copy()y_train=train_set['price']x_train=train_set.drop(['price','regDate','creatDate'],axis = 1)x_train.head()
用lightgbm筛选特征
import lightgbm as lgbfrom sklearn.model_selection import train_test_split imfrom sklearn.metrics import mean_squared_error as MSEfeatures = pd.get_dummies(x_train)feature_names = list(features.columns)features = np.array(features)labels = np.array(y_train).reshape((-1, ))feature_importance_values = np.zeros(len(feature_names))task='regression'early_stopping=Trueeval_metric= 'l2'n_iterations=10for _ in range(n_iterations):if task == 'classification':model = lgb.LGBMClassifier(n_estimators=1000, learning_rate = 0.05, verbose = -1)if task =='regression':model = lgb.LGBMRegressor(n_estimators=1000, learning_rate = 0.05, verbose = -1)else:raise ValueError('Task must be either "classification" or "regression"')#提前终止训练,需要验证集if early_stopping:train_features, valid_features, train_labels, valid_labels = train_test_split(features, labels, test_size = 0.15)# Train the model with early stoppingmodel.fit(train_features, train_labels, eval_metric = eval_metric,eval_set = [(valid_features, valid_labels)],early_stopping_rounds = 100, verbose = -1)gc.enable()del train_features, train_labels, valid_features, valid_labelsgc.collect()else:model.fit(features, labels)# Record the feature importancesfeature_importance_values += model.feature_importances_ / n_iterationsfeature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
#按照重要性大小对特征进行排序feature_importances = feature_importances.sort_values('importance', ascending = False).reset_index(drop = True)#计算特征的相对重要性,全部特征的相对重要性之和为1feature_importances['normalized_importance'] = feature_importances['importance'] / feature_importances['importance'].sum()#计算特征的累计重要性#cutsum :返回给定 axis 上的累计和feature_importances['cumulative_importance'] = np.cumsum(feature_importances['normalized_importance'])#选取累计重要性大于0.99的特征,这些特征将会被删除掉。drop_columns=list(feature_importances.query('cumulative_importance>0.99')['feature'])
#去掉重要度低的列x_set=x_train.copy()x_set.drop(drop_columns,axis=1,inplace=True)#对数据集总体概览#显示所有行pd.set_option("display.max_info_columns", 300) # 设置info中信息显示数量为200x_set.info()
六、建模调参
# 构建模型拟合的评价指标from sklearn.metrics import mean_squared_error,mean_absolute_errordef model_goodness(model,x,y):prediction=model.predict(x)mae=mean_absolute_error(y,prediction)mse=mean_squared_error(y,prediction)rmse=np.sqrt(mse)print('MAE:',mae)#绝对平均误差print('MSE:',mse)#均方差print('RMSE:',rmse)#均方根
# 定义模型泛化能力的指标计算函数:from sklearn.model_selection import cross_val_scoredef display_scores(scores):print("Scores:", scores)print("Mean:", scores.mean())print("Standard deviation:", scores.std())
#先用简单线性回归模型拟合from sklearn.linear_model import LinearRegressionlin_reg=LinearRegression()lin_reg.fit(x_set,y_train)model_goodness(lin_reg,x_set,y_train)'''MAE: 0.17541397968387218MSE: 0.07846792179703589RMSE: 0.28012126266500353'''
随机森林
from sklearn.ensemble import RandomForestRegressorforest_reg=RandomForestRegressor()0forest_reg.fit(x_set,y_train)model_goodness(forest_reg,x_set,y_train)# 采用10折交叉验证的方法来验证模型的泛化能力scores=cross_val_score(forest_reg,x_set,y_train,scoring='neg_mean_absolute_error',cv=10)mae_scores=np.abs(-scores)display_scores(mae_scores)
”MAE: 0.047468466346616035
MSE: 0.008013848284210116
RMSE: 0.08952009988941095”’存在过拟合,
”’Scores: [0.1294032 0.12707153 0.12940989 0.12829302 0.13042102 0.1285104
0.12762524 0.12703461 0.1289176 0.12968754]
Mean: 0.12863740448866307
Standard deviation: 0.0010828607409916612”’
GBDT
# GBDTfrom sklearn.ensemble import GradientBoostingRegressorgbrt=GradientBoostingRegressor()gbrt.fit(x_set,y_train)model_goodness(gbrt,x_set,y_train)scores=cross_val_score(gbrt,x_set,y_train,scoring='neg_mean_absolute_error',cv=10)mae_scores=np.abs(scores)display_scores(mae_scores)
MAE: 0.1579591089700307
MSE: 0.06534997589709124
RMSE: 0.2556364134803398Scores: [0.16032467 0.15964983 0.16159922 0.15899314 0.16286916 0.16034439
0.15793287 0.1580428 0.15949101 0.16185252]
Mean: 0.16010996168246888
Standard deviation: 0.0015434916175588425
XGBoost
# XGBoostimport lightgbm as lgbimport xgboost as xgbxgb_reg= xgb.XGBRegressor()xgb_reg.fit(x_set,y_train)model_goodness(xgb_reg,x_set,y_train)scores=cross_val_score(xgb_reg,x_set,y_train,scoring='neg_mean_absolute_error',cv=10)mae_scores=np.abs(scores)display_scores(mae_scores)
”’
MAE: 0.11684430449593118
MSE: 0.03652492452344296
RMSE: 0.1911149510724971
Scores: [0.13500033 0.1333282 0.13477914 0.13414655 0.1365417 0.13534464
0.13483075 0.13339024 0.1352027 0.13584453]
Mean: 0.1348408781266727
Standard deviation: 0.000958580534103817”’
LightGBM
#LightGBMlgb_reg=lgb.LGBMRegressor()lgb_reg.fit(x_set,y_train)model_goodness(lgb_reg,x_set,y_train)scores=cross_val_score(lgb_reg,x_set,y_train,scoring='neg_mean_absolute_error',cv=10)mae_scores=np.abs(scores)display_scores(mae_scores)
”’
MAE: 0.1307250662409778
MSE: 0.049472769306324126
RMSE: 0.22242474976118132
Scores: [0.13610695 0.13486826 0.13710767 0.13597915 0.13788547 0.13687976
0.13471174 0.13481778 0.13525209 0.13684043]
Mean: 0.13604493148788416
Standard deviation: 0.0010560012820324028”’
还缺个模型调参和模型融合,回头补
调参
1.利用随机搜索对随机森林模型进行调优
利用sklearn.model_selection模块中的RandomizedSearchCV来进行随机搜索,搜索的超参数包括bootstrap,最大特征数max_features,树的最大深度max_depth,n_estimators。
from sklearn.model_selection import RandomizedSearchCV#2.设置参数空间from hyperopt import hpspace_forest = {'bootstrap':[True,False],'max_features':list(range(0,25,1)),'max_depth': list(range(0, 100, 1)),'n_estimators': list(range(30, 150, 1))}#随机搜索,利用5折交叉验证得分来作为模型优劣的判断标准forest_reg=RandomForestRegressor()random_search=RandomizedSearchCV(forest_reg, space_forest,cv=5,scoring='neg_mean_squared_error')#得到最优参数random_search.best_params_
2.利用贝叶斯方法对LightBoost进行调优
python中的hypreopt包可以进行贝叶斯方法的调优,这篇文章里Python 环境下的自动化机器学习超参数调优,有详细的介绍。
# 贝叶斯方法对LightBoost进行调优#2.定义参数空间from hyperopt import hpspace = {'num_leaves': hp.quniform('num_leaves', 30, 150, 1),'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),'max_depth': hp.quniform('max_depth', 0, 100, 1),'n_estimators': hp.quniform('n_estimators', 30, 150, 1)}#定义优化函数,即为5折交叉验证的得分from sklearn.model_selection import cross_val_scoredef objective(params, n_folds=5):num_leaf=int(params['num_leaves'])estimator=int(params['n_estimators'])rate=params['learning_rate']sub_for_bin=int(params['subsample_for_bin'])max_dep=int(params['max_depth'])lgb_reg=lgb.LGBMRegressor(num_leaves=num_leaf,n_estimators = estimator,learning_rate=rate,subsample_for_bin=sub_for_bin,max_depth=max_dep)lgb_reg.fit(x_set,y_train)scores=cross_val_score(lgb_reg,x_set,y_train,scoring='neg_mean_absolute_error',cv=5)mae_scores=np.abs(scores)loss=mae_scores.mean()return loss#寻找到使优化函数最小超参数组合,利用hyperopt中的fmin来求最小化from hyperopt import Trials,fmin,tpebest = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 500)
待补充。。。
天池长期赛:二手车价格预测(422方案分享)
阿里天池竞赛项目——二手车交易价格预测
基于Python实现的二手车价格预测_biyezuopin的博客-CSDN博客_python二手车价格预测
平均数编码:针对高基数定性特征(类别特征)的数据预处理/特征工程 – 知乎 (zhihu.com)