本文是对天池教学赛,银行客户认购产品预测的记录,教学赛网址如下:
【教学赛】金融数据分析赛题1:银行客户认购产品预测_学习赛_天池大赛-阿里云天池
1. 读取数据
import pandas as pd# 加载数据train = pd.read_csv('train.csv')test = pd.read_csv('test.csv')
2. 数据处理
2.1 合并数据
# 训练集和测试集合并, 以便于处理特征的数据df = pd.concat([train, test], axis=0) #将训练数据和测试数据在行的方向拼接df
得到的结果:
idagejobmaritaleducationdefaulthousingloancontactmonth...campaignpdayspreviouspoutcomeemp_var_ratecons_price_indexcons_conf_indexlending_rate3mnr_employedsubscribe0151admin.divorcedprofessional.coursenoyesyescellularaug...11122failure1.490.81-35.530.695219.74no1250servicesmarriedhigh.schoolunknownyesnocellularmay...14122nonexistent-1.896.33-40.584.054974.79yes2348blue-collardivorcedbasic.9ynononocellularapr...010271failure-1.896.33-44.741.505022.61no3426entrepreneursinglehigh.schoolyesyesyescellularaug...269980nonexistent1.497.08-35.555.115222.87yes4545admin.singleuniversity.degreenononocellularnov...12404success-3.489.82-33.831.174884.70no..................................................................74952999649admin.unknownuniversity.degreeunknownyesyestelephoneapr...503021failure-1.895.77-40.503.865058.64NaN74962999734blue-collarmarriedbasic.4ynononocellularjul...84403failure1.490.59-47.291.775156.70NaN74972999850retiredsinglebasic.4ynoyesnocellularjun...39970nonexistent-2.997.42-39.691.295116.80NaN74982999931technicianmarriedprofessional.coursenononocellularaug...310280nonexistent1.496.90-37.685.185144.45NaN74993000046admin.divorceduniversity.degreenoyesnocellularaug...23873success1.497.49-31.543.795082.25NaN30000 rows × 22 columns
可见数据既有数字也有文字,需要将文字转换为数字
2.2将非数字的特征转换为数字
# 首先选出所有的特征为object(非数字)的特征cat_columns = df.select_dtypes(include='object').columns#选择非数字的列,对其进行处理df[cat_columns]
# 对非数字特征进行编码from sklearn.preprocessing import LabelEncoderjob_le = LabelEncoder()df['job'] = job_le.fit_transform(df['job'])df['marital'] = df['marital'].map({'unknown':0, 'single':1, 'married':2, 'divorced':3})df['education'] = df['education'].map({'unknown':0, 'basic.4y':1, 'basic.6y':2, 'basic.9y':3, 'high.school':4, 'university.degree':5, 'professional.course':6, 'illiterate':7})df['housing'] = df['housing'].map({'unknown': 0, 'no': 1, 'yes': 2})df['loan'] = df['loan'].map({'unknown': 0, 'no': 1, 'yes': 2})df['contact'] = df['contact'].map({'cellular': 0, 'telephone': 1})df['day_of_week'] = df['day_of_week'].map({'mon': 0, 'tue': 1, 'wed': 2, 'thu': 3, 'fri': 4})df['poutcome'] = df['poutcome'].map({'nonexistent': 0, 'failure': 1, 'success': 2})df['default'] = df['default'].map({'unknown': 0, 'no': 1, 'yes': 2})df['month'] = df['month'].map({'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, \ 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12})df['subscribe'] = df['subscribe'].map({'no': 0, 'yes': 1})
2.3 切分数据
# 将数据集重新划分为训练集和测试集 通过subscribe是不是空来判断train = df[df['subscribe'].notnull()]test = df[df['subscribe'].isnull()]# 查看训练集中,标签为0和1的比例,可以看出0和1不均衡,0是1的6.6倍train['subscribe'].value_counts()
得到
0.0195481.0 2952Name: subscribe, dtype: int64
2.4 分析数据
import numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport warningswarnings.filterwarnings('ignore')%matplotlib inlinenum_features = [x for x in train.columns if x not in cat_columns and x!='id']fig = plt.figure(figsize=(80,60))for i in range(len(num_features)):plt.subplot(7,2,i+1)sns.boxplot(train[num_features[i]])plt.ylabel(num_features[i], fontsize=36)plt.show()
存在离群点,对离群点进行处理
2.5 处理离群点
for colum in num_features:temp = train[colum]q1 = temp.quantile(0.25)q2 = temp.quantile(0.75)delta = (q2-q1) * 10train[colum] = np.clip(temp, q1-delta, q2+delta)## 将超过10倍的值,进行处理
2.6 其他处理
进行数据均衡和特征选择,但是做完处理后都导致了分类效果变差,此处省略。但是把原码贴出来,供参考。
'''# 采用SMOTE进行过采样,虽然训练的效果好了,但是对于最终的分类效果反而降低了,此处先不采用过采样from imblearn.over_sampling import SMOTEfrom imblearn.over_sampling import ADASYN#smo = SMOTE(random_state=0, k_neighbors=10)adasyn = ADASYN()X_smo, y_smo = adasyn.fit_resample(train.iloc[:,:-1], train.iloc[:,-1])train_smo = pd.concat([X_smo, y_smo], axis=1)train_smo['subscribe'].value_counts()'''
'''# 特征选择方法采用SelectFromModel,Model选择树模型from sklearn.ensemble import ExtraTreesClassifierfrom sklearn.feature_selection import SelectFromModel# 提取出训练数据和标签train_X = train.iloc[:,:-1]train_y = train.iloc[:,-1]# clf_ect是模型名,FeaSel为特征选择模型clf_etc = ExtraTreesClassifier(n_estimators=50)clf_etc = clf_etc.fit(train_X, train_y)FeaSel = SelectFromModel(clf_etc, prefit=True)train_sel = FeaSel.transform(train_X)test_sel = FeaSel.transform(test.iloc[:,:-1])# 提取特征名,并把特征名写回原始数据train_new = pd.DataFrame(train_sel)feature_idx = FeaSel.get_support() #提取选择的列名train_new.columns = train_X.columns[feature_idx]#将列名写回选择后的数据train_new = pd.concat([train_new, train_y],axis=1)test_new = pd.DataFrame(test_sel)test_new.columns = train_X.columns[feature_idx]'''
此部门内容可能存在变量命名方面的问题。
2.7 数据保存
train_new = traintest_new = test# 将处理完的数据写回到train_new和test_new进行保存train_new.to_csv('train_new.csv', index=False)test_new.to_csv('test_new.csv', index=False)
3. 模型训练
3.1 导入包和数据
from sklearn.model_selection import GridSearchCVfrom sklearn.linear_model import LogisticRegressionfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.ensemble import AdaBoostClassifierfrom xgboost import XGBRFClassifierfrom lightgbm import LGBMClassifierfrom sklearn.model_selection import cross_val_scoreimport timeclf_lr = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')clf_dt = DecisionTreeClassifier()clf_rf = RandomForestClassifier()clf_gb = GradientBoostingClassifier()clf_adab = AdaBoostClassifier()clf_xgbrf = XGBRFClassifier()clf_lgb = LGBMClassifier()from sklearn.model_selection import train_test_splittrain_new = pd.read_csv('train_new.csv')test_new = pd.read_csv('test_new.csv')feature_columns = [col for col in train_new.columns if col not in ['subscribe']]train_data = train_new[feature_columns]target_data = train_new['subscribe']
3.2 模型调参
from lightgbm import LGBMClassifierfrom sklearn.metrics import classification_reportfrom sklearn.model_selection import GridSearchCVfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(train_data, target_data, test_size=0.2,shuffle=True, random_state=2023)#X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5,shuffle=True,random_state=2023)n_estimators = [300]learning_rate = [0.02]#中0.2最优subsample = [0.6]colsample_bytree = [0.7] ##在[0.5, 0.6, 0.7]中0.6最优max_depth = [9, 11, 13] ##在[7, 9, 11, 13]中11最优is_unbalance = [False]early_stopping_rounds = [300]num_boost_round = [5000]metric = ['binary_logloss']feature_fraction = [0.6, 0.75, 0.9]bagging_fraction = [0.6, 0.75, 0.9]bagging_freq = [2, 4, 5, 8]lambda_l1 = [0, 0.1, 0.4, 0.5]lambda_l2 =[0, 10, 15, 35]cat_smooth = [1, 10, 15, 20]param = {'n_estimators':n_estimators, 'learning_rate':learning_rate, 'subsample':subsample, 'colsample_bytree':colsample_bytree, 'max_depth':max_depth, 'is_unbalance':is_unbalance, 'early_stopping_rounds':early_stopping_rounds, 'num_boost_round':num_boost_round, 'metric':metric, 'feature_fraction':feature_fraction, 'bagging_fraction':bagging_fraction, 'lambda_l1':lambda_l1, 'lambda_l2':lambda_l2, 'cat_smooth':cat_smooth}model = LGBMClassifier()clf = GridSearchCV(model, param, cv=3, scoring='accuracy', verbose=1, n_jobs=-1)clf.fit(X_train, y_train, eval_set=[(X_train, y_train),(X_test, y_test)])print(clf.best_params_, clf.best_score_)
里面只有1个值的,是已经通过GridSearchCV找到的最优优值了,程序显示的是最后的6个参数的寻优,都放到一起训练时间太长了,所以选择分开寻找。
得到的结果:
Early stopping, best iteration is:[287]training's binary_logloss: 0.22302valid_1's binary_logloss: 0.253303{'bagging_fraction': 0.6, 'cat_smooth': 1, 'colsample_bytree': 0.7, 'early_stopping_rounds': 300, 'feature_fraction': 0.75, 'is_unbalance': False, 'lambda_l1': 0.4, 'lambda_l2': 10, 'learning_rate': 0.02, 'max_depth': 11, 'metric': 'binary_logloss', 'n_estimators': 300, 'num_boost_round': 5000, 'subsample': 0.6} 0.8853333333333334
3.3 预测结果
y_true, y_pred = y_test, clf.predict(X_test)accuracy = accuracy_score(y_true,y_pred)print(classification_report(y_true, y_pred))print('Accuracy',accuracy)
结果
precisionrecallf1-score support 0.0 0.910.970.943933 1.0 0.600.320.42 567accuracy 0.894500 macro avg 0.750.640.684500weighted avg 0.870.890.874500Accuracy 0.8875555555555555
查看混淆矩阵
from sklearn import metricsconfusion_matrix_result = metrics.confusion_matrix(y_true, y_pred)plt.figure(figsize=(8,6))sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')plt.xlabel('predict')plt.ylabel('true')plt.show()
4. 输出结果
test_x = test[feature_columns]pred_test = clf.predict(test_x)result = pd.read_csv('./submission.csv')subscribe_map ={1: 'yes', 0: 'no'}result['subscribe'] = [subscribe_map[x] for x in pred_test]result.to_csv('./baseline_lgb1.csv', index=False)result['subscribe'].value_counts()
结果:
no 6987yes 513Name: subscribe, dtype: int64
5. 提交结果
6. 总结
本人的方法只获得了0.9676的结果,希望您能在本人的程序基础上进行改进,以得到更佳的效果。如果有了更好的方法,欢迎在留言区告诉我,相互讨论。
改进的思路:
1. 数据处理方面,本人在进行数据均衡时,训练的效果很好,但是最终的效果较差,应该是数据过拟合了;另外在数据的离群点处理方面也可以做更进一步的考虑;
2.方法的改进,本人对比了lr, dt, rf, gb, adab, xgbrf, lgb最终lgb的效果最好,所以最终选择lgb进行调参,可以考虑采用多种方法的组合,进行训练;
3.在lgb的基础上进行调参,这个是最没有科技含量的。不过花时间应该会得到比我的结果更好的效果。
© 版权声明
文章版权归作者所有,未经允许请勿转载。
THE END