目录

  • 赛题背景
  • 全代码
    • 导入包
    • 读取数据(训练数据前10000行,测试数据前100条)
    • 读取全部数据
    • 获取训练和测试数据
    • 切分40%数据用于线下验证
    • 交叉验证:评估估算器性能
    • F1验证
    • ShuffleSplit切分数据
    • 模型调参
    • 模糊矩阵
    • 不同的分类模型
      • LR 模型
      • KNN 模型
      • tree树模型
      • bagging模型
      • 随机森林模型
      • ExTree模型
      • AdaBoost模型
      • GBDT模型
      • VOTE模型投票
      • lgb 模型
      • xgb 模型
    • 自己封装模型
      • Stacking,Bootstrap,Bagging技术实践
      • 测试自己封装的模型类
    • 天猫复购场景实战
      • 读取特征数据
      • 设置模型参数
      • 模型训练
      • 预测结果
      • 保存结果

赛题背景

商家一般会在 “双十一”,“双十二” 等节日进行大规模的促销,比如各种折扣券和现金券。然而,被低价、折扣、各种让利吸引的用户往往在这次消费之后就再也没有购买,主要为了“薅羊毛”,针对这些用户的促销并没有带来未来销量的提高,只是增加了相应的营销成本。因此店铺有迫切的需求,想知道哪些用户可能会成为重复购买其店铺商品的忠诚用户,以便对这些有潜力的用户进行精准营销,以降低促销成本,提高投资回报率。
这个赛题的目标就是给一堆数据(用户、店铺的历史行为),然后用训练好的模型预测新用户是否会在6个月内再次从同一店铺购买商品。所以这是一个典型的二分类问题
常见的分类算法:朴素贝叶斯决策树支持向量机KNN逻辑回归等等;
集成学习:随机森林GBDT(梯度提升决策树),AdabootXGBoostLightGBMCatBoost等等;
神经网络:MLP(多层神经网络),DL(深度学习)等。
本赛题的数据量不大,一把用不到深度学习,根据赛题特点,集成算法,尤其是XGBoost,LightGBM,CatBoost等算法效果会比较好。

全代码

一个典型的机器学习实战算法基本包括 1) 数据处理,2) 特征选取、优化,和 3) 模型选取、验证、优化。 因为 “数据和特征决定了机器学习的上限,而模型和算法知识逼近这个上限而已。” 所以在解决一个机器学习问题时大部分时间都会花在数据处理和特征优化上。
大家最好在jupyter notebook上一段一段地跑下面的代码,加深理解。
机器学习的基本知识可以康康我的其他文章哦

读取全部数据

train_data.columns

获取训练和测试数据

features_columns = [col for col in train_data.columns if col not in ['user_id','label']]train = train_data[features_columns].valuestest = test_data[features_columns].valuestarget =train_data['label'].values

切分40%数据用于线下验证

from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.4, random_state=0)print(X_train.shape, y_train.shape)print(X_test.shape, y_test.shape)clf = clf.fit(X_train, y_train)clf.score(X_test, y_test)  

交叉验证:评估估算器性能

from sklearn.model_selection import cross_val_scorefrom sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)scores = cross_val_score(clf, train, target, cv=5)print(scores)print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) 

F1验证

from sklearn import metricsfrom sklearn.model_selection import cross_val_scorefrom sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)scores = cross_val_score(clf, train, target, cv=5, scoring='f1_macro')print(scores)  print("F1: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

ShuffleSplit切分数据

from sklearn.model_selection import ShuffleSplitfrom sklearn.model_selection import cross_val_scorefrom sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)cross_val_score(clf, train, target, cv=cv)  

模型调参

from sklearn.model_selection import train_test_splitfrom sklearn.model_selection import GridSearchCVfrom sklearn.metrics import classification_reportfrom sklearn.ensemble import RandomForestClassifier# Split the dataset in two equal partsX_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.5, random_state=0)# model clf = RandomForestClassifier(n_jobs=-1)# Set the parameters by cross-validationtuned_parameters = {                    'n_estimators': [50, 100, 200]#                     ,'criterion': ['gini', 'entropy']#                     ,'max_depth': [2, 5]#                     ,'max_features': ['log2', 'sqrt', 'int']#                     ,'bootstrap': [True, False]#                     ,'warm_start': [True, False]                    }scores = ['precision']for score in scores:    print("# Tuning hyper-parameters for %s" % score)    print()    clf = GridSearchCV(clf, tuned_parameters, cv=5,                       scoring='%s_macro' % score)    clf.fit(X_train, y_train)    print("Best parameters set found on development set:")    print()    print(clf.best_params_)    print()    print("Grid scores on development set:")    print()    means = clf.cv_results_['mean_test_score']    stds = clf.cv_results_['std_test_score']    for mean, std, params in zip(means, stds, clf.cv_results_['params']):        print("%0.3f (+/-%0.03f) for %r"              % (mean, std * 2, params))    print()    print("Detailed classification report:")    print()    print("The model is trained on the full development set.")    print("The scores are computed on the full evaluation set.")    print()    y_true, y_pred = y_test, clf.predict(X_test)    print(classification_report(y_true, y_pred))    print()

模糊矩阵

import itertoolsimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import confusion_matrixfrom sklearn.ensemble import RandomForestClassifier# label nameclass_names = ['no-repeat', 'repeat']# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)# Run classifier, using a model that is too regularized (C too low) to see# the impact on the resultsclf = RandomForestClassifier(n_jobs=-1)y_pred = clf.fit(X_train, y_train).predict(X_test)def plot_confusion_matrix(cm, classes,                          normalize=False,                          title='Confusion matrix',                          cmap=plt.cm.Blues):    """    This function prints and plots the confusion matrix.    Normalization can be applied by setting `normalize=True`.    """    if normalize:        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]        print("Normalized confusion matrix")    else:        print('Confusion matrix, without normalization')    print(cm)    plt.imshow(cm, interpolation='nearest', cmap=cmap)    plt.title(title)    plt.colorbar()    tick_marks = np.arange(len(classes))    plt.xticks(tick_marks, classes, rotation=45)    plt.yticks(tick_marks, classes)    fmt = '.2f' if normalize else 'd'    thresh = cm.max() / 2.    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):        plt.text(j, i, format(cm[i, j], fmt),                 horizontalalignment="center",                 color="white" if cm[i, j] > thresh else "black")    plt.ylabel('True label')    plt.xlabel('Predicted label')    plt.tight_layout()# Compute confusion matrixcnf_matrix = confusion_matrix(y_test, y_pred)np.set_printoptions(precision=2)# Plot non-normalized confusion matrixplt.figure()plot_confusion_matrix(cnf_matrix, classes=class_names,                      title='Confusion matrix, without normalization')# Plot normalized confusion matrixplt.figure()plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,                      title='Normalized confusion matrix')plt.show()

from sklearn.metrics import classification_reportfrom sklearn.ensemble import RandomForestClassifier# label nameclass_names = ['no-repeat', 'repeat']# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)# Run classifier, using a model that is too regularized (C too low) to see# the impact on the resultsclf = RandomForestClassifier(n_jobs=-1)y_pred = clf.fit(X_train, y_train).predict(X_test)print(classification_report(y_test, y_pred, target_names=class_names))

不同的分类模型

LR 模型

from sklearn.linear_model import LinearRegressionfrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()X = stdScaler.fit_transform(train)# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)clf.score(X_test, y_test)

KNN 模型

from sklearn.neighbors import KNeighborsClassifierfrom sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()X = stdScaler.fit_transform(train)# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)clf.score(X_test, y_test)

tree树模型

from sklearn import tree# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)clf = tree.DecisionTreeClassifier()clf = clf.fit(X_train, y_train)clf.score(X_test, y_test)

bagging模型

from sklearn.ensemble import BaggingClassifierfrom sklearn.neighbors import KNeighborsClassifier# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)clf = clf.fit(X_train, y_train)clf.score(X_test, y_test)

随机森林模型

from sklearn.ensemble import RandomForestClassifier# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)clf = clf = RandomForestClassifier(n_estimators=10, max_depth=3, min_samples_split=12, random_state=0)clf = clf.fit(X_train, y_train)clf.score(X_test, y_test)

ExTree模型

from sklearn.ensemble import ExtraTreesClassifier# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)clf = clf.fit(X_train, y_train)clf.score(X_test, y_test)clf.n_features_clf.feature_importances_[:10]

AdaBoost模型

from sklearn.ensemble import AdaBoostClassifier# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)clf = AdaBoostClassifier(n_estimators=10)clf = clf.fit(X_train, y_train)clf.score(X_test, y_test)

GBDT模型

from sklearn.ensemble import GradientBoostingClassifier# Split the data into a training set and a test setX_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)clf = GradientBoostingClassifier(n_estimators=10, learning_rate=1.0, max_depth=1, random_state=0)clf = clf.fit(X_train, y_train)clf.score(X_test, y_test)

VOTE模型投票

from sklearn import datasetsfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.naive_bayes import GaussianNBfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import VotingClassifierfrom sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()X = stdScaler.fit_transform(train)y = targetclf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial', random_state=1)clf2 = RandomForestClassifier(n_estimators=50, random_state=1)clf3 = GaussianNB()eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

lgb 模型

import lightgbmX_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.4, random_state=0)X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5, random_state=0)clf = lightgbmtrain_matrix = clf.Dataset(X_train, label=y_train)test_matrix = clf.Dataset(X_test, label=y_test)params = {          'boosting_type': 'gbdt',          #'boosting_type': 'dart',          'objective': 'multiclass',          'metric': 'multi_logloss',          'min_child_weight': 1.5,          'num_leaves': 2**5,          'lambda_l2': 10,          'subsample': 0.7,          'colsample_bytree': 0.7,          'colsample_bylevel': 0.7,          'learning_rate': 0.03,          'tree_method': 'exact',          'seed': 2017,          "num_class": 2,          'silent': True,          }num_round = 10000early_stopping_rounds = 100model = clf.train(params,                   train_matrix,                  num_round,                  valid_sets=test_matrix,                  early_stopping_rounds=early_stopping_rounds)pre= model.predict(X_valid,num_iteration=model.best_iteration)print('score : ', np.mean((pre[:,1]>0.5)==y_valid))

xgb 模型

import xgboostX_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.4, random_state=0)X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5, random_state=0)clf = xgboosttrain_matrix = clf.DMatrix(X_train, label=y_train, missing=-1)test_matrix = clf.DMatrix(X_test, label=y_test, missing=-1)z = clf.DMatrix(X_valid, label=y_valid, missing=-1)params = {'booster': 'gbtree',          'objective': 'multi:softprob',          'eval_metric': 'mlogloss',          'gamma': 1,          'min_child_weight': 1.5,          'max_depth': 5,          'lambda': 100,          'subsample': 0.7,          'colsample_bytree': 0.7,          'colsample_bylevel': 0.7,          'eta': 0.03,          'tree_method': 'exact',          'seed': 2017,          "num_class": 2          }num_round = 10000early_stopping_rounds = 100watchlist = [(train_matrix, 'train'),             (test_matrix, 'eval')             ]model = clf.train(params,                  train_matrix,                  num_boost_round=num_round,                  evals=watchlist,                  early_stopping_rounds=early_stopping_rounds                  )pre = model.predict(z,ntree_limit=model.best_ntree_limit)print('score : ', np.mean((pre[:,1]>0.3)==y_valid))

自己封装模型

Stacking,Bootstrap,Bagging技术实践

"""    导入相关包"""import pandas as pdimport numpy as npimport lightgbm as lgbfrom sklearn.metrics import f1_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import StratifiedKFoldclass SBBTree():    """        SBBTree        Stacking,Bootstap,Bagging    """    def __init__(                    self,                     params,                    stacking_num,                    bagging_num,                    bagging_test_size,                    num_boost_round,                    early_stopping_rounds                ):        """            Initializes the SBBTree.            Args:              params : lgb params.              stacking_num : k_flod stacking.              bagging_num : bootstrap num.              bagging_test_size : bootstrap sample rate.              num_boost_round : boost num.              early_stopping_rounds : early_stopping_rounds.        """        self.params = params        self.stacking_num = stacking_num        self.bagging_num = bagging_num        self.bagging_test_size = bagging_test_size        self.num_boost_round = num_boost_round        self.early_stopping_rounds = early_stopping_rounds        self.model = lgb        self.stacking_model = []        self.bagging_model = []    def fit(self, X, y):        """ fit model. """        if self.stacking_num > 1:            layer_train = np.zeros((X.shape[0], 2))            self.SK = StratifiedKFold(n_splits=self.stacking_num, shuffle=True, random_state=1)            for k,(train_index, test_index) in enumerate(self.SK.split(X, y)):                X_train = X[train_index]                y_train = y[train_index]                X_test = X[test_index]                y_test = y[test_index]                lgb_train = lgb.Dataset(X_train, y_train)                lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)                gbm = lgb.train(self.params,                            lgb_train,                            num_boost_round=self.num_boost_round,                            valid_sets=lgb_eval,                            early_stopping_rounds=self.early_stopping_rounds)                self.stacking_model.append(gbm)                pred_y = gbm.predict(X_test, num_iteration=gbm.best_iteration)                layer_train[test_index, 1] = pred_y            X = np.hstack((X, layer_train[:,1].reshape((-1,1))))         else:            pass        for bn in range(self.bagging_num):            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.bagging_test_size, random_state=bn)            lgb_train = lgb.Dataset(X_train, y_train)            lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)            gbm = lgb.train(self.params,                        lgb_train,                        num_boost_round=10000,                        valid_sets=lgb_eval,                        early_stopping_rounds=200)            self.bagging_model.append(gbm)    def predict(self, X_pred):        """ predict test data. """        if self.stacking_num > 1:            test_pred = np.zeros((X_pred.shape[0], self.stacking_num))            for sn,gbm in enumerate(self.stacking_model):                pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)                test_pred[:, sn] = pred            X_pred = np.hstack((X_pred, test_pred.mean(axis=1).reshape((-1,1))))          else:            pass         for bn,gbm in enumerate(self.bagging_model):            pred = gbm.predict(X_pred, num_iteration=gbm.best_iteration)            if bn == 0:                pred_out=pred            else:                pred_out+=pred        return pred_out/self.bagging_num

测试自己封装的模型类

"""    TEST CODE"""from sklearn.datasets import make_classificationfrom sklearn.datasets import load_breast_cancerfrom sklearn.datasets import make_gaussian_quantilesfrom sklearn import metricsfrom sklearn.metrics import f1_score# X, y = make_classification(n_samples=1000, n_features=25, n_clusters_per_class=1, n_informative=15, random_state=1)X, y = make_gaussian_quantiles(mean=None, cov=1.0, n_samples=1000, n_features=50, n_classes=2, shuffle=True, random_state=2)# data = load_breast_cancer()# X, y = data.data, data.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)params = {        'task': 'train',        'boosting_type': 'gbdt',        'objective': 'binary',        'metric': 'auc',        'num_leaves': 9,        'learning_rate': 0.03,        'feature_fraction_seed': 2,        'feature_fraction': 0.9,        'bagging_fraction': 0.8,        'bagging_freq': 5,        'min_data': 20,        'min_hessian': 1,        'verbose': -1,        'silent': 0        }# test 1model = SBBTree(params=params, stacking_num=2, bagging_num=1,  bagging_test_size=0.33, num_boost_round=10000, early_stopping_rounds=200)model.fit(X,y)X_pred = X[0].reshape((1,-1))pred=model.predict(X_pred)print('pred')print(pred)print('TEST 1 ok')# test 1model = SBBTree(params, stacking_num=1, bagging_num=1, bagging_test_size=0.33, num_boost_round=10000, early_stopping_rounds=200)model.fit(X_train,y_train)pred1=model.predict(X_test)# test 2 model = SBBTree(params, stacking_num=1, bagging_num=3, bagging_test_size=0.33, num_boost_round=10000, early_stopping_rounds=200)model.fit(X_train,y_train)pred2=model.predict(X_test)# test 3 model = SBBTree(params, stacking_num=5, bagging_num=1, bagging_test_size=0.33, num_boost_round=10000, early_stopping_rounds=200)model.fit(X_train,y_train)pred3=model.predict(X_test)# test 4 model = SBBTree(params, stacking_num=5, bagging_num=3, bagging_test_size=0.33, num_boost_round=10000, early_stopping_rounds=200)model.fit(X_train,y_train)pred4=model.predict(X_test)fpr, tpr, thresholds = metrics.roc_curve(y_test+1, pred1, pos_label=2)print('auc: ',metrics.auc(fpr, tpr))fpr, tpr, thresholds = metrics.roc_curve(y_test+1, pred2, pos_label=2)print('auc: ',metrics.auc(fpr, tpr))fpr, tpr, thresholds = metrics.roc_curve(y_test+1, pred3, pos_label=2)print('auc: ',metrics.auc(fpr, tpr))fpr, tpr, thresholds = metrics.roc_curve(y_test+1, pred4, pos_label=2)print('auc: ',metrics.auc(fpr, tpr))# auc:  0.7281621243885396# auc:  0.7710471146419509# auc:  0.7894369046305492# auc:  0.8084519474787597

天猫复购场景实战

读取特征数据

import pandas as pdimport numpy as npimport lightgbm as lgbfrom sklearn.metrics import f1_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import StratifiedKFoldtrain_data = pd.read_csv('train_all.csv',nrows=10000)test_data = pd.read_csv('test_all.csv',nrows=100)features_columns = [col for col in train_data.columns if col not in ['user_id','label']]train = train_data[features_columns].valuestest = test_data[features_columns].valuestarget =train_data['label'].values

设置模型参数

params = {        'task': 'train',        'boosting_type': 'gbdt',        'objective': 'binary',        'metric': 'auc',        'num_leaves': 9,        'learning_rate': 0.03,        'feature_fraction_seed': 2,        'feature_fraction': 0.9,        'bagging_fraction': 0.8,        'bagging_freq': 5,        'min_data': 20,        'min_hessian': 1,        'verbose': -1,        'silent': 0        }model = SBBTree(params=params,                stacking_num=5,                bagging_num=3,                bagging_test_size=0.33,                num_boost_round=10000,                early_stopping_rounds=200)

模型训练

model.fit(train, target)

预测结果

pred = model.predict(test)df_out = pd.DataFrame()df_out['user_id'] = test_data['user_id'].astype(int)df_out['predict_prob'] = preddf_out.head()

保存结果

"""    保留数据头,不保存index"""df_out.to_csv('df_out.csv',header=True,index=False)print('save OK!')

以上内容和代码全部来自于《阿里云天池大赛赛题解析(机器学习篇)》这本好书,十分推荐大家去阅读原书!