机器学习本科课程实验3 决策树处理分类任务

实验3.1 决策树处理分类任务

使用sklearn.tree.DecisionTreeClassifier完成肿瘤分类（breast-cancer）
计算最大深度为10时，十折交叉验证的精度(accuracy)，查准率(precision)，查全率(recall)，F1值
绘制最大深度从1到10的决策树十折交叉验证精度的变化图

1. 读取数据

import numpy as npimport pandas as pddata = pd.read_csv('breast-cancer.csv')print(data.shape)data.head()

data = data.values data_x = data[:,2:-1]data_y = data[:,1:2]data_y = np.reshape(data_y,(-1))print(data_x.shape)print(data_y.shape)

2. 导入模型

from sklearn.model_selection import cross_val_predictfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_scorefrom sklearn.metrics import recall_scorefrom sklearn.metrics import f1_scorefrom sklearn.tree import DecisionTreeClassifier

3. 训练与预测

计算最大深度为10的决策树，在使用数据data_x，标记data_y下，十折交叉验证的精度，查准率，查全率和F1值

model = DecisionTreeClassifier(max_depth = 10) # 参数max_depth决定了决策树的最大深度prediction = cross_val_predict(model,data_x,data_y,cv = 10) acc1 = accuracy_score(data_y,prediction)precision1 = precision_score(data_y,prediction,average="macro")recall1 = recall_score(data_y,prediction,average="macro")f1 = f1_score(data_y,prediction,average="macro")print("决策树在data_测试集上的四项指标")print("精度:",acc1)print("查准率:",precision1)print("查全率:",recall1)print("f1值:",f1)

4. 改变最大深度，绘制决策树的精度变换图

绘制最大深度从1到10，决策树十折交叉验证精度的变化图

import matplotlib.pyplot as plt%matplotlib inliney = []for i in range(10):model = DecisionTreeClassifier(max_depth = i + 1)prediction = cross_val_predict(model,data_x,data_y,cv=10)y.append(prediction)x = np.linspace(1,10,10)test = [accuracy_score(data_y, val) for val in y]plt.figure()plt.plot(x,test,'-')plt.title("DecisionTree's accuracy_score changes with the max_depth")plt.xlabel("max_depth")plt.ylabel("accuracy_score")

5. 通过调整参数，得到一个泛化能力最好的模型

查看决策树文档，通过调整决策树的参数，得到一个最好的模型
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
并在下方给出参数的设定与其泛化性能指标

criterion：用于衡量特征选择质量的准则。可以是”gini”（基尼系数）或”entropy”（信息增益）。
max_depth：决策树的最大深度。控制树的复杂度和过拟合的风险。
min_samples_split：拆分内部节点所需的最小样本数。
min_samples_leaf：叶子节点所需的最小样本数。
max_features：寻找最佳分割时要考虑的特征数量。
random_state：控制随机性的种子值。

使用的GridSearchCV，它存在的意义就是自动调参，只要把参数输进去，就能给出最优化的结果和参数。
但是这个方法适合于小数据集，一旦数据的量级上去了，很难得出结果。

from sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import StratifiedKFoldmodelfit = DecisionTreeClassifier(max_depth = 10)param_grid = {'criterion':['gini','entropy'],'max_depth':[10,11,12],'min_samples_leaf':[1,2,3,4,5],'max_features':[1,2,3,4,5],'min_samples_split':[2,3,4,5]}grid = GridSearchCV(modelfit,param_grid,cv = 10)grid.fit(data_x,data_y)best = grid.best_params_#最优分类器print(best)best_decision_tree_classifier = DecisionTreeClassifier(max_depth = best['max_depth'],max_features=best['max_features'], min_samples_leaf = best['min_samples_leaf'],min_samples_split = best['min_samples_split'])# your codeprediction11 = cross_val_predict(model, data_x, data_y, cv=10)acc11 = accuracy_score(data_y, prediction)precision11 = precision_score(data_y, prediction, average="macro")recall11 = recall_score(data_y, prediction, average="macro")f1_11 = f1_score(data_y, prediction, average="macro")print("-------------------")print("精度:", acc11)print("查准率:", precision11)print("查全率:", recall11)print("f1值:", f1_11)

双击此处填写优化后的决策树参数设置与性能指标的结果

参数设置：

划分标准-基尼系数; 最大深度-10; 最大特征数-5; 叶子节点最少样本数-5; 内部节点再划分所需最小样本数-3;

性能指标得分：

精度: 0.9104
查准率: 0.9033
查全率: 0.9056
f1值: 0.9044

实验3.2决策树处理回归任务

使用sklearn.tree.DecisionTreeRegressor完成kaggle房价预测问题
计算最大深度为10的决策树，训练集上十折交叉验证的MAE和RMSE
绘制最大深度从1到30，决策树在训练集和测试集上MAE的变化曲线
选择一个合理的树的最大深度，并给出理由

1. 读取数据

import pandas as pddata = pd.read_csv('train.csv')# 丢弃有缺失值的特征（列）data.dropna(axis = 1, inplace = True)# 只保留整数的特征data = data[[col for col in data.dtypes.index if data.dtypes[col] == 'int64']]

2. 数据集划分

70%做训练集，30%做测试集

from sklearn.utils import shuffledata_shuffled = shuffle(data, random_state = 32)split_line = int(len(data_shuffled) * 0.7)training_data = data_shuffled[:split_line]testing_data = data_shuffled[split_line:]

3. 导入模型

from sklearn.model_selection import cross_val_predictfrom sklearn.metrics import mean_absolute_errorfrom sklearn.metrics import mean_squared_errorfrom sklearn.tree import DecisionTreeRegressorimport numpy as np

4. 选取特征和标记

features = data.columns.tolist()target = 'SalePrice'features.remove(target)

5. 训练与预测

请你在下面计算树的最大深度为10时，使用训练集全量特征训练的决策树的十折交叉验证的MAE和RMSE

# YOUR CODE HERE# training_data[features]# training_data[target]model12 = DecisionTreeRegressor(max_depth = 10)model12.fit(training_data[features], training_data[target])predictions12 = model12.predict(testing_data[features])mae12 = mean_absolute_error(testing_data[target], predictions12)mse12 = mean_squared_error(testing_data[target], predictions12)rmse12 = np.sqrt(mse12)print("Mean Absolute Error:", mae12)print("Mean Squared Error:", mse12)print("Root Mean Squared Error:", rmse12)

6. 改变最大深度，绘制决策树的精度变换图

绘制最大深度从1到30，决策树训练集和测试集MAE的变化图

import matplotlib.pyplot as plt%matplotlib inlineplt.style.use("fivethirtyeight")# YOUR CODE HEREy12 = []for i in range(30):model12 =DecisionTreeRegressor(max_depth = i+1)model12.fit(training_data[features], training_data[target])predictions12 = model12.predict(testing_data[features])mae12 = mean_absolute_error(testing_data[target], predictions12)mse12 = mean_squared_error(testing_data[target], predictions12)rmse12 = np.sqrt(mse12)y12.append(mae12)print('----------------------')print("max_depth: ", i+1)print("Mean Absolute Error:", mae12)print("Mean Squared Error:", mse12)print("Root Mean Squared Error:", rmse12)

x12 = np.linspace(1, 30, 30)plt.figure()plt.plot(x12, y12, '-')plt.title("DecisionTree's MAE changes with the max_depth")plt.xlabel("max_depth")plt.ylabel("MAE")

7. 请你选择一个合理的树的最大深度，并给出理由

请你选择一个合理的树的最大深度，并给出理由

根据走势图, 我认为选择最大深度为6比较合适, 当最大深度到达6附近, MAE接近全局最小值, 而当最大深度增大时, 对模型的性能没有明显的增益, 甚至增大一定程度后会造成MAE值的波动, 因此在效率和精度的双重考虑下选择最大深度为6

实验3.3实现决策树

使用LendingClub Safe Loans数据集：

实现信息增益、信息增益率、基尼指数三种划分标准
使用给定的训练集完成三种决策树的训练过程
计算三种决策树在最大深度为10时在训练集和测试集上的精度，查准率，查全率，F1值
画出决策树（选作）

在这部分，我们会实现一个很简单的二叉决策树

1. 读取数据

# 导入类库import pandas as pdimport numpy as npimport json# 导入数据loans = pd.read_csv('lending-club-data.csv', low_memory=False)

数据中有两列是我们想预测的指标，一项是safe_loans，一项是bad_loans，分别表示正例和负例，我们对其进行处理，将正例的safe_loans设为1，负例设为-1，删除bad_loans这列

# 对数据进行预处理，将safe_loans作为标记loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)del loans['bad_loans']

我们只使用grade, term, home_ownership, emp_length这四列作为特征，safe_loans作为标记，只保留loans中的这五列

features = ['grade',# grade of the loan'term', # the term of the loan'home_ownership', # home_ownership status: own, mortgage or rent'emp_length', # number of years of employment ]target = 'safe_loans'loans = loans[features + [target]]

2. 划分训练集和测试集

from sklearn.utils import shuffleloans = shuffle(loans, random_state = 34)split_line = int(len(loans) * 0.6)train_data = loans.iloc[: split_line]test_data = loans.iloc[split_line:]

3. 特征预处理

可以看到所有的特征都是离散类型的特征，需要对数据进行预处理，使用one-hot编码对其进行处理。
one-hot编码的思想就是将离散特征变成向量，假设特征 $A$ 有三种取值 ${a, b, c\}$ ，这三种取值等价，如果我们使用1,2,3三个数字表示这三种取值，那么在计算时就会产生偏差，有一些涉及距离度量的算法会认为，2和1离得近，3和1离得远，但这三个值应该是等价的，这种表示方法会造成模型在判断上出现偏差。解决方案就是使用一个三维向量表示他们，用 $[1, 0, 0]$ 表示a， $[0, 1, 0]$ 表示b， $[0, 0, 1]$ 表示c，这样三个向量之间的距离就都是相等的了，任意两个向量在欧式空间的距离都是 $\sqrt{2}$ 。这就是one-hot编码是思想。
pandas中使用get_dummies生成one-hot向量

def one_hot_encoding(data, features_categorical):'''Parameter----------data: pd.DataFramefeatures_categorical: list(str)'''# 对所有的离散特征遍历for cat in features_categorical:# 对这列进行one-hot编码，前缀为这个变量名one_encoding = pd.get_dummies(data[cat], prefix = cat)# 将生成的one-hot编码与之前的dataframe拼接起来data = pd.concat([data, one_encoding],axis=1)# 删除掉原始的这列离散特征del data[cat]return data

首先对训练集生成one-hot向量，然后对测试集生成one-hot向量，这里需要注意的是，如果训练集中，特征 $A$ 的取值为 ${a, b, c\}$ ，这样我们生成的特征就有三列，分别为 $A\_a$ , $A\_b$ , $A\_c$ ，然后我们使用这个训练集训练模型，模型就就会考虑这三个特征，在测试集中如果有一个样本的特征 $A$ 的值为 $d$ ，那它的 $A\_a$ ， $A\_b$ ， $A\_c$ 就都为0，我们不去考虑 $A\_d$ ，因为这个特征在训练模型的时候是不存在的。

train_data = one_hot_encoding(train_data, features)train_data.head()one_hot_features = train_data.columns.tolist()one_hot_features.remove(target)one_hot_features

接下来是对测试集进行one_hot编码，但只要保留出现在one_hot_features中的特征即可

test_data_tmp = one_hot_encoding(test_data, features)# 创建一个空的DataFrametest_data = pd.DataFrame(columns = train_data.columns)for feature in train_data.columns:# 如果训练集中当前特征在test_data_tmp中出现了，将其复制到test_data中if feature in test_data_tmp.columns:test_data[feature] = test_data_tmp[feature].copy()else:# 否则就用全为0的列去替代test_data[feature] = np.zeros(test_data_tmp.shape[0], dtype = 'uint8')test_data.head()

处理完后，所有的特征都是0和1，标记是1和-1，以上就是数据预处理流程

4. 实现3种特征划分准则

决策树中有很多常用的特征划分方法，比如信息增益、信息增益率、基尼指数

我们需要实现一个函数，它的作用是，给定决策树的某个结点内的所有样本的标记，让它计算出对应划分指标的值是多少

接下来我们会实现上述三种划分指标

这里我们约定，将所有特征取值为0的样本，划分到左子树，特征取值为1的样本，划分到右子树

4.1 信息增益

信息熵：
$k\mathrm{Ent}(D) = – \sum^{\vert \mathcal{Y} \vert}_{k = 1} p_k \mathrm{log}_2 p_k$

信息增益：
$v)\mathrm{Gain}(D, a) = \mathrm{Ent}(D) – \sum^{V}_{v=1} \frac{\vert D^v \vert}{\vert D \vert} \mathrm{Ent}(D^v)$

计算信息熵时约定：若 $p = 0$ ，则 $p \log_2p = 0$

def information_entropy(labels_in_node):'''求当前结点的信息熵Parameter----------labels_in_node: np.ndarray, 如[-1, 1, -1, 1, 1]Returns----------float: information entropy'''# 统计样本总个数num_of_samples = labels_in_node.shape[0]if num_of_samples == 0:return 0# 统计出标记为1的个数num_of_positive = len(labels_in_node[labels_in_node == 1])# 统计出标记为-1的个数num_of_negative = len(labels_in_node[labels_in_node == -1])# 统计正例的概率prob_positive = num_of_positive / num_of_samples# 统计负例的概率prob_negative = num_of_negative / num_of_samplesif prob_positive == 0:positive_part = 0else:positive_part = prob_positive * np.log2(prob_positive)if prob_negative == 0:negative_part = 0else:negative_part = prob_negative * np.log2(prob_negative)return - ( positive_part + negative_part )

下面是6个测试样例

# 信息熵测试样例1example_labels = np.array([-1, -1, 1, 1, 1])print(information_entropy(example_labels)) # 0.97095# 信息熵测试样例2example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])print(information_entropy(example_labels)) # 0.86312# 信息熵测试样例3example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])print(information_entropy(example_labels)) # 0.86312# 信息熵测试样例4example_labels = np.array([-1] * 9 + [1] * 8)print(information_entropy(example_labels)) # 0.99750# 信息熵测试样例5example_labels = np.array([1] * 8)print(information_entropy(example_labels)) # 0# 信息熵测试样例6example_labels = np.array([])print(information_entropy(example_labels)) # 0

接下来完成计算所有特征的信息增益的函数
需要填写三个部分

def compute_information_gains(data, features, target, annotate = False):'''计算所有特征的信息增益Parameter----------data: pd.DataFrame，传入的样本，带有特征和标记的dataframefeatures: list(str)，特征名组成的listtarget: str, 标记(label)的名字annotate, boolean，是否打印所有特征的信息增益值，默认为FalseReturns----------information_gains: dict, key: str, 特征名 value: float，信息增益'''# 我们将每个特征划分的信息增益值存储在一个dict中# 键是特征名，值是信息增益值information_gains = dict()# 对所有的特征进行遍历，使用信息增益对每个特征进行计算for feature in features:# 左子树保证所有的样本的这个特征取值为0left_split_target = data[data[feature] == 0][target]# 右子树保证所有的样本的这个特征取值为1right_split_target =data[data[feature] == 1][target]# 计算左子树的信息熵left_entropy = information_entropy(left_split_target)# 计算左子树的权重left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))# 计算右子树的信息熵right_entropy = information_entropy(right_split_target)# 计算右子树的权重right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))# 计算当前结点的信息熵current_entropy = information_entropy(data[target])# 计算使用当前特征划分的信息增益gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy)# 将特征名与增益值以键值对的形式存储在information_gains中information_gains[feature] = gainif annotate:print(" ", feature, gain)return information_gains

# 信息增益测试样例1print(compute_information_gains(train_data, one_hot_features, target)['grade_A']) # 0.01759# 信息增益测试样例2print(compute_information_gains(train_data, one_hot_features, target)['term_ 60 months']) # 0.01429# 信息增益测试样例3print(compute_information_gains(train_data, one_hot_features, target)['grade_B']) # 0.00370

4.2 信息增益率

信息增益率：

$\mathrm{Gain\_ratio}(D, a) = \frac{\mathrm{Gain}(D, a)}{\mathrm{IV}(a)}$

其中

$\mathrm{IV}(a) = – \sum^V_{v=1} \frac{\vert D^v \vert}{\vert D \vert} \log_2 \frac{\vert D^v \vert}{\vert D \vert}$
完成计算所有特征信息增益率的函数
这里要完成五个部分

def compute_information_gain_ratios(data, features, target, annotate = False):'''计算所有特征的信息增益率并保存起来Parameter----------data: pd.DataFrame, 带有特征和标记的数据features: list(str)，特征名组成的listtarget: str， 特征的名字annotate: boolean, default False，是否打印注释Returns----------gain_ratios: dict, key: str, 特征名 value: float，信息增益率'''gain_ratios = dict()# 对所有的特征进行遍历，使用当前的划分方法对每个特征进行计算for feature in features:# 左子树保证所有的样本的这个特征取值为0left_split_target = data[data[feature] == 0][target]# 右子树保证所有的样本的这个特征取值为1right_split_target =data[data[feature] == 1][target]# 计算左子树的信息熵left_entropy = information_entropy(left_split_target)# 计算左子树的权重left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))# 计算右子树的信息熵right_entropy = information_entropy(right_split_target)# 计算右子树的权重right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))# 计算当前结点的信息熵current_entropy = information_entropy(data[target])# 计算当前结点的信息增益gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy)# 计算IV公式中，当前特征为0的值if left_weight == 0:left_IV = 0else:left_IV = - (left_weight * np.log2(left_weight))# 计算IV公式中，当前特征为1的值if right_weight == 0:right_IV = 0else:right_IV = - (right_weight * np.log2(right_weight))# IV 等于所有子树IV之和的相反数IV = - (left_IV + right_IV)# 计算使用当前特征划分的信息增益率# 这里为了防止IV是0，导致除法得到np.inf（无穷），在分母加了一个很小的小数gain_ratio = gain / (IV + np.finfo(np.longdouble).eps)# 信息增益率的存储gain_ratios[feature] = gain_ratioif annotate:print(" ", feature, gain_ratio)return gain_ratios

# 信息增益率测试样例1print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_A']) # 0.02573# 信息增益率测试样例2print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_B']) # 0.00417# 信息增益率测试样例3print(compute_information_gain_ratios(train_data, one_hot_features, target)['term_ 60 months']) # 0.01970

4.3 基尼指数

数据集 $D$ 的基尼值：

$2.\begin{aligned} \mathrm{Gini}(D) & = \sum^{\vert \mathcal{Y} \vert}_{k=1} \sum_{k’ \neq k} p_k p_{k’}\\ & = 1 – \sum^{\vert \mathcal{Y} \vert}_{k=1} p^2_k. \end{aligned}$

属性 $a$ 的基尼指数：

$v)\mathrm{Gini\_index}(D, a) = \sum^V_{v = 1} \frac{\vert D^v \vert}{\vert D \vert} \mathrm{Gini}(D^v)$
完成数据集基尼值的计算
这里需要填写三部分

def gini(labels_in_node):'''计算一个结点内样本的基尼指数Paramters----------label_in_data: np.ndarray, 样本的标记，如[-1, -1, 1, 1, 1]Returns---------gini: float，基尼指数'''# 统计样本总个数num_of_samples = labels_in_node.shape[0]if num_of_samples == 0:return 0# 统计出1的个数num_of_positive = len(labels_in_node[labels_in_node == 1])# 统计出-1的个数num_of_negative = len(labels_in_node[labels_in_node == -1])# 统计正例的概率prob_positive = num_of_positive / num_of_samples# 统计负例的概率prob_negative = num_of_negative / num_of_samples# 计算基尼值gini = 1 - (prob_positive ** 2 + prob_negative ** 2)return gini

# 基尼值测试样例1example_labels = np.array([-1, -1, 1, 1, 1])print(gini(example_labels)) # 0.48# 基尼值测试样例2example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])print(gini(example_labels)) # 0.40816# 基尼值测试样例3example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])print(gini(example_labels)) # 0.40816# 基尼值测试样例4example_labels = np.array([-1] * 9 + [1] * 8)print(gini(example_labels)) # 0.49827# 基尼值测试样例5example_labels = np.array([1] * 8)print(gini(example_labels)) # 0# 基尼值测试样例6example_labels = np.array([])print(gini(example_labels)) # 0

然后计算所有特征的基尼指数
这里需要填写三部分

def compute_gini_indices(data, features, target, annotate = False):'''计算使用各个特征进行划分时，各特征的基尼指数Parameter----------data: pd.DataFrame, 带有特征和标记的数据features: list(str)，特征名组成的listtarget: str， 特征的名字annotate: boolean, default False，是否打印注释Returns----------gini_indices: dict, key: str, 特征名 value: float，基尼指数'''gini_indices = dict()# 对所有的特征进行遍历，使用当前的划分方法对每个特征进行计算for feature in features:# 左子树保证所有的样本的这个特征取值为0left_split_target = data[data[feature] == 0][target]# 右子树保证所有的样本的这个特征取值为1right_split_target =data[data[feature] == 1][target]# 计算左子树的基尼值left_gini = gini(left_split_target.values)# 计算左子树的权重left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))# 计算右子树的基尼值right_gini = gini(right_split_target.values)# 计算右子树的权重right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))# 计算当前结点的基尼指数gini_index = left_weight * left_gini + right_weight * right_gini# 存储gini_indices[feature] = gini_indexif annotate:print(" ", feature, gini_index)return gini_indices

# 基尼指数测试样例1print(compute_gini_indices(train_data, one_hot_features, target)['grade_A']) # 0.30095# 基尼指数测试样例2print(compute_gini_indices(train_data, one_hot_features, target)['grade_B']) # 0.30568# 基尼指数测试样例3print(compute_gini_indices(train_data, one_hot_features, target)['term_ 36 months']) # 0.30055

5. 完成最优特征的选择

到此，我们完成了三种划分策略的实现，接下来就是完成获取最优特征的函数
这里需要填写三个部分

def best_splitting_feature(data, features, target, criterion = 'gini', annotate = False):'''给定划分方法和数据，找到最优的划分特征Parameters----------data: pd.DataFrame, 带有特征和标记的数据features: list(str)，特征名组成的listtarget: str， 特征的名字criterion: str, 使用哪种指标，三种选项: 'information_gain', 'gain_ratio', 'gini'annotate: boolean, default False，是否打印注释Returns----------best_feature: str, 最佳的划分特征的名字'''if criterion == 'information_gain':if annotate:print('using information gain')# 得到当前所有特征的信息增益information_gains = compute_information_gains(data, features, target, annotate)# information_gains是一个dict类型的对象，我们要找值最大的那个元素的键是谁# 根据这些特征和他们的信息增益，找到最佳的划分特征# YOUR CODE HEREbest_feature = max(information_gains, key=information_gains.get)return best_featureelif criterion == 'gain_ratio':if annotate:print('using information gain ratio')# 得到当前所有特征的信息增益率gain_ratios = compute_information_gain_ratios(data, features, target, annotate)# 根据这些特征和他们的信息增益率，找到最佳的划分特征# YOUR CODE HEREbest_feature = max(gain_ratios,key=gain_ratios.get)return best_featureelif criterion == 'gini':if annotate:print('using gini')# 得到当前所有特征的基尼指数gini_indices = compute_gini_indices(data, features, target, annotate)# 根据这些特征和他们的基尼指数，找到最佳的划分特征# YOUR CODE HEREbest_feature = min(gini_indices, key=gini_indices.get)return best_featureelse:raise Exception("传入的criterion不合规!", criterion)

6. 判断结点内样本的类别是否为同一类

def intermediate_node_num_mistakes(labels_in_node):'''求树的结点中，样本数少的那个类的样本有多少，比如输入是[1, 1, -1, -1, 1]，返回2Parameter----------labels_in_node: np.ndarray, pd.SeriesReturns----------int：个数'''# 如果传入的array为空，返回0if len(labels_in_node) == 0:return 0# 统计1的个数# YOUR CODE HEREnum_of_one = np.sum(labels_in_node == 1)# 统计-1的个数# YOUR CODE HEREnum_of_minus_one = np.sum(labels_in_node == -1)return num_of_one if num_of_minus_one > num_of_one else num_of_minus_one

# 测试样例1print(intermediate_node_num_mistakes(np.array([1, 1, -1, -1, -1]))) # 2# 测试样例2print(intermediate_node_num_mistakes(np.array([]))) # 0# 测试样例3print(intermediate_node_num_mistakes(np.array([1]))) # 0

7. 创建叶子结点

def create_leaf(target_values):'''计算出当前叶子结点的标记是什么，并且将叶子结点信息保存在一个dict中Parameter:----------target_values: pd.Series, 当前叶子结点内样本的标记Returns:----------leaf: dict，表示一个叶结点，leaf['splitting_features'], None，叶结点不需要划分特征leaf['left'], None，叶结点没有左子树leaf['right'], None，叶结点没有右子树leaf['is_leaf'], True, 是否是叶子结点leaf['prediction'], int, 表示该叶子结点的预测值'''# 创建叶子结点leaf = {'splitting_feature' : None,'left' : None,'right' : None,'is_leaf': True} # 数结点内-1和+1的个数num_ones = len(target_values[target_values == +1])num_minus_ones = len(target_values[target_values == -1])# 叶子结点的标记使用少数服从多数的原则，为样本数多的那类的标记，保存在 leaf['prediction']if num_ones > num_minus_ones:leaf['prediction'] = 1else:leaf['prediction'] = -1# 返回叶子结点return leaf

8. 递归地创建决策树

递归的创建决策树
递归算法终止的三个条件：

如果结点内所有的样本的标记都相同，该结点就不需要再继续划分，直接做叶子结点即可
如果结点所有的特征都已经在之前使用过了，在当前结点无剩余特征可供划分样本，该结点直接做叶子结点
如果当前结点的深度已经达到了我们限制的树的最大深度，直接做叶子结点

def decision_tree_create(data, features, target, criterion = 'gini', current_depth = 0, max_depth = 10, annotate = False):'''Parameter:----------data: pd.DataFrame, 数据features: iterable, 特征组成的可迭代对象，比如一个listtarget: str, 标记的名字criterion: 'str', 特征划分方法，只支持三种：'information_gain', 'gain_ratio', 'gini'current_depth: int, 当前深度，递归的时候需要记录max_depth: int, 树的最大深度，我们设定的树的最大深度，达到最大深度需要终止递归Returns:----------dict, dict['is_leaf']: False, 当前顶点不是叶子结点dict['prediction'] : None, 不是叶子结点就没有预测值dict['splitting_feature']: splitting_feature, 当前结点是使用哪个特征进行划分的dict['left'] : dictdict['right']: dict'''if criterion not in ['information_gain', 'gain_ratio', 'gini']:raise Exception("传入的criterion不合规!", criterion)# 复制一份特征，存储起来，每使用一个特征进行划分，我们就删除一个remaining_features = features[:]# 取出标记值target_values = data[target]if annotate:print("-" * 50)print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))# 终止条件1# 如果当前结点内所有样本同属一类，即这个结点中，各类别样本数最小的那个等于0# 使用前面写的intermediate_node_num_mistakes来完成这个判断# YOUR CODE HEREif intermediate_node_num_mistakes(target_values) == 0:if annotate:print("Stopping condition 1 reached.")return create_leaf(target_values) # 创建叶子结点# 终止条件2# 如果已经没有剩余的特征可供分割，即remaining_features为空# YOUR CODE HEREif not remaining_features:if annotate:print("Stopping condition 2 reached.")return create_leaf(target_values) # 创建叶子结点# 终止条件3# 如果已经到达了我们要求的最大深度，即当前深度达到了最大深度# YOUR CODE HEREif current_depth == max_depth:if annotate:print("Reached maximum depth. Stopping for now.")return create_leaf(target_values) # 创建叶子结点# 找到最优划分特征# 使用best_splitting_feature这个函数# YOUR CODE HEREsplitting_feature = best_splitting_feature(data, remaining_features, target, criterion, annotate)# 使用我们找到的最优特征将数据划分成两份# 左子树的数据left_split = data[data[splitting_feature] == 0]# 右子树的数据# YOUR CODE HEREright_split = data[data[splitting_feature] == 1]# 现在已经完成划分，我们要从剩余特征中删除掉当前这个特征remaining_features.remove(splitting_feature)# 打印当前划分使用的特征，打印左子树样本个数，右子树样本个数if annotate:print("Split on feature %s. (%s, %s)" % (\splitting_feature, len(left_split), len(right_split)))# 如果使用当前的特征，将所有的样本都划分到一棵子树中，那么就直接将这棵子树变成叶子结点# 判断左子树是不是“完美”的if len(left_split) == len(data):if annotate:print("Creating leaf node.")return create_leaf(left_split[target])# 判断右子树是不是“完美”的if len(right_split) == len(data):if annotate:print("Creating right node.")return create_leaf(right_split[target])# 递归地创建左子树left_tree = decision_tree_create(left_split, remaining_features, target, criterion, current_depth + 1, max_depth, annotate)# 递归地创建右子树right_tree = decision_tree_create(right_split, remaining_features, target, criterion, current_depth + 1, max_depth, annotate) # YOUR CODE HERE# 返回树的非叶子结点return {'is_leaf': False, 'prediction' : None,'splitting_feature': splitting_feature,'left' : left_tree, 'right': right_tree}

训练一个模型

my_decision_tree = decision_tree_create(train_data, one_hot_features, target, 'gini', max_depth = 6, annotate = False)

9. 预测

接下来我们需要完成预测函数

def classify(tree, x, annotate = False):'''递归的进行预测，一次只能预测一个样本Parameters----------tree: dictx: pd.Series，待预测的样本annotate： boolean, 是否显示注释Returns----------返回预测的标记'''if tree['is_leaf']:if annotate:print ("At leaf, predicting %s" % tree['prediction'])return tree['prediction']else:split_feature_value = x[tree['splitting_feature']]if annotate: print ("Split on %s = %s" % (tree['splitting_feature'], split_feature_value))if split_feature_value == 0:return classify(tree['left'], x, annotate)else:return classify(tree['right'], x, annotate)

我们取测试集第一个样本来测试

test_sample = test_data.iloc[0]print(test_sample)

print('True class: %s ' % (test_sample['safe_loans']))print('Predicted class: %s ' % classify(my_decision_tree, test_sample))

打印出使用决策树判断的过程

classify(my_decision_tree, test_sample, annotate=True)

10. 在测试集上对我们的模型进行评估

from sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_scorefrom sklearn.metrics import recall_scorefrom sklearn.metrics import f1_score

先来编写一个批量预测的函数，传入的是整个测试集那样的pd.DataFrame，这个函数返回一个np.ndarray，存储模型的预测结果
这里需要填写一个部分

def predict(tree, data):'''按行遍历data，对每个样本进行预测，将值存在prediction中，最后返回np.ndarrayParameter----------tree: dict, 模型data: pd.DataFrame, 数据Returns----------predictions：np.ndarray, 模型对这些样本的预测结果'''predictions = np.zeros(len(data)) # 长度和data一样# YOUR CODE HEREfor i in range(len(data)):predictions[i] = classify(tree,data.iloc[i])return predictions

11. 请你计算使用不同评价指标得到模型的四项指标的值，填写在下方表格内

树的最大深度为6

# YOUR CODE HEREcriteria = ['information_gain', 'gain_ratio', 'gini']for c in criteria:print(f'Using criterion: {c}')my_decision_tree=decision_tree_create(train_data, one_hot_features, target, c, max_depth=6, annotate=False)predictions = predict(my_decision_tree, test_data)acc33 = accuracy_score(test_data[target], predictions)precision33 = precision_score(test_data[target], predictions, average='macro')recall33 = recall_score(test_data[target], predictions, average='macro')f133 = f1_score(test_data[target], predictions, average='macro')print(f'Accuracy: {acc33}\nPrecision: {precision33}\nRecall: {recall33}\nF1 Score: {f133}\n')

实验3.4随机森林的应用——鸢尾花分类

加载sklearn中的鸢尾花数据集，选取前两个特征作为分类依据

运用Accuracy, Precision, Recall, F1四个指标进行评测

可视化分类结果

import sklearnimport numpy as np

1.导入数据

from sklearn.datasets import load_irisfeat,label = load_iris(return_X_y=True)data = load_iris()feat_names = data['feature_names']label_names = data['target_names']print(feat_names)print(label_names)

选取前两个特征

feat = feat[:,:2]feat.shape

2.导入模型

from sklearn.ensemble import RandomForestClassifier

3.模型训练

# YOUR CODE HEREfrom sklearn.model_selection import train_test_split# 拆分训练集, 测试集feat_train, feat_test, label_train, label_test = train_test_split(feat, label, test_size=0.2, random_state=42)# 创建随机森林分类器rf_classifier34 = RandomForestClassifier(n_estimators=100, random_state=42)# 拟合模型rf_classifier34.fit(feat_train, label_train)# 进行预测label_pred = rf_classifier34.predict(feat_test)

4.评价指标的计算

from sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_scorefrom sklearn.metrics import recall_scorefrom sklearn.metrics import f1_score

# YOUR CODE HEREacc34 = accuracy_score(label_test, label_pred)precision34 = precision_score(label_test, label_pred, average='macro')recall34 = recall_score(label_test, label_pred, average='macro')f134 = f1_score(label_test, label_pred, average='macro')print(f'Accuracy: {acc34}\nPrecision: {precision34}\nRecall: {recall34}\nF1 Score: {f134}\n')

5.可视化分类结果

import matplotlib.pyplot as pltfrom matplotlib.colors import ListedColormap# 创建一个网格，用于绘制决策边界x_min, x_max = feat[:, 0].min() - 1, feat[:, 0].max() + 1y_min, y_max = feat[:, 1].min() - 1, feat[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))# 预测网格上的点的类别Z = rf_classifier34.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)cmap_background = ListedColormap(['#FFAAAA', '#AAAAFF', '#AAFFAA'])cmap_points = ListedColormap(['#FF0000', '#0000FF', '#00FF00'])# 绘制决策边界plt.contourf(xx, yy, Z, cmap=cmap_background, alpha=0.3)# 绘制数据点plt.scatter(feat_test[:, 0], feat_test[:, 1], c=label_test, cmap=cmap_points, edgecolors='k', marker='o', s=80)# 设置图形属性plt.title('Random Forest Classifier - Iris Dataset')plt.xlabel(feat_names[0])plt.ylabel(feat_names[1])# 显示图例legend_labels = [f'{label_names[i]} ({i})' for i in range(len(label_names))]plt.legend(legend_labels)# 显示图形plt.show()

实验3.5自行实现AdaBoost并完成肿瘤分类

加载sklearn中的肿瘤归类数据集

自行选择基学习器（可以使用Scikit-learn现成的分类器）自己实现，使用不同的基学习器实现2种以上的AdaBoost

运用Accuracy, Precision, Recall, F1四个指标进行对比评测，随机选择70%作为训练集，30%作为测试集，把结果绘制成表格

与Scikit-learn 的AdaBoostClassifier得到的结果进行对比（基学习器和你自己实现的AdaBoost相同）

import sklearnimport numpy as np

导入数据

from sklearn.datasets import load_breast_cancerfeat,label = load_breast_cancer(return_X_y=True)feat.shape

划分数据集：70%训练集，30%测试集（随机种子固定为32）

from sklearn.model_selection import train_test_splittrainX, testX, trainY, testY = train_test_split(feat, label, test_size = 0.3, random_state = 32)trainX.shape

实现AdaBoost

import copy# YOUR CODE HEREclass MyAdaBoostClassifier:def __init__(self, base_estimator, n_estimators=50, learning_rate=1.0, n_classes=2):self.base_estimator = base_estimatorself.n_estimators = n_estimators self.lr = learning_rateself.R = n_classesself.estimators = []self.alphas = []# model_weightfor m in range(n_estimators):self.estimators.append(copy.deepcopy(base_estimator))def fit(self, X, y):sample_weight = np.ones(len(X)) / len(X)# 初始化样本权重为 1/Nfor i in range(self.n_estimators):model = self.estimators[i]model.fit(X, y, sample_weight)# 训练弱学习器y_pred = model.predict(X)error = np.sum(sample_weight* (y_pred != y))alpha = self.lr * (np.log((1-error)/error) + np.log(self.R-1))# 权重系数sample_weight *= np.exp(alpha*(y_pred!=y))# 更新迭代样本权重sample_weight /= np.sum(sample_weight)# 样本权重归一化self.alphas.append(alpha)return selfdef predict(self, X):# 假定类别映射成 0,1,...y_pred = [] for i in range(self.n_estimators):y_pred.append(self.estimators[i].predict_proba(X) )# 将预测类别概率与训练权重乘积作为集成预测类别概率y_pred = np.average(np.asarray(y_pred), weights=np.array(self.alphas), axis=0 ) y_pred = y_pred/np.array(self.alphas).sum()y_pred = np.argmax(y_pred, axis=1)return y_pred

基分类器选择决策树

from sklearn.tree import DecisionTreeClassifier# YOUR CODE HEREbase_tree = DecisionTreeClassifier(max_depth=1)# You can customize the base estimatorMymodel1 = MyAdaBoostClassifier(base_estimator=base_tree, n_estimators=50, learning_rate=1.0)Mymodel1.fit(trainX, trainY)

基分类器选择对数几率回归

from sklearn.linear_model import LogisticRegression# YOUR CODE HEREbase_lr = LogisticRegression() Mymodel2 = MyAdaBoostClassifier(base_estimator=base_lr, n_estimators=50, learning_rate=1.0)Mymodel2.fit(trainX, trainY)

评价指标的计算

from sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_scorefrom sklearn.metrics import recall_scorefrom sklearn.metrics import f1_scoreprediction1 = Mymodel1.predict(testX)prediction2 = Mymodel2.predict(testX)# YOUR CODE HEREacc351 = accuracy_score(testY, prediction1)precision351 = precision_score(testY, prediction1)recall351 = recall_score(testY, prediction1)f1351 = f1_score(testY, prediction1)print(f'Accuracy: {acc351}\nPrecision: {precision351}\nRecall: {recall351}\nF1 Score: {f1351}\n')acc352 = accuracy_score(testY, prediction2)precision352 = precision_score(testY, prediction2)recall352 = recall_score(testY, prediction2)f1352 = f1_score(testY, prediction2)print(f'Accuracy: {acc352}\nPrecision: {precision352}\nRecall: {recall352}\nF1 Score: {f1352}\n')

调用sklearn的模型

from sklearn.ensemble import AdaBoostClassifier# YOUR CODE HERE# 决策树作为基base_classifier = DecisionTreeClassifier(max_depth=1)ada_boost_model = AdaBoostClassifier(base_classifier, n_estimators=50, learning_rate=1.0)ada_boost_model.fit(trainX, trainY)# 预测predictions353 = ada_boost_model.predict(testX)# 计算评估指标acc353 = accuracy_score(testY, predictions353)precision353 = precision_score(testY, predictions353)recall353 = recall_score(testY, predictions353)f1353 = f1_score(testY, predictions353)print(f'Accuracy: {acc353}\nPrecision: {precision353}\nRecall: {recall353}\nF1 Score: {f1353}\n')

机器学习本科课程 实验3 决策树处理分类任务

实验3.1 决策树处理分类任务

1. 读取数据

2. 导入模型

3. 训练与预测

4. 改变最大深度，绘制决策树的精度变换图

5. 通过调整参数，得到一个泛化能力最好的模型

双击此处填写优化后的决策树参数设置与性能指标的结果

实验3.2决策树处理回归任务

1. 读取数据

2. 数据集划分

3. 导入模型

4. 选取特征和标记

5. 训练与预测

6. 改变最大深度，绘制决策树的精度变换图

7. 请你选择一个合理的树的最大深度，并给出理由

实验3.3实现决策树

1. 读取数据

2. 划分训练集和测试集

3. 特征预处理

4. 实现3种特征划分准则

4.1 信息增益

4.2 信息增益率

4.3 基尼指数

5. 完成最优特征的选择

6. 判断结点内样本的类别是否为同一类

7. 创建叶子结点

8. 递归地创建决策树

9. 预测

10. 在测试集上对我们的模型进行评估

11. 请你计算使用不同评价指标得到模型的四项指标的值，填写在下方表格内

实验3.4随机森林的应用——鸢尾花分类

1.导入数据

选取前两个特征

2.导入模型

3.模型训练

4.评价指标的计算

5.可视化分类结果

实验3.5自行实现AdaBoost并完成肿瘤分类

导入数据

划分数据集：70%训练集，30%测试集 （随机种子固定为32）

实现AdaBoost

基分类器选择决策树

基分类器选择对数几率回归

评价指标的计算

调用sklearn的模型

相关文章

最新关注

热文推荐

机器学习本科课程实验3 决策树处理分类任务

划分数据集：70%训练集，30%测试集（随机种子固定为32）