11.3.4 Train-Test Split(拆分数据集)

“Train-Test Split” 是机器学习和数据分析中常用的一种数据集拆分方法,用于评估模型的性能和泛化能力。Train-Test Split的主要目的是,将原始数据集划分为两个互斥的子集:训练集(Training Set)和测试集(Test Set)。

(1)导入了 sklearn(Scikit-Learn)库中的 train_test_split 函数,并展示了数据集的前几行。 train_test_split 函数是用于将数据集划分为训练集和测试集的常用工具。它可以将数据集按照一定的比例分割成训练集和测试集,以便进行机器学习模型的训练和评估。具体实现代码如下所示。

from sklearn.model_selection import train_test_splitdataset.head()

执行后会输出:

AddressFLAGAvg min between sent tnxAvg min between received tnxTime Diff between first and last (Mins)Sent tnxReceived TnxNumber of Created ContractsUnique Received From AddressesUnique Sent To Addresses...max val sent to contracttotal Ether senttotal ether balanceTotal ERC20 tnxsERC20 total Ether receivedERC20 total ether sentERC20 total Ether sent contractERC20 uniq sent addr.1ERC20 uniq rec contract addrERC20 min val rec00x00009277775ac7d0d59eaad8fee3d10ac6c805e80844.261093.71704785.6372189040118...0.0865.691093-279.224419265.03.558854e+073.560317e+070.00.058.00.010x0002b44ddb1476db43c868bd494422ee4c136fed012709.072958.441218216.739480514...0.03.087297-0.0018198.04.034283e+022.260809e+000.00.07.00.020x0002bda54cb772d040f779e88eb453cac0daa2440246194.542434.02516729.302100102...0.03.5886160.0004418.05.215121e+020.000000e+000.00.08.00.030x00038e6ba2fd5c09aedb96697c8d7b8fa6632e5e010219.6015785.09397555.902590713...0.01750.045862-854.64630314.01.711105e+041.141223e+040.00.011.00.040x00062d1dd1afb6fb02540ddad9cdebfe568e0d89036.6110707.77382472.424598201719...0.0104.318883-50.89698642.01.628297e+051.235399e+050.00.027.00.0

2)首先将目标变量(响应变量)存储在 y 变量中,特征变量存储在 X 变量中。同时,将 “FLAG” 列和 “Address” 列从特征中移除。然后,定义了一个名为 train_val_test_split 的函数,用于将数据集划分为训练集、验证集和测试集。这个函数使用 train_test_split 函数来进行划分。最后,使用 train_val_test_split 函数将数据集划分为训练集(80%)、验证集(10%)和测试集(10%),并分别存储在 X_train、X_val、X_test、y_train、y_val 和 y_test 变量中。具体实现代码如下所示。

# 将响应变量放入 y,将特征变量放入 Xy = dataset['FLAG']X = dataset.drop(['FLAG', 'Address'], axis=1)# 定义一个用于划分数据集的函数def train_val_test_split(X, y, train_size, val_size, test_size):X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=test_size)relative_train_size = train_size / (val_size + train_size)X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val,train_size=relative_train_size, test_size=1-relative_train_size)return X_train, X_val, X_test, y_train, y_val, y_test# 将数据集划分为训练集、验证集和测试集X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(X, y, 0.8, 0.1, 0.1)X_train.shape, y_train.shape, X_test.shape, y_test.shape,X_val.shape,y_val.shape

这些形状信息可以用于确保数据集的维度正确,并且可以作为训练、测试和验证过程中的参考。

3获取训练集 X_train 的列名,具体实现代码如下所示。

X_train.columns

执行后将返回训练集中的特征列(不包括目标列)的列名列表:

Index([' Total ERC20 tnxs', ' ERC20 uniq rec contract addr', 'total ether balance', 'Time Diff between first and last (Mins)', 'max value received ', 'avg val received', ' ERC20 total Ether received', ' ERC20 min val rec', 'Unique Received From Addresses', 'Received Tnx', 'Avg min between received tnx', 'min value received', 'Avg min between sent tnx', 'total Ether sent', 'avg val sent', 'max val sent', 'Sent tnx', 'Unique Sent To Addresses'],dtype='object')

4通过互信息评估每个特征对目标的重要性,并可视化显示了前 18 个具有最大信息增益的特征的重要性。具体实现代码如下所示。

!pip install skfeature-chappersfrom sklearn.feature_selection import mutual_info_classifimportance=mutual_info_classif(X_train,y_train)feat_importances=pd.Series(importance,X_train.columns[0:len(X_train.columns)])plt.figure(figsize=[30,15])feat_importances.nlargest(18).plot(kind='barh',color='teal',)plt.show()

5获取具有最大信息增益的前 18 个重要特征的列名,这些列名被存储在名为 col_x 的变量中。具体实现代码如下所示。

col_x=feat_importances.nlargest(18).indexcol_x

执行后将获得这些重要特征的列名列表,这些列名代表了对目标变量具有较高影响的特征。

Index([' Total ERC20 tnxs', ' ERC20 uniq rec contract addr', 'total ether balance', 'Time Diff between first and last (Mins)', 'max value received ', 'avg val received', ' ERC20 total Ether received', ' ERC20 min val rec', 'Unique Received From Addresses', 'Received Tnx', 'Avg min between received tnx', 'min value received', 'Avg min between sent tnx', 'total Ether sent', 'avg val sent', 'max val sent', 'Sent tnx', 'Unique Sent To Addresses'],dtype='object')

6从训练集 X_train、验证集 X_val 和测试集 X_test 中选择了具有最大信息增益的前 18 个重要特征,并将这些特征存储在了相应的数据集中。具体实现代码如下所示。

X_train=X_train[col_x]X_val=X_val[col_x]X_test=X_test[col_x]feat_importances

执行后会输出:

Avg min between sent tnx 0.096649Avg min between received tnx 0.102166Time Diff between first and last (Mins)0.237711Sent tnx 0.068052Received Tnx 0.109679#######省略部分输出结果 ERC20 total Ether sent contract 0.005287 ERC20 uniq sent addr.10.001419 ERC20 uniq rec contract addr0.254201 ERC20 min val rec 0.141128dtype: float64

未完待续