本篇未完结,请持续关注更新。

源码和数据集下载在本篇最后

以太坊(Ethereum)是一种基于区块链技术的开源平台和加密货币。它于2015年由Vitalik Buterin和Gavin Wood等开发者创建,并成为比特币之后最受欢迎的加密货币之一。以太坊不仅支持加密货币交易,还为开发者和企业提供了构建去中心化应用程序的强大工具。在本节的内容中,将实型一个完整的机器学习模型项目,智能检测出以太坊区块链中的非法账户。从问题定义到模型建立和评估,再到最终的总结和建议。本实例突出了处理类别不平衡问题的重要性,并展示了如何使用多种机器学习算法来解决实际问题。此外,通过数据可视化和性能指标的使用,使得结果更具可解释性和可操作性。

实例11-1:使用模型检测以太坊区块链中的非法账户(源码路径:daima/11/illicit-account-detection.ipynb

11.3.1 数据集介绍

本项目所使用的数据集主要用于以太坊区块链上的欺诈检测研究,这个数据集包含了已知的欺诈交易和有效交易的记录,可以用于数据分析、机器学习和欺诈检测算法的开发和测试。下面是对该数据集的概要说明:

  1. 数据来源:该数据集的来源是以太坊区块链,其中包含了一系列与以太坊账户和交易相关的信息。
  2. 目的:数据集的主要目的是为研究人员和数据科学家提供一个用于欺诈检测的样本数据集。研究人员可以使用这些数据来训练机器学习模型,以识别潜在的欺诈性交易。
  3. 数据列:数据集包含了多个列,其中包括账户地址、交易类型、交易时间间隔、交易数量、以太币价值等信息。还有一个”FLAG”列,用于指示交易是否为欺诈。
  4. 用途:这个数据集可以用于开展欺诈检测、数据挖掘、特征工程等与以太坊区块链上的交易行为相关的分析和研究。

11.3.2 数据预处理

1读取名为 “transaction_dataset.csv” 的数据文件,并显示数据集的前几行内容以便进行初步了解。具体实现代码如下所示。

dataset=pd.read_csv("../input/ethereum-frauddetection-dataset/transaction_dataset.csv")dataset.head()

执行后会输出:

Unnamed: 0IndexAddressFLAGAvg min between sent tnxAvg min between received tnxTime Diff between first and last (Mins)Sent tnxReceived TnxNumber of Created Contracts...ERC20 min val sentERC20 max val sentERC20 avg val sentERC20 min val sent contractERC20 max val sent contractERC20 avg val sent contractERC20 uniq sent token nameERC20 uniq rec token nameERC20 most sent token typeERC20_most_rec_token_type0010x00009277775ac7d0d59eaad8fee3d10ac6c805e80844.261093.71704785.63721890...0.0000001.683100e+07271779.9200000.00.00.039.057.0CofounditNumeraire1120x0002b44ddb1476db43c868bd494422ee4c136fed012709.072958.441218216.739480...2.2608092.260809e+002.2608090.00.00.01.07.0Livepeer TokenLivepeer Token2230x0002bda54cb772d040f779e88eb453cac0daa2440246194.542434.02516729.302100...0.0000000.000000e+000.0000000.00.00.00.08.0NoneXENON3340x00038e6ba2fd5c09aedb96697c8d7b8fa6632e5e010219.6015785.09397555.902590...100.0000009.029231e+033804.0768930.00.00.01.011.0RaidenXENON4450x00062d1dd1afb6fb02540ddad9cdebfe568e0d89036.6110707.77382472.424598201...0.0000004.500000e+0413726.6592200.00.00.06.027.0StatusNetworkEOS

(2)获取数据集维度(行数和列数),在这种情况下,如果执行 dataset.shape,它将返回一个包含两个值的元组,第一个值表示数据集的行数,第二个值表示数据集的列数。例如,如果返回的元组是 (1000, 20),那么意味着数据集有1000行和20列。具体实现代码如下所示。

dataset.shape

执行后会输出:

(9841, 51)

3获取数据集的详细信息,这个命令对于快速了解数据集的结构和数据类型非常有用,以及检查是否存在缺失值。具体实现代码如下所示。

dataset.info()

执行这行代码后,将会输出有关数据集的以下信息:

  1. 数据集中每列的名称(列名)。
  2. 每列非缺失值的数量。
  3. 每列的数据类型(例如,整数、浮点数、对象等)。
  4. 数据集中的总行数。

执行后会输出:

RangeIndex: 9841 entries, 0 to 9840Data columns (total 51 columns): # ColumnNon-Null CountDtype---------------------------- 0 Unnamed: 09841 non-null int64 1 Index 9841 non-null int64 2 Address 9841 non-null object3 FLAG9841 non-null int64 4 Avg min between sent tnx9841 non-nul########省略部分内容 46 ERC20 avg val sent contract9012 non-null float64 47 ERC20 uniq sent token name 9012 non-null float64 48 ERC20 uniq rec token name9012 non-null float64 49 ERC20 most sent token type 9000 non-null object50 ERC20_most_rec_token_type8990 non-null object dtypes: float64(39), int64(9), object(3)memory usage: 3.8+ MB

(4)检查数据集中的重复行并删除它们,然后删除一个名为 “Unnamed: 0” 的列,因为它不需要用于进一步的分析。最后,输出了处理后的数据集的维度。具体实现代码如下所示。

# 检查并删除重复行dataset.drop_duplicates(subset=None, inplace=True)# 删除 "Unnamed: 0" 列(因为它不需要用于进一步的分析)dataset.drop(['Unnamed: 0'], axis=1, inplace=True)# 获取处理后的数据集的维度数据集维度 = dataset.shape

执行后会输出:

(9841, 50)

5生成数据集的描述性统计信息,包括数据集中数值列的统计汇总,如均值、标准差、最小值、25%分位数、中位数(50%分位数)、75%分位数和最大值。这些统计信息对于初步了解数据的分布和特征非常有用。具体实现代码如下所示。

dataset.describe()

执行后会输出:

IndexFLAGAvg min between sent tnxAvg min between received tnxTime Diff between first and last (Mins)Sent tnxReceived TnxNumber of Created ContractsUnique Received From AddressesUnique Sent To Addresses...ERC20 max val recERC20 avg val recERC20 min val sentERC20 max val sentERC20 avg val sentERC20 min val sent contractERC20 max val sent contractERC20 avg val sent contractERC20 uniq sent token nameERC20 uniq rec token namecount9841.0000009841.0000009841.0000009841.0000009.841000e+039841.0000009841.0000009841.0000009841.0000009841.000000...9.012000e+039.012000e+039.012000e+039.012000e+039.012000e+039012.09012.09012.09012.0000009012.000000mean1815.0498930.2214215086.8787218004.8511842.183333e+05115.931714163.7009453.72970230.36093925.840159...1.252524e+084.346203e+061.174126e+041.303594e+076.318389e+060.00.00.01.3849314.826676std1222.6218300.41522421486.54997423081.7148013.229379e+05757.226361940.836550141.445583298.621112263.820410...1.053741e+102.141192e+081.053567e+061.179905e+095.914764e+080.00.00.06.73512116.678607min1.0000000.0000000.0000000.0000000.000000e+000.0000000.0000000.0000000.0000000.000000...0.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.00.00.00.0000000.00000025%821.0000000.0000000.0000000.0000003.169300e+021.0000001.0000000.0000001.0000001.000000...0.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.00.00.00.0000000.00000050%1641.0000000.00000017.340000509.7700004.663703e+043.0000004.0000000.0000002.0000002.000000...0.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.00.00.00.0000001.00000075%2601.0000000.000000565.4700005480.3900003.040710e+0511.00000027.0000000.0000005.0000003.000000...9.900000e+012.946467e+010.000000e+000.000000e+000.000000e+000.00.00.00.0000002.000000max4729.0000001.000000430287.670000482175.4900001.954861e+0610000.00000010000.0000009995.0000009999.0000009287.000000...1.000000e+121.724181e+101.000000e+081.120000e+115.614756e+100.00.00.0213.000000737.000000

6获取数据集中的列名,即数据集中包含的所有列的名称。获取列名是为了更好地了解数据集的结构和标识不同的特征或属性。具体实现代码如下所示。

# 获取数据集中的列名column=dataset.columnscolumn

执行后会输出:

Index(['Index', 'Address', 'FLAG', 'Avg min between sent tnx', 'Avg min between received tnx', 'Time Diff between first and last (Mins)', 'Sent tnx', 'Received Tnx', 'Number of Created Contracts', 'Unique Received From Addresses', 'Unique Sent To Addresses', 'min value received', 'max value received ', 'avg val received', 'min val sent', 'max val sent', 'avg val sent', 'min value sent to contract', 'max val sent to contract', 'avg value sent to contract', 'total transactions (including tnx to create contract', 'total Ether sent', 'total ether received', 'total ether sent contracts', 'total ether balance', ' Total ERC20 tnxs', ' ERC20 total Ether received', ' ERC20 total ether sent', ' ERC20 total Ether sent contract', ' ERC20 uniq sent addr', ' ERC20 uniq rec addr', ' ERC20 uniq sent addr.1', ' ERC20 uniq rec contract addr', ' ERC20 avg time between sent tnx', ' ERC20 avg time between rec tnx', ' ERC20 avg time between rec 2 tnx', ' ERC20 avg time between contract tnx', ' ERC20 min val rec', ' ERC20 max val rec', ' ERC20 avg val rec', ' ERC20 min val sent', ' ERC20 max val sent', ' ERC20 avg val sent', ' ERC20 min val sent contract', ' ERC20 max val sent contract', ' ERC20 avg val sent contract', ' ERC20 uniq sent token name', ' ERC20 uniq rec token name', ' ERC20 most sent token type', ' ERC20_most_rec_token_type'],dtype='object')

7检查数据集中的缺失值,并计算每列中的缺失值数量。具体实现代码如下所示。

dataset.isnull().sum()

执行后会输出:

Index 0Address 0FLAG0Avg min between sent tnx0Avg min between received tnx0Time Diff between first and last (Mins) 0#####省略部分输出 ERC20 avg val sent contract829 ERC20 uniq sent token name 829 ERC20 uniq rec token name829 ERC20 most sent token type 841 ERC20_most_rec_token_type851dtype: int64

7)再次显示数据集的前几行,以便初步了解数据集的内容和结构。具体实现代码如下所示。

dataset.head()

此时执行后会输出:

IndexAddressFLAGAvg min between sent tnxAvg min between received tnxTime Diff between first and last (Mins)Sent tnxReceived TnxNumber of Created ContractsUnique Received From Addresses...ERC20 min val sentERC20 max val sentERC20 avg val sentERC20 min val sent contractERC20 max val sent contractERC20 avg val sent contractERC20 uniq sent token nameERC20 uniq rec token nameERC20 most sent token typeERC20_most_rec_token_type010x00009277775ac7d0d59eaad8fee3d10ac6c805e80844.261093.71704785.6372189040...0.0000001.683100e+07271779.9200000.00.00.039.057.0CofounditNumeraire120x0002b44ddb1476db43c868bd494422ee4c136fed012709.072958.441218216.7394805...2.2608092.260809e+002.2608090.00.00.01.07.0Livepeer TokenLivepeer Token230x0002bda54cb772d040f779e88eb453cac0daa2440246194.542434.02516729.30210010...0.0000000.000000e+000.0000000.00.00.00.08.0NoneXENON340x00038e6ba2fd5c09aedb96697c8d7b8fa6632e5e010219.6015785.09397555.9025907...100.0000009.029231e+033804.0768930.00.00.01.011.0RaidenXENON450x00062d1dd1afb6fb02540ddad9cdebfe568e0d89036.6110707.77382472.4245982017...0.0000004.500000e+0413726.6592200.00.00.06.027.0StatusNetworkEOS

(8)首先获取数据集的列名,然后计算名为’ ERC20 most sent token type’ 的列中各个值的数量。这对于了解特定列中不同值的分布情况非常有用。具体实现代码如下所示。

column=dataset.columnscolumndataset[' ERC20 most sent token type'].value_counts()

执行后会输出:

for col in column:print(dataset[col].value_counts())

9遍历数据集的每一列,然后计算每列中不同值的数量。具体实现代码如下所示。

# 遍历数据集的每一列并计算各个值的数量

for col in column:print(dataset[col].value_counts())

上述代码有助于了解每个特征或属性的分布情况,执行后会输出:

1 314583145231453314543 ..3527135261352513524147291Name: Index, Length: 4729, dtype: int640x4cd526aa2db72eb1fd557b37c6b0394acd35b21220x4cd3bb2110eda1805dc63abc1959a5ee2d386e9f20x4c1da8781f6ca312bc11217b3f61e5dfdf428de120x4c24af967901ec87a6644eb1ef42b680f58e67f520x4c268c7b1d51b369153d6f1f28c61b15f0e177462 ..0x57b417366e5681ad493a03492d9b61ecd0d3d24710x57bb2d6426fed243c633d0b16d4297d12bc2063810x57c0cf70020f0af5073c24cb272e93e7529c6a4010x57ccf2b7ffe5e4497a7e04ac174646f5f16e24ce10xd624d046edbdef805c5e4140dce5fb5ec1b39a3c1Name: Address, Length: 9816, dtype: int640766212179Name: FLAG, dtype: int640.0035222.1114##########省略部分输出结果Blockwell say NOTSAFU 779DATAcoin358Livepeer Token207 ... BCDN1Egretia 1UG Coin 1Yun Planet1INS Promo11Name:ERC20_most_rec_token_type, Length: 467, dtype: int64

本篇未完结,请持续关注更新。