pandas模块介绍 pandas官方文档:https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750
pandas基于Numpy,可以看成是处理文本或者表格数据。pandas中有两个主要的数据结构,其中Series数据结构类似于Numpy中的一维数组,DataFrame类似于多维表格数据结构。
pandas是python数据分析的核心模块。它主要提供了五大功能:
支持文件存取操作,支持数据库(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。 支持增删改查、切片、高阶函数、分组聚合等单表操作,以及和dict、list的互相转换。 支持多表拼接合并操作。 支持简单的绘图操作。 支持简单的统计分析操作。 一、Series数据结构 Series是一种类似于一维数组的对象,由一组数据和一组与之相关的数据标签(索引)组成。
Series比较像列表(数组)和字典的结合体
import numpy as npimport pandas as pddf = pd. Series( 0 , index= [ 'a' , 'b' , 'c' , 'd' ] ) print ( df) a 0 b 0 c 0 d 0 dtype: int64 print ( df. values) print ( df. index)
1 Series支持NumPy模块的特性(下标) 详解 方法 从ndarray创建Series Series(arr) 与标量运算 df*2 两个Series运算 df1+df2 索引 df[0], df[[1,2,4]] 切片 df[0:2] 通用函数 np.abs(df) 布尔值过滤 df[df>0]
arr = np. array( [ 1 , 2 , 3 , 4 , np. nan] ) print ( arr) df = pd. Series( arr, index= [ 'a' , 'b' , 'c' , 'd' , 'e' ] ) print ( df) a 1.0 b 2.0 c 3.0 d 4.0 e NaNdtype: float64 print ( df** 2 ) a 1.0 b 4.0 c 9.0 d 16.0 e NaN dtype: float64 print ( df[ 0 ] ) print ( df[ 'a' ] ) print ( df[ [ 0 , 1 , 2 ] ] ) a 1.0 b 2.0 c 3.0 dtype: float64 print ( df[ 0 : 2 ] ) a 1.0 b 2.0 dtype: float64 np. sin( df) a 0.841471 b 0.909297 c 0.141120 d - 0.756802 e NaN dtype: float64 df[ df > 1 ] b 2.0 c 3.0 d 4.0 dtype: float64
2 Series支持字典的特性(标签) 详解 方法 从字典创建Series Series(dic), in运算 ’a’ in sr 键索引 sr[‘a’], sr[[‘a’, ‘b’, ‘d’]]
df = pd. Series( { 'a' : 1 , 'b' : 2 } ) print ( df) a 1 b 2 dtype: int64 print ( 'a' in df) print ( df[ 'a' ] )
3 Series缺失数据处理 方法 详解 dropna() 过滤掉值为NaN的行 fillna() 填充缺失数据 isnull() 返回布尔数组,缺失值对应为True notnull() 返回布尔数组,缺失值对应为False
df = pd. Series( [ 1 , 2 , 3 , 4 , np. nan] , index= [ 'a' , 'b' , 'c' , 'd' , 'e' ] ) print ( df) a 1.0 b 2.0 c 3.0 d 4.0 e NaN dtype: float64 print ( df. dropna( ) ) a 1.0 b 2.0 c 3.0 d 4.0 dtype: float64print ( df. fillna( 5 ) ) a 1.0 b 2.0 c 3.0 d 4.0 e 5.0 dtype: float64print ( df. isnull( ) ) a False b False c False d False e True dtype: bool print ( df. notnull( ) ) a True b True c True d True e False dtype: bool
二、DataFrame数据结构 DataFrame是一个表格型的数据结构,含有一组有序的列。
DataFrame可以被看做是由Series组成的字典,并且共用一个索引。
1 产生时间对象数组:date_range date_range参数详解:
参数 详解 start 开始时间 end 结束时间 periods 时间长度 freq 时间频率,默认为’D’,可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…
dates = pd. date_range( '20190101' , periods= 6 , freq= 'M' ) print ( dates) DatetimeIndex( [ '2019-01-31' , '2019-02-28' , '2019-03-31' , '2019-04-30' , '2019-05-31' , '2019-06-30' ] , dtype= 'datetime64[ns]' , freq= 'M' ) np. random. seed( 1 ) arr = 10 * np. random. randn( 6 , 4 ) print ( arr) [ [ 16.24345364 - 6.11756414 - 5.28171752 - 10.72968622 ] [ 8.65407629 - 23.01538697 17.44811764 - 7.61206901 ] [ 3.19039096 - 2.49370375 14.62107937 - 20.60140709 ] [ - 3.22417204 - 3.84054355 11.33769442 - 10.99891267 ] [ - 1.72428208 - 8.77858418 0.42213747 5.82815214 ] [ - 11.00619177 11.4472371 9.01590721 5.02494339 ] ] df = pd. DataFrame( arr, index= dates, columns= [ 'c1' , 'c2' , 'c3' , 'c4' ] ) df
空格 c1 c2 c3 c4 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 2019-06-30 -11.006192 11.447237 9.015907 5.024943
三、DataFrame属性 属性 详解 dtype是 查看数据类型 index 查看行序列或者索引 columns 查看各列的标签 values 查看数据框内的数据,也即不含表头索引的数据 describe 查看数据每一列的极值,均值,中位数,只可用于数值型数据 transpose 转置,也可用T来操作 sort_index 排序,可按行或列index排序输出 sort_values 按数据值来排序
print ( df2. dtypes) 0 float64 1 float64 2 float64 3 float64 dtype: object df
空格 c1 c2 c3 c4 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 2019-06-30 -11.006192 11.447237 9.015907 5.024943
print ( df. index) DatetimeIndex( [ '2019-01-31' , '2019-02-28' , '2019-03-31' , '2019-04-30' , '2019-05-31' , '2019-06-30' ] , dtype= 'datetime64[ns]' , freq= 'M' ) print ( df. columns) Index( [ 'c1' , 'c2' , 'c3' , 'c4' ] , dtype= 'object' ) print ( df. values) [ [ 16.24345364 - 6.11756414 - 5.28171752 - 10.72968622 ] [ 8.65407629 - 23.01538697 17.44811764 - 7.61206901 ] [ 3.19039096 - 2.49370375 14.62107937 - 20.60140709 ] [ - 3.22417204 - 3.84054355 11.33769442 - 10.99891267 ] [ - 1.72428208 - 8.77858418 0.42213747 5.82815214 ] [ - 11.00619177 11.4472371 9.01590721 5.02494339 ] ] df. describe( )
空格 c1 c2 c3 c4 count 6.000000 6.000000 6.000000 6.000000 mean 2.022213 -5.466424 7.927203 -6.514830 std 9.580084 11.107772 8.707171 10.227641 min -11.006192 -23.015387 -5.281718 -20.601407 25% -2.849200 -8.113329 2.570580 -10.931606 50% 0.733054 -4.979054 10.176801 -9.170878 75% 7.288155 -2.830414 13.800233 1.865690 max 16.243454 11.447237 17.448118 5.828152
df. T
空格 2019-01-31 00:00:00 2019-02-28 00:00:00 2019-03-31 00:00:00 2019-04-30 00:00:00 2019-05-31 00:00:00 2019-06-30 00:00:00 c1 16.243454 8.654076 3.190391 -3.224172 -1.724282 -11.006192 c2 -6.117564 -23.015387 -2.493704 -3.840544 -8.778584 11.447237 c3 -5.281718 17.448118 14.621079 11.337694 0.422137 9.015907 c4 -10.729686 -7.612069 -20.601407 -10.998913 5.828152 5.024943
df. sort_index( axis= 0 )
空格 c1 c2 c3 c4 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 2019-06-30 -11.006192 11.447237 9.015907 5.024943
df. sort_index( axis= 1 )
空格 c1 c2 c3 c4 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 2019-06-30 -11.006192 11.447237 9.015907 5.024943
df. sort_values( by= 'c2' )
空格 c1 c2 c3 c4 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 2019-06-30 -11.006192 11.447237 9.015907 5.024943
四、DataFrame取值 df
空格 c1 c2 c3 c4 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 2019-06-30 -11.006192 11.447237 9.015907 5.024943
1 通过columns取值 df[ 'c2' ] 2019 - 01 - 31 - 6.117564 2019 - 02 - 28 - 23.015387 2019 - 03 - 31 - 2.493704 2019 - 04 - 30 - 3.840544 2019 - 05 - 31 - 8.778584 2019 - 06 - 30 11.447237 Freq: M, Name: c2, dtype: float64 df[ [ 'c2' , 'c3' ] ]
空格 c2 c3 2019-01-31 -6.117564 -5.281718 2019-02-28 -23.015387 17.448118 2019-03-31 -2.493704 14.621079 2019-04-30 -3.840544 11.337694 2019-05-31 -8.778584 0.422137 2019-06-30 11.447237 9.015907
2 loc(通过行标签取值) df. loc[ '2019-01-01' : '2019-01-03' ]
df[ 0 : 3 ]
空格 c1 c2 c3 c4 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 2019-03-31 3.190391 -2.493704 14.621079 -20.601407
3 iloc(类似于numpy数组取值) df. values array( [ [ 16.24345364 , - 6.11756414 , - 5.28171752 , - 10.72968622 ] , [ 8.65407629 , - 23.01538697 , 17.44811764 , - 7.61206901 ] , [ 3.19039096 , - 2.49370375 , 14.62107937 , - 20.60140709 ] , [ - 3.22417204 , - 3.84054355 , 11.33769442 , - 10.99891267 ] , [ - 1.72428208 , - 8.77858418 , 0.42213747 , 5.82815214 ] , [ - 11.00619177 , 11.4472371 , 9.01590721 , 5.02494339 ] ] ) print ( df. iloc[ 2 , 1 ] ) df. iloc[ 1 : 4 , 1 : 4 ]
空格 c2 c3 c4 2019-02-28 -23.015387 17.448118 -7.612069 2019-03-31 -2.493704 14.621079 -20.601407 2019-04-30 -3.840544 11.337694 -10.998913
4 使用逻辑判断取值 df[ df[ 'c1' ] > 0 ]
空格 c1 c2 c3 c4 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 2019-03-31 3.190391 -2.493704 14.621079 -20.601407
df[ ( df[ 'c1' ] > 0 ) & ( df[ 'c2' ] > - 8 ) ]
空格 c1 c2 c3 c4 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-03-31 3.190391 -2.493704 14.621079 -20.601407
五、DataFrame值替换 df
空格 c1 c2 c3 c4 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686 2019-02-28 8.654076 -23.015387 17.448118 -7.612069 2019-03-31 3.190391 -2.493704 14.621079 -20.601407 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 2019-06-30 -11.006192 11.447237 9.015907 5.024943
df. iloc[ 0 : 3 , 0 : 2 ] = 0 df
空格 c1 c2 c3 c4 2019-01-31 0.000000 0.000000 -5.281718 -10.729686 2019-02-28 0.000000 0.000000 17.448118 -7.612069 2019-03-31 0.000000 0.000000 14.621079 -20.601407 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 2019-06-30 -11.006192 11.447237 9.015907 5.024943
df[ 'c3' ] > 10 2019 - 01 - 31 False 2019 - 02 - 28 True 2019 - 03 - 31 True 2019 - 04 - 30 True 2019 - 05 - 31 False 2019 - 06 - 30 False Freq: M, Name: c3, dtype: bool df[ df[ 'c3' ] > 10 ] = 100 df
空格 c1 c2 c3 c4 2019-01-31 0.000000 0.000000 -5.281718 -10.729686 2019-02-28 100.000000 100.000000 100.000000 100.000000 2019-03-31 100.000000 100.000000 100.000000 100.000000 2019-04-30 100.000000 100.000000 100.000000 100.000000 2019-05-31 -1.724282 -8.778584 0.422137 5.828152 2019-06-30 -11.006192 11.447237 9.015907 5.024943
df = df. astype( np. int32) df[ df[ 'c3' ] . isin( [ 100 ] ) ] = 1000 df
空格 c1 c2 c3 c4 2019-01-31 0 0 -5 -10 2019-02-28 1000 1000 1000 1000 2019-03-31 1000 1000 1000 1000 2019-04-30 1000 1000 1000 1000 2019-05-31 -1 -8 0 5 2019-06-30 -11 11 9 5
六、读取CSV文件 import pandas as pdfrom io import StringIOtest_data = '''5.1,,1.4,0.24.9,3.0,1.4,0.24.7,3.2,,0.27.0,3.2,4.7,1.46.4,3.2,4.5,1.56.9,3.1,4.9,,,,''' test_data = StringIO( test_data) df = pd. read_csv( test_data, header= None ) df. columns = [ 'c1' , 'c2' , 'c3' , 'c4' ] df
空格 c1 c2 c3 c4 0 5.1 NaN 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 NaN 0.2 3 7.0 3.2 4.7 1.4 4 6.4 3.2 4.5 1.5 5 6.9 3.1 4.9 NaN 6 NaN NaN NaN NaN
七、处理丢失数据 df. isnull( )
空格 c1 c2 c3 c4 0 False True False False 1 False False False False 2 False False True False 3 False False False False 4 False False False False 5 False False False True 6 True True True True
print ( df. isnull( ) . sum ( ) ) c1 1 c2 2 c3 2 c4 2 dtype: int64df. dropna( axis= 0 )
空格 c1 c2 c3 c4 1 4.9 3.0 1.4 0.2 3 7.0 3.2 4.7 1.4 4 6.4 3.2 4.5 1.5
df. dropna( axis= 1 )
df. dropna( how= 'all' )
空格 c1 c2 c3 c4 0 5.1 NaN 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 NaN 0.2 3 7.0 3.2 4.7 1.4 4 6.4 3.2 4.5 1.5 5 6.9 3.1 4.9 NaN
df. dropna( thresh= 4 )
空格 c1 c2 c3 c4 1 4.9 3.0 1.4 0.2 3 7.0 3.2 4.7 1.4 4 6.4 3.2 4.5 1.5
df. dropna( subset= [ 'c2' ] )
空格 c1 c2 c3 c4 1 4.9 3.0 1.4 0.2 2 4.7 3.2 NaN 0.2 3 7.0 3.2 4.7 1.4 4 6.4 3.2 4.5 1.5 5 6.9 3.1 4.9 NaN
df. fillna( value= 10 )
空格 c1 c2 c3 c4 0 5.1 10.0 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 10.0 0.2 3 7.0 3.2 4.7 1.4 4 6.4 3.2 4.5 1.5 5 6.9 3.1 4.9 10.0 6 10.0 10.0 10.0 10.0
八、合并数据 df1 = pd. DataFrame( np. zeros( ( 3 , 4 ) ) ) df1
空格 0 1 2 3 0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0
df2 = pd. DataFrame( np. ones( ( 3 , 4 ) ) ) df2
空格 0 1 2 3 0 1.0 1.0 1.0 1.0 1 1.0 1.0 1.0 1.0 2 1.0 1.0 1.0 1.0
pd. concat( ( df1, df2) , axis= 0 )
空格 0 1 2 3 0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 0 1.0 1.0 1.0 1.0 1 1.0 1.0 1.0 1.0 2 1.0 1.0 1.0 1.0
pd. concat( ( df1, df2) , axis= 1 ) 0 1 2 3 0 1 2 3 0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 df1. append( df2)
空格 0 1 2 3 0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 0 1.0 1.0 1.0 1.0 1 1.0 1.0 1.0 1.0 2 1.0 1.0 1.0 1.0
九、导入导出数据 使用df = pd.read_excel(filename)读取文件,使用df.to_excel(filename)保存文件。
1 读取文件导入数据 读取文件导入数据函数主要参数:
参数 详解 sep 指定分隔符,可用正则表达式如’\s+’ header=None 指定文件无行名 name 指定列名 index_col 指定某列作为索引 skip_row 指定跳过某些行 na_values 指定某些字符串表示缺失值 parse_dates 指定某些列是否被解析为日期,布尔值或列表
df = pd. read_excel( filename) df = pd. read_csv( filename)
2 写入文件导出数据
写入文件函数的主要参数:
参数 详解 sep 分隔符 na_rep 指定缺失值转换的字符串,默认为空字符串 header=False 不保存列名 index=False 不保存行索引 cols 指定输出的列,传入列表
df. to_excel( filename)
十、pandas读取json文件 strtext = '[ { "ttery" : "min" , "issue" : "20130801-3391" , "code" : "8,4,5,2,9" , "code1" : "297734529" , "code2" : null, "time" : 1013395466000 } , \{ "ttery" : "min" , "issue" : "20130801-3390" , "code" : "7,8,2,1,2" , "code1" : "298058212" , "code2" : null, "time" : 1013395406000 } , \{ "ttery" : "min" , "issue" : "20130801-3389" , "code" : "5,9,1,2,9" , "code1" : "298329129" , "code2" : null, "time" : 1013395346000 } , \{ "ttery" : "min" , "issue" : "20130801-3388" , "code" : "3,8,7,3,3" , "code1" : "298588733" , "code2" : null, "time" : 1013395286000 } , \{ "ttery" : "min" , "issue" : "20130801-3387" , "code" : "0,8,5,2,7" , "code1" : "298818527" , "code2" : null, "time" : 1013395226000 } ] 'df = pd. read_json( strtext, orient= 'records' ) df
空格 code code1 code2 issue time ttery 0 8,4,5,2,9 297734529 NaN 20130801-3391 1013395466000 min 1 7,8,2,1,2 298058212 NaN 20130801-3390 1013395406000 min 2 5,9,1,2,9 298329129 NaN 20130801-3389 1013395346000 min 3 3,8,7,3,3 298588733 NaN 20130801-3388 1013395286000 min 4 0,8,5,2,7 298818527 NaN 20130801-3387 1013395226000 min
df. to_excel( 'pandas处理json.xlsx' , index= False , columns= [ "ttery" , "issue" , "code" , "code1" , "code2" , "time" ] )
1 orient参数的五种形式 orient是表明预期的json字符串格式。orient的设置有以下五个值:
‘split’ : dict like {index -> [index], columns -> [columns], data -> [values]}
这种就是有索引,有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。
s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}' df = pd. read_json( s, orient= 'split' ) df
‘records’ : list like [{column -> value}, … , {column -> value}]
这种就是成员为字典的列表。如我今天要处理的json数据示例所见。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。
strtext = '[ { "ttery" : "min" , "issue" : "20130801-3391" , "code" : "8,4,5,2,9" , "code1" : "297734529" , "code2" : null, "time" : 1013395466000 } , \{ "ttery" : "min" , "issue" : "20130801-3390" , "code" : "7,8,2,1,2" , "code1" : "298058212" , "code2" : null, "time" : 1013395406000 } ] 'df = pd. read_json( strtext, orient= 'records' ) df
空格 code code1 code2 issue time ttery 0 8,4,5,2,9 297734529 NaN 20130801-3391 1013395466000 min 1 7,8,2,1,2 298058212 NaN 20130801-3390 1013395406000 min
‘index’ : dict like {index -> {column -> value}}
以索引为key,以列字段构成的字典为键值。如:
s = '{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}' df = pd. read_json( s, orient= 'index' ) df
‘columns’ : dict like {column -> {index -> value}}
这种处理的就是以列为键,对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:
s = '{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}' df = pd. read_json( s, orient= 'columns' ) df
‘values’ : just the values array。
values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表,2层的。
s = '[["a",1],["b",2]]' df = pd. read_json( s, orient= 'values' ) df
十一、pandas读取sql语句 import numpy as npimport pandas as pdimport pymysqldef conn ( sql) : conn = pymysql. connect( host= "localhost" , port= 3306 , user= "root" , passwd= "123" , db= "db1" , ) try : data = pd. read_sql( sql, con= conn) return data except Exception as e: print ( "SQL is not correct!" ) finally : conn. close( ) sql = "select * from test1 limit 0, 10" data = conn( sql) print ( data. columns. tolist( ) ) print ( data)