本博客为 Numpy & Pandas 莫烦 python 数据处理 的个人学习笔记!
numpy 的相关介绍可以参考 【python】numpy
最后一次更新时间为:2018-12-10
如果用 python 的列表和字典来作比较, 那么可以说 Numpy 是列表形式的,没有数值标签,而 Pandas 就是字典形式。Pandas是基于Numpy构建的,让Numpy为中心的应用变得更加简单。
要使用pandas,首先需要了解他主要两个数据结构:Series和DataFrame。
参考:https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/3-1-pd-intro/
import pandas as pd
import numpy as np
s = pd.Series([1,3,6,np.nan,44,1])
s
output
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
Series
的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引。于是会自动创建一个0到N-1(N为长度)的整数型索引。
DataFrame
是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame
既有行索引也有列索引, 它可以被看做由Series组成的大字典。
dates = pd.date_range('20181129',periods=6)
print(dates)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
df
output
DatetimeIndex(['2018-11-29', '2018-11-30', '2018-12-01', '2018-12-02',
'2018-12-03', '2018-12-04'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(3,2))
df
output
df2 = pd.DataFrame({'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo'})
df2
output
df2.dtypes
output
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
df2.index # 行索引
output
Int64Index([0, 1, 2, 3], dtype='int64')
df2.columns # 列索引
output
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
df2.values # 值
output
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
df2.describe()
df2.T
df2.sort_index(axis=1,ascending=False) #列降序
df2.sort_index(axis=0,ascending=False) # 行降序
df2.sort_values(by='E') # E列的值排序
import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df
output
取 A 列,两种等价的写法
print(df['A'],'\n')
print(df.A)
output
2018-11-29 0
2018-11-30 4
2018-12-01 8
2018-12-02 12
2018-12-03 16
2018-12-04 20
Freq: D, Name: A, dtype: int32
2018-11-29 0
2018-11-30 4
2018-12-01 8
2018-12-02 12
2018-12-03 16
2018-12-04 20
Freq: D, Name: A, dtype: int32
取前 3 行两种等价的写法
print(df[0:3],'\n')
print(df['20181129':'20181201'])
output
A B C D
2018-11-29 0 1 2 3
2018-11-30 4 5 6 7
2018-12-01 8 9 10 11
A B C D
2018-11-29 0 1 2 3
2018-11-30 4 5 6 7
2018-12-01 8 9 10 11
# select by label:loc
print(df.loc['20181130'],'\n') # index
print(df.loc['20181130',['A','B']],'\n') # index
print(df.loc[:,['A','B']])
output
A 4
B 5
C 6
D 7
Name: 2018-11-30 00:00:00, dtype: int32
A 4
B 5
Name: 2018-11-30 00:00:00, dtype: int32
A B
2018-11-29 0 1
2018-11-30 4 5
2018-12-01 8 9
2018-12-02 12 13
2018-12-03 16 17
2018-12-04 20 21
# select by positio: iloc
print(df.iloc[3],'\n')
print(df.iloc[3,1],'\n')
print(df.iloc[3:5,1:3],'\n')
print(df.iloc[[1,3,5],1:3])
output
A 12
B 13
C 14
D 15
Name: 2018-12-02 00:00:00, dtype: int32
13
B C
2018-12-02 13 14
2018-12-03 17 18
B C
2018-11-30 5 6
2018-12-02 13 14
2018-12-04 21 22
混合上面两种写法
# mixed selection: ix
print(df.ix[:3,['A','C']],'\n')
print(df.ix['20181129':'20181201',[0,2]])
output
A C
2018-11-29 0 2
2018-11-30 4 6
2018-12-01 8 10
A C
2018-11-29 0 2
2018-11-30 4 6
2018-12-01 8 10
总结,跨行的话用 []
框出来[[X,Y],Z]
,索引的话不用框出来, [X,Y]
即可
# boolean indexing
print(df,'\n')
print(df.A>8,'\n') # 对A列进行选择,返回 A 列的是 True 和 False
print(df[df.A>8]) # 返回 True 的数据
output
A B C D
2018-11-29 0 1 2 3
2018-11-30 4 5 6 7
2018-12-01 8 9 10 11
2018-12-02 12 13 14 15
2018-12-03 16 17 18 19
2018-12-04 20 21 22 23
2018-11-29 False
2018-11-30 False
2018-12-01 False
2018-12-02 True
2018-12-03 True
2018-12-04 True
Freq: D, Name: A, dtype: bool
A B C D
2018-12-02 12 13 14 15
2018-12-03 16 17 18 19
2018-12-04 20 21 22 23
import pandas as pd
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df
df.iloc[2,2] = 1111
df.loc['2018-12-02','B'] = 222
df
import pandas as pd
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
print(df,'\n')
df[df.A>4] = 0
print(df)
output
把A>4的整行都变成了0
只筛选A的话,用如下的方式
import pandas as pd
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.A[df.A>4] = 0
df
也可以这样,把B列符合筛选条件的值变为0
import pandas as pd
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.B[df.A>4] = 0 # B列中,A列大于0的都变成0
df
import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.loc['2018-11-29',:] = np.nan # 整行都设置为nan
df.loc[:,'A'] = np.nan # 整列都设置为 nan
df.iloc[2,2] = np.nan # 设置某一个位置的值为nan
df
新增行列
import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df['F'] = np.nan # 添加列
df.loc['2018-12-05'] = np.nan # 添加行
df
import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
df
如果想直接去掉有 NaN 的行或列, 可以使用 dropna
1)去掉有 nan 的所有行
df.dropna(axis=0)
2)去掉有 nan 的所有列
df.dropna(axis=1)
3)how
的设置
默认为 any
,行列中只要有nan就删掉,也可以换成 all
,所有的行或者列为nan才删掉
import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df['F'] = np.nan
df.iloc[0,-1] = 0
print(df)
print(df.dropna(axis=1,how='any'))
print(df.dropna(axis=1,how='all'))
如果是将 NaN 的值用其他值代替, 比如代替成 0
import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
df.fillna(value=0)
判断是否有缺失数据 NaN, 为 True 表示缺失数据
import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
df.isnull()
结合 np.any() 使用会更好
import pandas as pd
import numpy as np
dates = pd.date_range('20181129',periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
np.any(df.isnull()==True)
ouput
True
很简单便捷,导入都用read_XXX
,导出都用to_XXX
http://pandas.pydata.org/pandas-docs/stable/io.html
新建一个 excel 试验下
1)导入
import pandas as pd
data = pd.read_excel('C://Users/Administrator/Desktop/1.xlsx')
print(data)
output
StudentID name age genda
0 1 A 18 男
1 2 B 19 女
会默认给你添加 index
2)导出
data.to_pickle('C://Users/Administrator/Desktop/student.pickle')
在指定目录下会有student.pickle
文件生成,方便。
pandas处理多组数据的时候往往会要用到数据的合并处理,使用 concat是一种基本的合并方式.而且concat中有很多参数可以调整,合并成你想要的数据形式.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((2,2))*0, columns=['a','b'])
df2 = pd.DataFrame(np.ones((2,2))*1, columns=['a','b'])
df3 = pd.DataFrame(np.ones((2,2))*2, columns=['a','b'])
concat
默认 axis = 0
res = pd.concat([df1,df2,df3],axis=0,ignore_index=True) # index 没有变
res
res = pd.concat([df1,df2,df3],axis=1,ignore_index=True) # index 没有变
res
join = ['inner','outer']
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])
1)inner
只合并相同的index
res = pd.concat([df1,df2],axis=0,join = 'inner',ignore_index=True)
res
res = pd.concat([df1,df2],axis=1,join = 'inner',ignore_index=True)
res
2)outer
,无脑合并,没有的补nan
res = pd.concat([df1,df2],axis=0,join = 'outer',ignore_index=True)
res
res = pd.concat([df1,df2],axis=1,join = 'outer',ignore_index=True)
res
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])
res = pd.concat([df1,df2],axis = 1)
res
依照 df1.index
进行横向合并
res = pd.concat([df1,df2],axis = 1,join_axes=[df1.index])
res
依照 df1.columns
进行纵向合并
res = pd.concat([df1,df2],axis = 0,join_axes=[df1.columns])
res
append
只有纵向合并,没有横向合并。
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
s1 = pd.Series([1,2,3,4], index=['a','b','c','d'])
s2 = pd.Series([1,2,3,4], index=['a','b','c','d'])
1)合并 df1
和 df2
res = df1.append(df2,ignore_index=True)
res
2)合并 df1
,df2
和df3
res = df1.append([df2,df3],ignore_index=True)
res
3)合并一行数据
res = df1.append(s1,ignore_index=True)
res
pandas中的merge
和concat
类似,但主要是用于两组有key column
的数据,统一索引的数据. 通常也被用在Database的处理当中.
import pandas as pd
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
根据 key
合并
res = pd.merge(left,right,on = 'key')
res
import pandas as pd
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
依据key1
与key2
columns进行合并,并打印出四种结果[‘left’, ‘right’, ‘outer’, ‘inner’],默认设置的是'inner'
res = pd.merge(left,right,on=['key1','key2'],how = 'inner')
res
注意left frame
中的 A2 B2
被匹配了两次
res = pd.merge(left,right,on=['key1','key2'],how = 'outer')
res
没有的用NaN补充
res = pd.merge(left,right,on=['key1','key2'],how = 'left')
res
res = pd.merge(left,right,on=['key1','key2'],how = 'right')
res
indicator=True
会将合并的记录放在新的一列。
import pandas as pd
df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
res = pd.merge(df1, df2, on='col1', how='outer', indicator=True)
res
DIY 最后一列的名字,默认为_merge
res = pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column')
res
import pandas as pd
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
注意 left_index
和 right_index
必须是 True
res = pd.merge(left, right, left_index=True, right_index=True, how='outer',indicator='indicator_column')
res
res = pd.merge(left, right, left_index=True, right_index=True, how='inner',indicator='indicator_column')
res
res = pd.merge(left, right, left_index=True, right_index=True, how='left')
res
res = pd.merge(left, right, left_index=True, right_index=True, how='right')
res
import pandas as pd
boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})
girls = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})
有两个age
res = pd.merge(boys, girls, on='k', how='inner')
res
系统会默认_x,_y
,我们用suffixes
来改下名字
res = pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner')
res
padans 画图官方文档
http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# plot data
#Series
data = pd.Series(np.random.randn(1000),index=np.arange(1000))
# 为了方便观看效果, 我们累加这个数据
data = data.cumsum()
data.plot()
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# plot data
#Series
data = pd.DataFrame(np.random.randn(1000,4),
index=np.arange(1000),
columns=['A','B','C','D'])
data = data.cumsum()
data.plot()
# plot methods:
# bar, hist,box,kde,area,scatter,hexbin,pie
ax = data.plot.scatter(x='A',y='B',color='DarkBlue',label='Class 1')
data.plot.scatter(x='A',y='C',color='DarkGreen',label='Class 2',ax = ax)
plt.show()
import pandas as pd
data = [[1,2,3],
[4,5,6],
[7,8,9]]
df = pd.DataFrame(data, columns=[0,1,2])
print(df)
def add_num(x):
return f"{x}01"
df.loc[:,0] = df.loc[:,0].apply(add_num)
print(df)
output
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
0 1 2
0 101 2 3
1 401 5 6
2 701 8 9
简洁一点的写法为
df.loc[:,0] = df.loc[:,0].apply(lambda x:f"{x}01")
修改行则用
df.loc[0,:] = df.loc[0,:].apply(lambda x:f"{x}01")
print(df)
output
0 1 2
0 101 201 301
1 4 5 6
2 7 8 9
再看一个列子
import pandas as pd
data = [["a",'2','3'],
["b",'5','6'],
["c",'8','9']]
df = pd.DataFrame(data, columns=[0,1,2])
print(df)
df.loc[:,0] = df.loc[:,0].apply(lambda x:x.title())
print(df)
output
0 1 2
0 a 2 3
1 b 5 6
2 c 8 9
0 1 2
0 A 2 3
1 B 5 6
2 C 8 9