Pandas学习--数值运算
-
- 数值计算和统计基础
-
- 常用数学、统计方法
-
- 基本参数:axis、skipna
- 主要数学计算方法,可用于Series和DataFrame(1)
- 主要数学计算方法,可用于Series和DataFrame(2)累和、累积
- 主要数学计算方法,唯一值:.unique()
- 主要数学计算方法,值计数(计算频率):.value_counts()
- 主要数学计算方法,成员资格(是否包含):.isin()
-
- 处理文本数据
-
- 通过str访问,且自动排除丢失/ NA值
- 字符串常用方法(1) - lower,upper,len,startswith,endswith
- 字符串常用方法(2) - strip
- 字符串常用方法(3) - replace
- 字符串常用方法(4) - split、rsplit
- 字符串索引
-
- 合并 merge、join
-
- merge合并 → 类似excel的vlookup
- merge合并 → 参数how → 合并方式
- merge合并 → 参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键
- merge合并 →参数 sort
- pd.join() → 直接通过索引链接
-
- 连接与修补 concat、combine_first
-
- 连接:concat
- 连接方式:join,join_axes
- 覆盖列名(!!!)
- 修补 pd.combine_first()
-
- 去重及替换 .duplicated / .replace
- 去重 .duplicated
-
- 数据分组!!!(重要)
-
- groupby分组
- 分组 - 可迭代对象
- 其他轴上的分组
- 通过字典或者Series分组
- 通过函数分组
- 分组计算函数方法
- 分组多函数计算:agg()
-
- 分组转换及一般性“拆分-应用-合并”
-
- 数据分组转换,transform
- 一般化Groupby方法:apply
-
- 透视表及交叉表
-
- 透视表:pivot_table
- 交叉表:crosstab
-
- 数据读取
-
- 读取普通分隔数据:read_table
- 读取csv数据:read_csv
- 读取excel数据:read_excel
数值计算和统计基础
'''
【课程2.14】 数值计算和统计基础
常用数学、统计方法
'''
常用数学、统计方法
基本参数:axis、skipna
import numpy as np
import pandas as pd
df = pd.DataFrame({
'key1':[4,5,3,np.nan,2],
'key2':[1,2,np.nan,4,5],
'key3':[1,2,3,'j','k']},
index = ['a','b','c','d','e'])
print(df)
print(df['key1'].dtype,df['key2'].dtype,df['key3'].dtype)
print('-----')
m1 = df.mean()
print(m1,type(m1))
print('单独统计一列:',df['key2'].mean())
print('-----')
m2 = df.mean(axis=1)
print(m2)
print('-----')
m3 = df.mean(skipna=False)
print(m3)
print('-----')
key1 key2 key3
a 4.0 1.0 1
b 5.0 2.0 2
c 3.0 NaN 3
d NaN 4.0 j
e 2.0 5.0 k
float64 float64 object
-----
key1 3.5
key2 3.0
dtype: float64
单独统计一列: 3.0
-----
a 2.5
b 3.5
c 3.0
d 4.0
e 3.5
dtype: float64
-----
key1 NaN
key2 NaN
dtype: float64
-----
主要数学计算方法,可用于Series和DataFrame(1)
df = pd.DataFrame({
'key1':np.arange(10),
'key2':np.random.rand(10)*10})
print(df)
print('-----')
print(df.count(),'→ count统计非Na值的数量\n')
print(df.min(),'→ min统计最小值\n',df['key2'].max(),'→ max统计最大值\n')
print(df.quantile(q=0.75),'→ quantile统计分位数,参数q确定位置\n')
print(df.sum(),'→ sum求和\n')
print(df.mean(),'→ mean求平均值\n')
print(df.median(),'→ median求算数中位数,50%分位数\n')
print(df.std(),'\n',df.var(),'→ std,var分别求标准差,方差\n')
print(df.skew(),'→ skew样本的偏度\n')
print(df.kurt(),'→ kurt样本的峰度\n')
key1 key2
0 0 0.327398
1 1 0.959262
2 2 6.455080
3 3 6.275359
4 4 6.138641
5 5 8.853716
6 6 4.525300
7 7 9.740657
8 8 9.229833
9 9 0.949789
-----
key1 10
key2 10
dtype: int64 → count统计非Na值的数量
key1 0.000000
key2 0.327398
dtype: float64 → min统计最小值
9.740656570973671 → max统计最大值
key1 6.750000
key2 8.254057
Name: 0.75, dtype: float64 → quantile统计分位数,参数q确定位置
key1 45.000000
key2 53.455034
dtype: float64 → sum求和
key1 4.500000
key2 5.345503
dtype: float64 → mean求平均值
key1 4.500
key2 6.207
dtype: float64 → median求算数中位数,50%分位数
key1 3.027650
key2 3.556736
dtype: float64
key1 9.166667
key2 12.650371
dtype: float64 → std,var分别求标准差,方差
key1 0.000000
key2 -0.329924
dtype: float64 → skew样本的偏度
key1 -1.200000
key2 -1.430276
dtype: float64 → kurt样本的峰度
主要数学计算方法,可用于Series和DataFrame(2)累和、累积
df['key1_s'] = df['key1'].cumsum()
df['key2_s'] = df['key2'].cumsum()
print(df,'→ cumsum样本的累计和\n')
df['key1_p'] = df['key1'].cumprod()
df['key2_p'] = df['key2'].cumprod()
print(df,'→ cumprod样本的累计积\n')
print(df.cummax(),'\n',df.cummin(),'→ cummax,cummin分别求累计最大值,累计最小值\n')
key1 key2 key1_s key2_s
0 0 0.327398 0 0.327398
1 1 0.959262 1 1.286660
2 2 6.455080 3 7.741740
3 3 6.275359 6 14.017099
4 4 6.138641 10 20.155740
5 5 8.853716 15 29.009456
6 6 4.525300 21 33.534756
7 7 9.740657 28 43.275412
8 8 9.229833 36 52.505245
9 9 0.949789 45 53.455034 → cumsum样本的累计和
key1 key2 key1_s key2_s key1_p key2_p
0 0 0.327398 0 0.327398 0 0.327398
1 1 0.959262 1 1.286660 0 0.314061
2 2 6.455080 3 7.741740 0 2.027286
3 3 6.275359 6 14.017099 0 12.721946
4 4 6.138641 10 20.155740 0 78.095454
5 5 8.853716 15 29.009456 0 691.434982
6 6 4.525300 21 33.534756 0 3128.950808
7 7 9.740657 28 43.275412 0 30478.035251
8 8 9.229833 36 52.505245 0 281307.179260
9 9 0.949789 45 53.455034 0 267182.375541 → cumprod样本的累计积
key1 key2 key1_s key2_s key1_p key2_p
0 0.0 0.327398 0.0 0.327398 0.0 0.327398
1 1.0 0.959262 1.0 1.286660 0.0 0.327398
2 2.0 6.455080 3.0 7.741740 0.0 2.027286
3 3.0 6.455080 6.0 14.017099 0.0 12.721946
4 4.0 6.455080 10.0 20.155740 0.0 78.095454
5 5.0 8.853716 15.0 29.009456 0.0 691.434982
6 6.0 8.853716 21.0 33.534756 0.0 3128.950808
7 7.0 9.740657 28.0 43.275412 0.0 30478.035251
8 8.0 9.740657 36.0 52.505245 0.0 281307.179260
9 9.0 9.740657 45.0 53.455034 0.0 281307.179260
key1 key2 key1_s key2_s key1_p key2_p
0 0.0 0.327398 0.0 0.327398 0.0 0.327398
1 0.0 0.327398 0.0 0.327398 0.0 0.314061
2 0.0 0.327398 0.0 0.327398 0.0 0.314061
3 0.0 0.327398 0.0 0.327398 0.0 0.314061
4 0.0 0.327398 0.0 0.327398 0.0 0.314061
5 0.0 0.327398 0.0 0.327398 0.0 0.314061
6 0.0 0.327398 0.0 0.327398 0.0 0.314061
7 0.0 0.327398 0.0 0.327398 0.0 0.314061
8 0.0 0.327398 0.0 0.327398 0.0 0.314061
9 0.0 0.327398 0.0 0.327398 0.0 0.314061 → cummax,cummin分别求累计最大值,累计最小值
主要数学计算方法,唯一值:.unique()
s = pd.Series(list('asdvasdcfgg'))
sq = s.unique()
print(s)
print(sq,type(sq))
print(pd.Series(sq))
sq.sort()
print(sq)
0 a
1 s
2 d
3 v
4 a
5 s
6 d
7 c
8 f
9 g
10 g
dtype: object
['a' 's' 'd' 'v' 'c' 'f' 'g']
0 a
1 s
2 d
3 v
4 c
5 f
6 g
dtype: object
['a' 'c' 'd' 'f' 'g' 's' 'v']
主要数学计算方法,值计数(计算频率):.value_counts()
sc = s.value_counts(sort = False)
print(sc)
a 2
d 2
v 1
g 2
s 2
f 1
c 1
dtype: int64
主要数学计算方法,成员资格(是否包含):.isin()
s = pd.Series(np.arange(10,15))
df = pd.DataFrame({
'key1':list('asdcbvasd'),
'key2':np.arange(4,13)})
print(s)
print(df)
print('-----')
print(s.isin([5,14]))
print(df.isin(['a','bc','10',8]))
0 10
1 11
2 12
3 13
4 14
dtype: int32
key1 key2
0 a 4
1 s 5
2 d 6
3 c 7
4 b 8
5 v 9
6 a 10
7 s 11
8 d 12
-----
0 False
1 False
2 False
3 False
4 True
dtype: bool
key1 key2
0 True False
1 False False
2 False False
3 False False
4 False True
5 False False
6 True False
7 False False
8 False False
课堂作业
ts1 = pd.DataFrame(np.random.rand(5,2)*100,columns=['key1','key2'])
print("创建的Dateframe为:")
print(ts1)
print('------')
print("df['key1']的均值为:")
print(ts1['key1'].mean())
print('------')
print("df['key1']的中位数为:")
print(ts1['key1'].median())
print('------')
print("df['key2']的均值为:")
print(ts1['key2'].mean())
print('------')
print("df['key2']的中位数为:")
print(ts1['key2'].median())
print('------')
print("df['key2']的累计和为:")
ts1['key1_cumsum'] = ts1['key1'].cumsum()
ts1['key2_cumsum'] = ts1['key2'].cumsum()
print(ts1)
创建的Dateframe为:
key1 key2
0 0.445031 70.879116
1 40.164080 8.052621
2 4.118756 72.932482
3 46.818794 12.744497
4 37.192819 18.393109
------
df['key1']的均值为:
25.747896160805663
------
df['key1']的中位数为:
37.192819239210486
------
df['key2']的均值为:
36.60036488397306
------
df['key2']的中位数为:
18.39310866824474
------
df['key2']的累计和为:
key1 key2 key1_cumsum key2_cumsum
0 0.445031 70.879116 0.445031 70.879116
1 40.164080 8.052621 40.609112 78.931737
2 4.118756 72.932482 44.727868 151.864219
3 46.818794 12.744497 91.546662 164.608716
4 37.192819 18.393109 128.739481 183.001824
def f(s):
s2 = s.unique()
if len(s) == len(s2):
print('------\n该数组是唯一值数组')
else:
print('------\n该数组不是唯一值数组')
d = input('请随机输入一组元素,用逗号(英文符号)隔开:\n')
lst = d.split(',')
ds = pd.Series(lst)
f(ds)
请随机输入一组元素,用逗号(英文符号)隔开:
a,sc,2,2,2,d,s,s,a
------
该数组不是唯一值数组
处理文本数据
'''
【课程2.15】 文本数据
Pandas针对字符串配备的一套方法,使其易于对数组的每个元素进行操作
'''
通过str访问,且自动排除丢失/ NA值
s = pd.Series(['A','bB','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({
'key1':list('abcdef'),
'key2':['hee','fv','w','hija','123',np.nan]})
print(s)
print(df)
print('-----')
print(s.str.count('b'))
print(df['key2'].str.upper())
print(df['key2'])
print('-----')
df.columns = df.columns.str.upper()
print(df)
0 A
1 bB
2 C
3 bbhello
4 123
5 NaN
6 hj
dtype: object
key1 key2
0 a hee
1 b fv
2 c w
3 d hija
4 e 123
5 f NaN
-----
0 0.0
1 1.0
2 0.0
3 2.0
4 0.0
5 NaN
6 0.0
dtype: float64
0 HEE
1 FV
2 W
3 HIJA
4 123
5 NaN
Name: key2, dtype: object
key1 key2
0 a hee
1 b fv
2 c w
3 d hija
4 e 123
5 f NaN
-----
KEY1 KEY2
0 a hee
1 b fv
2 c w
3 d hija
4 e 123
5 f NaN
字符串常用方法(1) - lower,upper,len,startswith,endswith
s = pd.Series(['A','b','bbhello','123',np.nan])
print(s.str.lower(),'→ lower小写\n')
print(s.str.upper(),'→ upper大写\n')
print(s.str.len(),'→ len字符长度\n')
print(s.str.startswith('b'),'→ 判断起始是否为a\n')
print(s.str.endswith('3'),'→ 判断结束是否为3\n')
0 a
1 b
2 bbhello
3 123
4 NaN
dtype: object → lower小写
0 A
1 B
2 BBHELLO
3 123
4 NaN
dtype: object → upper大写
0 1.0
1 1.0
2 7.0
3 3.0
4 NaN
dtype: float64 → len字符长度
0 False
1 True
2 True
3 False
4 NaN
dtype: object → 判断起始是否为a
0 False
1 False
2 False
3 True
4 NaN
dtype: object → 判断结束是否为3
字符串常用方法(2) - strip
s = pd.Series([' jack', 'ji ll ', ' jesse ', 'frank'])
df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
index=range(3))
print(s)
print(df)
print('-----')
print(s.str.strip())
print(s.str.lstrip())
print(s.str.rstrip())
df.columns = df.columns.str.strip()
print(df)
0 jack
1 ji ll
2 jesse
3 frank
dtype: object
Column A Column B
0 1.178373 -0.770705
1 0.611277 0.705297
2 -1.106696 1.455232
-----
0 jack
1 ji ll
2 jesse
3 frank
dtype: object
0 jack
1 ji ll
2 jesse
3 frank
dtype: object
0 jack
1 ji ll
2 jesse
3 frank
dtype: object
Column A Column B
0 1.178373 -0.770705
1 0.611277 0.705297
2 -1.106696 1.455232
字符串常用方法(3) - replace
df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
index=range(3))
df.columns = df.columns.str.replace(' ','-')
print(df)
df.columns = df.columns.str.replace('-','hehe',n=1)
print(df)
-Column-A- -Column-B-
0 -1.140552 -2.215192
1 -0.386697 1.323757
2 -0.288860 1.405160
heheColumn-A- heheColumn-B-
0 -1.140552 -2.215192
1 -0.386697 1.323757
2 -0.288860 1.405160
字符串常用方法(4) - split、rsplit
s = pd.Series(['a,b,c','1,2,3',['a,,,c'],np.nan])
print(s)
print(s.str.split(','),type(s.str.split(',')))
print('-----')
print(s.str.split(',')[0],type(s.str.split(',')[0]))
print('-----')
print(s.str.split(',').str)
print(s.str.split(',').str[1],type(s.str.split(',').str[1]),'split.str..')
print(s.str.split(',').str.get(1))
print('-----')
print(s.str.split(',', expand=True))
print(s.str.split(',', expand=True, n = 1))
print(s.str.rsplit(',', expand=True, n = 1))
print('-----')
df = pd.DataFrame({
'key1':['a,b,c','1,2,3',[':,., ']],
'key2':['a-b-c','1-2-3',[':-.- ']]})
print(df['key2'].str.split('-'))
0 a,b,c
1 1,2,3
2 [a,,,c]
3 NaN
dtype: object
0 [a, b, c]
1 [1, 2, 3]
2 NaN
3 NaN
dtype: object
-----
['a', 'b', 'c']
-----
0 b
1 2
2 NaN
3 NaN
dtype: object split.str..
0 b
1 2
2 NaN
3 NaN
dtype: object
-----
0 1 2
0 a b c
1 1 2 3
2 NaN NaN NaN
3 NaN NaN NaN
0 1
0 a b,c
1 1 2,3
2 NaN NaN
3 NaN NaN
0 1
0 a,b c
1 1,2 3
2 NaN NaN
3 NaN NaN
-----
0 [a, b, c]
1 [1, 2, 3]
2 NaN
Name: key2, dtype: object
字符串索引
s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({
'key1':list('abcdef'),
'key2':['hee','fv','w','hija','123',np.nan]})
print(s.str[0])
print(s.str[:2])
print(df['key2'].str[0])
0 A
1 b
2 C
3 b
4 1
5 NaN
6 h
dtype: object
0 A
1 b
2 C
3 bb
4 12
5 NaN
6 hj
dtype: object
0 h
1 f
2 w
3 h
4 1
5 NaN
Name: key2, dtype: object
课堂作业
df = pd.DataFrame({
'name':['jack','tom','Marry','zack','heheda'],
'gender':['M ','M',' F',' M ',' F'],
'score':['90-92-89','89-78-88','90-92-95','78-88-76','60-60-67']})
print(df)
df['gender'] = df['gender'].str.strip()
df['name'] = df['name'].str.capitalize()
df = df.reindex(['gender','name','score'],axis=1)
sf = df['score'].str.split('-', expand=True)
print(sf,type(sf))
print(sf[0],type(sf[0]))
df['math'] = sf[0]
df['english'] = sf[1]
df['art'] = sf[2]
del df['score']
print(df)
name gender score
0 jack M 90-92-89
1 tom M 89-78-88
2 Marry F 90-92-95
3 zack M 78-88-76
4 heheda F 60-60-67
0 1 2
0 90 92 89
1 89 78 88
2 90 92 95
3 78 88 76
4 60 60 67
0 90
1 89
2 90
3 78
4 60
Name: 0, dtype: object
gender name math english art
0 M Jack 90 92 89
1 M Tom 89 78 88
2 F Marry 90 92 95
3 M Zack 78 88 76
4 F Heheda 60 60 67
合并 merge、join
'''
【课程2.16】 合并 merge、join
Pandas具有全功能的,高性能内存中连接操作,与SQL等关系数据库非常相似
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False)
'''
merge合并 → 类似excel的vlookup
df1 = pd.DataFrame({
'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({
'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
df3 = pd.DataFrame({
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df4 = pd.DataFrame({
'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print(pd.merge(df1, df2, on='key'))
print('------')
print(pd.merge(df3, df4, on=['key1','key2']))
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
------
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
merge合并 → 参数how → 合并方式
print(pd.merge(df3, df4,on=['key1','key2'], how = 'inner'))
print('------')
print(pd.merge(df3, df4, on=['key1','key2'], how = 'outer'))
print('------')
print(df3)
print(df4)
print(pd.merge(df3, df4, on=['key1','key2'], how = 'left'))
print('------')
print(pd.merge(df3, df4, on=['key1','key2'], how = 'right'))
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
------
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
------
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
------
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
merge合并 → 参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键
df1 = pd.DataFrame({
'lkey':list('bbacaab'),
'data1':range(7)})
df2 = pd.DataFrame({
'rkey':list('abd'),
'date2':range(3)})
print(pd.merge(df1, df2, left_on='lkey', right_on='rkey'))
print('------')
df1 = pd.DataFrame({
'key':list('abcdfeg'),
'data1':range(7)})
df2 = pd.DataFrame({
'date2':range(100,105)},
index = list('abcde'))
print(pd.merge(df1, df2, left_on='key', right_index=True))
lkey data1 rkey date2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
------
key data1 date2
0 a 0 100
1 b 1 101
2 c 2 102
3 d 3 103
5 e 5 104
merge合并 →参数 sort
df1 = pd.DataFrame({
'key':list('bbacaab'),
'data1':[1,3,2,4,5,9,7]})
df2 = pd.DataFrame({
'key':list('abd'),
'date2':[11,2,33]})
x1 = pd.merge(df1,df2, on = 'key', how = 'outer')
x2 = pd.merge(df1,df2, on = 'key', sort=True, how = 'outer')
print(x1)
print(x2)
print('------')
print(x2.sort_values('data1'))
key data1 date2
0 b 1.0 2.0
1 b 3.0 2.0
2 b 7.0 2.0
3 a 2.0 11.0
4 a 5.0 11.0
5 a 9.0 11.0
6 c 4.0 NaN
7 d NaN 33.0
key data1 date2
0 a 2.0 11.0
1 a 5.0 11.0
2 a 9.0 11.0
3 b 1.0 2.0
4 b 3.0 2.0
5 b 7.0 2.0
6 c 4.0 NaN
7 d NaN 33.0
------
key data1 date2
3 b 1.0 2.0
0 a 2.0 11.0
4 b 3.0 2.0
6 c 4.0 NaN
1 a 5.0 11.0
5 b 7.0 2.0
2 a 9.0 11.0
7 d NaN 33.0
pd.join() → 直接通过索引链接
left = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({
'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
print(left)
print(right)
print(left.join(right))
print(left.join(right, how='outer'))
print('-----')
df1 = pd.DataFrame({
'key':list('bbacaab'),
'data1':[1,3,2,4,5,9,7]})
df2 = pd.DataFrame({
'key':list('abd'),
'date2':[11,2,33]})
print(df1)
print(df2)
print(pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_1', '_2')))
print(df1.join(df2['date2']))
print('-----')
left = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({
'C': ['C0', 'C1'],
'D': ['D0', 'D1']},
index=['K0', 'K1'])
print(left)
print(right)
print(left.join(right, on = 'key'))
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
-----
key data1
0 b 1
1 b 3
2 a 2
3 c 4
4 a 5
5 a 9
6 b 7
key date2
0 a 11
1 b 2
2 d 33
key_1 data1 key_2 date2
0 b 1 a 11
1 b 3 b 2
2 a 2 d 33
key data1 date2
0 b 1 11.0
1 b 3 2.0
2 a 2 33.0
3 c 4 NaN
4 a 5 NaN
5 a 9 NaN
6 b 7 NaN
-----
A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K0
3 A3 B3 K1
C D
K0 C0 D0
K1 C1 D1
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K0 C0 D0
3 A3 B3 K1 C1 D1
课堂作业
df1 = pd.DataFrame(np.random.rand(3),columns=['values1'])
df1['key'] = list('abc')
print('创建df1为:')
print(df1)
df2 = pd.DataFrame(np.random.rand(3),columns=['values2'])
print('创建df2为:')
df2['key'] = list('bcd')
print(df2)
df3 = pd.merge(df1,df2,on='key',how='outer')
print('合并df3(取并集)为:')
print(df3)
创建df1为:
values1 key
0 0.363363 a
1 0.705128 b
2 0.514941 c
创建df2为:
values2 key
0 0.305494 b
1 0.243707 c
2 0.816473 d
合并df3(取并集)为:
values1 key values2
0 0.363363 a NaN
1 0.705128 b 0.305494
2 0.514941 c 0.243707
3 NaN d 0.816473
df1 = pd.DataFrame(np.random.rand(3),columns=['values1'])
df1['lkey'] = list('abc')
print('创建df1为:')
print(df1)
df2 = pd.DataFrame(np.random.rand(3),columns=['values2'])
print('创建df2为:')
df2['rkey'] = list('bcd')
print(df2)
df3 = pd.merge(df1,df2,left_on='lkey',right_on='rkey',how='left')
print('合并df3(左连接,保留left所有)为:')
print(df3)
创建df1为:
values1 lkey
0 0.625525 a
1 0.121965 b
2 0.114507 c
创建df2为:
values2 rkey
0 0.406097 b
1 0.922127 c
2 0.326960 d
合并df3(左连接,保留left所有)为:
values1 lkey values2 rkey
0 0.625525 a NaN NaN
1 0.121965 b 0.406097 b
2 0.114507 c 0.922127 c
df1 = pd.DataFrame(np.random.rand(3),columns=['values1'])
df1['lkey'] = list('abc')
print('创建df1为:')
print(df1)
df2 = pd.DataFrame(np.random.rand(3),columns=['values2'],index=list('bcd'))
print('创建df2为:')
df2['value3'] = [5,6,7]
print(df2)
df3 = pd.merge(df1,df2,left_on='lkey',right_index=True,how='inner')
print('合并df3(内连接,取并集)为:')
print(df3)
创建df1为:
values1 lkey
0 0.509719 a
1 0.157929 b
2 0.392352 c
创建df2为:
values2 value3
b 0.805541 5
c 0.897287 6
d 0.093350 7
合并df3(内连接,取并集)为:
values1 lkey values2 value3
1 0.157929 b 0.805541 5
2 0.392352 c 0.897287 6
连接与修补 concat、combine_first
'''
【课程2.17】 连接与修补 concat、combine_first
连接 - 沿轴执行连接操作
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
'''
连接:concat
s1 = pd.Series([1,2,3])
s2 = pd.Series([2,3,4])
s3 = pd.Series([1,2,3],index = ['a','c','h'])
s4 = pd.Series([2,3,4],index = ['b','e','d'])
print(pd.concat([s1,s2]))
print(pd.concat([s3,s4]).sort_index())
print('-----')
print(pd.concat([s3,s4], axis=1))
print('-----')
0 1
1 2
2 3
0 2
1 3
2 4
dtype: int64
a 1
b 2
c 2
d 4
e 3
h 3
dtype: int64
-----
0 1
a 1.0 NaN
b NaN 2.0
c 2.0 NaN
d NaN 4.0
e NaN 3.0
h 3.0 NaN
-----
D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:12: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
if sys.path[0] == '':
连接方式:join,join_axes
s5 = pd.Series([1,2,3],index = ['a','b','c'])
s6 = pd.Series([2,3,4],index = ['b','c','d'])
print(pd.concat([s5,s6], axis= 1))
print(pd.concat([s5,s6], axis= 1, join='inner'))
print(pd.concat([s5,s6], axis= 1, join_axes=[['a','b','d']]))
0 1
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
0 1
b 2 2
c 3 3
0 1
a 1.0 NaN
b 2.0 2.0
d NaN 4.0
D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""
覆盖列名(!!!)
sre = pd.concat([s5,s6], keys = ['one','two'])
print(sre,type(sre))
print(sre.index)
print('-----')
sre = pd.concat([s5,s6], axis=1, keys = ['one','two'])
print(sre,type(sre))
one a 1
b 2
c 3
two b 2
c 3
d 4
dtype: int64
MultiIndex(levels=[['one', 'two'], ['a', 'b', 'c', 'd']],
codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 1, 2, 3]])
-----
one two
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
if __name__ == '__main__':
修补 pd.combine_first()
df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, np.nan, np.nan],[np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],index=[1, 2])
print(df1)
print(df2)
print(df1.combine_first(df2))
print('-----')
df1.update(df2)
print(df1)
0 1 2
0 NaN 3.0 5.0
1 -4.6 NaN NaN
2 NaN 7.0 NaN
0 1 2
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
0 1 2
0 NaN 3.0 5.0
1 -4.6 NaN -8.2
2 -5.0 7.0 4.0
-----
0 1 2
0 NaN 3.0 5.0
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
课堂作业
df1 = pd.DataFrame(np.random.rand(4,2),index=list('abcd'),columns=['value1','value2'])
print("创建df1为:")
print(df1)
print('----------')
df2 = pd.DataFrame(np.random.rand(4,2),index=list('efgh'),columns=['value1','value2'])
print("创建df2为:")
print(df2)
print('----------')
df3 = pd.concat([df1,df2])
print('堆叠为df3:')
print(df3)
创建df1为:
value1 value2
a 0.261681 0.109421
b 0.782509 0.374875
c 0.447257 0.056709
d 0.349732 0.669266
----------
创建df2为:
value1 value2
e 0.902231 0.531241
f 0.818947 0.537972
g 0.052821 0.696736
h 0.098303 0.911916
----------
堆叠为df3:
value1 value2
a 0.261681 0.109421
b 0.782509 0.374875
c 0.447257 0.056709
d 0.349732 0.669266
e 0.902231 0.531241
f 0.818947 0.537972
g 0.052821 0.696736
h 0.098303 0.911916
data = np.random.rand(4,2)
data[1:3,0] = np.NAN
df1 = pd.DataFrame(data,index=list('abcd'),columns=['value1','value2'])
print("创建df1为:")
print(df1)
print('----------')
df2 = pd.DataFrame(np.arange(8).reshape(4,2),index=list('abcd'),columns=['value1','value2'])
print("创建df2为:")
print(df2)
print('----------')
df3 = df1.combine_first(df2)
print('df1修补后为:')
print(df3)
创建df1为:
value1 value2
a 0.451591 0.556266
b NaN 0.943348
c NaN 0.944175
d 0.273202 0.594670
----------
创建df2为:
value1 value2
a 0 1
b 2 3
c 4 5
d 6 7
----------
df1修补后为:
value1 value2
a 0.451591 0.556266
b 2.000000 0.943348
c 4.000000 0.944175
d 0.273202 0.594670
去重及替换 .duplicated / .replace
'''
【课程2.18】 去重及替换
.duplicated / .replace
'''
去重 .duplicated
s = pd.Series([1,1,1,1,2,2,2,3,4,5,5,5,5])
print(s.duplicated())
print(s[s.duplicated() == False])
print('-----')
s_re = s.drop_duplicates()
print(s_re)
print('-----')
df = pd.DataFrame({
'key1':['a','a',3,4,5],
'key2':['a','a','b','b','c']})
print(df.duplicated())
print(df['key2'].duplicated())
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 False
9 False
10 True
11 True
12 True
dtype: bool
0 1
4 2
7 3
8 4
9 5
dtype: int64
-----
0 1
4 2
7 3
8 4
9 5
dtype: int64
-----
0 False
1 True
2 False
3 False
4 False
dtype: bool
0 False
1 True
2 False
3 True
4 False
Name: key2, dtype: bool
替换 .replace
s = pd.Series(list('ascaazsd'))
print(s.replace('a', np.nan))
print(s.replace(['a','s'] ,np.nan))
print(s.replace({
'a':'hello world!','s':123}))
0 NaN
1 s
2 c
3 NaN
4 NaN
5 z
6 s
7 d
dtype: object
0 NaN
1 NaN
2 c
3 NaN
4 NaN
5 z
6 NaN
7 d
dtype: object
0 hello world!
1 123
2 c
3 hello world!
4 hello world!
5 z
6 123
7 d
dtype: object
数据分组!!!(重要)
'''
【课程2.19】 数据分组
分组统计 - groupby功能
① 根据某些条件将数据拆分成组
② 对每个组独立应用函数
③ 将结果合并到一个数据结构中
Dataframe在行(axis=0)或列(axis=1)上进行分组,将一个函数应用到各个分组并产生一个新值,然后函数执行结果被合并到最终的结果对象中。
df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
'''
groupby分组
df = pd.DataFrame({
'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
print(df)
print('------')
print(df.groupby('A'), type(df.groupby('A')))
print('------')
a = df.groupby('A').mean()
b = df.groupby(['A','B']).mean()
c = df.groupby(['A'])['D'].mean()
print(a,type(a),'\n',a.columns)
print(b,type(b),'\n',b.columns)
print(c,type(c))
A B C D
0 foo one 0.172157 1.118132
1 bar one 0.323895 1.188046
2 foo two -1.048614 -0.747383
3 bar three 0.338934 1.587185
4 foo two 0.423342 -1.542578
5 bar two 0.255962 1.337651
6 foo one 0.225461 0.557273
7 foo three -0.748118 0.418550
------
------
C D
A
bar 0.306263 1.370960
foo -0.195154 -0.039201
Index(['C', 'D'], dtype='object')
C D
A B
bar one 0.323895 1.188046
three 0.338934 1.587185
two 0.255962 1.337651
foo one 0.198809 0.837702
three -0.748118 0.418550
two -0.312636 -1.144981
Index(['C', 'D'], dtype='object')
A
bar 1.370960
foo -0.039201
Name: D, dtype: float64
分组 - 可迭代对象
df = pd.DataFrame({
'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print(df)
print(df.groupby('X'), type(df.groupby('X')))
print('-----')
print(list(df.groupby('X')), '→ 可迭代对象,直接生成list\n')
print(list(df.groupby('X'))[0], '→ 以元祖形式显示\n')
for n,g in df.groupby('X'):
print(n)
print(g)
print('###')
print('-----')
print(df.groupby(['X']).get_group('A'),'\n')
print(df.groupby(['X']).get_group('B'),'\n')
print('-----')
grouped = df.groupby(['X'])
print(grouped.groups)
print(grouped.groups['A'])
print('-----')
sz = grouped.size()
print(sz,type(sz))
print('-----')
df = pd.DataFrame({
'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
grouped = df.groupby(['A','B']).groups
print(df)
print(grouped)
print(grouped[('foo', 'three')])
X Y
0 A 1
1 B 4
2 A 3
3 B 2
-----
[('A', X Y
0 A 1
2 A 3), ('B', X Y
1 B 4
3 B 2)] → 可迭代对象,直接生成list
('A', X Y
0 A 1
2 A 3) → 以元祖形式显示
A
X Y
0 A 1
2 A 3
###
B
X Y
1 B 4
3 B 2
###
-----
X Y
0 A 1
2 A 3
X Y
1 B 4
3 B 2
-----
{'A': Int64Index([0, 2], dtype='int64'), 'B': Int64Index([1, 3], dtype='int64')}
Int64Index([0, 2], dtype='int64')
-----
X
A 2
B 2
dtype: int64
-----
A B C D
0 foo one 0.981468 0.473817
1 bar one -1.236826 0.028449
2 foo two -1.611723 1.444489
3 bar three 1.136316 0.881776
4 foo two 0.523383 0.707726
5 bar two -2.196340 -0.201260
6 foo one 1.014091 0.256455
7 foo three -1.700698 1.217236
{('bar', 'one'): Int64Index([1], dtype='int64'), ('bar', 'three'): Int64Index([3], dtype='int64'), ('bar', 'two'): Int64Index([5], dtype='int64'), ('foo', 'one'): Int64Index([0, 6], dtype='int64'), ('foo', 'three'): Int64Index([7], dtype='int64'), ('foo', 'two'): Int64Index([2, 4], dtype='int64')}
Int64Index([7], dtype='int64')
其他轴上的分组
df = pd.DataFrame({
'data1':np.random.rand(2),
'data2':np.random.rand(2),
'key1':[1,'b'],
'key2':['one','two']})
print(df)
print(df.dtypes,type(df.dtypes))
print('-----')
for n,p in df.groupby(df.dtypes, axis=1):
print(n)
print(p)
print('##')
data1 data2 key1 key2
0 0.572579 0.924789 1 one
1 0.575395 0.814979 b two
data1 float64
data2 float64
key1 object
key2 object
dtype: object
-----
float64
data1 data2
0 0.572579 0.924789
1 0.575395 0.814979
##
object
key1 key2
0 1 one
1 b two
##
通过字典或者Series分组
df = pd.DataFrame(np.arange(16).reshape(4,4),
columns = ['a','b','c','d'])
print(df)
print('-----')
mapping = {
'a':'one','b':'one','c':'two','d':'two','e':'three'}
by_column = df.groupby(mapping, axis = 1)
print(by_column.sum())
print('-----')
s = pd.Series(mapping)
print(s,'\n')
print(s.groupby(s).count())
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
-----
one two
0 1 5
1 9 13
2 17 21
3 25 29
-----
a one
b one
c two
d two
e three
dtype: object
one 2
three 1
two 2
dtype: int64
通过函数分组
df = pd.DataFrame(np.arange(16).reshape(4,4),
columns = ['a','b','c','d'],
index = ['abc','bcd','aa','b'])
print(df,'\n')
print(df.groupby(len).sum())
a b c d
abc 0 1 2 3
bcd 4 5 6 7
aa 8 9 10 11
b 12 13 14 15
a b c d
1 12 13 14 15
2 8 9 10 11
3 4 6 8 10
分组计算函数方法
s = pd.Series([1, 2, 3, 10, 20, 30], index = [1, 2, 3, 1, 2, 3])
grouped = s.groupby(level=0)
print(grouped)
print(grouped.first(),'→ first:非NaN的第一个值\n')
print(grouped.last(),'→ last:非NaN的最后一个值\n')
print(grouped.sum(),'→ sum:非NaN的和\n')
print(grouped.mean(),'→ mean:非NaN的平均值\n')
print(grouped.median(),'→ median:非NaN的算术中位数\n')
print(grouped.count(),'→ count:非NaN的值\n')
print(grouped.min(),'→ min、max:非NaN的最小值、最大值\n')
print(grouped.std(),'→ std,var:非NaN的标准差和方差\n')
print(grouped.prod(),'→ prod:非NaN的积\n')
1 1
2 2
3 3
dtype: int64 → first:非NaN的第一个值
1 10
2 20
3 30
dtype: int64 → last:非NaN的最后一个值
1 11
2 22
3 33
dtype: int64 → sum:非NaN的和
1 5.5
2 11.0
3 16.5
dtype: float64 → mean:非NaN的平均值
1 5.5
2 11.0
3 16.5
dtype: float64 → median:非NaN的算术中位数
1 2
2 2
3 2
dtype: int64 → count:非NaN的值
1 1
2 2
3 3
dtype: int64 → min、max:非NaN的最小值、最大值
1 6.363961
2 12.727922
3 19.091883
dtype: float64 → std,var:非NaN的标准差和方差
1 10
2 40
3 90
dtype: int64 → prod:非NaN的积
分组多函数计算:agg()
df = pd.DataFrame({
'a':[1,1,2,2],
'b':np.random.rand(4),
'c':np.random.rand(4),
'd':np.random.rand(4),})
print(df)
print(df.groupby('a').agg(['mean',np.sum]))
print(df.groupby('a')['b'].agg({
'result1':np.mean,
'result2':np.sum}))
a b c d
0 1 0.456934 0.286735 0.889033
1 1 0.354812 0.117281 0.476132
2 2 0.958267 0.239303 0.276428
3 2 0.840423 0.544267 0.514867
b c d
mean sum mean sum mean sum
a
1 0.405873 0.811746 0.202008 0.404016 0.682582 1.365165
2 0.899345 1.798690 0.391785 0.783570 0.395648 0.791296
result1 result2
a
1 0.405873 0.811746
2 0.899345 1.798690
D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
# Remove the CWD from sys.path while we load stuff.
课堂作业
df = pd.DataFrame({
'A' : ['one', 'two', 'three', 'one','two', 'three', 'one', 'two'],
'B' : ['h', 'h', 'h', 'h', 'f', 'f', 'f', 'f'],
'C' : np.arange(10,26,2),
'D' : np.random.randn(8),
'E':np.random.rand(8)})
print(df)
df2 = df.groupby(['A'])[['C','D']].mean()
print(df2)
df2 = df.groupby(['A','B'])[['D','E']].sum()
print(df2)
dica = df.groupby(['A']).groups
print(dica)
dt = df.groupby(df.dtypes, axis=1).sum()
print(dt)
print('---------')
mapping = {
'C':'one','D':'one'}
dm = df.groupby(mapping,axis=1).groups
print(dm)
dmap = df.groupby(mapping,axis=1).sum()
print(dmap)
print('-------')
print(df.groupby(mapping,axis=1).get_group('one'))
print('-------')
dcd = df.groupby(mapping,axis=1).get_group('one').sum()
print(dcd)
print('-------')
db = df.groupby(['B']).agg(['mean',np.sum,'max','min'])
print(db)
A B C D E
0 one h 10 1.006026 0.133697
1 two h 12 -0.359184 0.976752
2 three h 14 0.066493 0.933959
3 one h 16 -1.462475 0.614514
4 two f 18 2.007785 0.458461
5 three f 20 -1.650301 0.805937
6 one f 22 -0.197564 0.760070
7 two f 24 -1.654774 0.633005
C D
A
one 16 -0.218005
three 17 -0.791904
two 18 -0.002057
D E
A B
one f -0.197564 0.760070
h -0.456449 0.748211
three f -1.650301 0.805937
h 0.066493 0.933959
two f 0.353012 1.091466
h -0.359184 0.976752
{'one': Int64Index([0, 3, 6], dtype='int64'), 'three': Int64Index([2, 5], dtype='int64'), 'two': Int64Index([1, 4, 7], dtype='int64')}
int32 float64 object
0 10 1.139723 oneh
1 12 0.617569 twoh
2 14 1.000452 threeh
3 16 -0.847962 oneh
4 18 2.466246 twof
5 20 -0.844364 threef
6 22 0.562506 onef
7 24 -1.021768 twof
---------
{'one': Index(['C', 'D'], dtype='object')}
one
0 11.006026
1 11.640816
2 14.066493
3 14.537525
4 20.007785
5 18.349699
6 21.802436
7 22.345226
-------
C D
0 10 1.006026
1 12 -0.359184
2 14 0.066493
3 16 -1.462475
4 18 2.007785
5 20 -1.650301
6 22 -0.197564
7 24 -1.654774
-------
C 136.000000
D -2.243994
dtype: float64
-------
C D E \
mean sum max min mean sum max min mean
B
f 21 84 24 18 -0.373713 -1.494854 2.007785 -1.654774 0.664369
h 13 52 16 10 -0.187285 -0.749141 1.006026 -1.462475 0.664731
sum max min
B
f 2.657474 0.805937 0.458461
h 2.658922 0.976752 0.133697
分组转换及一般性“拆分-应用-合并”
'''
【课程2.20】 分组转换及一般性“拆分-应用-合并”
transform / apply
'''
数据分组转换,transform
df = pd.DataFrame({
'data1':np.random.rand(5),
'data2':np.random.rand(5),
'key1':list('aabba'),
'key2':['one','two','one','two','one']})
k_mean = df.groupby('key1').mean()
print(df)
print(k_mean)
print(pd.merge(df,k_mean,left_on='key1',right_index=True).add_prefix('mean_'))
print('-----')
print(df.groupby('key2').mean())
print(df.groupby('key2').transform(np.mean))
data1 data2 key1 key2
0 0.234441 0.600356 a one
1 0.773225 0.730067 a two
2 0.483987 0.637845 b one
3 0.243679 0.997665 b two
4 0.882532 0.617680 a one
data1 data2
key1
a 0.630066 0.649368
b 0.363833 0.817755
mean_data1_x mean_data2_x mean_key1 mean_key2 mean_data1_y mean_data2_y
0 0.234441 0.600356 a one 0.630066 0.649368
1 0.773225 0.730067 a two 0.630066 0.649368
4 0.882532 0.617680 a one 0.630066 0.649368
2 0.483987 0.637845 b one 0.363833 0.817755
3 0.243679 0.997665 b two 0.363833 0.817755
-----
data1 data2
key2
one 0.533653 0.618627
two 0.508452 0.863866
data1 data2
0 0.533653 0.618627
1 0.508452 0.863866
2 0.533653 0.618627
3 0.508452 0.863866
4 0.533653 0.618627
一般化Groupby方法:apply
df = pd.DataFrame({
'data1':np.random.rand(5),
'data2':np.random.rand(5),
'key1':list('aabba'),
'key2':['one','two','one','two','one']})
print(df.groupby('key1').apply(lambda x: x.describe()))
def f_df1(d,n):
return(d.sort_index()[:n])
def f_df2(d,k1):
return(d[k1])
print(df.groupby('key1').apply(f_df1,2),'\n')
print(df.groupby('key1').apply(f_df2,'data2'))
print(type(df.groupby('key1').apply(f_df2,'data2')))
data1 data2
key1
a count 3.000000 3.000000
mean 0.545712 0.522070
std 0.315040 0.463898
min 0.202184 0.026861
25% 0.408011 0.309841
50% 0.613838 0.592821
75% 0.717477 0.769675
max 0.821116 0.946529
b count 2.000000 2.000000
mean 0.446845 0.399589
std 0.311004 0.160466
min 0.226932 0.286123
25% 0.336888 0.342856
50% 0.446845 0.399589
75% 0.556801 0.456322
max 0.666758 0.513056
data1 data2 key1 key2
key1
a 0 0.202184 0.946529 a one
1 0.821116 0.592821 a two
b 2 0.226932 0.286123 b one
3 0.666758 0.513056 b two
key1
a 0 0.946529
1 0.592821
4 0.026861
b 2 0.286123
3 0.513056
Name: data2, dtype: float64
课堂作业
df = pd.DataFrame({
'data1':np.random.rand(8),
'data2':np.random.rand(8),
'key':list('aabbabab')})
print('创建df为:\n',df,'\n------')
df2 = df.groupby(['key']).mean()
print(df2)
df3 = pd.merge(df,df2,left_on='key',right_index=True).add_prefix('mean_')
print('求和且合并后的结果为:')
print(df3)
创建df为:
data1 data2 key
0 0.841120 0.987305 a
1 0.965404 0.734070 a
2 0.511385 0.044053 b
3 0.912349 0.828049 b
4 0.819506 0.131610 a
5 0.723875 0.642737 b
6 0.822328 0.457494 a
7 0.107970 0.936853 b
------
data1 data2
key
a 0.862090 0.577619
b 0.563895 0.612923
求和且合并后的结果为:
mean_data1_x mean_data2_x mean_key mean_data1_y mean_data2_y
0 0.841120 0.987305 a 0.862090 0.577619
1 0.965404 0.734070 a 0.862090 0.577619
4 0.819506 0.131610 a 0.862090 0.577619
6 0.822328 0.457494 a 0.862090 0.577619
2 0.511385 0.044053 b 0.563895 0.612923
3 0.912349 0.828049 b 0.563895 0.612923
5 0.723875 0.642737 b 0.563895 0.612923
7 0.107970 0.936853 b 0.563895 0.612923
透视表及交叉表
'''
【课程2.21】 透视表及交叉表
类似excel数据透视 - pivot table / crosstab
'''
透视表:pivot_table
date = ['2017-5-1','2017-5-2','2017-5-3']*3
rng = pd.to_datetime(date)
df = pd.DataFrame({
'date':rng,
'key':list('abcdabcda'),
'values':np.random.rand(9)*10})
print(df)
print('-----')
print(pd.pivot_table(df, values = 'values', index = 'date', aggfunc=np.sum))
print(pd.pivot_table(df, values = 'values', index = 'date', columns = 'key', aggfunc=np.sum))
print('-----')
print(pd.pivot_table(df, values = 'values', index = ['date','key'], aggfunc=len))
print('-----')
date key values
0 2017-05-01 a 1.573759
1 2017-05-02 b 3.750596
2 2017-05-03 c 4.958902
3 2017-05-01 d 0.797226
4 2017-05-02 a 5.757876
5 2017-05-03 b 0.082909
6 2017-05-01 c 3.799717
7 2017-05-02 d 0.754402
8 2017-05-03 a 3.117813
-----
values
date
2017-05-01 6.170701
2017-05-02 10.262874
2017-05-03 8.159623
key a b c d
date
2017-05-01 1.573759 NaN 3.799717 0.797226
2017-05-02 5.757876 3.750596 NaN 0.754402
2017-05-03 3.117813 0.082909 4.958902 NaN
-----
values
date key
2017-05-01 a 1.0
c 1.0
d 1.0
2017-05-02 a 1.0
b 1.0
d 1.0
2017-05-03 a 1.0
b 1.0
c 1.0
-----
交叉表:crosstab
df = pd.DataFrame({
'A': [1, 2, 2, 2, 2],
'B': [3, 3, 4, 4, 4],
'C': [1, 1, np.nan, 1, 1]})
print(df)
print('-----')
print(pd.crosstab(df['A'],df['B']))
print('-----')
print(pd.crosstab(df['A'],df['B'],normalize=True))
print('-----')
print(pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum))
print('-----')
print(pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum, margins=True))
print('-----')
A B C
0 1 3 1.0
1 2 3 1.0
2 2 4 NaN
3 2 4 1.0
4 2 4 1.0
-----
B 3 4
A
1 1 0
2 1 3
-----
B 3 4
A
1 0.2 0.0
2 0.2 0.6
-----
B 3 4
A
1 1.0 NaN
2 1.0 2.0
-----
B 3 4 All
A
1 1.0 NaN 1.0
2 1.0 2.0 3.0
All 2.0 2.0 4.0
-----
课堂作业
df = pd.DataFrame({
'A' : ['one', 'two', 'three', 'one','two', 'three', 'one', 'two'],
'B' : ['h', 'h', 'h', 'h', 'f', 'f', 'f', 'f'],
'C' : np.arange(10,26,2),
'D' : np.random.randn(8),
'E':np.random.rand(8)})
print(df)
print('---------')
print(pd.pivot_table(df,index=['A'],values=['C','D'],aggfunc='mean'))
print('---------')
print(pd.pivot_table(df,index=['A','B'],values=['D','E'],aggfunc=['mean','sum']))
print('---------')
print(pd.pivot_table(df,index=['B'],values=['C'],columns=['A'],aggfunc='count'))
print('---------')
print(pd.crosstab(df['B'],df['A']))
A B C D E
0 one h 10 1.801648 0.234444
1 two h 12 1.015224 0.473324
2 three h 14 1.145384 0.423148
3 one h 16 0.782241 0.053959
4 two f 18 -0.015952 0.669829
5 three f 20 -0.356324 0.455806
6 one f 22 -1.555999 0.136985
7 two f 24 1.791435 0.448069
---------
C D
A
one 16 0.342630
three 17 0.394530
two 18 0.930235
---------
mean sum
D E D E
A B
one f -1.555999 0.136985 -1.555999 0.136985
h 1.291944 0.144202 2.583889 0.288403
three f -0.356324 0.455806 -0.356324 0.455806
h 1.145384 0.423148 1.145384 0.423148
two f 0.887741 0.558949 1.775482 1.117898
h 1.015224 0.473324 1.015224 0.473324
---------
C
A one three two
B
f 1 1 2
h 2 1 1
---------
A one three two
B
f 1 1 2
h 2 1 1
数据读取
'''
【课程2.22】 数据读取
核心:read_table, read_csv, read_excel
'''
读取普通分隔数据:read_table
import os
os.chdir('D:/data/pandasData/')
data1 = pd.read_table('data1.txt', delimiter=',',header = 0, index_col=1)
print(data1)
va1 va3 va4
va2
2 1 3 4
3 2 4 5
4 3 5 6
5 4 6 7
D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: FutureWarning: read_table is deprecated, use read_csv instead.
import sys
读取csv数据:read_csv
data2 = pd.read_csv('data2.csv',engine = 'python')
print(data2.head())
省级政区代码 省级政区名称 地市级政区代码 地市级政区名称 年份 党委书记姓名 出生年份 出生月份 籍贯省份代码 籍贯省份名称 \
0 130000 河北省 130100 石家庄市 2000 陈来立 NaN NaN NaN NaN
1 130000 河北省 130100 石家庄市 2001 吴振华 NaN NaN NaN NaN
2 130000 河北省 130100 石家庄市 2002 吴振华 NaN NaN NaN NaN
3 130000 河北省 130100 石家庄市 2003 吴振华 NaN NaN NaN NaN
4 130000 河北省 130100 石家庄市 2004 吴振华 NaN NaN NaN NaN
... 民族 教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科 专业:理工 专业:农科 专业:医科 入党年份 工作年份
0 ... NaN 硕士 1.0 NaN NaN NaN NaN NaN NaN NaN
1 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
2 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
3 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
4 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
[5 rows x 23 columns]
读取excel数据:read_excel
data3 = pd.read_excel('地市级党委书记数据库(2000-10).xlsx',sheet_name='中国人民共和国地市级党委书记数据库(2000-10)',header=0)
print(data3)
省级政区代码 省级政区名称 地市级政区代码 地市级政区名称 年份 党委书记姓名 出生年份 出生月份 籍贯省份代码 \
0 130000 河北省 130100 石家庄市 2000 陈来立 NaN NaN NaN
1 130000 河北省 130100 石家庄市 2001 吴振华 NaN NaN NaN
2 130000 河北省 130100 石家庄市 2002 吴振华 NaN NaN NaN
3 130000 河北省 130100 石家庄市 2003 吴振华 NaN NaN NaN
4 130000 河北省 130100 石家庄市 2004 吴振华 NaN NaN NaN
5 130000 河北省 130100 石家庄市 2005 吴振华 NaN NaN NaN
6 130000 河北省 130100 石家庄市 2006 吴振华 NaN NaN NaN
7 130000 河北省 130100 石家庄市 2007 吴显国 NaN NaN NaN
8 130000 河北省 130100 石家庄市 2008 吴显国 NaN NaN NaN
9 130000 河北省 130100 石家庄市 2009 车俊 NaN NaN NaN
10 130000 河北省 130100 石家庄市 2010 孙瑞彬 NaN NaN NaN
11 130000 河北省 130200 唐山市 2000 白润璋 NaN NaN NaN
12 130000 河北省 130200 唐山市 2001 白润璋 NaN NaN NaN
13 130000 河北省 130200 唐山市 2002 白润璋 NaN NaN NaN
14 130000 河北省 130200 唐山市 2003 张和 NaN NaN NaN
15 130000 河北省 130200 唐山市 2004 张和 NaN NaN NaN
16 130000 河北省 130200 唐山市 2005 张和 NaN NaN NaN
17 130000 河北省 130200 唐山市 2006 张和 NaN NaN NaN
18 130000 河北省 130200 唐山市 2007 赵勇 NaN NaN NaN
19 130000 河北省 130200 唐山市 2008 赵勇 NaN NaN NaN
20 130000 河北省 130200 唐山市 2009 赵勇 NaN NaN NaN
21 130000 河北省 130200 唐山市 2010 赵勇 NaN NaN NaN
22 130000 河北省 130300 秦皇岛市 2000 王建忠 NaN NaN NaN
23 130000 河北省 130300 秦皇岛市 2001 王建忠 NaN NaN NaN
24 130000 河北省 130300 秦皇岛市 2002 王建忠 NaN NaN NaN
25 130000 河北省 130300 秦皇岛市 2003 宋长瑞 NaN NaN NaN
26 130000 河北省 130300 秦皇岛市 2004 宋长瑞 NaN NaN NaN
27 130000 河北省 130300 秦皇岛市 2005 宋长瑞 NaN NaN NaN
28 130000 河北省 130300 秦皇岛市 2006 宋长瑞 NaN NaN NaN
29 130000 河北省 130300 秦皇岛市 2007 王三堂 NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
3633 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2003 NaN NaN NaN NaN
3634 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2004 NaN NaN NaN NaN
3635 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2005 NaN NaN NaN NaN
3636 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2006 NaN NaN NaN NaN
3637 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2007 NaN NaN NaN NaN
3638 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2008 NaN NaN NaN NaN
3639 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2009 NaN NaN NaN NaN
3640 650000 新疆维吾尔自治区 654000 伊犁哈萨克自治州 2010 NaN NaN NaN NaN
3641 650000 新疆维吾尔自治区 654200 塔城地区 2000 NaN NaN NaN NaN
3642 650000 新疆维吾尔自治区 654200 塔城地区 2001 NaN NaN NaN NaN
3643 650000 新疆维吾尔自治区 654200 塔城地区 2002 NaN NaN NaN NaN
3644 650000 新疆维吾尔自治区 654200 塔城地区 2003 NaN NaN NaN NaN
3645 650000 新疆维吾尔自治区 654200 塔城地区 2004 NaN NaN NaN NaN
3646 650000 新疆维吾尔自治区 654200 塔城地区 2005 NaN NaN NaN NaN
3647 650000 新疆维吾尔自治区 654200 塔城地区 2006 NaN NaN NaN NaN
3648 650000 新疆维吾尔自治区 654200 塔城地区 2007 NaN NaN NaN NaN
3649 650000 新疆维吾尔自治区 654200 塔城地区 2008 NaN NaN NaN NaN
3650 650000 新疆维吾尔自治区 654200 塔城地区 2009 NaN NaN NaN NaN
3651 650000 新疆维吾尔自治区 654200 塔城地区 2010 NaN NaN NaN NaN
3652 650000 新疆维吾尔自治区 654300 阿勒泰地区 2000 NaN NaN NaN NaN
3653 650000 新疆维吾尔自治区 654300 阿勒泰地区 2001 NaN NaN NaN NaN
3654 650000 新疆维吾尔自治区 654300 阿勒泰地区 2002 NaN NaN NaN NaN
3655 650000 新疆维吾尔自治区 654300 阿勒泰地区 2003 NaN NaN NaN NaN
3656 650000 新疆维吾尔自治区 654300 阿勒泰地区 2004 NaN NaN NaN NaN
3657 650000 新疆维吾尔自治区 654300 阿勒泰地区 2005 NaN NaN NaN NaN
3658 650000 新疆维吾尔自治区 654300 阿勒泰地区 2006 NaN NaN NaN NaN
3659 650000 新疆维吾尔自治区 654300 阿勒泰地区 2007 NaN NaN NaN NaN
3660 650000 新疆维吾尔自治区 654300 阿勒泰地区 2008 NaN NaN NaN NaN
3661 650000 新疆维吾尔自治区 654300 阿勒泰地区 2009 NaN NaN NaN NaN
3662 650000 新疆维吾尔自治区 654300 阿勒泰地区 2010 NaN NaN NaN NaN
籍贯省份名称 ... 民族 教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科 专业:理工 专业:农科 专业:医科 \
0 NaN ... NaN 硕士 1.0 NaN NaN NaN NaN NaN
1 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
2 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
3 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
4 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
5 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
6 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
7 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
8 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
9 NaN ... NaN 本科 1.0 0.0 1.0 0.0 0.0 0.0
10 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
11 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
12 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
13 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
14 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
15 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
16 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
17 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
18 NaN ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0
19 NaN ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0
20 NaN ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0
21 NaN ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0
22 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
23 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
24 NaN ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0
25 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
26 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
27 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
28 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
29 NaN ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ...
3633 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3634 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3635 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3636 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3637 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3638 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3639 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3640 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3641 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3642 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3643 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3644 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3645 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3646 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3647 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3648 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3649 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3650 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3651 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3652 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3653 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3654 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3655 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3656 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3657 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3658 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3659 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3660 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3661 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
3662 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN
入党年份 工作年份
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 NaN NaN
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
19 NaN NaN
20 NaN NaN
21 NaN NaN
22 NaN NaN
23 NaN NaN
24 NaN NaN
25 NaN NaN
26 NaN NaN
27 NaN NaN
28 NaN NaN
29 NaN NaN
... ... ...
3633 NaN NaN
3634 NaN NaN
3635 NaN NaN
3636 NaN NaN
3637 NaN NaN
3638 NaN NaN
3639 NaN NaN
3640 NaN NaN
3641 NaN NaN
3642 NaN NaN
3643 NaN NaN
3644 NaN NaN
3645 NaN NaN
3646 NaN NaN
3647 NaN NaN
3648 NaN NaN
3649 NaN NaN
3650 NaN NaN
3651 NaN NaN
3652 NaN NaN
3653 NaN NaN
3654 NaN NaN
3655 NaN NaN
3656 NaN NaN
3657 NaN NaN
3658 NaN NaN
3659 NaN NaN
3660 NaN NaN
3661 NaN NaN
3662 NaN NaN
[3663 rows x 23 columns]