Pandas学习(三)---数值运算

Pandas学习--数值运算

    • 数值计算和统计基础
      • 常用数学、统计方法
        • 基本参数:axis、skipna
        • 主要数学计算方法,可用于Series和DataFrame(1)
        • 主要数学计算方法,可用于Series和DataFrame(2)累和、累积
        • 主要数学计算方法,唯一值:.unique()
        • 主要数学计算方法,值计数(计算频率):.value_counts()
        • 主要数学计算方法,成员资格(是否包含):.isin()
            • 课堂作业
      • 处理文本数据
        • 通过str访问,且自动排除丢失/ NA值
        • 字符串常用方法(1) - lower,upper,len,startswith,endswith
        • 字符串常用方法(2) - strip
        • 字符串常用方法(3) - replace
        • 字符串常用方法(4) - split、rsplit
        • 字符串索引
            • 课堂作业
      • 合并 merge、join
        • merge合并 → 类似excel的vlookup
        • merge合并 → 参数how → 合并方式
        • merge合并 → 参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键
        • merge合并 →参数 sort
      • pd.join() → 直接通过索引链接
            • 课堂作业
      • 连接与修补 concat、combine_first
        • 连接:concat
        • 连接方式:join,join_axes
        • 覆盖列名(!!!)
        • 修补 pd.combine_first()
          • 课堂作业
      • 去重及替换 .duplicated / .replace
      • 去重 .duplicated
        • 替换 .replace
      • 数据分组!!!(重要)
        • groupby分组
        • 分组 - 可迭代对象
        • 其他轴上的分组
        • 通过字典或者Series分组
        • 通过函数分组
        • 分组计算函数方法
        • 分组多函数计算:agg()
            • 课堂作业
      • 分组转换及一般性“拆分-应用-合并”
        • 数据分组转换,transform
        • 一般化Groupby方法:apply
            • 课堂作业
      • 透视表及交叉表
        • 透视表:pivot_table
        • 交叉表:crosstab
            • 课堂作业
      • 数据读取
        • 读取普通分隔数据:read_table
        • 读取csv数据:read_csv
        • 读取excel数据:read_excel

数值计算和统计基础

'''
【课程2.14】  数值计算和统计基础

常用数学、统计方法
 
'''

常用数学、统计方法

基本参数:axis、skipna

# 基本参数:axis、skipna

import numpy as np
import pandas as pd

df = pd.DataFrame({
     'key1':[4,5,3,np.nan,2],
                 'key2':[1,2,np.nan,4,5],
                 'key3':[1,2,3,'j','k']},
                 index = ['a','b','c','d','e'])
print(df)
print(df['key1'].dtype,df['key2'].dtype,df['key3'].dtype)
print('-----')

m1 = df.mean()
print(m1,type(m1))
print('单独统计一列:',df['key2'].mean())
print('-----')
# np.nan :空值
# .mean()计算均值
# 只统计数字列
# 可以通过索引单独统计一列

m2 = df.mean(axis=1)
print(m2)
print('-----')
# axis参数:默认为0,以列来计算,axis=1,以行来计算,这里就按照行来汇总了

m3 = df.mean(skipna=False)
print(m3)
print('-----')
# skipna参数:是否忽略NaN,默认True,如False,有NaN的列统计结果仍未NaN
   key1  key2 key3
a   4.0   1.0    1
b   5.0   2.0    2
c   3.0   NaN    3
d   NaN   4.0    j
e   2.0   5.0    k
float64 float64 object
-----
key1    3.5
key2    3.0
dtype: float64 
单独统计一列: 3.0
-----
a    2.5
b    3.5
c    3.0
d    4.0
e    3.5
dtype: float64
-----
key1   NaN
key2   NaN
dtype: float64
-----

主要数学计算方法,可用于Series和DataFrame(1)

# 主要数学计算方法,可用于Series和DataFrame(1)

df = pd.DataFrame({
     'key1':np.arange(10),
                  'key2':np.random.rand(10)*10})
print(df)
print('-----')

print(df.count(),'→ count统计非Na值的数量\n')
print(df.min(),'→ min统计最小值\n',df['key2'].max(),'→ max统计最大值\n')
print(df.quantile(q=0.75),'→ quantile统计分位数,参数q确定位置\n')
print(df.sum(),'→ sum求和\n')
print(df.mean(),'→ mean求平均值\n')
print(df.median(),'→ median求算数中位数,50%分位数\n')
print(df.std(),'\n',df.var(),'→ std,var分别求标准差,方差\n')
print(df.skew(),'→ skew样本的偏度\n')
print(df.kurt(),'→ kurt样本的峰度\n')
   key1      key2
0     0  0.327398
1     1  0.959262
2     2  6.455080
3     3  6.275359
4     4  6.138641
5     5  8.853716
6     6  4.525300
7     7  9.740657
8     8  9.229833
9     9  0.949789
-----
key1    10
key2    10
dtype: int64 → count统计非Na值的数量

key1    0.000000
key2    0.327398
dtype: float64 → min统计最小值
 9.740656570973671 → max统计最大值

key1    6.750000
key2    8.254057
Name: 0.75, dtype: float64 → quantile统计分位数,参数q确定位置

key1    45.000000
key2    53.455034
dtype: float64 → sum求和

key1    4.500000
key2    5.345503
dtype: float64 → mean求平均值

key1    4.500
key2    6.207
dtype: float64 → median求算数中位数,50%分位数

key1    3.027650
key2    3.556736
dtype: float64 
 key1     9.166667
key2    12.650371
dtype: float64 → std,var分别求标准差,方差

key1    0.000000
key2   -0.329924
dtype: float64 → skew样本的偏度

key1   -1.200000
key2   -1.430276
dtype: float64 → kurt样本的峰度

主要数学计算方法,可用于Series和DataFrame(2)累和、累积

# 主要数学计算方法,可用于Series和DataFrame(2)

df['key1_s'] = df['key1'].cumsum()
df['key2_s'] = df['key2'].cumsum()
print(df,'→ cumsum样本的累计和\n')

df['key1_p'] = df['key1'].cumprod()
df['key2_p'] = df['key2'].cumprod()
print(df,'→ cumprod样本的累计积\n')

print(df.cummax(),'\n',df.cummin(),'→ cummax,cummin分别求累计最大值,累计最小值\n')
# 会填充key1,和key2的值
   key1      key2  key1_s     key2_s
0     0  0.327398       0   0.327398
1     1  0.959262       1   1.286660
2     2  6.455080       3   7.741740
3     3  6.275359       6  14.017099
4     4  6.138641      10  20.155740
5     5  8.853716      15  29.009456
6     6  4.525300      21  33.534756
7     7  9.740657      28  43.275412
8     8  9.229833      36  52.505245
9     9  0.949789      45  53.455034 → cumsum样本的累计和

   key1      key2  key1_s     key2_s  key1_p         key2_p
0     0  0.327398       0   0.327398       0       0.327398
1     1  0.959262       1   1.286660       0       0.314061
2     2  6.455080       3   7.741740       0       2.027286
3     3  6.275359       6  14.017099       0      12.721946
4     4  6.138641      10  20.155740       0      78.095454
5     5  8.853716      15  29.009456       0     691.434982
6     6  4.525300      21  33.534756       0    3128.950808
7     7  9.740657      28  43.275412       0   30478.035251
8     8  9.229833      36  52.505245       0  281307.179260
9     9  0.949789      45  53.455034       0  267182.375541 → cumprod样本的累计积

   key1      key2  key1_s     key2_s  key1_p         key2_p
0   0.0  0.327398     0.0   0.327398     0.0       0.327398
1   1.0  0.959262     1.0   1.286660     0.0       0.327398
2   2.0  6.455080     3.0   7.741740     0.0       2.027286
3   3.0  6.455080     6.0  14.017099     0.0      12.721946
4   4.0  6.455080    10.0  20.155740     0.0      78.095454
5   5.0  8.853716    15.0  29.009456     0.0     691.434982
6   6.0  8.853716    21.0  33.534756     0.0    3128.950808
7   7.0  9.740657    28.0  43.275412     0.0   30478.035251
8   8.0  9.740657    36.0  52.505245     0.0  281307.179260
9   9.0  9.740657    45.0  53.455034     0.0  281307.179260 
    key1      key2  key1_s    key2_s  key1_p    key2_p
0   0.0  0.327398     0.0  0.327398     0.0  0.327398
1   0.0  0.327398     0.0  0.327398     0.0  0.314061
2   0.0  0.327398     0.0  0.327398     0.0  0.314061
3   0.0  0.327398     0.0  0.327398     0.0  0.314061
4   0.0  0.327398     0.0  0.327398     0.0  0.314061
5   0.0  0.327398     0.0  0.327398     0.0  0.314061
6   0.0  0.327398     0.0  0.327398     0.0  0.314061
7   0.0  0.327398     0.0  0.327398     0.0  0.314061
8   0.0  0.327398     0.0  0.327398     0.0  0.314061
9   0.0  0.327398     0.0  0.327398     0.0  0.314061 → cummax,cummin分别求累计最大值,累计最小值

主要数学计算方法,唯一值:.unique()

# 唯一值:.unique()

s = pd.Series(list('asdvasdcfgg'))
sq = s.unique()
print(s)
print(sq,type(sq))
print(pd.Series(sq))
# 得到一个唯一值数组
# 通过pd.Series重新变成新的Series

sq.sort()
print(sq)
# 重新排序
0     a
1     s
2     d
3     v
4     a
5     s
6     d
7     c
8     f
9     g
10    g
dtype: object
['a' 's' 'd' 'v' 'c' 'f' 'g'] 
0    a
1    s
2    d
3    v
4    c
5    f
6    g
dtype: object
['a' 'c' 'd' 'f' 'g' 's' 'v']

主要数学计算方法,值计数(计算频率):.value_counts()

# 值计数:.value_counts()

sc = s.value_counts(sort = False)  # 也可以这样写:pd.value_counts(sc, sort = False)
print(sc)
# 得到一个新的Series,计算出不同值出现的频率
# sort参数:排序,默认为True
a    2
d    2
v    1
g    2
s    2
f    1
c    1
dtype: int64

主要数学计算方法,成员资格(是否包含):.isin()

# 成员资格:.isin()

s = pd.Series(np.arange(10,15))
df = pd.DataFrame({
     'key1':list('asdcbvasd'),
                  'key2':np.arange(4,13)})
print(s)
print(df)
print('-----')

print(s.isin([5,14]))
print(df.isin(['a','bc','10',8]))
# 用[]表示
# 得到一个布尔值的Series或者Dataframe
0    10
1    11
2    12
3    13
4    14
dtype: int32
  key1  key2
0    a     4
1    s     5
2    d     6
3    c     7
4    b     8
5    v     9
6    a    10
7    s    11
8    d    12
-----
0    False
1    False
2    False
3    False
4     True
dtype: bool
    key1   key2
0   True  False
1  False  False
2  False  False
3  False  False
4  False   True
5  False  False
6   True  False
7  False  False
8  False  False
课堂作业
ts1 = pd.DataFrame(np.random.rand(5,2)*100,columns=['key1','key2'])
print("创建的Dateframe为:")
print(ts1)
print('------')
print("df['key1']的均值为:")
print(ts1['key1'].mean())
print('------')
print("df['key1']的中位数为:")
print(ts1['key1'].median())
print('------')
print("df['key2']的均值为:")
print(ts1['key2'].mean())
print('------')
print("df['key2']的中位数为:")
print(ts1['key2'].median())
print('------')
print("df['key2']的累计和为:")
ts1['key1_cumsum'] = ts1['key1'].cumsum()
ts1['key2_cumsum'] = ts1['key2'].cumsum()
print(ts1)
创建的Dateframe为:
        key1       key2
0   0.445031  70.879116
1  40.164080   8.052621
2   4.118756  72.932482
3  46.818794  12.744497
4  37.192819  18.393109
------
df['key1']的均值为:
25.747896160805663
------
df['key1']的中位数为:
37.192819239210486
------
df['key2']的均值为:
36.60036488397306
------
df['key2']的中位数为:
18.39310866824474
------
df['key2']的累计和为:
        key1       key2  key1_cumsum  key2_cumsum
0   0.445031  70.879116     0.445031    70.879116
1  40.164080   8.052621    40.609112    78.931737
2   4.118756  72.932482    44.727868   151.864219
3  46.818794  12.744497    91.546662   164.608716
4  37.192819  18.393109   128.739481   183.001824
# 作业2:写出一个输入元素直接生成数组的代码块,然后创建一个函数,该函数功能用于判断一个Series是否是唯一值数组,返回“是”和“不是”

def f(s):
    s2 = s.unique()
    if len(s) == len(s2):
        print('------\n该数组是唯一值数组')
    else:
        print('------\n该数组不是唯一值数组')

d = input('请随机输入一组元素,用逗号(英文符号)隔开:\n')
lst = d.split(',')
ds = pd.Series(lst)
f(ds)
请随机输入一组元素,用逗号(英文符号)隔开:
a,sc,2,2,2,d,s,s,a
------
该数组不是唯一值数组

处理文本数据

'''
【课程2.15】  文本数据

Pandas针对字符串配备的一套方法,使其易于对数组的每个元素进行操作
 
'''

通过str访问,且自动排除丢失/ NA值

# 通过str访问,且自动排除丢失/ NA值

s = pd.Series(['A','bB','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({
     'key1':list('abcdef'),
                  'key2':['hee','fv','w','hija','123',np.nan]})
print(s)
print(df)
print('-----')

print(s.str.count('b')) # 大小写敏感
print(df['key2'].str.upper()) # 不会改变原有值
print(df['key2']) 
print('-----')
# 直接通过.str调用字符串方法
# 可以对Series、Dataframe使用
# 自动过滤NaN值

df.columns = df.columns.str.upper()
print(df)
# df.columns是一个Index对象,也可使用.str
0          A
1         bB
2          C
3    bbhello
4        123
5        NaN
6         hj
dtype: object
  key1  key2
0    a   hee
1    b    fv
2    c     w
3    d  hija
4    e   123
5    f   NaN
-----
0    0.0
1    1.0
2    0.0
3    2.0
4    0.0
5    NaN
6    0.0
dtype: float64
0     HEE
1      FV
2       W
3    HIJA
4     123
5     NaN
Name: key2, dtype: object
  key1  key2
0    a   hee
1    b    fv
2    c     w
3    d  hija
4    e   123
5    f   NaN
-----
  KEY1  KEY2
0    a   hee
1    b    fv
2    c     w
3    d  hija
4    e   123
5    f   NaN

字符串常用方法(1) - lower,upper,len,startswith,endswith

# 字符串常用方法(1) - lower,upper,len,startswith,endswith

s = pd.Series(['A','b','bbhello','123',np.nan])

print(s.str.lower(),'→ lower小写\n')
print(s.str.upper(),'→ upper大写\n')
print(s.str.len(),'→ len字符长度\n')
print(s.str.startswith('b'),'→ 判断起始是否为a\n')
print(s.str.endswith('3'),'→ 判断结束是否为3\n')
0          a
1          b
2    bbhello
3        123
4        NaN
dtype: object → lower小写

0          A
1          B
2    BBHELLO
3        123
4        NaN
dtype: object → upper大写

0    1.0
1    1.0
2    7.0
3    3.0
4    NaN
dtype: float64 → len字符长度

0    False
1     True
2     True
3    False
4      NaN
dtype: object → 判断起始是否为a

0    False
1    False
2    False
3     True
4      NaN
dtype: object → 判断结束是否为3

字符串常用方法(2) - strip

# 字符串常用方法(2) - strip

s = pd.Series([' jack', 'ji ll ', ' jesse ', 'frank'])
df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
                  index=range(3))
print(s)
print(df)
print('-----')

print(s.str.strip())  # 去除字符串前后的空格
print(s.str.lstrip())  # 去除字符串中的左空格
print(s.str.rstrip())  # 去除字符串中的右空格

df.columns = df.columns.str.strip()
print(df)
# 这里去掉了columns的前后空格,但没有去掉中间空格
0       jack
1     ji ll 
2     jesse 
3      frank
dtype: object
    Column A    Column B 
0    1.178373   -0.770705
1    0.611277    0.705297
2   -1.106696    1.455232
-----
0     jack
1    ji ll
2    jesse
3    frank
dtype: object
0      jack
1    ji ll 
2    jesse 
3     frank
dtype: object
0      jack
1     ji ll
2     jesse
3     frank
dtype: object
   Column A  Column B
0  1.178373 -0.770705
1  0.611277  0.705297
2 -1.106696  1.455232

字符串常用方法(3) - replace

# 字符串常用方法(3) - replace

df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
                  index=range(3))
df.columns = df.columns.str.replace(' ','-')
print(df)
# 替换

df.columns = df.columns.str.replace('-','hehe',n=1)
print(df)
# n:替换个数
   -Column-A-  -Column-B-
0   -1.140552   -2.215192
1   -0.386697    1.323757
2   -0.288860    1.405160
   heheColumn-A-  heheColumn-B-
0      -1.140552      -2.215192
1      -0.386697       1.323757
2      -0.288860       1.405160

字符串常用方法(4) - split、rsplit

# 字符串常用方法(4) - split、rsplit

s = pd.Series(['a,b,c','1,2,3',['a,,,c'],np.nan])
print(s)
print(s.str.split(','),type(s.str.split(','))) # 因为['a,,,c']是列表,所有split是NAN   这是一个Series对象
print('-----')
# 类似字符串的split

print(s.str.split(',')[0],type(s.str.split(',')[0])) # 对于Series来说这是取第一行
print('-----')
# 直接索引得到一个list

print(s.str.split(',').str) # 这是对一个Series对象取str
print(s.str.split(',').str[1],type(s.str.split(',').str[1]),'split.str..') # 对于子元素已经是列表形式的Series来说, .str[0] 相当于取第一列
print(s.str.split(',').str.get(1))                                         # 而且返回的对象还是一个series
print('-----')
# 可以使用get或[]符号访问拆分列表中的元素

print(s.str.split(',', expand=True))
print(s.str.split(',', expand=True, n = 1))
print(s.str.rsplit(',', expand=True, n = 1))
print('-----')
# 可以使用expand可以轻松扩展此操作以返回DataFrame
# n参数限制分割数
# rsplit类似于split,反向工作,即从字符串的末尾到字符串的开头

df = pd.DataFrame({
     'key1':['a,b,c','1,2,3',[':,., ']],
                  'key2':['a-b-c','1-2-3',[':-.- ']]})
print(df['key2'].str.split('-'))
# Dataframe使用split
0      a,b,c
1      1,2,3
2    [a,,,c]
3        NaN
dtype: object
0    [a, b, c]
1    [1, 2, 3]
2          NaN
3          NaN
dtype: object 
-----
['a', 'b', 'c'] 
-----

0      b
1      2
2    NaN
3    NaN
dtype: object  split.str..
0      b
1      2
2    NaN
3    NaN
dtype: object
-----
     0    1    2
0    a    b    c
1    1    2    3
2  NaN  NaN  NaN
3  NaN  NaN  NaN
     0    1
0    a  b,c
1    1  2,3
2  NaN  NaN
3  NaN  NaN
     0    1
0  a,b    c
1  1,2    3
2  NaN  NaN
3  NaN  NaN
-----
0    [a, b, c]
1    [1, 2, 3]
2          NaN
Name: key2, dtype: object

字符串索引

# 字符串索引

s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({
     'key1':list('abcdef'),
                  'key2':['hee','fv','w','hija','123',np.nan]})

print(s.str[0])  # 取第一个字符串
print(s.str[:2])  # 取前两个字符串
print(df['key2'].str[0]) 
# str之后和字符串本身索引方式相同
0      A
1      b
2      C
3      b
4      1
5    NaN
6      h
dtype: object
0      A
1      b
2      C
3     bb
4     12
5    NaN
6     hj
dtype: object
0      h
1      f
2      w
3      h
4      1
5    NaN
Name: key2, dtype: object
课堂作业
df = pd.DataFrame({
     'name':['jack','tom','Marry','zack','heheda'],
                  'gender':['M ','M','   F','  M ','  F'],
                  'score':['90-92-89','89-78-88','90-92-95','78-88-76','60-60-67']})
print(df)
df['gender'] = df['gender'].str.strip()
df['name'] = df['name'].str.capitalize()  # 首字母大写
df = df.reindex(['gender','name','score'],axis=1)

sf = df['score'].str.split('-', expand=True) # 对单个字段的expand=True返回的是一个dataframe
print(sf,type(sf))
print(sf[0],type(sf[0]))

df['math'] = sf[0]
df['english'] = sf[1]
df['art'] = sf[2]

del df['score']
print(df)


# print(sf)  # 这是一个DateFrame

# 重要结论 
# Series split expand=False  返回Series,se[0] 第一行数据  se.str[0] 返回Series中列表第一列的数据,还是个Series对象
# Series split expand=True  返回dataframe
     name gender     score
0    jack     M   90-92-89
1     tom      M  89-78-88
2   Marry      F  90-92-95
3    zack     M   78-88-76
4  heheda      F  60-60-67
    0   1   2
0  90  92  89
1  89  78  88
2  90  92  95
3  78  88  76
4  60  60  67 
0    90
1    89
2    90
3    78
4    60
Name: 0, dtype: object 
  gender    name math english art
0      M    Jack   90      92  89
1      M     Tom   89      78  88
2      F   Marry   90      92  95
3      M    Zack   78      88  76
4      F  Heheda   60      60  67

合并 merge、join

'''
【课程2.16】  合并 merge、join

Pandas具有全功能的,高性能内存中连接操作,与SQL等关系数据库非常相似

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False)
 
'''

merge合并 → 类似excel的vlookup

# merge合并 → 类似excel的vlookup

df1 = pd.DataFrame({
     'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({
     'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
df3 = pd.DataFrame({
     'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
df4 = pd.DataFrame({
     'key1': ['K0', 'K1', 'K1', 'K2'],
                    'key2': ['K0', 'K0', 'K0', 'K0'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})
print(pd.merge(df1, df2, on='key'))
print('------')
# left:第一个df
# right:第二个df
# on:参考键

print(pd.merge(df3, df4, on=['key1','key2']))
# 多个链接键
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3
------
  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2

merge合并 → 参数how → 合并方式

# 参数how → 合并方式

print(pd.merge(df3, df4,on=['key1','key2'], how = 'inner'))  
print('------')
# inner:默认,取交集

print(pd.merge(df3, df4, on=['key1','key2'], how = 'outer'))  
print('------')
# outer:取并集,数据缺失范围NaN

print(df3)
print(df4)
print(pd.merge(df3, df4, on=['key1','key2'], how = 'left'))  
print('------')
# left:按照df3为参考合并,数据缺失范围NaN

print(pd.merge(df3, df4, on=['key1','key2'], how = 'right'))  
# right:按照df4为参考合并,数据缺失范围NaN
  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2
------
  key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K1   A3   B3  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3
------
  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3
  key1 key2   C   D
0   K0   K0  C0  D0
1   K1   K0  C1  D1
2   K1   K0  C2  D2
3   K2   K0  C3  D3
  key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN
------
  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3

merge合并 → 参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键

# 参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键

df1 = pd.DataFrame({
     'lkey':list('bbacaab'),
                   'data1':range(7)})
df2 = pd.DataFrame({
     'rkey':list('abd'),
                   'date2':range(3)})
print(pd.merge(df1, df2, left_on='lkey', right_on='rkey'))
print('------')
# df1以‘lkey’为键,df2以‘rkey’为键

df1 = pd.DataFrame({
     'key':list('abcdfeg'),
                   'data1':range(7)})
df2 = pd.DataFrame({
     'date2':range(100,105)},
                  index = list('abcde'))
print(pd.merge(df1, df2, left_on='key', right_index=True))
# df1以‘key’为键,df2以index为键
# left_index:为True时,第一个df以index为键,默认False
# right_index:为True时,第二个df以index为键,默认False

# 所以left_on, right_on, left_index, right_index可以相互组合:
# left_on + right_on, left_on + right_index, left_index + right_on, left_index + right_index
  lkey  data1 rkey  date2
0    b      0    b      1
1    b      1    b      1
2    b      6    b      1
3    a      2    a      0
4    a      4    a      0
5    a      5    a      0
------
  key  data1  date2
0   a      0    100
1   b      1    101
2   c      2    102
3   d      3    103
5   e      5    104

merge合并 →参数 sort

# 参数 sort

df1 = pd.DataFrame({
     'key':list('bbacaab'),
                   'data1':[1,3,2,4,5,9,7]})
df2 = pd.DataFrame({
     'key':list('abd'),
                   'date2':[11,2,33]})
x1 = pd.merge(df1,df2, on = 'key', how = 'outer')
x2 = pd.merge(df1,df2, on = 'key', sort=True, how = 'outer')
print(x1)
print(x2)
print('------')
# sort:按照字典顺序通过 连接键 对结果DataFrame进行排序。默认为False,设置为False会大幅提高性能

print(x2.sort_values('data1'))
# 也可直接用Dataframe的排序方法:sort_values,sort_index
  key  data1  date2
0   b    1.0    2.0
1   b    3.0    2.0
2   b    7.0    2.0
3   a    2.0   11.0
4   a    5.0   11.0
5   a    9.0   11.0
6   c    4.0    NaN
7   d    NaN   33.0
  key  data1  date2
0   a    2.0   11.0
1   a    5.0   11.0
2   a    9.0   11.0
3   b    1.0    2.0
4   b    3.0    2.0
5   b    7.0    2.0
6   c    4.0    NaN
7   d    NaN   33.0
------
  key  data1  date2
3   b    1.0    2.0
0   a    2.0   11.0
4   b    3.0    2.0
6   c    4.0    NaN
1   a    5.0   11.0
5   b    7.0    2.0
2   a    9.0   11.0
7   d    NaN   33.0

pd.join() → 直接通过索引链接

# pd.join() → 直接通过索引链接

left = pd.DataFrame({
     'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                    index=['K0', 'K1', 'K2'])
right = pd.DataFrame({
     'C': ['C0', 'C2', 'C3'],
                      'D': ['D0', 'D2', 'D3']},
                     index=['K0', 'K2', 'K3'])
print(left)
print(right)
print(left.join(right))
print(left.join(right, how='outer'))  
print('-----')
# 等价于:pd.merge(left, right, left_index=True, right_index=True, how='outer')

df1 = pd.DataFrame({
     'key':list('bbacaab'),
                   'data1':[1,3,2,4,5,9,7]})
df2 = pd.DataFrame({
     'key':list('abd'),
                   'date2':[11,2,33]})
print(df1)
print(df2)
print(pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_1', '_2')))  
print(df1.join(df2['date2']))
print('-----')
# suffixes=('_x', '_y')默认

left = pd.DataFrame({
     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'key': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({
     'C': ['C0', 'C1'],
                      'D': ['D0', 'D1']},
                     index=['K0', 'K1'])
print(left)
print(right)
print(left.join(right, on = 'key'))
# 等价于pd.merge(left, right, left_on='key', right_index=True, how='left', sort=False);
# left的‘key’和right的index
     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3
     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2
      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3
-----
  key  data1
0   b      1
1   b      3
2   a      2
3   c      4
4   a      5
5   a      9
6   b      7
  key  date2
0   a     11
1   b      2
2   d     33
  key_1  data1 key_2  date2
0     b      1     a     11
1     b      3     b      2
2     a      2     d     33
  key  data1  date2
0   b      1   11.0
1   b      3    2.0
2   a      2   33.0
3   c      4    NaN
4   a      5    NaN
5   a      9    NaN
6   b      7    NaN
-----
    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K0
3  A3  B3  K1
     C   D
K0  C0  D0
K1  C1  D1
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1
课堂作业
df1 = pd.DataFrame(np.random.rand(3),columns=['values1'])
df1['key'] = list('abc')
print('创建df1为:')
print(df1)
df2 = pd.DataFrame(np.random.rand(3),columns=['values2'])
print('创建df2为:')
df2['key'] = list('bcd')
print(df2)

df3 = pd.merge(df1,df2,on='key',how='outer')
print('合并df3(取并集)为:')
print(df3)
创建df1为:
    values1 key
0  0.363363   a
1  0.705128   b
2  0.514941   c
创建df2为:
    values2 key
0  0.305494   b
1  0.243707   c
2  0.816473   d
合并df3(取并集)为:
    values1 key   values2
0  0.363363   a       NaN
1  0.705128   b  0.305494
2  0.514941   c  0.243707
3       NaN   d  0.816473
df1 = pd.DataFrame(np.random.rand(3),columns=['values1'])
df1['lkey'] = list('abc')
print('创建df1为:')
print(df1)
df2 = pd.DataFrame(np.random.rand(3),columns=['values2'])
print('创建df2为:')
df2['rkey'] = list('bcd')
print(df2)

df3 = pd.merge(df1,df2,left_on='lkey',right_on='rkey',how='left')
print('合并df3(左连接,保留left所有)为:')
print(df3)
创建df1为:
    values1 lkey
0  0.625525    a
1  0.121965    b
2  0.114507    c
创建df2为:
    values2 rkey
0  0.406097    b
1  0.922127    c
2  0.326960    d
合并df3(左连接,保留left所有)为:
    values1 lkey   values2 rkey
0  0.625525    a       NaN  NaN
1  0.121965    b  0.406097    b
2  0.114507    c  0.922127    c
df1 = pd.DataFrame(np.random.rand(3),columns=['values1'])
df1['lkey'] = list('abc')
print('创建df1为:')
print(df1)
df2 = pd.DataFrame(np.random.rand(3),columns=['values2'],index=list('bcd'))
print('创建df2为:')
df2['value3'] = [5,6,7]
print(df2)

df3 = pd.merge(df1,df2,left_on='lkey',right_index=True,how='inner')
print('合并df3(内连接,取并集)为:')
print(df3)
创建df1为:
    values1 lkey
0  0.509719    a
1  0.157929    b
2  0.392352    c
创建df2为:
    values2  value3
b  0.805541       5
c  0.897287       6
d  0.093350       7
合并df3(内连接,取并集)为:
    values1 lkey   values2  value3
1  0.157929    b  0.805541       5
2  0.392352    c  0.897287       6

连接与修补 concat、combine_first

'''
【课程2.17】  连接与修补 concat、combine_first

连接 - 沿轴执行连接操作

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)
 
'''

连接:concat

# 连接:concat

s1 = pd.Series([1,2,3])
s2 = pd.Series([2,3,4])
s3 = pd.Series([1,2,3],index = ['a','c','h'])
s4 = pd.Series([2,3,4],index = ['b','e','d'])
print(pd.concat([s1,s2]))
print(pd.concat([s3,s4]).sort_index())
print('-----')
# 默认axis=0,行+行

print(pd.concat([s3,s4], axis=1))
print('-----')
# axis=1,列+列,成为一个Dataframe,相当于一个笛卡尔积
0    1
1    2
2    3
0    2
1    3
2    4
dtype: int64
a    1
b    2
c    2
d    4
e    3
h    3
dtype: int64
-----
     0    1
a  1.0  NaN
b  NaN  2.0
c  2.0  NaN
d  NaN  4.0
e  NaN  3.0
h  3.0  NaN
-----


D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:12: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  if sys.path[0] == '':

连接方式:join,join_axes

# 连接方式:join,join_axes

s5 = pd.Series([1,2,3],index = ['a','b','c'])
s6 = pd.Series([2,3,4],index = ['b','c','d'])
print(pd.concat([s5,s6], axis= 1))
print(pd.concat([s5,s6], axis= 1, join='inner'))
print(pd.concat([s5,s6], axis= 1, join_axes=[['a','b','d']]))
# join:{'inner','outer'},默认为“outer”。如何处理其他轴上的索引。outer为联合和inner为交集。
# join_axes:指定联合的index
     0    1
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
   0  1
b  2  2
c  3  3
     0    1
a  1.0  NaN
b  2.0  2.0
d  NaN  4.0


D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """

覆盖列名(!!!)

# 覆盖列名

sre = pd.concat([s5,s6], keys = ['one','two'])
print(sre,type(sre))
print(sre.index)
print('-----')
# keys:序列,默认值无。使用传递的键作为最外层构建层次索引

sre = pd.concat([s5,s6], axis=1, keys = ['one','two'])
print(sre,type(sre))
# axis = 1, 覆盖列名
one  a    1
     b    2
     c    3
two  b    2
     c    3
     d    4
dtype: int64 
MultiIndex(levels=[['one', 'two'], ['a', 'b', 'c', 'd']],
           codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 1, 2, 3]])
-----
   one  two
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0 


D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  if __name__ == '__main__':

修补 pd.combine_first()

# 修补 pd.combine_first()

df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, np.nan, np.nan],[np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],index=[1, 2])
print(df1)
print(df2)
print(df1.combine_first(df2))
print('-----')
# 根据index,df1的空值被df2替代
# 如果df2的index多于df1,则更新到df1上,比如index=['a',1]

df1.update(df2)
print(df1)
# update,直接df2覆盖df1,相同index位置
     0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN  NaN
2  NaN  7.0  NaN
      0    1    2
1 -42.6  NaN -8.2
2  -5.0  1.6  4.0
     0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN -8.2
2 -5.0  7.0  4.0
-----
      0    1    2
0   NaN  3.0  5.0
1 -42.6  NaN -8.2
2  -5.0  1.6  4.0
课堂作业
df1 = pd.DataFrame(np.random.rand(4,2),index=list('abcd'),columns=['value1','value2'])
print("创建df1为:")
print(df1)
print('----------')

df2 = pd.DataFrame(np.random.rand(4,2),index=list('efgh'),columns=['value1','value2'])
print("创建df2为:")
print(df2)
print('----------')

df3 = pd.concat([df1,df2])
print('堆叠为df3:')
print(df3)
创建df1为:
     value1    value2
a  0.261681  0.109421
b  0.782509  0.374875
c  0.447257  0.056709
d  0.349732  0.669266
----------
创建df2为:
     value1    value2
e  0.902231  0.531241
f  0.818947  0.537972
g  0.052821  0.696736
h  0.098303  0.911916
----------
堆叠为df3:
     value1    value2
a  0.261681  0.109421
b  0.782509  0.374875
c  0.447257  0.056709
d  0.349732  0.669266
e  0.902231  0.531241
f  0.818947  0.537972
g  0.052821  0.696736
h  0.098303  0.911916
data = np.random.rand(4,2)
data[1:3,0] = np.NAN
df1 = pd.DataFrame(data,index=list('abcd'),columns=['value1','value2'])
print("创建df1为:")

print(df1)
print('----------')

df2 = pd.DataFrame(np.arange(8).reshape(4,2),index=list('abcd'),columns=['value1','value2'])
print("创建df2为:")
print(df2)
print('----------')

df3 = df1.combine_first(df2)
print('df1修补后为:')
print(df3)
创建df1为:
     value1    value2
a  0.451591  0.556266
b       NaN  0.943348
c       NaN  0.944175
d  0.273202  0.594670
----------
创建df2为:
   value1  value2
a       0       1
b       2       3
c       4       5
d       6       7
----------
df1修补后为:
     value1    value2
a  0.451591  0.556266
b  2.000000  0.943348
c  4.000000  0.944175
d  0.273202  0.594670

去重及替换 .duplicated / .replace

'''
【课程2.18】  去重及替换

.duplicated / .replace
 
'''

去重 .duplicated

# 去重 .duplicated

s = pd.Series([1,1,1,1,2,2,2,3,4,5,5,5,5])
print(s.duplicated())
print(s[s.duplicated() == False])
print('-----')
# 判断是否重复
# 通过布尔判断,得到不重复的值

s_re = s.drop_duplicates()
print(s_re)
print('-----')
# drop.duplicates移除重复
# inplace参数:是否替换原值,默认False

df = pd.DataFrame({
     'key1':['a','a',3,4,5],
                  'key2':['a','a','b','b','c']})
print(df.duplicated())
print(df['key2'].duplicated())
# Dataframe中使用duplicated
0     False
1      True
2      True
3      True
4     False
5      True
6      True
7     False
8     False
9     False
10     True
11     True
12     True
dtype: bool
0    1
4    2
7    3
8    4
9    5
dtype: int64
-----
0    1
4    2
7    3
8    4
9    5
dtype: int64
-----
0    False
1     True
2    False
3    False
4    False
dtype: bool
0    False
1     True
2    False
3     True
4    False
Name: key2, dtype: bool

替换 .replace

# 替换 .replace

s = pd.Series(list('ascaazsd'))
print(s.replace('a', np.nan))
print(s.replace(['a','s'] ,np.nan))
print(s.replace({
     'a':'hello world!','s':123}))
# 可一次性替换一个值或多个值
# 可传入列表或字典
0    NaN
1      s
2      c
3    NaN
4    NaN
5      z
6      s
7      d
dtype: object
0    NaN
1    NaN
2      c
3    NaN
4    NaN
5      z
6    NaN
7      d
dtype: object
0    hello world!
1             123
2               c
3    hello world!
4    hello world!
5               z
6             123
7               d
dtype: object

数据分组!!!(重要)

'''
【课程2.19】  数据分组

分组统计 - groupby功能

① 根据某些条件将数据拆分成组
② 对每个组独立应用函数
③ 将结果合并到一个数据结构中

Dataframe在行(axis=0)或列(axis=1)上进行分组,将一个函数应用到各个分组并产生一个新值,然后函数执行结果被合并到最终的结果对象中。

df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
 
'''

groupby分组

# 分组

df = pd.DataFrame({
     'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
print(df)
print('------')

print(df.groupby('A'), type(df.groupby('A')))
print('------')
# 直接分组得到一个groupby对象,是一个中间数据,没有进行计算

a = df.groupby('A').mean()
b = df.groupby(['A','B']).mean()
c = df.groupby(['A'])['D'].mean()  # 以A分组,算D的平均值
print(a,type(a),'\n',a.columns)
print(b,type(b),'\n',b.columns)
print(c,type(c))
# 通过分组后的计算,得到一个新的dataframe
# 默认axis = 0,以行来分组
# 可单个或多个([])列分组
     A      B         C         D
0  foo    one  0.172157  1.118132
1  bar    one  0.323895  1.188046
2  foo    two -1.048614 -0.747383
3  bar  three  0.338934  1.587185
4  foo    two  0.423342 -1.542578
5  bar    two  0.255962  1.337651
6  foo    one  0.225461  0.557273
7  foo  three -0.748118  0.418550
------
 
------
            C         D
A                      
bar  0.306263  1.370960
foo -0.195154 -0.039201  
 Index(['C', 'D'], dtype='object')
                  C         D
A   B                        
bar one    0.323895  1.188046
    three  0.338934  1.587185
    two    0.255962  1.337651
foo one    0.198809  0.837702
    three -0.748118  0.418550
    two   -0.312636 -1.144981  
 Index(['C', 'D'], dtype='object')
A
bar    1.370960
foo   -0.039201
Name: D, dtype: float64 

分组 - 可迭代对象

# 分组 - 可迭代对象

df = pd.DataFrame({
     'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print(df)
print(df.groupby('X'), type(df.groupby('X')))
print('-----')

print(list(df.groupby('X')), '→ 可迭代对象,直接生成list\n')
print(list(df.groupby('X'))[0], '→ 以元祖形式显示\n')
for n,g in df.groupby('X'):
    print(n)
    print(g)
    print('###')
print('-----')
# n是组名,g是分组后的Dataframe

print(df.groupby(['X']).get_group('A'),'\n')
print(df.groupby(['X']).get_group('B'),'\n')
print('-----')
# .get_group()提取分组后的组

grouped = df.groupby(['X'])
print(grouped.groups)
print(grouped.groups['A'])  # 也可写:df.groupby('X').groups['A']
print('-----')
# .groups:将分组后的groups转为dict
# 可以字典索引方法来查看groups里的元素

sz = grouped.size()
print(sz,type(sz))
print('-----')
# .size():查看分组后的长度

df = pd.DataFrame({
     'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
grouped = df.groupby(['A','B']).groups
print(df)
print(grouped)
print(grouped[('foo', 'three')])
# 按照两个列进行分组
   X  Y
0  A  1
1  B  4
2  A  3
3  B  2
 
-----
[('A',    X  Y
0  A  1
2  A  3), ('B',    X  Y
1  B  4
3  B  2)] → 可迭代对象,直接生成list

('A',    X  Y
0  A  1
2  A  3) → 以元祖形式显示

A
   X  Y
0  A  1
2  A  3
###
B
   X  Y
1  B  4
3  B  2
###
-----
   X  Y
0  A  1
2  A  3 

   X  Y
1  B  4
3  B  2 

-----
{'A': Int64Index([0, 2], dtype='int64'), 'B': Int64Index([1, 3], dtype='int64')}
Int64Index([0, 2], dtype='int64')
-----
X
A    2
B    2
dtype: int64 
-----
     A      B         C         D
0  foo    one  0.981468  0.473817
1  bar    one -1.236826  0.028449
2  foo    two -1.611723  1.444489
3  bar  three  1.136316  0.881776
4  foo    two  0.523383  0.707726
5  bar    two -2.196340 -0.201260
6  foo    one  1.014091  0.256455
7  foo  three -1.700698  1.217236
{('bar', 'one'): Int64Index([1], dtype='int64'), ('bar', 'three'): Int64Index([3], dtype='int64'), ('bar', 'two'): Int64Index([5], dtype='int64'), ('foo', 'one'): Int64Index([0, 6], dtype='int64'), ('foo', 'three'): Int64Index([7], dtype='int64'), ('foo', 'two'): Int64Index([2, 4], dtype='int64')}
Int64Index([7], dtype='int64')

其他轴上的分组

# 其他轴上的分组

df = pd.DataFrame({
     'data1':np.random.rand(2),
                  'data2':np.random.rand(2),
                  'key1':[1,'b'],
                  'key2':['one','two']})
print(df)
print(df.dtypes,type(df.dtypes))  #返回的是一个Series
print('-----')
for n,p in df.groupby(df.dtypes, axis=1):
    print(n)
    print(p)
    print('##')
# 按照值类型分列
      data1     data2 key1 key2
0  0.572579  0.924789    1  one
1  0.575395  0.814979    b  two
data1    float64
data2    float64
key1      object
key2      object
dtype: object 
-----
float64
      data1     data2
0  0.572579  0.924789
1  0.575395  0.814979
##
object
  key1 key2
0    1  one
1    b  two
##

通过字典或者Series分组

# 通过字典或者Series分组

df = pd.DataFrame(np.arange(16).reshape(4,4),
                  columns = ['a','b','c','d'])
print(df)
print('-----')
# 通过字典可以将多列变成一个自定义分组
mapping = {
     'a':'one','b':'one','c':'two','d':'two','e':'three'}
by_column = df.groupby(mapping, axis = 1)
print(by_column.sum())
print('-----')
# mapping中,a、b列对应的为one,c、d列对应的为two,以字典来分组

s = pd.Series(mapping)
print(s,'\n')
print(s.groupby(s).count())
# s中,index中a、b对应的为one,c、d对应的为two,以Series来分组
    a   b   c   d
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
-----
   one  two
0    1    5
1    9   13
2   17   21
3   25   29
-----
a      one
b      one
c      two
d      two
e    three
dtype: object 

one      2
three    1
two      2
dtype: int64

通过函数分组

# 通过函数分组

df = pd.DataFrame(np.arange(16).reshape(4,4),
                  columns = ['a','b','c','d'],
                 index = ['abc','bcd','aa','b'])
print(df,'\n')
print(df.groupby(len).sum())# 默认传递的参数是索引
# 按照字母长度分组
      a   b   c   d
abc   0   1   2   3
bcd   4   5   6   7
aa    8   9  10  11
b    12  13  14  15 

    a   b   c   d
1  12  13  14  15
2   8   9  10  11
3   4   6   8  10

分组计算函数方法

# 分组计算函数方法

s = pd.Series([1, 2, 3, 10, 20, 30], index = [1, 2, 3, 1, 2, 3])
grouped = s.groupby(level=0)  # 唯一索引用.groupby(level=0),将同一个index的分为一组
print(grouped)
print(grouped.first(),'→ first:非NaN的第一个值\n')
print(grouped.last(),'→ last:非NaN的最后一个值\n')
print(grouped.sum(),'→ sum:非NaN的和\n')
print(grouped.mean(),'→ mean:非NaN的平均值\n')
print(grouped.median(),'→ median:非NaN的算术中位数\n')
print(grouped.count(),'→ count:非NaN的值\n')
print(grouped.min(),'→ min、max:非NaN的最小值、最大值\n')
print(grouped.std(),'→ std,var:非NaN的标准差和方差\n')
print(grouped.prod(),'→ prod:非NaN的积\n')

1    1
2    2
3    3
dtype: int64 → first:非NaN的第一个值

1    10
2    20
3    30
dtype: int64 → last:非NaN的最后一个值

1    11
2    22
3    33
dtype: int64 → sum:非NaN的和

1     5.5
2    11.0
3    16.5
dtype: float64 → mean:非NaN的平均值

1     5.5
2    11.0
3    16.5
dtype: float64 → median:非NaN的算术中位数

1    2
2    2
3    2
dtype: int64 → count:非NaN的值

1    1
2    2
3    3
dtype: int64 → min、max:非NaN的最小值、最大值

1     6.363961
2    12.727922
3    19.091883
dtype: float64 → std,var:非NaN的标准差和方差

1    10
2    40
3    90
dtype: int64 → prod:非NaN的积

分组多函数计算:agg()

# 多函数计算:agg()

df = pd.DataFrame({
     'a':[1,1,2,2],
                  'b':np.random.rand(4),
                  'c':np.random.rand(4),
                  'd':np.random.rand(4),})
print(df)
print(df.groupby('a').agg(['mean',np.sum]))
print(df.groupby('a')['b'].agg({
     'result1':np.mean,
                               'result2':np.sum})) # 快过时了
# 函数写法可以用str,或者np.方法
# 可以通过list,dict传入,当用dict时,key名为columns
   a         b         c         d
0  1  0.456934  0.286735  0.889033
1  1  0.354812  0.117281  0.476132
2  2  0.958267  0.239303  0.276428
3  2  0.840423  0.544267  0.514867
          b                   c                   d          
       mean       sum      mean       sum      mean       sum
a                                                            
1  0.405873  0.811746  0.202008  0.404016  0.682582  1.365165
2  0.899345  1.798690  0.391785  0.783570  0.395648  0.791296
    result1   result2
a                    
1  0.405873  0.811746
2  0.899345  1.798690


D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  # Remove the CWD from sys.path while we load stuff.
课堂作业
df = pd.DataFrame({
     'A' : ['one', 'two', 'three', 'one','two', 'three', 'one', 'two'],
                   'B' : ['h', 'h', 'h', 'h', 'f', 'f', 'f', 'f'],
                   'C' : np.arange(10,26,2),
                   'D' : np.random.randn(8),
                   'E':np.random.rand(8)})
print(df)
df2 = df.groupby(['A'])[['C','D']].mean()
print(df2)
df2 = df.groupby(['A','B'])[['D','E']].sum()
print(df2)

dica = df.groupby(['A']).groups
print(dica)

# dtypes = df.dtypes
# print(dtypes)
dt = df.groupby(df.dtypes, axis=1).sum()  # axis=1就是按照列进行分组
print(dt)

print('---------')
mapping = {
     'C':'one','D':'one'}
dm = df.groupby(mapping,axis=1).groups
print(dm)  # 这个分组结构只会包含one
dmap = df.groupby(mapping,axis=1).sum()  # 凡是按照列分组,加上axis=1就对了 ; 这是将C列和D列相加
print(dmap)  # 凡是按照列分组,加上axis=1就对了

print('-------')
print(df.groupby(mapping,axis=1).get_group('one'))

print('-------')
dcd = df.groupby(mapping,axis=1).get_group('one').sum()  # 这是将 C列和D列从one中提前出来,分别对他们自己求和,注意,返回的是一个Series
print(dcd)                                                 # 其中index为 CD

print('-------')
db = df.groupby(['B']).agg(['mean',np.sum,'max','min'])
print(db)
       A  B   C         D         E
0    one  h  10  1.006026  0.133697
1    two  h  12 -0.359184  0.976752
2  three  h  14  0.066493  0.933959
3    one  h  16 -1.462475  0.614514
4    two  f  18  2.007785  0.458461
5  three  f  20 -1.650301  0.805937
6    one  f  22 -0.197564  0.760070
7    two  f  24 -1.654774  0.633005
        C         D
A                  
one    16 -0.218005
three  17 -0.791904
two    18 -0.002057
                D         E
A     B                    
one   f -0.197564  0.760070
      h -0.456449  0.748211
three f -1.650301  0.805937
      h  0.066493  0.933959
two   f  0.353012  1.091466
      h -0.359184  0.976752
{'one': Int64Index([0, 3, 6], dtype='int64'), 'three': Int64Index([2, 5], dtype='int64'), 'two': Int64Index([1, 4, 7], dtype='int64')}
   int32   float64  object
0     10  1.139723    oneh
1     12  0.617569    twoh
2     14  1.000452  threeh
3     16 -0.847962    oneh
4     18  2.466246    twof
5     20 -0.844364  threef
6     22  0.562506    onef
7     24 -1.021768    twof
---------
{'one': Index(['C', 'D'], dtype='object')}
         one
0  11.006026
1  11.640816
2  14.066493
3  14.537525
4  20.007785
5  18.349699
6  21.802436
7  22.345226
-------
    C         D
0  10  1.006026
1  12 -0.359184
2  14  0.066493
3  16 -1.462475
4  18  2.007785
5  20 -1.650301
6  22 -0.197564
7  24 -1.654774
-------
C    136.000000
D     -2.243994
dtype: float64
-------
     C                     D                                       E  \
  mean sum max min      mean       sum       max       min      mean   
B                                                                      
f   21  84  24  18 -0.373713 -1.494854  2.007785 -1.654774  0.664369   
h   13  52  16  10 -0.187285 -0.749141  1.006026 -1.462475  0.664731   

                                 
        sum       max       min  
B                                
f  2.657474  0.805937  0.458461  
h  2.658922  0.976752  0.133697  

分组转换及一般性“拆分-应用-合并”

'''
【课程2.20】  分组转换及一般性“拆分-应用-合并”

transform / apply
 
'''

数据分组转换,transform

# 数据分组转换,transform

df = pd.DataFrame({
     'data1':np.random.rand(5),
                  'data2':np.random.rand(5),
                  'key1':list('aabba'),
                  'key2':['one','two','one','two','one']})
k_mean = df.groupby('key1').mean()
print(df)
print(k_mean)
print(pd.merge(df,k_mean,left_on='key1',right_index=True).add_prefix('mean_'))  # .add_prefix('mean_'):添加前缀
print('-----')
# 通过分组、合并,得到一个包含均值的Dataframe

print(df.groupby('key2').mean()) # 按照key2分组求均值
print(df.groupby('key2').transform(np.mean))
# data1、data2每个位置元素取对应分组列的均值
# 字符串不能进行计算
      data1     data2 key1 key2
0  0.234441  0.600356    a  one
1  0.773225  0.730067    a  two
2  0.483987  0.637845    b  one
3  0.243679  0.997665    b  two
4  0.882532  0.617680    a  one
         data1     data2
key1                    
a     0.630066  0.649368
b     0.363833  0.817755
   mean_data1_x  mean_data2_x mean_key1 mean_key2  mean_data1_y  mean_data2_y
0      0.234441      0.600356         a       one      0.630066      0.649368
1      0.773225      0.730067         a       two      0.630066      0.649368
4      0.882532      0.617680         a       one      0.630066      0.649368
2      0.483987      0.637845         b       one      0.363833      0.817755
3      0.243679      0.997665         b       two      0.363833      0.817755
-----
         data1     data2
key2                    
one   0.533653  0.618627
two   0.508452  0.863866
      data1     data2
0  0.533653  0.618627
1  0.508452  0.863866
2  0.533653  0.618627
3  0.508452  0.863866
4  0.533653  0.618627

一般化Groupby方法:apply

# 一般化Groupby方法:apply

df = pd.DataFrame({
     'data1':np.random.rand(5),
                  'data2':np.random.rand(5),
                  'key1':list('aabba'),
                  'key2':['one','two','one','two','one']})

print(df.groupby('key1').apply(lambda x: x.describe()))
# apply直接运行其中的函数
# 这里为匿名函数,直接描述分组后的统计量

def f_df1(d,n):
    return(d.sort_index()[:n])
def f_df2(d,k1):
    return(d[k1])
print(df.groupby('key1').apply(f_df1,2),'\n')
print(df.groupby('key1').apply(f_df2,'data2'))
print(type(df.groupby('key1').apply(f_df2,'data2')))
# f_df1函数:返回排序后的前n行数据
# f_df2函数:返回分组后表的k1列,结果为Series,层次化索引
# 直接运行f_df函数
# 参数直接写在后面,也可以为.apply(f_df,n = 2))
               data1     data2
key1                          
a    count  3.000000  3.000000
     mean   0.545712  0.522070
     std    0.315040  0.463898
     min    0.202184  0.026861
     25%    0.408011  0.309841
     50%    0.613838  0.592821
     75%    0.717477  0.769675
     max    0.821116  0.946529
b    count  2.000000  2.000000
     mean   0.446845  0.399589
     std    0.311004  0.160466
     min    0.226932  0.286123
     25%    0.336888  0.342856
     50%    0.446845  0.399589
     75%    0.556801  0.456322
     max    0.666758  0.513056
           data1     data2 key1 key2
key1                                
a    0  0.202184  0.946529    a  one
     1  0.821116  0.592821    a  two
b    2  0.226932  0.286123    b  one
     3  0.666758  0.513056    b  two 

key1   
a     0    0.946529
      1    0.592821
      4    0.026861
b     2    0.286123
      3    0.513056
Name: data2, dtype: float64

课堂作业
df = pd.DataFrame({
     'data1':np.random.rand(8),
                  'data2':np.random.rand(8),
                  'key':list('aabbabab')})

print('创建df为:\n',df,'\n------')
df2 = df.groupby(['key']).mean()
print(df2)
df3 = pd.merge(df,df2,left_on='key',right_index=True).add_prefix('mean_')
print('求和且合并后的结果为:')
print(df3)

# df_ = df.groupby('key').transform(np.mean)
# print('求和且合并之后结果为:\n',df.join(df_,rsuffix='_mean'),'\n------')
创建df为:
       data1     data2 key
0  0.841120  0.987305   a
1  0.965404  0.734070   a
2  0.511385  0.044053   b
3  0.912349  0.828049   b
4  0.819506  0.131610   a
5  0.723875  0.642737   b
6  0.822328  0.457494   a
7  0.107970  0.936853   b 
------
        data1     data2
key                    
a    0.862090  0.577619
b    0.563895  0.612923
求和且合并后的结果为:
   mean_data1_x  mean_data2_x mean_key  mean_data1_y  mean_data2_y
0      0.841120      0.987305        a      0.862090      0.577619
1      0.965404      0.734070        a      0.862090      0.577619
4      0.819506      0.131610        a      0.862090      0.577619
6      0.822328      0.457494        a      0.862090      0.577619
2      0.511385      0.044053        b      0.563895      0.612923
3      0.912349      0.828049        b      0.563895      0.612923
5      0.723875      0.642737        b      0.563895      0.612923
7      0.107970      0.936853        b      0.563895      0.612923

透视表及交叉表

'''
【课程2.21】  透视表及交叉表

类似excel数据透视 - pivot table / crosstab
 
'''

透视表:pivot_table

# 透视表:pivot_table
# pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')

date = ['2017-5-1','2017-5-2','2017-5-3']*3
rng = pd.to_datetime(date)
df = pd.DataFrame({
     'date':rng,
                   'key':list('abcdabcda'),
                  'values':np.random.rand(9)*10})
print(df)
print('-----')

# 相当于 groupby[index][values].aggfunc     columns见下
print(pd.pivot_table(df, values = 'values', index = 'date', aggfunc=np.sum))  # 也可以写 aggfunc='sum'
print(pd.pivot_table(df, values = 'values', index = 'date', columns = 'key', aggfunc=np.sum))  # 也可以写 aggfunc='sum'
                                    # 如果加上了colums那么他就会以这一列的行分组作为生成的DF的columns,
                                    # 而其中colunms下对应的值就是原来表中通过index和colunms(联合)确定到的values值
print('-----')
# data:DataFrame对象
# values:要聚合的列或列的列表
# index:数据透视表的index,从原数据的列中筛选
# columns:数据透视表的columns,从原数据的列中筛选
# aggfunc:用于聚合的函数,默认为numpy.mean,支持numpy计算方法

print(pd.pivot_table(df, values = 'values', index = ['date','key'], aggfunc=len))
print('-----')
# 这里就分别以date、key共同做数据透视,值为values:统计不同(date,key)情况下values的平均值
# aggfunc=len(或者count):计数
        date key    values
0 2017-05-01   a  1.573759
1 2017-05-02   b  3.750596
2 2017-05-03   c  4.958902
3 2017-05-01   d  0.797226
4 2017-05-02   a  5.757876
5 2017-05-03   b  0.082909
6 2017-05-01   c  3.799717
7 2017-05-02   d  0.754402
8 2017-05-03   a  3.117813
-----
               values
date                 
2017-05-01   6.170701
2017-05-02  10.262874
2017-05-03   8.159623
key                a         b         c         d
date                                              
2017-05-01  1.573759       NaN  3.799717  0.797226
2017-05-02  5.757876  3.750596       NaN  0.754402
2017-05-03  3.117813  0.082909  4.958902       NaN
-----
                values
date       key        
2017-05-01 a       1.0
           c       1.0
           d       1.0
2017-05-02 a       1.0
           b       1.0
           d       1.0
2017-05-03 a       1.0
           b       1.0
           c       1.0
-----

交叉表:crosstab

# 交叉表:crosstab
# 默认情况下,crosstab计算因子的频率表,比如用于str的数据透视分析
# pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)

df = pd.DataFrame({
     'A': [1, 2, 2, 2, 2],
                   'B': [3, 3, 4, 4, 4],
                   'C': [1, 1, np.nan, 1, 1]})
print(df)
print('-----')

print(pd.crosstab(df['A'],df['B']))
print('-----')
# 如果crosstab只接收两个Series,它将提供一个频率表。
# 用A的唯一值,统计B唯一值的出现次数

print(pd.crosstab(df['A'],df['B'],normalize=True))
print('-----')
# normalize:默认False,将所有值除以值的总和进行归一化 → 为True时候显示百分比

print(pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum))
print('-----')
# values:可选,根据因子聚合的值数组
# aggfunc:可选,如果未传递values数组,则计算频率表,如果传递数组,则按照指定计算
# 这里相当于以A和B界定分组,计算出每组中第三个系列C的值

print(pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum, margins=True))
print('-----')
# margins:布尔值,默认值False,添加行/列边距(小计)
   A  B    C
0  1  3  1.0
1  2  3  1.0
2  2  4  NaN
3  2  4  1.0
4  2  4  1.0
-----
B  3  4
A      
1  1  0
2  1  3
-----
B    3    4
A          
1  0.2  0.0
2  0.2  0.6
-----
B    3    4
A          
1  1.0  NaN
2  1.0  2.0
-----
B      3    4  All
A                 
1    1.0  NaN  1.0
2    1.0  2.0  3.0
All  2.0  2.0  4.0
-----
课堂作业
df = pd.DataFrame({
     'A' : ['one', 'two', 'three', 'one','two', 'three', 'one', 'two'],
                   'B' : ['h', 'h', 'h', 'h', 'f', 'f', 'f', 'f'],
                   'C' : np.arange(10,26,2),
                   'D' : np.random.randn(8),
                   'E':np.random.rand(8)})

print(df)
print('---------')
print(pd.pivot_table(df,index=['A'],values=['C','D'],aggfunc='mean'))
print('---------')
print(pd.pivot_table(df,index=['A','B'],values=['D','E'],aggfunc=['mean','sum']))
print('---------')
print(pd.pivot_table(df,index=['B'],values=['C'],columns=['A'],aggfunc='count')) # 或者用下面这种
print('---------')
print(pd.crosstab(df['B'],df['A'])) # 一般使用交叉表计算频率
       A  B   C         D         E
0    one  h  10  1.801648  0.234444
1    two  h  12  1.015224  0.473324
2  three  h  14  1.145384  0.423148
3    one  h  16  0.782241  0.053959
4    two  f  18 -0.015952  0.669829
5  three  f  20 -0.356324  0.455806
6    one  f  22 -1.555999  0.136985
7    two  f  24  1.791435  0.448069
---------
        C         D
A                  
one    16  0.342630
three  17  0.394530
two    18  0.930235
---------
             mean                 sum          
                D         E         D         E
A     B                                        
one   f -1.555999  0.136985 -1.555999  0.136985
      h  1.291944  0.144202  2.583889  0.288403
three f -0.356324  0.455806 -0.356324  0.455806
      h  1.145384  0.423148  1.145384  0.423148
two   f  0.887741  0.558949  1.775482  1.117898
      h  1.015224  0.473324  1.015224  0.473324
---------
    C          
A one three two
B              
f   1     1   2
h   2     1   1
---------
A  one  three  two
B                 
f    1      1    2
h    2      1    1

数据读取

'''
【课程2.22】  数据读取

核心:read_table, read_csv, read_excel
 
'''

读取普通分隔数据:read_table

# 读取普通分隔数据:read_table
# 可以读取txt,csv

import os
os.chdir('D:/data/pandasData/')

data1 = pd.read_table('data1.txt', delimiter=',',header = 0, index_col=1)
print(data1)
# delimiter:用于拆分的字符,也可以用sep:sep = ','
# header:用做列名的序号,默认为0(第一行)
# index_col:指定某列为行索引,否则自动索引0, 1, .....

# read_table主要用于读取简单的数据,txt/csv
     va1  va3  va4
va2               
2      1    3    4
3      2    4    5
4      3    5    6
5      4    6    7


D:\python\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: FutureWarning: read_table is deprecated, use read_csv instead.
  import sys

读取csv数据:read_csv

# 读取csv数据:read_csv
# 先熟悉一下excel怎么导出csv

data2 = pd.read_csv('data2.csv',engine = 'python')
print(data2.head())
# engine:使用的分析引擎。可以选择C或者是python。C引擎快但是Python引擎功能更加完备。
# encoding:指定字符集类型,即编码,通常指定为'utf-8'

# 大多数情况先将excel导出csv,再读取
   省级政区代码 省级政区名称  地市级政区代码 地市级政区名称    年份 党委书记姓名  出生年份  出生月份  籍贯省份代码 籍贯省份名称  \
0  130000    河北省   130100    石家庄市  2000    陈来立   NaN   NaN     NaN    NaN   
1  130000    河北省   130100    石家庄市  2001    吴振华   NaN   NaN     NaN    NaN   
2  130000    河北省   130100    石家庄市  2002    吴振华   NaN   NaN     NaN    NaN   
3  130000    河北省   130100    石家庄市  2003    吴振华   NaN   NaN     NaN    NaN   
4  130000    河北省   130100    石家庄市  2004    吴振华   NaN   NaN     NaN    NaN   

   ...   民族  教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科  专业:理工  专业:农科  专业:医科  入党年份  工作年份  
0  ...  NaN  硕士              1.0   NaN   NaN    NaN    NaN    NaN   NaN   NaN  
1  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  
2  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  
3  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  
4  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  

[5 rows x 23 columns]

读取excel数据:read_excel

# 读取excel数据:read_excel

data3 = pd.read_excel('地市级党委书记数据库(2000-10).xlsx',sheet_name='中国人民共和国地市级党委书记数据库(2000-10)',header=0)
print(data3)
# io :文件路径。
# sheetname:返回多表使用sheetname=[0,1],若sheetname=None是返回全表 → ① int/string 返回的是dataframe ②而none和list返回的是dict
# header:指定列名行,默认0,即取第一行
# index_col:指定列为索引列,也可以使用u”strings”
      省级政区代码    省级政区名称  地市级政区代码   地市级政区名称    年份 党委书记姓名  出生年份  出生月份  籍贯省份代码  \
0     130000       河北省   130100      石家庄市  2000    陈来立   NaN   NaN     NaN   
1     130000       河北省   130100      石家庄市  2001    吴振华   NaN   NaN     NaN   
2     130000       河北省   130100      石家庄市  2002    吴振华   NaN   NaN     NaN   
3     130000       河北省   130100      石家庄市  2003    吴振华   NaN   NaN     NaN   
4     130000       河北省   130100      石家庄市  2004    吴振华   NaN   NaN     NaN   
5     130000       河北省   130100      石家庄市  2005    吴振华   NaN   NaN     NaN   
6     130000       河北省   130100      石家庄市  2006    吴振华   NaN   NaN     NaN   
7     130000       河北省   130100      石家庄市  2007    吴显国   NaN   NaN     NaN   
8     130000       河北省   130100      石家庄市  2008    吴显国   NaN   NaN     NaN   
9     130000       河北省   130100      石家庄市  2009     车俊   NaN   NaN     NaN   
10    130000       河北省   130100      石家庄市  2010    孙瑞彬   NaN   NaN     NaN   
11    130000       河北省   130200       唐山市  2000    白润璋   NaN   NaN     NaN   
12    130000       河北省   130200       唐山市  2001    白润璋   NaN   NaN     NaN   
13    130000       河北省   130200       唐山市  2002    白润璋   NaN   NaN     NaN   
14    130000       河北省   130200       唐山市  2003     张和   NaN   NaN     NaN   
15    130000       河北省   130200       唐山市  2004     张和   NaN   NaN     NaN   
16    130000       河北省   130200       唐山市  2005     张和   NaN   NaN     NaN   
17    130000       河北省   130200       唐山市  2006     张和   NaN   NaN     NaN   
18    130000       河北省   130200       唐山市  2007     赵勇   NaN   NaN     NaN   
19    130000       河北省   130200       唐山市  2008     赵勇   NaN   NaN     NaN   
20    130000       河北省   130200       唐山市  2009     赵勇   NaN   NaN     NaN   
21    130000       河北省   130200       唐山市  2010     赵勇   NaN   NaN     NaN   
22    130000       河北省   130300      秦皇岛市  2000    王建忠   NaN   NaN     NaN   
23    130000       河北省   130300      秦皇岛市  2001    王建忠   NaN   NaN     NaN   
24    130000       河北省   130300      秦皇岛市  2002    王建忠   NaN   NaN     NaN   
25    130000       河北省   130300      秦皇岛市  2003    宋长瑞   NaN   NaN     NaN   
26    130000       河北省   130300      秦皇岛市  2004    宋长瑞   NaN   NaN     NaN   
27    130000       河北省   130300      秦皇岛市  2005    宋长瑞   NaN   NaN     NaN   
28    130000       河北省   130300      秦皇岛市  2006    宋长瑞   NaN   NaN     NaN   
29    130000       河北省   130300      秦皇岛市  2007    王三堂   NaN   NaN     NaN   
...      ...       ...      ...       ...   ...    ...   ...   ...     ...   
3633  650000  新疆维吾尔自治区   654000  伊犁哈萨克自治州  2003    NaN   NaN   NaN     NaN   
3634  650000  新疆维吾尔自治区   654000  伊犁哈萨克自治州  2004    NaN   NaN   NaN     NaN   
3635  650000  新疆维吾尔自治区   654000  伊犁哈萨克自治州  2005    NaN   NaN   NaN     NaN   
3636  650000  新疆维吾尔自治区   654000  伊犁哈萨克自治州  2006    NaN   NaN   NaN     NaN   
3637  650000  新疆维吾尔自治区   654000  伊犁哈萨克自治州  2007    NaN   NaN   NaN     NaN   
3638  650000  新疆维吾尔自治区   654000  伊犁哈萨克自治州  2008    NaN   NaN   NaN     NaN   
3639  650000  新疆维吾尔自治区   654000  伊犁哈萨克自治州  2009    NaN   NaN   NaN     NaN   
3640  650000  新疆维吾尔自治区   654000  伊犁哈萨克自治州  2010    NaN   NaN   NaN     NaN   
3641  650000  新疆维吾尔自治区   654200      塔城地区  2000    NaN   NaN   NaN     NaN   
3642  650000  新疆维吾尔自治区   654200      塔城地区  2001    NaN   NaN   NaN     NaN   
3643  650000  新疆维吾尔自治区   654200      塔城地区  2002    NaN   NaN   NaN     NaN   
3644  650000  新疆维吾尔自治区   654200      塔城地区  2003    NaN   NaN   NaN     NaN   
3645  650000  新疆维吾尔自治区   654200      塔城地区  2004    NaN   NaN   NaN     NaN   
3646  650000  新疆维吾尔自治区   654200      塔城地区  2005    NaN   NaN   NaN     NaN   
3647  650000  新疆维吾尔自治区   654200      塔城地区  2006    NaN   NaN   NaN     NaN   
3648  650000  新疆维吾尔自治区   654200      塔城地区  2007    NaN   NaN   NaN     NaN   
3649  650000  新疆维吾尔自治区   654200      塔城地区  2008    NaN   NaN   NaN     NaN   
3650  650000  新疆维吾尔自治区   654200      塔城地区  2009    NaN   NaN   NaN     NaN   
3651  650000  新疆维吾尔自治区   654200      塔城地区  2010    NaN   NaN   NaN     NaN   
3652  650000  新疆维吾尔自治区   654300     阿勒泰地区  2000    NaN   NaN   NaN     NaN   
3653  650000  新疆维吾尔自治区   654300     阿勒泰地区  2001    NaN   NaN   NaN     NaN   
3654  650000  新疆维吾尔自治区   654300     阿勒泰地区  2002    NaN   NaN   NaN     NaN   
3655  650000  新疆维吾尔自治区   654300     阿勒泰地区  2003    NaN   NaN   NaN     NaN   
3656  650000  新疆维吾尔自治区   654300     阿勒泰地区  2004    NaN   NaN   NaN     NaN   
3657  650000  新疆维吾尔自治区   654300     阿勒泰地区  2005    NaN   NaN   NaN     NaN   
3658  650000  新疆维吾尔自治区   654300     阿勒泰地区  2006    NaN   NaN   NaN     NaN   
3659  650000  新疆维吾尔自治区   654300     阿勒泰地区  2007    NaN   NaN   NaN     NaN   
3660  650000  新疆维吾尔自治区   654300     阿勒泰地区  2008    NaN   NaN   NaN     NaN   
3661  650000  新疆维吾尔自治区   654300     阿勒泰地区  2009    NaN   NaN   NaN     NaN   
3662  650000  新疆维吾尔自治区   654300     阿勒泰地区  2010    NaN   NaN   NaN     NaN   

     籍贯省份名称  ...   民族   教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科  专业:理工  专业:农科  专业:医科  \
0       NaN  ...  NaN   硕士              1.0   NaN   NaN    NaN    NaN    NaN   
1       NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
2       NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
3       NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
4       NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
5       NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
6       NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
7       NaN  ...  NaN   硕士              1.0   0.0   1.0    0.0    0.0    0.0   
8       NaN  ...  NaN   硕士              1.0   0.0   1.0    0.0    0.0    0.0   
9       NaN  ...  NaN   本科              1.0   0.0   1.0    0.0    0.0    0.0   
10      NaN  ...  NaN   硕士              1.0   0.0   1.0    0.0    0.0    0.0   
11      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
12      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
13      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
14      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
15      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
16      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
17      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
18      NaN  ...  NaN   博士              0.0   0.0   1.0    0.0    0.0    0.0   
19      NaN  ...  NaN   博士              0.0   0.0   1.0    0.0    0.0    0.0   
20      NaN  ...  NaN   博士              0.0   0.0   1.0    0.0    0.0    0.0   
21      NaN  ...  NaN   博士              0.0   0.0   1.0    0.0    0.0    0.0   
22      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
23      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
24      NaN  ...  NaN   本科              0.0   0.0   0.0    1.0    0.0    0.0   
25      NaN  ...  NaN   硕士              1.0   0.0   1.0    0.0    0.0    0.0   
26      NaN  ...  NaN   硕士              1.0   0.0   1.0    0.0    0.0    0.0   
27      NaN  ...  NaN   硕士              1.0   0.0   1.0    0.0    0.0    0.0   
28      NaN  ...  NaN   硕士              1.0   0.0   1.0    0.0    0.0    0.0   
29      NaN  ...  NaN   硕士              1.0   0.0   1.0    0.0    0.0    0.0   
...     ...  ...  ...  ...              ...   ...   ...    ...    ...    ...   
3633    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3634    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3635    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3636    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3637    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3638    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3639    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3640    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3641    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3642    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3643    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3644    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3645    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3646    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3647    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3648    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3649    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3650    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3651    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3652    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3653    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3654    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3655    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3656    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3657    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3658    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3659    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3660    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3661    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   
3662    NaN  ...  NaN  NaN              NaN   NaN   NaN    NaN    NaN    NaN   

      入党年份  工作年份  
0      NaN   NaN  
1      NaN   NaN  
2      NaN   NaN  
3      NaN   NaN  
4      NaN   NaN  
5      NaN   NaN  
6      NaN   NaN  
7      NaN   NaN  
8      NaN   NaN  
9      NaN   NaN  
10     NaN   NaN  
11     NaN   NaN  
12     NaN   NaN  
13     NaN   NaN  
14     NaN   NaN  
15     NaN   NaN  
16     NaN   NaN  
17     NaN   NaN  
18     NaN   NaN  
19     NaN   NaN  
20     NaN   NaN  
21     NaN   NaN  
22     NaN   NaN  
23     NaN   NaN  
24     NaN   NaN  
25     NaN   NaN  
26     NaN   NaN  
27     NaN   NaN  
28     NaN   NaN  
29     NaN   NaN  
...    ...   ...  
3633   NaN   NaN  
3634   NaN   NaN  
3635   NaN   NaN  
3636   NaN   NaN  
3637   NaN   NaN  
3638   NaN   NaN  
3639   NaN   NaN  
3640   NaN   NaN  
3641   NaN   NaN  
3642   NaN   NaN  
3643   NaN   NaN  
3644   NaN   NaN  
3645   NaN   NaN  
3646   NaN   NaN  
3647   NaN   NaN  
3648   NaN   NaN  
3649   NaN   NaN  
3650   NaN   NaN  
3651   NaN   NaN  
3652   NaN   NaN  
3653   NaN   NaN  
3654   NaN   NaN  
3655   NaN   NaN  
3656   NaN   NaN  
3657   NaN   NaN  
3658   NaN   NaN  
3659   NaN   NaN  
3660   NaN   NaN  
3661   NaN   NaN  
3662   NaN   NaN  

[3663 rows x 23 columns]

你可能感兴趣的:(Pandas)