



  • 级联:pd.concat, pd.append
  • 合并:pd.merge, pd.join

0. 回顾numpy的级联

import numpy as np
import pandas as pd
from pandas import Series,DataFrame



  1. 生成2个3*3的矩阵,对其分别进行两个维度上的级联


nd1 =np.array([1,2,3])
nd2 =np.array([-1,-2,-3,-4])
array([ 1,  2,  3, -1, -2, -3, -4])
nd3 = np.array([[-1,-2,-3],[0,2,4]])

nd1 + nd3
array([[0, 0, 0],
       [1, 4, 7]])
(2, 3)
nd1 + nd2

ValueError                                Traceback (most recent call last)

 in ()
----> 1 nd1 + nd2

ValueError: operands could not be broadcast together with shapes (3,) (4,) 


def make_df(cols,inds):
    data = {c:[c+str(i) for i in inds] for c in cols}
    return DataFrame(data,index = inds)
#当c = a   c:a1 a2  a3
#当c =b    c: b1 b2 b3

df1 = make_df(list("abc"),[1,2,3])
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
df2 = make_df(list('abc'),[4,5,6])
a b c
4 a4 b4 c4
5 a5 b5 c5
6 a6 b6 c6

1. 使用pd.concat()级联


pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,

1) 简单级联


a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
5 a5 b5 c5
6 a6 b6 c6
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
df3 =make_df(list("def"),[1,2,3])
d e f
1 d1 e1 f1
2 d2 e2 f2
3 d3 e3 f3
df1 + df3
a b c d e f
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
pd.concat([df1, df3], axis = 1)
a b c d e f
1 a1 b1 c1 d1 e1 f1
2 a2 b2 c2 d2 e2 f2
3 a3 b3 c3 d3 e3 f3
pd.concat([df1,df2],axis = 1)
a b c a b c
1 a1 b1 c1 NaN NaN NaN
2 a2 b2 c2 NaN NaN NaN
3 a3 b3 c3 NaN NaN NaN
4 NaN NaN NaN a4 b4 c4
5 NaN NaN NaN a5 b5 c5
6 NaN NaN NaN a6 b6 c6



a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
df4 = make_df(list('abc'),[2,3,4])
a b c
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4


a b c
0 a1 b1 c1
1 a2 b2 c2
2 a3 b3 c3
3 a2 b2 c2
4 a3 b3 c3
5 a4 b4 c4

或者使用多层索引 keys


pd.concat([df1,df4],keys = ["三班","四班"])
a b c
三班 1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
四班 2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4



  1. 想一想级联的应用场景?

  2. 使用昨天的知识,建立一个期中考试张三、李四的成绩表ddd

  3. 假设新增考试学科"计算机",如何实现?

  4. 新增王老五同学的成绩,如何实现?


2) 不匹配级联


a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
df5 = make_df(list("abcd"),[3,4,5,6])
a b c d
3 a3 b3 c3 d3
4 a4 b4 c4 d4
5 a5 b5 c5 d5
6 a6 b6 c6 d6
a b c d
1 a1 b1 c1 NaN
2 a2 b2 c2 NaN
3 a3 b3 c3 NaN
3 a3 b3 c3 d3
4 a4 b4 c4 d4
5 a5 b5 c5 d5
6 a6 b6 c6 d6


  • 外连接:补NaN(默认模式)
#上面的这种情况  默认的这种情况!!!!
  • 内连接:只连接匹配的项
pd.concat([df1,df5],join = "inner")
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
3 a3 b3 c3
4 a4 b4 c4
5 a5 b5 c5
6 a6 b6 c6
  • 连接指定轴 join_axes
df6 = make_df(list("abcz"), [3,4,7,8])
a b c z
3 a3 b3 c3 z3
4 a4 b4 c4 z4
7 a7 b7 c7 z7
8 a8 b8 c8 z8
Index(['a', 'b', 'c', 'z'], dtype='object')
pd.concat([df6,df5,df2,df1], join_axes=[df6.columns])
#axis  轴  axes  轴面
#join_axes  list of Index objects
a b c z
3 a3 b3 c3 z3
4 a4 b4 c4 z4
7 a7 b7 c7 z7
8 a8 b8 c8 z8
3 a3 b3 c3 NaN
4 a4 b4 c4 NaN
5 a5 b5 c5 NaN
6 a6 b6 c6 NaN
4 a4 b4 c4 NaN
5 a5 b5 c5 NaN
6 a6 b6 c6 NaN
1 a1 b1 c1 NaN
2 a2 b2 c2 NaN
3 a3 b3 c3 NaN





3) 使用append()函数添加


s1 = ["123"]

['123', '456']
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
5 a5 b5 c5
6 a6 b6 c6
a b c d
3 a3 b3 c3 d3
4 a4 b4 c4 d4
5 a5 b5 c5 d5
6 a6 b6 c6 d6
a b c d
3 a3 b3 c3 d3
4 a4 b4 c4 d4
5 a5 b5 c5 d5
6 a6 b6 c6 d6
1 a1 b1 c1 NaN
2 a2 b2 c2 NaN
3 a3 b3 c3 NaN





2. 使用pd.merge()合并





1) 一对一合并

df1 = DataFrame({"age":[30,22,36],"work":['tech',"accounting","sell"],"sex":["男","女","女"]}, index = list("abc"))
age sex work
a 30 tech
b 22 accounting
c 36 sell
df2 = DataFrame({"home":["上海","安徽","山东"],"work":['tech',"accounting","sell"],"weight":[60,50,45]},
                index = list("abc"))
home weight work
a 上海 60 tech
b 安徽 50 accounting
c 山东 45 sell
pd.concat([df1,df2],axis = 1)
age sex work home weight work
a 30 tech 上海 60 tech
b 22 accounting 安徽 50 accounting
c 36 sell 山东 45 sell
age sex work home weight
0 30 tech 上海 60
1 22 accounting 安徽 50
2 36 sell 山东 45

2) 多对一合并

age sex work
a 30 tech
b 22 accounting
c 36 sell
df3 = DataFrame({"home":["深圳","北京","上海","安徽","山东"],
                "weight":[60,75,80,54,63]},index = list("abcde"))
home weight work
a 深圳 60 tech
b 北京 75 tech
c 上海 80 tech
d 安徽 54 accounting
e 山东 63 sell
age sex work home weight
0 30 tech 深圳 60
1 30 tech 北京 75
2 30 tech 上海 80
3 22 accounting 安徽 54
4 36 sell 山东 63

3) 多对多合并

df5 = DataFrame({"age":[28,30,22,36], "work":['tech',"tech","accounting","sell"],"sex":["女","男","女","女"]}, index = list("abce"))
age sex work
a 28 tech
b 30 tech
c 22 accounting
e 36 sell
home weight work
a 深圳 60 tech
b 北京 75 tech
c 上海 80 tech
d 安徽 54 accounting
e 山东 63 sell
home weight work age sex
0 深圳 60 tech 28
1 深圳 60 tech 30
2 北京 75 tech 28
3 北京 75 tech 30
4 上海 80 tech 28
5 上海 80 tech 30
6 安徽 54 accounting 22
7 山东 63 sell 36

4) key的规范化

  • 使用on=显式指定哪一列为key,当有多个key相同时使用
age sex work
a 28 tech
b 30 tech
c 22 accounting
e 36 sell
df6 = DataFrame({"age":[30,27,36],"work":["tech","leader","sell"],"hoppy":["sixdog","diaofish","playcat"]}, index = list("abc"))
age hoppy work
a 30 sixdog tech
b 27 diaofish leader
c 36 playcat sell
df5.merge(df6, on = "age", suffixes=["_总部","_分部"])
age sex work_总部 hoppy work_分部
0 30 tech sixdog tech
1 36 sell playcat sell
df5.merge(df6,on = "work")
age_x sex work age_y hoppy
0 28 tech 30 sixdog
1 30 tech 30 sixdog
2 36 sell 36 playcat
  • 使用left_on和right_on指定左右两边的列作为key,当左右两边的key都不想等时使用
age sex work
a 28 tech
b 30 tech
c 22 accounting
e 36 sell
df7 = DataFrame({"年龄":[30,22,36],"工作":["tech","accounting","sell"],"性别":["男","女","女"]},index = list("abc"))
工作 年龄 性别
a tech 30
b accounting 22
c sell 36
df5.merge(df7,left_on = "work", right_on = "工作")
age sex work 工作 年龄 性别
0 28 tech tech 30
1 30 tech tech 30
2 22 accounting accounting 22
3 36 sell sell 36
age sex work
a 28 tech
b 30 tech
c 22 accounting
e 36 sell
s = df5[["age"]]*1000
s.columns = ["salary"]
a 28000
b 30000
c 22000
e 36000
df5.merge(s, left_index = True,right_index=True)
age sex work salary
a 28 tech 28000
b 30 tech 30000
c 22 accounting 22000
e 36 sell 36000
pd.concat([df5,s],axis = 1)
age sex work salary
a 28 tech 28000
b 30 tech 30000
c 22 accounting 22000
e 36 sell 36000



  1. 假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?

  2. 如果ddd4中张三的名字被打错了,成为了张十三,怎么办?

  3. 自行练习多对一,多对多的情况


5) 内合并与外合并

  • 内合并:只保留两者都有的key(默认模式)
home weight work
a 深圳 60 tech
b 北京 75 tech
c 上海 80 tech
d 安徽 54 accounting
e 山东 63 sell
age sex work
a 28 tech
b 30 tech
c 22 accounting
e 36 sell
age hoppy work
a 30 sixdog tech
b 27 diaofish leader
c 36 playcat sell
home weight work
a 深圳 60 tech
b 北京 75 tech
c 上海 80 tech
d 安徽 54 accounting
e 山东 63 sell
home weight work age hoppy
0 深圳 60 tech 30 sixdog
1 北京 75 tech 30 sixdog
2 上海 80 tech 30 sixdog
3 山东 63 sell 36 playcat
  • 外合并 how='outer':补NaN
df3.merge(df6,how = "outer")
home weight work age hoppy
0 深圳 60.0 tech 30.0 sixdog
1 北京 75.0 tech 30.0 sixdog
2 上海 80.0 tech 30.0 sixdog
3 安徽 54.0 accounting NaN NaN
4 山东 63.0 sell 36.0 playcat
5 NaN NaN leader 27.0 diaofish
  • 左合并、右合并:how='left',how='right',
home weight work
a 深圳 60 tech
b 北京 75 tech
c 上海 80 tech
d 安徽 54 accounting
e 山东 63 sell
age hoppy work
a 30 sixdog tech
b 27 diaofish leader
c 36 playcat sell
df3.merge(df6, how = "left")
home weight work age hoppy
0 深圳 60 tech 30.0 sixdog
1 北京 75 tech 30.0 sixdog
2 上海 80 tech 30.0 sixdog
3 安徽 54 accounting NaN NaN
4 山东 63 sell 36.0 playcat
df3.merge(df6, how = "right")
home weight work age hoppy
0 深圳 60.0 tech 30 sixdog
1 北京 75.0 tech 30 sixdog
2 上海 80.0 tech 30 sixdog
3 山东 63.0 sell 36 playcat
4 NaN NaN leader 27 diaofish



  1. 如果只有张三赵小六语数英三个科目的成绩,如何合并?

  2. 考虑应用情景,使用多种方式合并ddd与ddd4


6) 列冲突的解决








3. 案例分析:美国各州人口数据分析


pop = pd.read_csv("./state-population.csv")
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
5 AL total 2011 4801627.0
6 AL total 2009 4757938.0
7 AL under18 2009 1134192.0
8 AL under18 2013 1111481.0
9 AL total 2013 4833722.0
10 AL total 2007 4672840.0
11 AL under18 2007 1132296.0
12 AL total 2008 4718206.0
13 AL under18 2008 1134927.0
14 AL total 2005 4569805.0
15 AL under18 2005 1117229.0
16 AL total 2006 4628981.0
17 AL under18 2006 1126798.0
18 AL total 2004 4530729.0
19 AL under18 2004 1113662.0

(2544, 4)
areas = pd.read_csv("./state-areas.csv")
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
5 Colorado 104100
6 Connecticut 5544
7 Delaware 1954
8 Florida 65758
9 Georgia 59441
10 Hawaii 10932
11 Idaho 83574
12 Illinois 57918
13 Indiana 36420
14 Iowa 56276
15 Kansas 82282
16 Kentucky 40411
17 Louisiana 51843
18 Maine 35387
19 Maryland 12407
20 Massachusetts 10555
21 Michigan 96810
22 Minnesota 86943
23 Mississippi 48434
24 Missouri 69709
25 Montana 147046
26 Nebraska 77358
27 Nevada 110567
28 New Hampshire 9351
29 New Jersey 8722
30 New Mexico 121593
31 New York 54475
32 North Carolina 53821
33 North Dakota 70704
34 Ohio 44828
35 Oklahoma 69903
36 Oregon 98386
37 Pennsylvania 46058
38 Rhode Island 1545
39 South Carolina 32007
40 South Dakota 77121
41 Tennessee 42146
42 Texas 268601
43 Utah 84904
44 Vermont 9615
45 Virginia 42769
46 Washington 71303
47 West Virginia 24231
48 Wisconsin 65503
49 Wyoming 97818
50 District of Columbia 68
51 Puerto Rico 3515
(52, 2)
abbr = pd.read_csv("./state-abbrevs.csv")
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
(51, 2)



#pop  :2544行的数据  abbr   51的条数据
pop2 = pop.merge(abbr,left_on = "state/region", right_on = "abbreviation", how = "left")
state/region ages year population state abbreviation
0 AL under18 2012 1117489.0 Alabama AL
1 AL total 2012 4817528.0 Alabama AL
2 AL under18 2010 1130966.0 Alabama AL
3 AL total 2010 4785570.0 Alabama AL
4 AL under18 2011 1125763.0 Alabama AL


pop2.drop("abbreviation", axis = 1,inplace=True)

state/region ages year population state
0 AL under18 2012 1117489.0 Alabama
1 AL total 2012 4817528.0 Alabama
2 AL under18 2010 1130966.0 Alabama
3 AL total 2010 4785570.0 Alabama
4 AL under18 2011 1125763.0 Alabama
5 AL total 2011 4801627.0 Alabama
6 AL total 2009 4757938.0 Alabama
7 AL under18 2009 1134192.0 Alabama
8 AL under18 2013 1111481.0 Alabama
9 AL total 2013 4833722.0 Alabama
10 AL total 2007 4672840.0 Alabama
11 AL under18 2007 1132296.0 Alabama
12 AL total 2008 4718206.0 Alabama
13 AL under18 2008 1134927.0 Alabama
14 AL total 2005 4569805.0 Alabama
15 AL under18 2005 1117229.0 Alabama
16 AL total 2006 4628981.0 Alabama
17 AL under18 2006 1126798.0 Alabama
18 AL total 2004 4530729.0 Alabama
19 AL under18 2004 1113662.0 Alabama
20 AL total 2003 4503491.0 Alabama
21 AL under18 2003 1113083.0 Alabama
22 AL total 2001 4467634.0 Alabama
23 AL under18 2001 1120409.0 Alabama
24 AL total 2002 4480089.0 Alabama
25 AL under18 2002 1116590.0 Alabama
26 AL under18 1999 1121287.0 Alabama
27 AL total 1999 4430141.0 Alabama
28 AL total 2000 4452173.0 Alabama
29 AL under18 2000 1122273.0 Alabama
... ... ... ... ... ...
2514 USA under18 1999 71946051.0 NaN
2515 USA total 2000 282162411.0 NaN
2516 USA under18 2000 72376189.0 NaN
2517 USA total 1999 279040181.0 NaN
2518 USA total 2001 284968955.0 NaN
2519 USA under18 2001 72671175.0 NaN
2520 USA total 2002 287625193.0 NaN
2521 USA under18 2002 72936457.0 NaN
2522 USA total 2003 290107933.0 NaN
2523 USA under18 2003 73100758.0 NaN
2524 USA total 2004 292805298.0 NaN
2525 USA under18 2004 73297735.0 NaN
2526 USA total 2005 295516599.0 NaN
2527 USA under18 2005 73523669.0 NaN
2528 USA total 2006 298379912.0 NaN
2529 USA under18 2006 73757714.0 NaN
2530 USA total 2007 301231207.0 NaN
2531 USA under18 2007 74019405.0 NaN
2532 USA total 2008 304093966.0 NaN
2533 USA under18 2008 74104602.0 NaN
2534 USA under18 2013 73585872.0 NaN
2535 USA total 2013 316128839.0 NaN
2536 USA total 2009 306771529.0 NaN
2537 USA under18 2009 74134167.0 NaN
2538 USA under18 2010 74119556.0 NaN
2539 USA total 2010 309326295.0 NaN
2540 USA under18 2011 73902222.0 NaN
2541 USA total 2011 311582564.0 NaN
2542 USA under18 2012 73708179.0 NaN
2543 USA total 2012 313873685.0 NaN

2544 rows × 5 columns



cond = pop2.isnull().any(axis = 1)

state/region ages year population state
2448 PR under18 1990 NaN NaN
2449 PR total 1990 NaN NaN
2450 PR total 1991 NaN NaN
2451 PR under18 1991 NaN NaN
2452 PR total 1993 NaN NaN
2453 PR under18 1993 NaN NaN
2454 PR under18 1992 NaN NaN
2455 PR total 1992 NaN NaN
2456 PR under18 1994 NaN NaN
2457 PR total 1994 NaN NaN
2458 PR total 1995 NaN NaN
2459 PR under18 1995 NaN NaN
2460 PR under18 1996 NaN NaN
2461 PR total 1996 NaN NaN
2462 PR under18 1998 NaN NaN
2463 PR total 1998 NaN NaN
2464 PR total 1997 NaN NaN
2465 PR under18 1997 NaN NaN
2466 PR total 1999 NaN NaN
2467 PR under18 1999 NaN NaN
2468 PR total 2000 3810605.0 NaN
2469 PR under18 2000 1089063.0 NaN
2470 PR total 2001 3818774.0 NaN
2471 PR under18 2001 1077566.0 NaN
2472 PR total 2002 3823701.0 NaN
2473 PR under18 2002 1065051.0 NaN
2474 PR total 2004 3826878.0 NaN
2475 PR under18 2004 1035919.0 NaN
2476 PR total 2003 3826095.0 NaN
2477 PR under18 2003 1050615.0 NaN
... ... ... ... ... ...
2514 USA under18 1999 71946051.0 NaN
2515 USA total 2000 282162411.0 NaN
2516 USA under18 2000 72376189.0 NaN
2517 USA total 1999 279040181.0 NaN
2518 USA total 2001 284968955.0 NaN
2519 USA under18 2001 72671175.0 NaN
2520 USA total 2002 287625193.0 NaN
2521 USA under18 2002 72936457.0 NaN
2522 USA total 2003 290107933.0 NaN
2523 USA under18 2003 73100758.0 NaN
2524 USA total 2004 292805298.0 NaN
2525 USA under18 2004 73297735.0 NaN
2526 USA total 2005 295516599.0 NaN
2527 USA under18 2005 73523669.0 NaN
2528 USA total 2006 298379912.0 NaN
2529 USA under18 2006 73757714.0 NaN
2530 USA total 2007 301231207.0 NaN
2531 USA under18 2007 74019405.0 NaN
2532 USA total 2008 304093966.0 NaN
2533 USA under18 2008 74104602.0 NaN
2534 USA under18 2013 73585872.0 NaN
2535 USA total 2013 316128839.0 NaN
2536 USA total 2009 306771529.0 NaN
2537 USA under18 2009 74134167.0 NaN
2538 USA under18 2010 74119556.0 NaN
2539 USA total 2010 309326295.0 NaN
2540 USA under18 2011 73902222.0 NaN
2541 USA total 2011 311582564.0 NaN
2542 USA under18 2012 73708179.0 NaN
2543 USA total 2012 313873685.0 NaN

96 rows × 5 columns




state/region ages year population state
0 AL under18 2012 1117489.0 Alabama
1 AL total 2012 4817528.0 Alabama
2 AL under18 2010 1130966.0 Alabama
3 AL total 2010 4785570.0 Alabama
4 AL under18 2011 1125763.0 Alabama
#让你查看哪一个州的有空值的   州的缩写
cond_state = pop2["state"].isnull()
0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
2514     True
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
Name: state, Length: 2544, dtype: bool
array(['PR', 'USA'], dtype=object)















  • 统一用loc()索引
  • 善于使用.isnull().any()找到存在NaN的列
  • 善于使用.unique()确定该列中哪些key是我们需要的
  • 一般使用外合并、左合并,目的只有一个:宁愿该列是NaN也不要丢弃其他列的信息


  • Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值
  • ndarray有广播,通过重复已有值来计算
