利用python进行数据分析之——数据规整化2(ETL)

待我学有所成,结发与蕊可好。@夏瑾墨 by Jooey

3.数据的轴向连接
Nunpy 有一个用于合并串联原始Numpy数组的concatenation函数

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

arr=np.arange(12).reshape((3,4))
print (arr)
print (np.concatenate([arr,arr],axis=1))

输出结果:

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[ 0  1  2  3  0  1  2  3]
 [ 4  5  6  7  4  5  6  7]
 [ 8  9 10 11  8  9 10 11]]

假设有三个没有重叠索引的Series

s1=Series([0,1],index=['a','b'])
s2=Series([2,3,4],index=['c','d','e'])
s3=Series([5,6],index=['f','g'])
print (pd.concat([s1,s2,s3]))

输出结果:

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

默认情况下,concat是在axis=0上工作的,最终产生一个新的Series。如果传入axis=1,则结果就会变成一个DataFrame(axis=1是列)

print (pd.concat([s1,s2,s3],axis=1))

输出结果:

     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0

这种情况下,另外一条轴上没有重叠,从索引的有序并集(外连接)上就可以看出来。传入join=‘inner’即可得到它们的交集

s4=pd.concat([s1*5,s3])
print (pd.concat([s1,s4],axis=1))
print (pd.concat([s1,s4],axis=1,join='inner'))

输出结果:

     0  1
a  0.0  0
b  1.0  5
f  NaN  5
g  NaN  6
   0  1
a  0  0
b  1  5

你可以通过join_axes指定要在其它轴上使用的索引

print (pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']]))

输出结果:

     0    1
a  0.0  0.0
c  NaN  NaN
b  1.0  5.0
e  NaN  NaN

Nan := Not A Number
有个问题,参与连接的片段在结果中区分不开。假设你想在连接轴上创建一个层次化索引。使用keys参数即可达到这个目的

result=pd.concat([s1,s2,s3],keys=['one','two','three'])
print (result)
print (result.unstack())

输出结果:

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
         a    b    c    d    e    f    g
one    0.0  1.0  NaN  NaN  NaN  NaN  NaN
two    NaN  NaN  2.0  3.0  4.0  NaN  NaN
three  NaN  NaN  NaN  NaN  NaN  5.0  6.0

如果沿着axis=1对Series进行合并,则keys就会成为DataFrame的列头


print (pd.concat([s1,s2,s3],axis=1,keys=['one','two','three']))

输出结果:

   one  two  three
a    0  NaN    NaN
b    1  NaN    NaN
c  NaN    2    NaN
d  NaN    3    NaN
e  NaN    4    NaN
f  NaN  NaN      5
g  NaN  NaN      6

同样的逻辑对DataFrame对象也是一样

df5=DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])
df6=DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])
print (pd.concat([df5,df6],axis=1,keys=['level1','level2']))

输出结果:

  level1     level2     
     one two  three four
a      0   1      5    6
b      2   3    NaN  NaN
c      4   5      7    8

如果传入的不是列表而是一个字典,则字典的键就会被当做keys选项的值

print (pd.concat({'level1':df5,'level2':df6},axis=1))

输出结果:

  level1     level2     
     one two  three four
a      0   1      5    6
b      2   3    NaN  NaN
c      4   5      7    8

此外还有两个用于管理层次化索引创建方式的参数,见下表

print (pd.concat([df5,df6],axis=1,keys=['level1','level2'],names=['upper','lower']))

输出结果:

upper level1     level2     
lower    one two  three four
a          0   1      5    6
b          2   3    NaN  NaN
c          4   5      7    8

python3里面写函数的相关参数只需依次逗号分隔即可。
利用python进行数据分析之——数据规整化2(ETL)_第1张图片
最后一个需要考虑的问题就是,跟当前分析工作无关的DataFrame行索引。传入ignore_index=True即可

df7=DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
df8=DataFrame(np.random.randn(2,3),columns=['b','d','a'])
print (df7)
print (df8)
print (pd.concat([df7,df8],ignore_index=True))

输出结果:

       a         b         c         d
0 -0.844224  0.593684  0.144469  0.729945
1  0.484216 -0.736679 -2.385474  0.004167
2 -0.007380 -0.129935 -0.014069  0.907947
          b         d         a
0 -1.377938 -0.616348  0.936278
1  0.400851  2.066192  0.127229
          a         b         c         d
0 -0.844224  0.593684  0.144469  0.729945
1  0.484216 -0.736679 -2.385474  0.004167
2 -0.007380 -0.129935 -0.014069  0.907947
3  0.936278 -1.377938       NaN -0.616348
4  0.127229  0.400851       NaN  2.066192

待我学有所成,结发与蕊可好。@夏瑾墨 by Jooey

你可能感兴趣的:(每日Python)