python3.6 pandas 数据规整-如何合并数据集,转换数据,检测和过滤异常值,排列和随机抽样,计算指标/哑变量

....数据分析和建模方面的大量工作用在数据准备上

0. 小结

以下《用python进行数据分析》的学习笔记包括:

  1. 合并数据集
    1.1.1. 多对一合并
    1.1.2. 多对多合并
    1.1.3. 索引上的合并
    1.1.4. 轴向连接
    1.1.5. 合并重叠数据

  2. 重塑和轴向旋转

  3. 数据转换
    3.1. 移除重复项
    3.2. 利用函数或映射进行数据转换
    3.4. 替换值
    3.5. 重命名轴索引
    3.6. 离散化和面元划分/分组

  4. 检测和过滤异常值

  5. 排列和随机抽样

  6. 计算指标/哑变量

1. 合并数据集

1.1. 数据库风格的join和merge

inner,outer,left, right都跟数据库一样。
多键值(键值对)

1.1.1. 多对一合并

  1. inner merge 键取交集(默认)

1.1. 多对一合并,如无说明,就会将df1和df2中重叠的key作为键。

In [53]: df1
Out[53]:
  data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b

In [54]: df2
Out[54]:
  data2 key
0      0   a
1      1   b
2      2   d

In [55]: pd.merge(df1,df2)
Out[55]:
  data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0
  1. 使用on定义键,或者是左右inner连接这样子
In [65]: pd.merge(df1,df2,on='key')
Out[65]:
  data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0

In [58]: df3
Out[58]:
  data1 lkey
0      0    b
1      1    b
2      2    a
3      3    c
4      4    a
5      5    a
6      6    b

In [59]: df4
Out[59]:
  data2 rkey
0      0    a
1      1    b
2      2    d

In [60]:

In [60]: pd.merge(df3,df4,left_on='lkey',right_on = 'rkey')
Out[60]:
  data1 lkey  data2 rkey
0      0    b      1    b
1      1    b      1    b
2      6    b      1    b
3      2    a      0    a
4      4    a      0    a
5      5    a      0    a

  1. outer merger 外链接做并集
In [66]: pd.merge(df1,df2,how='outer')
Out[66]:
   data1 key  data2
0    0.0   b    1.0
1    1.0   b    1.0
2    6.0   b    1.0
3    2.0   a    0.0
4    4.0   a    0.0
5    5.0   a    0.0
6    3.0   c    NaN
7    NaN   d    2.0

1.1.2. 多对多合并

笛卡尔积

1.1.3. 索引上的合并

把表格上的索引作为连接的,其他一样。

right_index = True, left_index = True

1.1.4. 轴向连接

很像MATLAB里面的扩展行,扩展列。

axis=0, 行方向扩展
axis=1,列方向扩展

  1. numpy.concatenate
In [8]: arr
Out[8]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [9]:

In [9]:

In [9]: np.concatenate([arr,arr],axis=1)
Out[9]:
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [11]: np.concatenate([arr,arr,arr])
Out[11]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
  1. pandas.concat

这个变成DataFrame还挺酷的...

2.1 Series

In [15]: s1
Out[15]:
a    0
b    1
dtype: int64

In [16]: s2
Out[16]:
c    2
d    3
e    4
dtype: int64

In [17]: s3
Out[17]:
f    5
g    6
dtype: int64

In [18]: pd.concat([s1,s2,s3])
Out[18]:
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [19]: pd.concat([s1,s2,s3],axis=1)
Out[19]:
     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0


2.2 多维数组的pandas.concat

In [67]: df1
Out[67]:
          a         b         c         d
0  1.241152  0.719738  0.076190  0.126979
1  1.773495 -0.221310 -0.581650  0.824466
2  0.505247 -1.218025  1.291052 -0.208664

In [68]: df2
Out[68]:
          b         d         a
0 -0.809509  1.781506 -0.091040
1  0.943622 -1.629456 -0.258901

In [69]: pd.concat([df1,df2],ignore_index=True)
Out[69]:
          a         b         c         d
0  1.241152  0.719738  0.076190  0.126979
1  1.773495 -0.221310 -0.581650  0.824466
2  0.505247 -1.218025  1.291052 -0.208664
3 -0.091040 -0.809509       NaN  1.781506
4 -0.258901  0.943622       NaN -1.629456

1.1.5. 合并重叠数据

where,或者combine_first
不能用以上简单的方法合并,用矢量化的if-else.

In [80]: a = Series([np.nan,2.5,np.nan],index=['a','b','c'])

In [81]: b = Series([11,2.5,np.nan],index=['a','b','c'])

In [82]: a
Out[82]:
a    NaN
b    2.5
c    NaN
dtype: float64

In [83]: b
Out[83]:
a    11.0
b     2.5
c     NaN
dtype: float64

In [84]: np.where(pd.isnull(a),a,b)
Out[84]: array([ nan,  2.5,  nan])

In [88]: Series(np.where(pd.isnull(a),a,b))
Out[88]:
0    NaN
1    2.5
2    NaN
dtype: float64

In [91]: a.combine_first(b)
Out[91]:
a    11.0
b     2.5
c     NaN
dtype: float64

显示跟书不一样,应该是np处理后变成是array,和Series格式不一样。

2. 重塑和轴向旋转

重塑: reshape
轴向旋转: stack,unstack

  1. stack: 把数据的列转为行
  2. unstack:反过来
In [4]: data = DataFrame(np.arange(6).reshape(2,3),
   ...: index = pd.Index(['Ohio','Colorado'],name='state'),
   ...: columns = pd.Index(['one','two','three'],name='number'))

In [5]: data
Out[5]:
number    one  two  three
state
Ohio        0    1      2
Colorado    3    4      5

In [6]: data.stack()
Out[6]:
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

In [12]: result = data.stack()

In [13]: result.unstack()
Out[13]:
number    one  two  three
state
Ohio        0    1      2
Colorado    3    4      5

#指定unstack的方式

In [15]: result.unstack(0)
Out[15]:
state   Ohio  Colorado
number
one        0         3
two        1         4
three      2         5

In [16]: result.unstack(1)
Out[16]:
number    one  two  three
state
Ohio        0    1      2
Colorado    3    4      5

In [17]: result.unstack('state')
Out[17]:
state   Ohio  Colorado
number
one        0         3
two        1         4
three      2         5

In [18]: result.unstack('number')
Out[18]:
number    one  two  three
state
Ohio        0    1      2
Colorado    3    4      5

  1. 注意,如果值不全,可能引入缺失值。因为unstack()默认清楚nan,所以看起来是可逆的。
  2. 如果unstack()需要保留nan值,则需要选择参数dropna=False
In [22]: s1 = Series([0,1,2,3],index=['a','b','c','d'])

In [23]: s2 = Series([4,5,6],index=['c','d','e'])

In [24]: s1
Out[24]:
a    0
b    1
c    2
d    3
dtype: int64

In [25]: s2
Out[25]:
c    4
d    5
e    6
dtype: int64

In [26]: data2 = pd.concat([s1,s2],keys=['one','two'])

In [27]: data2
Out[27]:
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [28]: data2.unstack()
Out[28]:
       a    b    c    d    e
one  0.0  1.0  2.0  3.0  NaN
two  NaN  NaN  4.0  5.0  6.0

In [29]: data2.unstack().stack()
Out[29]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [30]: data2.unstack().stack(dropna=False)
Out[30]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

3. 数据转换

3.1. 移除重复项

  1. (默认)针对全部列进行去重
In [41]: data = DataFrame({'k1':['one'] * 3 + ['two'] * 4,
    ...: 'k2':[1,1,2,3,3,4,4]})

In [42]: data
Out[42]:
    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4

In [43]: data.duplicated()
Out[43]:
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [44]: data.drop_duplicates()
Out[44]:
    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4
  1. 制定对一列或多了去重, 可用keep确定是保留第一个值,还是最后一个值
In [45]: data.drop_duplicates('k1')
Out[45]:
    k1  k2
0  one   1
3  two   3

In [50]: data2 = DataFrame({'k1':['one'] * 3 + ['two'] * 4,
    ...: 'k2':range(7),
    ...: 'k3':['AA'] * 3 + ['BB'] * 2 + ['CC'] * 2})

In [51]: data2
Out[51]:
    k1  k2  k3
0  one   0  AA
1  one   1  AA
2  one   2  AA
3  two   3  BB
4  two   4  BB
5  two   5  CC
6  two   6  CC

In [52]: data2.drop_duplicates(['k1','k3'])
Out[52]:
    k1  k2  k3
0  one   0  AA
3  two   3  BB
5  two   5  CC

In [56]: data2.drop_duplicates(['k1','k3'], keep = 'last')
Out[56]:
    k1  k2  k3
2  one   2  AA
4  two   4  BB
6  two   6  CC

3.2. 利用函数或映射进行数据转换

map函数,很明显我并不想打辣么多单词。一般都是先转化为统一格式的单词,然后再进行映射。

In [76]: data = Series(['A']*4 + ['a'] + ['B']*3 + ['c']* 5)

In [77]: studentGrade = DataFrame({'grade':shuffle(data),'studenNo':range(13)})
    ...:

In [78]: map_to_evaluation = {
    ...: 'a' : 'good',
    ...: 'b' : 'mid',
    ...: 'c' : 'bad'}

In [79]: studentGrade['evaluate'] = studentGrade['grade'].map(str.lower).map(ma
    ...: p_to_evaluation)

In [80]: studentGrade
Out[80]:
   grade  studenNo evaluate
3      A         0     good
5      B         1      mid
10     c         2      bad
9      c         3      bad
2      A         4     good
4      a         5     good
8      c         6      bad
7      B         7      mid
0      A         8     good
1      A         9     good
12     c        10      bad
6      B        11      mid
11     c        12      bad

## 3.3. 可以写成lambda函数

In [84]: studentGrade['eval2'] = studentGrade['grade'].map(lambda x : map_to_ev
    ...: aluation[x.lower()])

In [85]: studentGrade
Out[85]:
   grade  studenNo evaluate eval2
3      A         0     good  good
5      B         1      mid   mid
10     c         2      bad   bad
9      c         3      bad   bad
2      A         4     good  good
4      a         5     good  good
8      c         6      bad   bad
7      B         7      mid   mid
0      A         8     good  good
1      A         9     good  good
12     c        10      bad   bad
6      B        11      mid   mid
11     c        12      bad   bad

3.4. 替换值

replace函数,可以只改一个,也可以用range限定范围改,或者变为数组做一对一映射;反正动动脑筋,试试函数,弱限定的python几乎啥都能做..

In [89]: data
Out[89]:
0     A
1     A
2     A
3     A
4     a
5     B
6     B
7     B
8     c
9     c
10    c
11    c
12    c
dtype: object

#改一个数值

In [90]: data.replace('A',np.nan)
Out[90]:
0     NaN
1     NaN
2     NaN
3     NaN
4       a
5       B
6       B
7       B
8       c
9       c
10      c
11      c
12      c
dtype: object

# 一对一映射更改,可以写成字典或者数组

In [91]: data.replace(['A','B'],['A.change','B.change'])
Out[91]:
0     A.change
1     A.change
2     A.change
3     A.change
4            a
5     B.change
6     B.change
7     B.change
8            c
9            c
10           c
11           c
12           c
dtype: object

# 限定某一段进行更改,想改数字大小的话,可以先排序,再改

In [102]: number.replace([range(4)],np.nan)
Out[102]:
0    NaN
1    NaN
2    NaN
3    NaN
4    4.0
5    5.0
6    6.0
7    7.0
dtype: float64

In [103]: number.replace([range(4,7)],np.nan)
Out[103]:
0    0.0
1    1.0
2    2.0
3    3.0
4    NaN
5    NaN
6    NaN
7    7.0
dtype: float64

3.5. 重命名轴索引

rename进行重命名,一般用于创建不同格式的数据,不会改变原数据。

In [163]: testGrade
Out[163]:
  grade studenNo evaluate eval2
a     A        0     good  good
b     B        1      mid   mid
c     c        2      bad   bad
d     c        3      bad   bad
e     A        4     good  good
f     a        5     good  good
g     c        6      bad   bad
h     B        7      mid   mid
i     A        8     good  good
j     A        9     good  good
k     c       10      bad   bad
l     B       11      mid   mid
m     c       12      bad   bad

In [164]: testGrade.rename(index =str.upper, columns= str.upper)
Out[164]:
  GRADE STUDENNO EVALUATE EVAL2
A     A        0     good  good
B     B        1      mid   mid
C     c        2      bad   bad
D     c        3      bad   bad
E     A        4     good  good
F     a        5     good  good
G     c        6      bad   bad
H     B        7      mid   mid
I     A        8     good  good
J     A        9     good  good
K     c       10      bad   bad
L     B       11      mid   mid
M     c       12      bad   bad

In [165]: testGrade
Out[165]:
  grade studenNo evaluate eval2
a     A        0     good  good
b     B        1      mid   mid
c     c        2      bad   bad
d     c        3      bad   bad
e     A        4     good  good
f     a        5     good  good
g     c        6      bad   bad
h     B        7      mid   mid
i     A        8     good  good
j     A        9     good  good
k     c       10      bad   bad
l     B       11      mid   mid
m     c       12      bad   bad

3.6. 离散化和面元划分/分组

把连续数据分成离散值,例如把消费者的年龄数据划分为不同的年龄段。

因为不想打字, 可以生成随机数组做测试,可以参考我的这篇文章:[python3.6 numpy 数组的多种取整方式]

cut函数

  1. 生成一个随机年龄数组
In [268]: ages = np.array( 90 * np.random.rand(30) ).astype(np.int)

In [269]: ages
Out[269]:
array([85, 39, 55, 71,  4,  2, 24, 59, 65, 44, 84, 58,  2, 27, 36, 24, 41,
       12, 55, 14, 26, 81, 32, 82, 76, 40,  6, 29,  7, 47])
  1. 如果ages包含未定义的年龄段,会返回NAN
In [278]: group_names = ['Youth','YouthAdult','MiddleAge','Senior']

In [279]: bins
Out[279]: [18, 25, 35, 60, 100]

In [280]: pd.cut(ages,bins,right=False,labels=group_names)
Out[280]:
[Senior, MiddleAge, MiddleAge, Senior, NaN, ..., MiddleAge, NaN, YouthAdult, NaN
, MiddleAge]
Length: 30
Categories (4, object): [Youth < YouthAdult < MiddleAge < Senior]

  1. 所以加上了起始年龄的定义
In [281]: ages
Out[281]:
array([85, 39, 55, 71,  4,  2, 24, 59, 65, 44, 84, 58,  2, 27, 36, 24, 41,
       12, 55, 14, 26, 81, 32, 82, 76, 40,  6, 29,  7, 47])

In [282]: bins2 = [0,18,25,35,60,100]

In [283]: group_names2 = ['Minor','Youth','YouthAdult','MiddleAge','Senior']

In [284]: pd.cut(ages,bins2,right=False,labels=group_names2)
Out[284]:
[Senior, MiddleAge, MiddleAge, Senior, Minor, ..., MiddleAge, Minor, YouthAdult,
 Minor, MiddleAge]
Length: 30
Categories (5, object): [Minor < Youth < YouthAdult < MiddleAge < Senior]

qcut函数

根据最小值和最大值计算等长面元,假设有A个数据,从大到小排列,qcut设置为4等分,每一份里含有(A/4)个数据,然后再计算这些数据的值的范围。

In [288]: data = np.random.rand(1000)

In [289]: cats = pd.qcut(data,4)

In [290]: pd.value_counts(cats)
Out[290]:
(0.736, 1.0]           250
(0.483, 0.736]         250
(0.225, 0.483]         250
(-0.0009003, 0.225]    250
dtype: int64

4. 检测和过滤异常值

  1. 用describe()看数值的整体情况
  2. 过滤出超过某个范围值
  3. 为第二点的“异常值”赋值
In [312]: np.random.seed(12345)

In [313]: data = DataFrame(np.random.randn(1000,4))

In [314]: data.describe()
Out[314]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean     -0.067684     0.067924     0.025598    -0.002298
std       0.998035     0.992106     1.006835     0.996794
min      -3.428254    -3.548824    -3.184377    -3.745356
25%      -0.774890    -0.591841    -0.641675    -0.644144
50%      -0.116401     0.101143     0.002073    -0.013611
75%       0.616366     0.780282     0.680391     0.654328
max       3.366626     2.653656     3.260383     3.927528

# 查看某一列大于3的异常值

In [315]: col = data[3]

In [316]: col[np.abs(col) > 3]
Out[316]:
97     3.927528
305   -3.399312
400   -3.745356
Name: 3, dtype: float64

# 查看数组中大于3的异常值

In [317]: data[(np.abs(data) > 3).any(1)]
Out[317]:
            0         1         2         3
5   -0.539741  0.476985  3.248944 -1.021228
97  -0.774363  0.552936  0.106061  3.927528
102 -0.655054 -0.565230  3.176873  0.959533
305 -2.315555  0.457246 -0.025907 -3.399312
324  0.050188  1.951312  3.260383  0.963301
400  0.146326  0.508391 -0.196713 -3.745356
499 -0.293333 -0.242459 -3.056990  1.918403
523 -3.428254 -0.296336 -0.439938 -0.867165
586  0.275144  1.179227 -3.184377  1.369891
808 -0.362528 -3.548824  1.553205 -2.186301
900  3.366626 -2.372214  0.851010  1.332846

# 改变异常值

In [318]: data[(np.abs(data) > 3).any(1)] = np.sign(data) * 3

In [319]: data.describe()
Out[319]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean     -0.072154     0.072823     0.023305     0.001450
std       1.031713     1.028480     1.028945     1.017781
min      -3.000000    -3.000000    -3.000000    -3.000000
25%      -0.795388    -0.599807    -0.670407    -0.644144
50%      -0.116401     0.101143     0.002073    -0.013611
75%       0.624615     0.792539     0.680976     0.654328
max       3.000000     3.000000     3.000000     3.000000

5. 排列和随机抽样

  1. permutation进行重排序
In [324]: df = DataFrame(np.arange(20).reshape(5,4))

In [325]: df
Out[325]:
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19

In [326]: sampler = np.random.permutation(5)

In [327]: df.take(sampler)
Out[327]:
    0   1   2   3
1   4   5   6   7
3  12  13  14  15
4  16  17  18  19
0   0   1   2   3
2   8   9  10  11

In [330]: df.take(np.random.permutation(len(df)))
Out[330]:
    0   1   2   3
1   4   5   6   7
0   0   1   2   3
4  16  17  18  19
3  12  13  14  15
2   8   9  10  11
  1. 直接用随机数做重排序
In [331]: bag = np.array([5,7,-1,6,4])

In [332]: sampler = np.random.randint(0,len(bag),size =10)

In [333]: sampler
Out[333]: array([3, 0, 4, 1, 1, 2, 3, 0, 1, 2])

In [334]: draws = bag.take(sampler)

In [335]: draws
Out[335]: array([ 6,  5,  4,  7,  7, -1,  6,  5,  7, -1])

6. 计算指标/哑变量

常用语机器学习分类算法,“1”代表分到x类,也很像编码。

  1. 基础方法
In [4]: df = DataFrame({'key':['b','b','a','c','a','b'],
   ...: 'data1':range(6)})

In [5]: pd.get_dummies(df['key'])
Out[5]:
   a  b  c
0  0  1  0
1  0  1  0
2  1  0  0
3  0  0  1
4  1  0  0
5  0  1  0

In [6]: dummies = pd.get_dummies(df['key'],prefix='key')

In [7]: dummies
Out[7]:
   key_a  key_b  key_c
0      0      1      0
1      0      1      0
2      1      0      0
3      0      0      1
4      1      0      0
5      0      1      0

In [8]: df_with_dummies = df[['data1']].join(dummies)

In [9]: df_with_dummies
Out[9]:
   data1  key_a  key_b  key_c
0      0      0      1      0
1      1      0      1      0
2      2      1      0      0
3      3      0      0      1
4      4      1      0      0
5      5      0      1      0
  1. 对于范围值可以进行离散化
In [19]: values = np.random.rand(10)

In [20]: values
Out[20]:
array([ 0.12422622,  0.66517564,  0.82179204,  0.95121697,  0.73976916,
        0.00804186,  0.51379806,  0.00963952,  0.11634595,  0.52704073])

In [21]: bins = [0,0.2,0.4,0.6,0.8,1]

In [22]: pd.get_dummies(pd.cut(values,bins))
Out[22]:
   (0.0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1.0]
0           1           0           0           0           0
1           0           0           0           1           0
2           0           0           0           0           1
3           0           0           0           0           1
4           0           0           0           1           0
5           1           0           0           0           0
6           0           0           1           0           0
7           1           0           0           0           0
8           1           0           0           0           0
9           0           0           1           0           0


没错,数据库基础又要翻翻了...这里的merge有点像数据库中的join连接。又看到笛卡尔积了...

笛卡尔积:因为名字很长经常忘记,记住“积”,连接后新表的行数为两表的行数之积。

SQL连接:下面这个写得很好,可以用来参考
极客学院-SQL 使用连接

2018.8.1 进度书里的页码 P 214

你可能感兴趣的:(python3.6 pandas 数据规整-如何合并数据集,转换数据,检测和过滤异常值,排列和随机抽样,计算指标/哑变量)