....数据分析和建模方面的大量工作用在数据准备上
0. 小结
以下《用python进行数据分析》的学习笔记包括:
合并数据集
1.1.1. 多对一合并
1.1.2. 多对多合并
1.1.3. 索引上的合并
1.1.4. 轴向连接
1.1.5. 合并重叠数据重塑和轴向旋转
数据转换
3.1. 移除重复项
3.2. 利用函数或映射进行数据转换
3.4. 替换值
3.5. 重命名轴索引
3.6. 离散化和面元划分/分组检测和过滤异常值
排列和随机抽样
计算指标/哑变量
1. 合并数据集
1.1. 数据库风格的join和merge
inner,outer,left, right都跟数据库一样。
多键值(键值对)
1.1.1. 多对一合并
- inner merge 键取交集(默认)
1.1. 多对一合并,如无说明,就会将df1和df2中重叠的key作为键。
In [53]: df1
Out[53]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
In [54]: df2
Out[54]:
data2 key
0 0 a
1 1 b
2 2 d
In [55]: pd.merge(df1,df2)
Out[55]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
- 使用
on
定义键,或者是左右inner连接这样子
In [65]: pd.merge(df1,df2,on='key')
Out[65]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
In [58]: df3
Out[58]:
data1 lkey
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
In [59]: df4
Out[59]:
data2 rkey
0 0 a
1 1 b
2 2 d
In [60]:
In [60]: pd.merge(df3,df4,left_on='lkey',right_on = 'rkey')
Out[60]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
- outer merger 外链接做并集
In [66]: pd.merge(df1,df2,how='outer')
Out[66]:
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 6.0 b 1.0
3 2.0 a 0.0
4 4.0 a 0.0
5 5.0 a 0.0
6 3.0 c NaN
7 NaN d 2.0
1.1.2. 多对多合并
笛卡尔积
1.1.3. 索引上的合并
把表格上的索引作为连接的键,其他一样。
right_index = True, left_index = True
1.1.4. 轴向连接
很像MATLAB里面的扩展行,扩展列。
axis=0, 行方向扩展
axis=1,列方向扩展
- numpy.concatenate
In [8]: arr
Out[8]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [9]:
In [9]:
In [9]: np.concatenate([arr,arr],axis=1)
Out[9]:
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
In [11]: np.concatenate([arr,arr,arr])
Out[11]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
- pandas.concat
这个变成DataFrame还挺酷的...
2.1 Series
In [15]: s1
Out[15]:
a 0
b 1
dtype: int64
In [16]: s2
Out[16]:
c 2
d 3
e 4
dtype: int64
In [17]: s3
Out[17]:
f 5
g 6
dtype: int64
In [18]: pd.concat([s1,s2,s3])
Out[18]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
In [19]: pd.concat([s1,s2,s3],axis=1)
Out[19]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
2.2 多维数组的pandas.concat
In [67]: df1
Out[67]:
a b c d
0 1.241152 0.719738 0.076190 0.126979
1 1.773495 -0.221310 -0.581650 0.824466
2 0.505247 -1.218025 1.291052 -0.208664
In [68]: df2
Out[68]:
b d a
0 -0.809509 1.781506 -0.091040
1 0.943622 -1.629456 -0.258901
In [69]: pd.concat([df1,df2],ignore_index=True)
Out[69]:
a b c d
0 1.241152 0.719738 0.076190 0.126979
1 1.773495 -0.221310 -0.581650 0.824466
2 0.505247 -1.218025 1.291052 -0.208664
3 -0.091040 -0.809509 NaN 1.781506
4 -0.258901 0.943622 NaN -1.629456
1.1.5. 合并重叠数据
用where
,或者combine_first
不能用以上简单的方法合并,用矢量化的if-else.
In [80]: a = Series([np.nan,2.5,np.nan],index=['a','b','c'])
In [81]: b = Series([11,2.5,np.nan],index=['a','b','c'])
In [82]: a
Out[82]:
a NaN
b 2.5
c NaN
dtype: float64
In [83]: b
Out[83]:
a 11.0
b 2.5
c NaN
dtype: float64
In [84]: np.where(pd.isnull(a),a,b)
Out[84]: array([ nan, 2.5, nan])
In [88]: Series(np.where(pd.isnull(a),a,b))
Out[88]:
0 NaN
1 2.5
2 NaN
dtype: float64
In [91]: a.combine_first(b)
Out[91]:
a 11.0
b 2.5
c NaN
dtype: float64
显示跟书不一样,应该是np处理后变成是array,和Series格式不一样。
2. 重塑和轴向旋转
重塑: reshape
轴向旋转: stack,unstack
- stack: 把数据的列转为行
- unstack:反过来
In [4]: data = DataFrame(np.arange(6).reshape(2,3),
...: index = pd.Index(['Ohio','Colorado'],name='state'),
...: columns = pd.Index(['one','two','three'],name='number'))
In [5]: data
Out[5]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
In [6]: data.stack()
Out[6]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
In [12]: result = data.stack()
In [13]: result.unstack()
Out[13]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
#指定unstack的方式
In [15]: result.unstack(0)
Out[15]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
In [16]: result.unstack(1)
Out[16]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
In [17]: result.unstack('state')
Out[17]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
In [18]: result.unstack('number')
Out[18]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
- 注意,如果值不全,可能引入缺失值。因为unstack()默认清楚nan,所以看起来是可逆的。
- 如果unstack()需要保留nan值,则需要选择参数dropna=False
In [22]: s1 = Series([0,1,2,3],index=['a','b','c','d'])
In [23]: s2 = Series([4,5,6],index=['c','d','e'])
In [24]: s1
Out[24]:
a 0
b 1
c 2
d 3
dtype: int64
In [25]: s2
Out[25]:
c 4
d 5
e 6
dtype: int64
In [26]: data2 = pd.concat([s1,s2],keys=['one','two'])
In [27]: data2
Out[27]:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
In [28]: data2.unstack()
Out[28]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
In [29]: data2.unstack().stack()
Out[29]:
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
In [30]: data2.unstack().stack(dropna=False)
Out[30]:
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
3. 数据转换
3.1. 移除重复项
- (默认)针对全部列进行去重
In [41]: data = DataFrame({'k1':['one'] * 3 + ['two'] * 4,
...: 'k2':[1,1,2,3,3,4,4]})
In [42]: data
Out[42]:
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
In [43]: data.duplicated()
Out[43]:
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
In [44]: data.drop_duplicates()
Out[44]:
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
- 制定对一列或多了去重, 可用
keep
确定是保留第一个值,还是最后一个值
In [45]: data.drop_duplicates('k1')
Out[45]:
k1 k2
0 one 1
3 two 3
In [50]: data2 = DataFrame({'k1':['one'] * 3 + ['two'] * 4,
...: 'k2':range(7),
...: 'k3':['AA'] * 3 + ['BB'] * 2 + ['CC'] * 2})
In [51]: data2
Out[51]:
k1 k2 k3
0 one 0 AA
1 one 1 AA
2 one 2 AA
3 two 3 BB
4 two 4 BB
5 two 5 CC
6 two 6 CC
In [52]: data2.drop_duplicates(['k1','k3'])
Out[52]:
k1 k2 k3
0 one 0 AA
3 two 3 BB
5 two 5 CC
In [56]: data2.drop_duplicates(['k1','k3'], keep = 'last')
Out[56]:
k1 k2 k3
2 one 2 AA
4 two 4 BB
6 two 6 CC
3.2. 利用函数或映射进行数据转换
map
函数,很明显我并不想打辣么多单词。一般都是先转化为统一格式的单词,然后再进行映射。
In [76]: data = Series(['A']*4 + ['a'] + ['B']*3 + ['c']* 5)
In [77]: studentGrade = DataFrame({'grade':shuffle(data),'studenNo':range(13)})
...:
In [78]: map_to_evaluation = {
...: 'a' : 'good',
...: 'b' : 'mid',
...: 'c' : 'bad'}
In [79]: studentGrade['evaluate'] = studentGrade['grade'].map(str.lower).map(ma
...: p_to_evaluation)
In [80]: studentGrade
Out[80]:
grade studenNo evaluate
3 A 0 good
5 B 1 mid
10 c 2 bad
9 c 3 bad
2 A 4 good
4 a 5 good
8 c 6 bad
7 B 7 mid
0 A 8 good
1 A 9 good
12 c 10 bad
6 B 11 mid
11 c 12 bad
## 3.3. 可以写成lambda函数
In [84]: studentGrade['eval2'] = studentGrade['grade'].map(lambda x : map_to_ev
...: aluation[x.lower()])
In [85]: studentGrade
Out[85]:
grade studenNo evaluate eval2
3 A 0 good good
5 B 1 mid mid
10 c 2 bad bad
9 c 3 bad bad
2 A 4 good good
4 a 5 good good
8 c 6 bad bad
7 B 7 mid mid
0 A 8 good good
1 A 9 good good
12 c 10 bad bad
6 B 11 mid mid
11 c 12 bad bad
3.4. 替换值
用replace
函数,可以只改一个,也可以用range
限定范围改,或者变为数组做一对一映射;反正动动脑筋,试试函数,弱限定的python几乎啥都能做..
In [89]: data
Out[89]:
0 A
1 A
2 A
3 A
4 a
5 B
6 B
7 B
8 c
9 c
10 c
11 c
12 c
dtype: object
#改一个数值
In [90]: data.replace('A',np.nan)
Out[90]:
0 NaN
1 NaN
2 NaN
3 NaN
4 a
5 B
6 B
7 B
8 c
9 c
10 c
11 c
12 c
dtype: object
# 一对一映射更改,可以写成字典或者数组
In [91]: data.replace(['A','B'],['A.change','B.change'])
Out[91]:
0 A.change
1 A.change
2 A.change
3 A.change
4 a
5 B.change
6 B.change
7 B.change
8 c
9 c
10 c
11 c
12 c
dtype: object
# 限定某一段进行更改,想改数字大小的话,可以先排序,再改
In [102]: number.replace([range(4)],np.nan)
Out[102]:
0 NaN
1 NaN
2 NaN
3 NaN
4 4.0
5 5.0
6 6.0
7 7.0
dtype: float64
In [103]: number.replace([range(4,7)],np.nan)
Out[103]:
0 0.0
1 1.0
2 2.0
3 3.0
4 NaN
5 NaN
6 NaN
7 7.0
dtype: float64
3.5. 重命名轴索引
用rename
进行重命名,一般用于创建不同格式的数据,不会改变原数据。
In [163]: testGrade
Out[163]:
grade studenNo evaluate eval2
a A 0 good good
b B 1 mid mid
c c 2 bad bad
d c 3 bad bad
e A 4 good good
f a 5 good good
g c 6 bad bad
h B 7 mid mid
i A 8 good good
j A 9 good good
k c 10 bad bad
l B 11 mid mid
m c 12 bad bad
In [164]: testGrade.rename(index =str.upper, columns= str.upper)
Out[164]:
GRADE STUDENNO EVALUATE EVAL2
A A 0 good good
B B 1 mid mid
C c 2 bad bad
D c 3 bad bad
E A 4 good good
F a 5 good good
G c 6 bad bad
H B 7 mid mid
I A 8 good good
J A 9 good good
K c 10 bad bad
L B 11 mid mid
M c 12 bad bad
In [165]: testGrade
Out[165]:
grade studenNo evaluate eval2
a A 0 good good
b B 1 mid mid
c c 2 bad bad
d c 3 bad bad
e A 4 good good
f a 5 good good
g c 6 bad bad
h B 7 mid mid
i A 8 good good
j A 9 good good
k c 10 bad bad
l B 11 mid mid
m c 12 bad bad
3.6. 离散化和面元划分/分组
把连续数据分成离散值,例如把消费者的年龄数据划分为不同的年龄段。
因为不想打字, 可以生成随机数组做测试,可以参考我的这篇文章:[python3.6 numpy 数组的多种取整方式]
cut
函数
- 生成一个随机年龄数组
In [268]: ages = np.array( 90 * np.random.rand(30) ).astype(np.int)
In [269]: ages
Out[269]:
array([85, 39, 55, 71, 4, 2, 24, 59, 65, 44, 84, 58, 2, 27, 36, 24, 41,
12, 55, 14, 26, 81, 32, 82, 76, 40, 6, 29, 7, 47])
- 如果ages包含未定义的年龄段,会返回NAN
In [278]: group_names = ['Youth','YouthAdult','MiddleAge','Senior']
In [279]: bins
Out[279]: [18, 25, 35, 60, 100]
In [280]: pd.cut(ages,bins,right=False,labels=group_names)
Out[280]:
[Senior, MiddleAge, MiddleAge, Senior, NaN, ..., MiddleAge, NaN, YouthAdult, NaN
, MiddleAge]
Length: 30
Categories (4, object): [Youth < YouthAdult < MiddleAge < Senior]
- 所以加上了起始年龄的定义
In [281]: ages
Out[281]:
array([85, 39, 55, 71, 4, 2, 24, 59, 65, 44, 84, 58, 2, 27, 36, 24, 41,
12, 55, 14, 26, 81, 32, 82, 76, 40, 6, 29, 7, 47])
In [282]: bins2 = [0,18,25,35,60,100]
In [283]: group_names2 = ['Minor','Youth','YouthAdult','MiddleAge','Senior']
In [284]: pd.cut(ages,bins2,right=False,labels=group_names2)
Out[284]:
[Senior, MiddleAge, MiddleAge, Senior, Minor, ..., MiddleAge, Minor, YouthAdult,
Minor, MiddleAge]
Length: 30
Categories (5, object): [Minor < Youth < YouthAdult < MiddleAge < Senior]
qcut
函数
根据最小值和最大值计算等长面元,假设有A个数据,从大到小排列,qcut设置为4等分,每一份里含有(A/4)个数据,然后再计算这些数据的值的范围。
In [288]: data = np.random.rand(1000)
In [289]: cats = pd.qcut(data,4)
In [290]: pd.value_counts(cats)
Out[290]:
(0.736, 1.0] 250
(0.483, 0.736] 250
(0.225, 0.483] 250
(-0.0009003, 0.225] 250
dtype: int64
4. 检测和过滤异常值
- 用describe()看数值的整体情况
- 过滤出超过某个范围值
- 为第二点的“异常值”赋值
In [312]: np.random.seed(12345)
In [313]: data = DataFrame(np.random.randn(1000,4))
In [314]: data.describe()
Out[314]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067684 0.067924 0.025598 -0.002298
std 0.998035 0.992106 1.006835 0.996794
min -3.428254 -3.548824 -3.184377 -3.745356
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.366626 2.653656 3.260383 3.927528
# 查看某一列大于3的异常值
In [315]: col = data[3]
In [316]: col[np.abs(col) > 3]
Out[316]:
97 3.927528
305 -3.399312
400 -3.745356
Name: 3, dtype: float64
# 查看数组中大于3的异常值
In [317]: data[(np.abs(data) > 3).any(1)]
Out[317]:
0 1 2 3
5 -0.539741 0.476985 3.248944 -1.021228
97 -0.774363 0.552936 0.106061 3.927528
102 -0.655054 -0.565230 3.176873 0.959533
305 -2.315555 0.457246 -0.025907 -3.399312
324 0.050188 1.951312 3.260383 0.963301
400 0.146326 0.508391 -0.196713 -3.745356
499 -0.293333 -0.242459 -3.056990 1.918403
523 -3.428254 -0.296336 -0.439938 -0.867165
586 0.275144 1.179227 -3.184377 1.369891
808 -0.362528 -3.548824 1.553205 -2.186301
900 3.366626 -2.372214 0.851010 1.332846
# 改变异常值
In [318]: data[(np.abs(data) > 3).any(1)] = np.sign(data) * 3
In [319]: data.describe()
Out[319]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.072154 0.072823 0.023305 0.001450
std 1.031713 1.028480 1.028945 1.017781
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.795388 -0.599807 -0.670407 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.624615 0.792539 0.680976 0.654328
max 3.000000 3.000000 3.000000 3.000000
5. 排列和随机抽样
- 用
permutation
进行重排序
In [324]: df = DataFrame(np.arange(20).reshape(5,4))
In [325]: df
Out[325]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
In [326]: sampler = np.random.permutation(5)
In [327]: df.take(sampler)
Out[327]:
0 1 2 3
1 4 5 6 7
3 12 13 14 15
4 16 17 18 19
0 0 1 2 3
2 8 9 10 11
In [330]: df.take(np.random.permutation(len(df)))
Out[330]:
0 1 2 3
1 4 5 6 7
0 0 1 2 3
4 16 17 18 19
3 12 13 14 15
2 8 9 10 11
- 直接用随机数做重排序
In [331]: bag = np.array([5,7,-1,6,4])
In [332]: sampler = np.random.randint(0,len(bag),size =10)
In [333]: sampler
Out[333]: array([3, 0, 4, 1, 1, 2, 3, 0, 1, 2])
In [334]: draws = bag.take(sampler)
In [335]: draws
Out[335]: array([ 6, 5, 4, 7, 7, -1, 6, 5, 7, -1])
6. 计算指标/哑变量
常用语机器学习分类算法,“1”代表分到x类,也很像编码。
- 基础方法
In [4]: df = DataFrame({'key':['b','b','a','c','a','b'],
...: 'data1':range(6)})
In [5]: pd.get_dummies(df['key'])
Out[5]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
In [6]: dummies = pd.get_dummies(df['key'],prefix='key')
In [7]: dummies
Out[7]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
In [8]: df_with_dummies = df[['data1']].join(dummies)
In [9]: df_with_dummies
Out[9]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
- 对于范围值可以进行离散化
In [19]: values = np.random.rand(10)
In [20]: values
Out[20]:
array([ 0.12422622, 0.66517564, 0.82179204, 0.95121697, 0.73976916,
0.00804186, 0.51379806, 0.00963952, 0.11634595, 0.52704073])
In [21]: bins = [0,0.2,0.4,0.6,0.8,1]
In [22]: pd.get_dummies(pd.cut(values,bins))
Out[22]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 1 0 0 0 0
1 0 0 0 1 0
2 0 0 0 0 1
3 0 0 0 0 1
4 0 0 0 1 0
5 1 0 0 0 0
6 0 0 1 0 0
7 1 0 0 0 0
8 1 0 0 0 0
9 0 0 1 0 0
没错,数据库基础又要翻翻了...这里的merge有点像数据库中的join连接。又看到笛卡尔积了...
笛卡尔积:因为名字很长经常忘记,记住“积”,连接后新表的行数为两表的行数之积。
SQL连接:下面这个写得很好,可以用来参考
极客学院-SQL 使用连接
2018.8.1 进度书里的页码 P 214