上一篇学习笔记学习了如何处理缺失值(数据清理),这一篇笔记继续学习pandas的数据处理:数据转换
Removing Duplicates去重
#首先还是进入ipython,当然你也可以用python进行练习
$ ipython
Python 3.7.6 (default, Jan 8 2020, 19:59:22)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pandas as pd
In [2]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
...: 'k2': [1, 1, 2, 3, 3, 4, 4]}) #构建一个dataframe
In [3]: data #这里索引5和索引6对应的行是完全一样的
Out[3]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
In [4]: data.duplicated() #检查是否有重复的行,这里索引6的结果返回的是True
Out[4]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [5]: data.drop_duplicates() #去掉.duplicated()函数判断结果为True的行,只保留返回结果为False的行
Out[5]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
上面的去重是直接把整行去掉,你也可以按照某一列里的元素进行去重:
In [6]: data['v1'] = range(7) #把dataframe加一列v1
In [7]: data
Out[7]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
In [8]: data.drop_duplicates(['k1']) #根据k1这一列里的元素进行去重,所以只返回了索引为0和1的两行,因为对于k1列来说,后面的元素都是one和two,所以都是重复项
Out[8]:
k1 k2 v1
0 one 1 0
1 two 1 1
duplicated
和drop_duplicates
都只保留了重复项的第一项,你也可以选择只保留所有重复项里最后出现的那一项:
In [9]: data.drop_duplicates(['k1', 'k2'], keep='last')
Out[9]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6 #索引5和6对应的两行是重复的,这里只保留了索引6对应的行
Transforming Data Using a Function or Mapping
对于许多数据集,你可能想根据数组、Series或Dataframe其中某一列中的值执行一些转换,举个例子:
In [11]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
...: 'Pastrami', 'corned beef', 'Bacon',
...: 'pastrami', 'honey ham', 'nova lox'],
...: 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
In [12]: data #创建一个dataframe,一列是food,一列是ounces数
Out[12]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
然后再加一列,是每一种食物对应的喂养的动物:
In [13]: meat_to_animal ={
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
The map method on a Series accepts a function or dict-like object containing a map‐
ping, but here we have a small problem in that some of the meats are capitalized and
others are not. Thus, we need to convert each value to lowercase using the str.lower
Series method
这里需要注意的是,如果你要使用map
功能,你必须使你要map的两个对象的大小写一致,比如Pastrami这个单词,在data和meat_to_animal里的开头字母的大小写不一致,所以我们要先把data里的food一栏都改成小写开头:
In [14]: lowercased = data['food'].str.lower()
In [15]: lowercased
Out[15]:
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
然后将data和meat_to_animal合并,并根据food一栏进行map:
In [16]: data['animal'] = lowercased.map(meat_to_animal)
In [17]: data
Out[17]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
上面是map之后,返回的是完整的dataframe,你也可以选择只返回map好的meat_to_animal一列:
In [18]: data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[18]:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
Replacing Values
上一篇笔记学习了用fillna
填补缺失值,这只是一个特殊情况。现在来学习一下一般值的替换。使用replace
功能:
In [19]: data = pd.Series([1., -999., 2., -999., -1000., 3.])
In [20]: data
Out[20]:
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
In [21]: import numpy as np
In [22]: data.replace(-999, np.nan) #把-999的值替换成缺失值
Out[22]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
In [23]: data.replace([-999, -1000], np.nan) #把-999和-1000替换成缺失值
Out[23]:
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
In [24]: data.replace([-999, -1000], [np.nan, 0]) #把-999替换成缺失值,把-1000替换成0
Out[24]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
In [25]: data.replace({-999: np.nan, -1000: 0}) #用字典形式进行替换
Out[25]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
Renaming Axis Indexes重命名轴索引
除了对dataframe里的值进行map,你也可以map dataframe的行名和列名:
#将行名重命名
In [26]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
In [27]: data
Out[27]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
In [28]: transform = lambda x: x[:4].upper()
In [29]: data.index.map(transform)
Out[29]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')
In [30]: data.index = data.index.map(transform)
In [31]: data
Out[31]:
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
#将列名重命名
#将列名进行大写处理
In [32]: data.rename(index=str.title, columns=str.upper)
Out[32]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
用rename
功能同时进行行名和列名的修改:
In [33]: data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
Out[33]:
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
rename
功能默认不覆盖原始dataframe,你也可以通过参数设置,让替换后的dataframe覆盖原始dataframe:
In [34]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
In [35]: data
Out[35]:
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
Discretization and Binning离散化和分箱
连续型数据经常是离散的,或者被分成不同的“箱”(bin)进行分析。假设这里有一组年龄的数据,你要把这些人的年龄分成几组:
In [36]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
然后我们将把年龄分成:18-25,26-35,36-60,以及61以上的。你可以使用pandas里的cut
函数:
In [37]: bins = [18, 25, 35, 60, 100]
In [38]: cats = pd.cut(ages,bins)
In [39]: cats #把ages里每一个数字都对应到相应的bin里
Out[39]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
除了把每一个元素对应到bin里,你还可以根据bin的索引来map你的每一个元素:
#查看你的bin
In [40]: cats.categories
Out[40]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
#根据bin的索引位置来map你的元素
In [41]: cats.codes
Out[41]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
#查看每一个bin里map了多少个元素
In [42]: pd.value_counts(cats)
Out[42]:
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
你也可以给每一个bin命名:
In [43]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
In [44]: pd.cut(ages, bins, labels=group_names)
Out[44]:
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
上面是对于整数元素而言的,下面看看浮点元素的操作:
In [45]: data = np.random.rand(20)
In [46]: data #随机生成20个0-1的数字
Out[46]:
array([0.81110965, 0.47609535, 0.71813052, 0.71125473, 0.50874785,
0.54215936, 0.51799094, 0.92532814, 0.75840793, 0.59097154,
0.28318023, 0.580895 , 0.57077378, 0.86404974, 0.10115019,
0.29316477, 0.57158071, 0.14108119, 0.36732534, 0.18085161])
In [47]: pd.cut(data,4,precision=2) #从上面20个数字里的最小值到最大值,等分成4分,就是4个bin
#precision=2的意思是保留2位小数点
Out[47]:
[(0.72, 0.93], (0.31, 0.51], (0.51, 0.72], (0.51, 0.72], (0.31, 0.51], ..., (0.1, 0.31], (0.51, 0.72], (0.1, 0.31], (0.31, 0.51], (0.1, 0.31]]
Length: 20
Categories (4, interval[float64]): [(0.1, 0.31] < (0.31, 0.51] < (0.51, 0.72] < (0.72, 0.93]]
Detecting and Filtering Outliers检测并过滤离群值
In [48]: data = pd.DataFrame(np.random.randn(1000, 4)) #构建一个1000行,4列的datagrame由随机数构成
In [49]: data
Out[49]:
0 1 2 3
0 0.650892 2.070260 0.538299 0.679726
1 1.074305 0.495664 -0.970136 -0.344586
2 -0.859329 1.453186 1.612602 -0.321719
3 0.558712 0.562590 0.646543 -0.483733
4 0.935728 1.516592 0.852229 1.926105
.. ... ... ... ...
995 -1.304159 -0.085793 0.482636 1.523466
996 0.285450 -1.898452 1.422107 -1.227373
997 2.083328 -1.410110 0.061404 -0.860416
998 -0.438052 -1.744427 -1.677008 -0.481408
999 0.999009 -0.240101 -1.299566 -0.172057
[1000 rows x 4 columns]
In [50]: data.describe() #查看每一列的数据情况
Out[50]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.004135 -0.041167 -0.030157 -0.015668
std 0.987452 0.996992 1.004676 0.982699
min -3.438090 -3.179065 -3.055933 -3.336112
25% -0.683605 -0.714445 -0.679892 -0.677707
50% -0.033715 -0.066457 -0.122572 -0.039982
75% 0.646951 0.578665 0.643963 0.632722
max 3.625363 3.499694 3.896634 3.082743
如果你想找第2列里绝对值大于3的数字:
In [51]: col = data[2]
In [52]: col[np.abs(col) >3]
Out[52]:
106 3.248979
365 3.896634
467 3.659385
475 -3.055933
Name: 2, dtype: float64
寻找所有行里绝对值大于3的数字:
In [53]: data[(np.abs(data) > 3).any(1)]
Out[53]:
0 1 2 3
106 0.402003 0.675224 3.248979 0.594704
135 3.625363 0.757227 0.267661 2.870315
143 -0.040237 3.025956 -0.950451 2.369624
250 0.411540 3.499694 -0.268605 0.811673
365 -1.609991 -0.691870 3.896634 -0.761072
439 0.050316 0.433346 -0.938715 -3.277411
467 -0.722803 2.093205 3.659385 0.373497
475 0.326796 0.196501 -3.055933 -0.601524
492 -3.438090 0.132217 -0.305294 -0.574376
616 -0.623880 0.588810 -0.742245 -3.239445
646 -0.009522 -3.179065 -1.350068 0.309251
778 1.959143 -0.304691 -0.760753 3.082743
871 -0.686462 3.470719 1.305890 -1.060105
919 3.452396 1.216505 -0.495451 0.237577
966 -1.427159 -0.546189 1.346565 -3.336112
把dataframe里所有大于3和小于-3的元素,一律都设置为3:
In [54]: data[np.abs(data) > 3] = np.sign(data) * 3
In [55]: data.describe()
Out[55]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.003495 -0.041985 -0.031906 -0.014898
std 0.982432 0.993157 0.998391 0.979717
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.683605 -0.714445 -0.679892 -0.677707
50% -0.033715 -0.066457 -0.122572 -0.039982
75% 0.646951 0.578665 0.643963 0.632722
max 3.000000 3.000000 3.000000 3.000000
上面的代码里np.sign(data)
的意思是把任何数根据正负判断,变成-1和1:
In [57]: np.sign(data)
Out[57]:
0 1 2 3
0 1.0 1.0 1.0 1.0
1 1.0 1.0 -1.0 -1.0
2 -1.0 1.0 1.0 -1.0
3 1.0 1.0 1.0 -1.0
4 1.0 1.0 1.0 1.0
.. ... ... ... ...
995 -1.0 -1.0 1.0 1.0
996 1.0 -1.0 1.0 -1.0
997 1.0 -1.0 1.0 -1.0
998 -1.0 -1.0 -1.0 -1.0
999 1.0 -1.0 -1.0 -1.0
Permutation and Random Sampling随机重排
使用np.random.permutation
可以对Series或者dataframe的行和列进行重排:
In [58]: df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
In [59]: df
Out[59]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
In [60]: sampler = np.random.permutation(5)
In [61]: sampler
Out[61]: array([2, 4, 0, 3, 1])
In [62]: df.take(sampler)
Out[62]:
0 1 2 3
2 8 9 10 11
4 16 17 18 19
0 0 1 2 3
3 12 13 14 15
1 4 5 6 7
Computing Indicator/Dummy Variables计算指标/虚拟变量
统计建模或机器学习应用的另一种转换是将分类变量转换为“虚拟”或“指标”矩阵。如果DataFrame中的一列有k个不同的值,那么你将得到k个列,包含所有1和0的矩阵或DataFrame。pandas有一个get_dummies
函数来完成这个任务:
In [67]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
...: 'data1': range(6)})
In [68]: df
Out[68]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
In [69]: pd.get_dummies(df['key'])
Out[69]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
上面的函数可以这样理解:get_dummies
函数是把dataframe里其中一列抽出来,把这一列的元素作为列名,行代表是否出现过key这一列的元素,出现过即为1,没出现即为0。所以对于key列来说,b在最开始出现了两次,所以在 get_dummies`返回的dataframe里,b列的前2行是1,其他列是0。以此类推。
可以把上面得到的dataframe改一下列名:
In [70]: dummies = pd.get_dummies(df['key'], prefix='key')
In [71]: dummies
Out[71]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0