pandas中的数据转换包括过滤、清理等
duplicated() 判断各行是否是重复行
drop_duplicated() 移除重复行(保留第一次出现的)
没啥好说的,直接看例子:
In [20]: s = pd.DataFrame({'key':['a']*4+['b']*3,'key0':[1,1,2 ...: ,3,3,4,4]}) In [21]: s.duplicated() Out[21]: 0 False 1 True 2 False 3 False 4 False 5 False 6 True dtype: bool In [22]: s.drop_duplicates() Out[22]: key key0 0 a 1 2 a 2 3 a 3 4 b 3 5 b 4 In [23]: s.drop_duplicates('key') # 可以根据某列去除重复的行 Out[23]: key key0 0 a 1 4 b 3 In [24]: s.drop_duplicates(['key','key0']) # 传入一个列组成的列表,去除重复的行 Out[24]: key key0 0 a 1 2 a 2 3 a 3 4 b 3 5 b 4
In [25]: s.key.drop_duplicates() # 嗯,这样写也是可以的
Out[25]:
0 a
4 b
Name: key, dtype: object
In [61]: data = pd.DataFrame({'food':['bacon','pulled pork','b ...: acon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'], ...: 'ounces':[4,3,12,6,7.5,8,3,5,6]}) In [62]: data Out[62]: food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 Pastrami 6.0 4 corned beef 7.5 5 Bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0假如你想添加一列表示该肉类食物来源的动物类型,我们先编写一个肉类到动物的映射。
In [63]: meat_to_animal = { ...: 'bacon':'pig', ...: 'pulled pork':'pig', ...: 'pastrami':'cow', ...: 'corned beef':'cow', ...: 'honey ham':'pig', ...: 'nova lox':'salmon' ...: }
Series的map方法可以接受一个函数或含有映射关系的字典型对象,
但是这里有个问题:有些大写了,有些没有。因此需要先转换大小写
In [64]: data['animal'] = data['food'].map(str.lower).map(meat_to_animal)下面看一下map用来执行函数,即将data['food']的每个元素应用到隐含函数
In [65]: data['food'].map(lambda x:meat_to_animal[x.lower()]) Out[65]: 0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object
replace()
In [26]: re = pd.Series([1,-9999,-9999,2,3,4,5,-1000,0]) In [27]: re Out[27]: 0 1 1 -9999 2 -9999 3 2 4 3 5 4 6 5 7 -1000 8 0 dtype: int64 In [28]: re.replace(-9999,np.nan) # 替换值 Out[28]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 -1000.0 8 0.0 dtype: float64 In [29]: re.replace([-9999,-1000],np.nan) # 替换多个 Out[29]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 NaN 8 0.0 dtype: float64 In [30]: re.replace([-9999,-1000],[np.nan,0]) # 值与替换值对应的列表 Out[30]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 0.0 8 0.0 dtype: float64 In [32]: re.replace({-9999:np.nan,-1000:0}) # 参数可以是一个字典 Out[32]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 0.0 8 0.0 dtype: float64
重命名轴索引
rename() 会创建数据的副本,也可以传入 inplace=True 参数进行就地修改
In [41]: data = pd.DataFrame(np.arange(6).reshape((2, 3)),inde
...: x=pd.Index(['Oh', 'Co'], name='state'),columns=pd.Ind
...: ex(['one', 'two', 'three'], name='number'))
In [42]: data.rename(index=str.title,columns=str.upper)
Out[42]:
number ONE TWO THREE
state
Oh 0 1 2
Co 3 4 5
In [43]: data.rename(index={'co':'sx'},columns={'one':'first'} # 传入字典,可以部分修改
...: )
Out[43]:
number first two three
state
Oh 0 1 2
Co 3 4 5
cut()
In [45]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] In [47]: In [47]: bins = [18, 25, 35, 60, 100] In [48]: cats = pd.cut(ages, bins) # 可以指定哪边的区间是开的,例如左闭右开,只需要设置 pd.cut(ages, bins,right=False) In [49]: cats # 结果返回的是一个特殊的 Categories 对象 Out[49]: [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
也可以为面元设置名称:
In [56]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior'] In [57]: pd.cut(ages, bins, labels=group_names) Out[57]: [Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult] Length: 12 Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
若不传入具体的面元划分边界,只传入划分的面元个数,则会自动等长划分面元:
In [58]: data = np.random.rand(20)
In [59]: pd.cut(data, 4, precision=2) # 分为4组,精度为2位
Out[59]:
[(0.29, 0.52], (0.75, 0.98], (0.75, 0.98], (0.057, 0.29], (0.29, 0.52], ...,
(0.75, 0.98], (0.75, 0.98], (0.75, 0.98], (0.057,0.29], (0.29, 0.52]]
Length: 20
Categories (4, interval[float64]): [(0.057, 0.29] < (0.29, 0.52] < (0.52, 0.75] < (0.75, 0.98]]
qcut函数是一个类似于cut的函数,可以根据样本分位数对数据进行面元划分。根据数据,cut可能无法
下面是随机选取一个DataFrame的一些行,做法就是随机产生行号,然后进行选取即可。
利用 numpy.random.permutation() 函数可以实现随机重排。 In [67]: df = pd.DataFrame(np.arange(5 * 4).reshape(5, 4)) In [68]: ran = np.random.permutation(5) In [70]: ran Out[70]: array([2, 3, 0, 1, 4]) In [71]: df Out[71]: 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19 In [72]: df.take(ran) Out[72]: 0 1 2 3 2 8 9 10 11 3 12 13 14 15 0 0 1 2 3 1 4 5 6 7 4 16 17 18 19
将分类变量转换为“哑变量矩阵”(dummy matrix)或“指标矩阵”(indicator matrix)。如果DataFrame的某一列有k各不同的值,可以派生出一个k列的矩阵或者DataFrame(值为1和0)
In [74]: df = pd.DataFrame({'key':['b','b','a','c','a','b'],'d ...: ata1' : range(6)}) In [75]: pd.get_dummies(df['key']) Out[75]: a b c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
给指标DataFrame的列加上一个前缀
In [76]: dummies = pd.get_dummies(df['key'],prefix = 'key') In [77]: dummies Out[77]: key_a key_b key_c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
顺便看看下面这个例子:
In [80]: df['data1'] Out[80]: 0 0 1 1 2 2 3 3 4 4 5 5 Name: data1, dtype: int64 In [81]: type(df['data1']) Out[81]: pandas.core.series.Series In [82]: df[['data1']] Out[82]: data1 0 0 1 1 2 2 3 3 4 4 5 5 In [83]: type(df[['data1']]) Out[83]: pandas.core.frame.DataFrame
df['data1']得到一个Series,而df[['data1']]得到一个DataFrame
Python有简单易用的字符串和文本处理功能。大部分文本运算直接做成了字符串对象的内置方法。当然还能用正则表达式。pandas对此进行了加强,能够对数组数据应用字符串表达式和正则表达式,而且能处理烦人的缺失数据。
字符串对象方法
举几个简单的例子:
In [87]: zifuchuan = ' i can be a can, i do not balabala' # 最前面有个空格
In [88]: sp = zifuchuan.split(',')
In [89]: sp
Out[89]: [' i can be a can', ' i do not balabala']
In [90]: ':::'.join(sp)
Out[90]: ' i can be a can::: i do not balabala'
In [91]: zifuchuan.index('can')
Out[91]: 3
In [92]: zifuchuan.index('i')
Out[92]: 1
In [94]: zifuchuan.count('can')
Out[94]: 2
正则表达式(regex)提供了一种灵活的在文本中搜索、匹配字符串的模式。
python中的正则表达式用的是re模块。re模块的函数分为3类:模式匹配、替换、拆分。
关于正则表达式的总结:http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
几举个简单的例子,以后有时间关于正则表达式再做一个总结。
In [119]: import re # 首先导入Python的re模块 In [120]: text = 'I love\t you' # 注意:这句后面没有空格 In [121]: re.split('\s+',text) Out[121]: ['I', 'love', 'you'] In [122]: text = 'I love\t you ' # 这句末尾多了一个空格,匹配时会把末尾的空格也算一个字符串 In [123]: re.split('\s+',text) Out[123]: ['I', 'love', 'you', '']
上面的例子首先正则表达式会被编译,然后在text上调用其split方法。
也可以这样写:
In [125]: patten = re.compile('\s+') # 先编译,得到一个可以重用的 regex 对象
In [126]: patten.split(text)
Out[126]: ['I', 'love', 'you', '']
findall、search、match 、sub
In [132]: text = """Dave [email protected] ...: Steve [email protected] ...: Rob [email protected] ...: Ryan [email protected] ...: """ In [133]: patten = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}' # 匹配邮箱 In [134]: regex = re.compile(patten,flags=re.IGNORECASE) #先编译,忽略大小写 In [135]: regex.findall(text) # 返回所有匹配到的模式 Out[135]: ['[email protected]', '[email protected]', '[email protected] ', '[email protected]'] In [137]: m = regex.search(text) # 返回匹配到的第一个模式 In [138]: m Out[138]: <_sre.SRE_Match object; span=(5, 20), match='dave@goo gle.com'> In [141]: text[m.start():m.end()] Out[141]: '[email protected]' In [144]: m.string # 返回原始匹配串 Out[144]: 'Dave [email protected]\nSteve [email protected]\nRob rob @gmail.com\nRyan [email protected]\n' In [143]: print(regex.match(text)) # match 只匹配开头 这里开头是‘Dave’,所以没有匹配到,返回None None
In [147]: print(regex.sub('replace',text)) # 将匹配到的模式全部替换
Dave replace
Steve replace
Rob replace
Ryan replace
将匹配到的模式分组:
In [148]: patten = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2, ...: 4})' In [149]: regex = re.compile(patten,flags=re.IGNORECASE) In [153]: regex.findall(text) Out[153]: [('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan', 'yahoo', 'com')] In [155]: s = regex.search(text) In [156]: s Out[156]: <_sre.SRE_Match object; span=(5, 20), match='dave@goo gle.com'> In [157]: s.groups() Out[157]: ('dave', 'google', 'com')
给匹配到到的模式命名:
In [163]: regex = re.compile(r""" ...: (?P[A-Z0-9._%+-]+) ...: @(?P [A-Z0-9.-]+) ...: \.(?P [A-Z]{2,4})""", ...: flags=re.IGNORECASE|re.VERBOSE) ...: In [164]: m = regex.match('[email protected]') In [165]: m.groupdict() Out[165]: {'domain': 'bright', 'suffix': 'net', 'username': 'we sm'} In [171]: f = regex.search(text) In [172]: f.group('username') Out[172]: 'dave' In [173]: f = regex.findall(text) In [174]: f Out[174]: [('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan', 'yahoo', 'com')]
In [176]: series = pd.Series({'Dave': '[email protected]', 'Stev ...: e': '[email protected]','Rob': '[email protected]', 'Wes': ...: np.nan}) In [177]: series Out[177]: Dave [email protected] Rob [email protected] Steve [email protected] Wes NaN dtype: object
通过Series的str方法可以对Series的内容进行操作:
In [176]: series = pd.Series({'Dave': '[email protected]', 'Stev ...: e': '[email protected]','Rob': '[email protected]', 'Wes': ...: np.nan}) In [177]: series Out[177]: Dave [email protected] Rob [email protected] Steve [email protected] Wes NaN dtype: object In [178]: series.str.contains('rob') Out[178]: Dave False Rob True Steve False Wes NaN dtype: object In [179]: patten = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2, ...: 4})' In [180]: series.str.findall(patten,re.IGNORECASE) Out[180]: Dave [(dave, google, com)] Rob [(rob, gmail, com)] 。。。
map函数在遇到NA值时会报错:
In [199]: matches = series.str.upper() In [200]: matches Out[200]: Dave [email protected] Rob [email protected] Steve [email protected] Wes NaN dtype: object In [202]:series.map(str.upper) --------------------------------------------------------------- TypeError Traceback (most recent call last)in () ----> 1 matches = series.map(str.upper) ... TypeError: descriptor 'upper' requires a 'str' object but received a 'float'