Pandas数据规整 - 转换 - 离散化和面元划分
为了便于分析,连续数据常常被离散化或拆分为“面元”(bin,分组区间)
连续数据离散化:降雨量、年龄、身高这类连续数据,要分析:只能画直方图,无法分组聚合 ,所以可以将连续数据离散化,例如降雨量转为 小雨中雨大雨暴雨,年龄转为 少年青年中年老年,就可以分组聚合
In [1]:
import numpy as np
import pandas as pd
例子:一组年龄数据,将它们划分为不同的年龄组
划分为“18到25”、“26到35”、“35到60”以及“60以上”几个面元
In [2]:
# 年龄
ages = [18, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
# 面元区间
bins = [18, 25, 35, 60, 100]
In [3]:
cats = pd.cut(ages, bins)
cats
Out[3]:
[NaN, (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
返回的是categories对象(划分的面元),可看做一组表示面元名称的字符串
底层含有:
一个codes属性中的年龄数据标签
一个表示不同分类的类型数组
In [4]:
type(cats)
Out[4]:
pandas.core.arrays.categorical.Categorical
In [4]:
cats.codes # 分组后的数据(下面分组区间的索引)
Out[4]:
array([-1, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [5]:
cats.categories # 类型,分组区间
Out[5]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
In [7]:
cats[11] # 查询单个值的分类
Out[7]:
Interval(25, 35, closed='right')
In [8]:
# pd.cut结果的面元计数
pd.value_counts(cats) # 统计每个分组区间的数据个数
Out[8]:
(18, 25] 4
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
cut方法:默认是左开右闭区间,不包含起始值,包含结束值
right=False后,左闭右开区间,包含起始值,不包含结束值
In [9]:
cats2 = pd.cut(ages, bins, right=False)
cats2
Out[9]:
[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]
In [10]:
cats2.codes
Out[10]:
array([0, 0, 1, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [11]:
cats2.categories
Out[11]:
IntervalIndex([[18, 25), [25, 35), [35, 60), [60, 100)]
closed='left',
dtype='interval[int64]')
修改面元名称
In [12]:
cat3 = pd.cut(ages, bins)
cat3 = pd.cut(ages, bins, labels=False) # 去掉面元名称
cat3 = pd.cut(ages, bins, labels=['少年', '青年', '中年', '老年']) # 自定义面元名称
cat3
Out[12]:
[NaN, 少年, 少年, 青年, 少年, ..., 青年, 老年, 中年, 中年, 青年]
Length: 12
Categories (4, object): [少年 < 青年 < 中年 < 老年]
In [13]:
cat3.codes
Out[13]:
array([-1, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [14]:
cat3.categories
Out[14]:
Index(['少年', '青年', '中年', '老年'], dtype='object')
不指定面元切分的起始结束值,而是指定面元切分的个数(切成几份),自动计算面元起始结束值
In [15]:
cat4 = pd.cut(ages, 4, precision=2) # 将数据分成四组,限定小数位数为2位
cat4
Out[15]:
[(17.96, 28.75], (17.96, 28.75], (17.96, 28.75], (17.96, 28.75], (17.96, 28.75], ..., (28.75, 39.5], (50.25, 61.0], (39.5, 50.25], (39.5, 50.25], (28.75, 39.5]]
Length: 12
Categories (4, interval[float64]): [(17.96, 28.75] < (28.75, 39.5] < (39.5, 50.25] < (50.25, 61.0]]
In [20]:
(61 - 18) / 4 # 最大值-最小值,除以划分个数,得出的时每个区间的年龄范围
Out[20]:
10.75
In [25]:
18 + 10.75, 28.75 + 10.75, 39.5 + 10.75, 50.25 + 10.75
Out[25]:
(28.75, 39.5, 50.25, 61.0)
In [29]:
cat4.codes
cat4.categories
cat4.value_counts()
pd.value_counts(cat4)
Out[29]:
(17.96, 28.75] 6
(28.75, 39.5] 3
(39.5, 50.25] 2
(50.25, 61.0] 1
dtype: int64
qcut根据样本分位数进行面元划分
某些数据分布情况cut可能无法使得各个面元含有相同数量的值
qcut使用样本分位数可以得到大小基本相等的面元
In [16]:
cat5 = pd.qcut(ages, 4)
cat5
Out[16]:
[(17.999, 22.75], (17.999, 22.75], (22.75, 29.0], (22.75, 29.0], (17.999, 22.75], ..., (29.0, 38.0], (38.0, 61.0], (38.0, 61.0], (38.0, 61.0], (29.0, 38.0]]
Length: 12
Categories (4, interval[float64]): [(17.999, 22.75] < (22.75, 29.0] < (29.0, 38.0] < (38.0, 61.0]]
In [17]:
cat5.value_counts()
Out[17]:
(17.999, 22.75] 3
(22.75, 29.0] 3
(29.0, 38.0] 3
(38.0, 61.0] 3
dtype: int64
手输入4分位数,效果一样
In [18]:
cat6 = pd.qcut(ages, [0,0.25,0.5,0.75,1])
cat6
Out[18]:
[(17.999, 22.75], (17.999, 22.75], (22.75, 29.0], (22.75, 29.0], (17.999, 22.75], ..., (29.0, 38.0], (38.0, 61.0], (38.0, 61.0], (38.0, 61.0], (29.0, 38.0]]
Length: 12
Categories (4, interval[float64]): [(17.999, 22.75] < (22.75, 29.0] < (29.0, 38.0] < (38.0, 61.0]]
In [19]:
cat6.value_counts()
Out[19]:
(17.999, 22.75] 3
(22.75, 29.0] 3
(29.0, 38.0] 3
(38.0, 61.0] 3
dtype: int64
In [34]:
cat6.codes
cat6.categories
Out[34]:
IntervalIndex([(17.999, 22.75], (22.75, 29.0], (29.0, 38.0], (38.0, 61.0]]
closed='right',
dtype='interval[float64]')
分位数和桶分析
pandas有一些能根据指定面元或样本分位数将数据拆分成多块的工具(比如cut和qcut)
将这些函数跟groupby结合起来,就能实现对数据集的桶(bucket)或分位数(quantile)分析
例:有年龄和性别两列,要分析某年龄段下的性别情况,需要先将年龄离散化,将离散数据为分组基准进行分组后,对性别列聚合
以下面这个简单的随机数据集为例,利用cut将其装入长度相等的桶中:
In [28]:
frame = pd.DataFrame({'data1': np.random.randn(1000), 'data2': np.random.randn(1000)})
frame.head()
Out[28]:
data1 | data2 | |
---|---|---|
0 | -0.092267 | 0.455749 |
1 | 2.240468 | 0.500134 |
2 | -0.841825 | 0.796062 |
3 | 1.338347 | 1.470217 |
4 | 0.704546 | 0.485647 |
In [29]:
q = pd.cut(frame['data1'], 4)
q.head()
Out[29]:
0 (-1.487, 0.188]
1 (1.864, 3.539]
2 (-1.487, 0.188]
3 (0.188, 1.864]
4 (0.188, 1.864]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.169, -1.487] < (-1.487, 0.188] < (0.188, 1.864] < (1.864, 3.539]]
In [30]:
# q是Series类型,不是面元类型类型
type(q)
Out[30]:
pandas.core.series.Series
In [31]:
# 面元类型
type(q.cat)
Out[31]:
pandas.core.arrays.categorical.CategoricalAccessor
In [80]:
q.cat.codes
q.cat.categories
Out[80]:
IntervalIndex([(-2.427, -0.966], (-0.966, 0.489], (0.489, 1.944], (1.944, 3.399]]
closed='right',
dtype='interval[float64]')
In [32]:
q.value_counts()
Out[32]:
(-1.487, 0.188] 484
(0.188, 1.864] 415
(-3.169, -1.487] 65
(1.864, 3.539] 36
Name: data1, dtype: int64
由cut返回的Categorical对象可直接传递到groupby。我们可以像下面这样对data2列做一些统计计算
In [33]:
frame.describe()
Out[33]:
data1 | data2 | |
---|---|---|
count | 1000.000000 | 1000.000000 |
mean | 0.067439 | -0.016663 |
std | 1.017269 | 1.015648 |
min | -3.162162 | -3.058359 |
25% | -0.573029 | -0.720729 |
50% | 0.069688 | -0.047600 |
75% | 0.738430 | 0.666327 |
max | 3.539001 | 3.629984 |
In [35]:
frame.groupby(q).size()
frame.groupby(q)['data2'].size()
Out[35]:
data1
(-3.169, -1.487] 65
(-1.487, 0.188] 484
(0.188, 1.864] 415
(1.864, 3.539] 36
Name: data2, dtype: int64
In [37]:
frame.groupby(q).sum()
frame.groupby(q)['data2'].sum()
Out[37]:
data1
(-3.169, -1.487] 20.243026
(-1.487, 0.188] -6.394616
(0.188, 1.864] -31.361414
(1.864, 3.539] 0.849714
Name: data2, dtype: float64
使用自定义函数同时计算多个指标,快速综合统计
自定义函数内构建字典或Series数据返回,会输出DataFrame
In [40]:
def aaa(x):
# return {
# 'count': x.count(),
# 'mean': x.mean(),
# 'std': x.std(),
# 'min': x.min(),
# 'max': x.max(),
# }
return pd.Series([x.count(), x.mean(), x.std(), x.min(), x.max()], index=['count', 'mean', 'std', 'min', 'max'])
# frame.groupby(q).apply(aaa)
frame.groupby(q)['data2'].apply(aaa)
frame.groupby(q)['data2'].apply(aaa).unstack()
frame.groupby(q)['data2'].apply(aaa).unstack().T
Out[40]:
data1 | (-3.169, -1.487] | (-1.487, 0.188] | (0.188, 1.864] | (1.864, 3.539] |
---|---|---|---|---|
count | 65.000000 | 484.000000 | 415.000000 | 36.000000 |
mean | 0.311431 | -0.013212 | -0.075570 | 0.023603 |
std | 1.119511 | 1.004911 | 1.007641 | 0.981124 |
min | -2.234248 | -3.058359 | -2.848361 | -2.647758 |
max | 3.629984 | 3.175076 | 3.080270 | 1.592952 |
计算指标/哑变量(了解)
一种常用于统计建模或机器学习的转换方式是:将分类变量(categorical variable)转换为 哑变量、指标矩阵(虚拟变量,独热(one-hot)编码变量)
如果DataFrame的某一列含有k个不同的值,则可以派生出一个k列矩阵或DataFrame(其值全为1和0)
pandas有一个get_dummies函数可以实现该功能
独热编码的作用:将不能计算的字符串转为可以计算的数值(表格,或矩阵)
字符串:'一个对统计应用有用的方法:结合get_dummies和如cut之类的离散化函数'
[统计,应用,有用,方法,结合,离散化,函数]
[1,1,1,1,1,1,1]
统计:[1, 0, 0, 0, 0, 0, 0]
方法:[0, 0, 0, 1, 0, 0, 0]
In [41]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df
Out[41]:
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | b | 5 |
In [42]:
df['key']
Out[42]:
0 b
1 b
2 a
3 c
4 a
5 b
Name: key, dtype: object
手动转为独热编码
[a,b,c]
[1,1,1]
a: [1,0,0]
b: [0,1,0]
c: [0,0,1]
[b,b,a,c,a,b]
b:[1,1,0,0,0,1]
a:[0,0,1,0,1,0]
In [43]:
pd.get_dummies(df['key'])
Out[43]:
a | b | c | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |
合并两个表格
In [61]:
dummies = pd.get_dummies(df['key'], prefix='key')
dummies
Out[61]:
key_a | key_b | key_c | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |
In [62]:
df
Out[62]:
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | b | 5 |
In [63]:
df.join(dummies) # 按行索引合并
Out[63]:
key | data1 | key_a | key_b | key_c | |
---|---|---|---|---|---|
0 | b | 0 | 0 | 1 | 0 |
1 | b | 1 | 0 | 1 | 0 |
2 | a | 2 | 1 | 0 | 0 |
3 | c | 3 | 0 | 0 | 1 |
4 | a | 4 | 1 | 0 | 0 |
5 | b | 5 | 0 | 1 | 0 |
例子:将一组数据转为哑变量
一个对统计应用有用的方法:结合get_dummies和如cut之类的离散化函数
In [44]:
# 生成随机数据
np.random.seed(12345)
values = np.random.rand(10)
values
Out[44]:
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
面元划分
In [45]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
x = pd.cut(values, bins)
x
Out[45]:
[(0.8, 1.0], (0.2, 0.4], (0.0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.4, 0.6], (0.8, 1.0], (0.6, 0.8], (0.6, 0.8], (0.6, 0.8]]
Categories (5, interval[float64]): [(0.0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1.0]]
In [46]:
x.categories
Out[46]:
IntervalIndex([(0.0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]]
closed='right',
dtype='interval[float64]')
In [47]:
x.codes
Out[47]:
array([4, 1, 0, 1, 2, 2, 4, 3, 3, 3], dtype=int8)
将面元划分结构进行独热编码(哑变量)
In [68]:
pd.get_dummies(x)
Out[68]:
(0.0, 0.2] | (0.2, 0.4] | (0.4, 0.6] | (0.6, 0.8] | (0.8, 1.0] | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 1 | 0 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 |
5 | 0 | 0 | 1 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 1 |
7 | 0 | 0 | 0 | 1 | 0 |
8 | 0 | 0 | 0 | 1 | 0 |
9 | 0 | 0 | 0 | 1 | 0 |
In [69]:
values
Out[69]:
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
0.8-1.0区间下的元素:第0个和第6个