连续值经常需要进行离散化,或者分离成“箱子”进行分析。现在我们有一组人群的年龄数据,我们想要将他们分入18-25,26-35,35-60以及60以上四个组中。为了实现分组可以使用pandas中的cut:
import pandas as pd
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
print(cats)
# [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
# Length: 12
# Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cut返回的是一个Catagorical对象。
cats.codes
#每一个元素所在的组的标签
# [0 0 0 1 0 0 2 1 3 2 2 1]
print(cats.categories)
# 分组信息
# IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
# closed='right',
# dtype='interval[int64]')
pd.value_counts(cats)
# (18, 25] 5
# (35, 60] 3
# (25, 35] 3
# (60, 100] 1
# dtype: int64
我们可以在分组的时候通过labels选项来自定义分组的名称:
cats = pd.cut(ages,bins,labels=['youth','youngAdult','MiddleAge','Senior'])
print(cats)
# [youth, youth, youth, youngAdult, youth, ..., youngAdult, Senior, MiddleAge, MiddleAge, youngAdult]
# Length: 12
# Categories (4, object): [youth < youngAdult < MiddleAge < Senior]
print(cats.value_counts())
# youth 5
# youngAdult 3
# MiddleAge 3
# Senior 1
# dtype: int64
在进行分箱(分组)时,也可以传入箱的个数而不是箱边,pandas将自动根据数据中的最大值和最小值计算出等长的箱:
import pandas as pd
import numpy as np
data = np.random.rand(20)
print(pd.cut(data,4,precision = 2))
#precision = 2代表保留小数点后两位
# [(0.29, 0.52], (0.057, 0.29], (0.057, 0.29], (0.52, 0.74], (0.52, 0.74], ..., (0.29, 0.52], (0.057, 0.29], (0.29, 0.52], (0.74, 0.97], (0.74, 0.97]]
# Length: 20
# Categories (4, interval[float64]): [(0.057, 0.29] < (0.29, 0.52] < (0.52, 0.74] < (0.74, 0.97]]
pands中的qcut是通过样本的分位数进行分箱,因此可以通过该函数进行等长分箱:
data = np.random.randn(1000)
cats = pd.qcut(data,4) #分成等数量的四分
print(cats)
# [(0.0416, 0.712], (-2.864, -0.607], (0.712, 3.555], (-2.864, -0.607], (-2.864, -0.607], ..., (-2.864, -0.607], (0.712, 3.555], (-0.607, 0.0416], (0.0416, 0.712], (0.0416, 0.712]]
# Length: 1000
# Categories (4, interval[float64]): [(-2.864, -0.607] < (-0.607, 0.0416] < (0.0416, 0.712] <
# (0.712, 3.555]]
print(cats.value_counts())
# (-2.864, -0.607] 250
# (-0.607, 0.0416] 250
# (0.0416, 0.712] 250
# (0.712, 3.555] 250
# dtype: int64
也可以显示的指定箱边,由于是分位数所以必须是0-1之间的数:
data = np.random.randn(1000)
cats = pd.qcut(data,[0,0.1,0.9,1.])
print(cats)
# [(-1.335, 1.269], (-1.335, 1.269], (-1.335, 1.269], (-1.335, 1.269], (1.269, 3.305], ..., (-1.335, 1.269], (-1.335, 1.269], (-1.335, 1.269], (1.269, 3.305], (-1.335, 1.269]]
# Length: 1000
# Categories (3, interval[float64]): [(-3.5829999999999997, -1.335] < (-1.335, 1.269] < (1.269, 3.305]] (0.712, 3.555]]
print(cats.value_counts())
# (-3.5829999999999997, -1.335] 100
# (-1.335, 1.269] 800
# (1.269, 3.305] 100
# dtype: int64
现在我们有一个1000行4列的DataFrame,如果我们想要找到某一列中绝对值大于3的值,可以进行以下操作:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(1000,4))
col = data[2]
print(col[np.abs(col) > 3])
# 350 -4.575044
# 770 -3.001040
# Name: 2, dtype: float64
我们也可以使用以下方法得到有元素的绝对值大于3的行:
data = pd.DataFrame(np.random.randn(1000,4))
print(data[(np.abs(data)>3).any(1)])
#any方法为判断一行或者一列上是否有值为True
# 0 1 2 3
# 16 0.007730 1.592332 -0.516403 -3.182190
# 78 -0.574520 3.326386 1.297592 0.918248
# 148 -3.253763 -0.879153 -0.341380 -0.421963
# 195 -0.595610 0.122527 -2.147196 3.699034
# 344 -4.702560 -0.519187 -0.696294 0.027065
# 425 -0.745760 -0.530089 -1.714175 -3.268162
# 632 -1.320824 -0.182225 -0.007968 3.311693
# 650 -3.168038 1.513647 -0.604709 -0.353666
# 740 -0.225773 -3.128919 -1.628717 0.951415
# 788 -0.212660 2.015071 0.231082 4.244741
# 846 0.907867 1.527613 -3.339816 0.001608
# 852 -0.585060 0.137457 3.379468 -0.924044
# 942 3.438187 0.061082 -0.202052 -1.716093
# 943 -0.106008 -0.256375 -3.786214 -0.589181
我们可以使用以下方法将值全部限制在-3到3之间,(大于3的变为3,小于-3的变为-3):
data = pd.DataFrame(np.random.randn(1000,4))
data[np.abs(data) > 3] = np.sign(data) * 3
#sign为根据传入数字的正负返回1或者-1
print(data.describe())
# 0 1 2 3
# count 1000.000000 1000.000000 1000.000000 1000.000000
# mean -0.018379 -0.033751 0.003946 -0.013483
# std 1.010813 1.008051 0.979408 0.957267
# min -3.000000 -3.000000 -3.000000 -3.000000
# 25% -0.689586 -0.699035 -0.672904 -0.661879
# 50% -0.017652 -0.009849 0.015468 -0.005632
# 75% 0.633984 0.621483 0.685101 0.626534
# max 3.000000 3.000000 3.000000 2.711717
使用numpy.random.permutation(n)可以得到一个随机数列(由0~n-1构成)。我们可以配合DataFrame或者Series中的take方法来进行行置换。
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(20).reshape((5,4)))
# 0 1 2 3
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 3 12 13 14 15
# 4 16 17 18 19
sampler = np.random.permutation(5)
sampler
# [1 2 4 3 0]
data.take(sampler)
# 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 4 16 17 18 19
# 3 12 13 14 15
# 0 0 1 2 3
我们可以使用DataFrame和Series中的sample方法进行抽样,参数n为抽取的样本数,而replace表示一个行被抽中后是否能被再次抽中(默认为False):
data.sample(n = 3,replace = True)
# 0 1 2 3
# 3 12 13 14 15
# 1 4 5 6 7
# 1 4 5 6 7
将分量转换为“虚拟”或者“指标”矩阵是一种用于统计建模或者机器学习的转换操作。如果DataFrame中的一列有k个不同的值,则可以衍生出一个k列值为0或者1的DataFrame或矩阵。可以使用pandas中的get_dummies函数实现该功能:
import pandas as pd
df = pd.DataFrame({'key':list('bbacab'),
'data1':range(6)})
# key data1
# 0 b 0
# 1 b 1
# 2 a 2
# 3 c 3
# 4 a 4
# 5 b 5
pd.get_dummies(df['key'])
# a b c
# 0 0 1 0
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 1 0 0
# 5 0 1 0
我们可以给每一列的名字加上前缀,并与其他数据合并:
dummies = pd.get_dummies(df['key'],prefix = "key")
df_with_dummy = df[['data1']].join(dummies)
print(df_with_dummy)
# data1 key_a key_b key_c
# 0 0 0 1 0
# 1 1 0 1 0
# 2 2 1 0 0
# 3 3 0 0 1
# 4 4 1 0 0
# 5 5 0 1 0
我们可以将get_dummies和cut结合使用:
values = np.random.rand(10)
# [0.40001288 0.34297895 0.28679133 0.74631303 0.42940742 0.48480155
# 0.0805901 0.38736358 0.76718969 0.38822771]
bins = [0,0.2,0.4,0.6,0.8,1]
# (0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
# 0 0 0 1 0 0
# 1 0 1 0 0 0
# 2 0 1 0 0 0
# 3 0 0 0 1 0
# 4 0 0 1 0 0
# 5 0 0 1 0 0
# 6 1 0 0 0 0
# 7 0 1 0 0 0
# 8 0 0 0 1 0
# 9 0 1 0 0 0