方法 | 说明 |
---|---|
count | 非NAN值的数量 |
describe | 针对Series或各DataFrame列计算汇总统计 |
min、max | 计算最小值和最大值 |
argmin. argmax | 计算能够获取到最小值和最大值的索引位置(整数) |
idxmin、 idxmax | 计算能够获取到最小值和最大值的索引值 |
quantile | 计算样本的分位数(0到1) |
sum | 值的总和 |
mean | 值的平均数 |
median | 值的算术中位数(50%分位数) |
mad | 根据平均值计算平均绝对离差 |
var | 样本值的方差 |
std | 样本值的标准差 |
skew | 样本值的偏度(三阶矩) |
kurt | 样本值的峰度(四阶矩) |
cumsum | 样本值的家计和 |
cummin、 cummax | 样本值的累计最大值和累计最小值 |
cumprod | 样本值的累计积 |
diff | 计算一阶差分(对时间序列很有用) |
pct_change | 计算百分数变化 |
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,np.nan],
[2,6,np.nan],
[3,8,9],
[np.nan,2,1]],index=list('abcd'),columns=list('MNK'))
df
M N K
a 1.0 2 NaN
b 2.0 6 NaN
c 3.0 8 9.0
d NaN 2 1.0
默认axis=0 计算列 默认skipna=True
(计算时排除NAN值)
df.sum()
M 6.0
N 18.0
K 10.0
dtype: float64
默认axis=0 计算列 ,skipna=False
(计算时不排除NAN值,有NAN值的计算都会返回NAN值)
df.sum(skipna=False)
M NaN
N 18.0
K NaN
dtype: float64
计算行(axis=1
或者axis='columns'
) 默认skipna=True
(计算时排除NAN值)
df.sum(axis=1)
a 3.0
b 8.0
c 20.0
d 3.0
dtype: float64
计算行(axis=1
或者axis='columns'
) skipna=False
(计算时不排除NAN值,有NAN值的计算都会返回NAN值)
df.sum(axis=1,skipna=False)
a NaN
b NaN
c 20.0
d NaN
dtype: float64
idxmax()
默计(axis=0
)算每列能够获取到最大值的行索引值 (如果有多个相同的最大值则取第一个),axis=1
则计算每行能够获取到最大值的列索引值。NAN值不用管。
idxmin()
默计(axis=0
)算每列能够获取到最小值的行索引值 (如果有多个相同的最小值则取第一个),axis=1
则计算每行能够获取到最大值的列索引值。
a = pd.DataFrame([ [4,2,4],
[2,6,np.nan],
[4,np.nan,9],
[np.nan,2,1]],index=list('abcd'),columns=list('MNK'))
a
M N K
a 4.0 2.0 4.0
b 2.0 6.0 NaN
c 4.0 NaN 9.0
d NaN 2.0 1.0
最大值索引
print(a.idxmax())
print("="*20)
print(a.idxmax(axis=0))
print("="*20)
print(a.idxmax(axis=1))
M a
N b
K c
dtype: object
====================
M a
N b
K c
dtype: object
====================
a M
b N
c K
d N
dtype: object
最小值索引
print(a.idxmin())
print("="*20)
print(a.idxmin(axis=0))
print("="*20)
print(a.idxmin(axis=1))
M b
N a
K d
dtype: object
====================
M b
N a
K d
dtype: object
====================
a N
b M
c M
d K
dtype: object
计算样本值的累计和
a2 = pd.DataFrame([[1,2,np.nan],
[2,6,4],
[3,8,9],
[2,1,np.nan]],index=list('abcd'),columns=list('MNK'))
a2
这里只有含NAN(float类型数据)列其他值的类型才会变为float类型,如果某列无NAN,那就是原来的类型。
M N K
a 1 2 NaN
b 2 6 4.0
c 3 8 9.0
d 2 1 NaN
默认或axis=0
,计算每列的累计和,就是第一次计算第一个数,第二次计算前两个数的和,…,第n次计算前n个数的和,遇到NAN值不用管,当作0计算就行。axis=1
就是计算行的累计和。
print(a2.cumsum()) # 计算样本值列的累计和
print("="*20)
print(a2.cumsum(axis=0)) # 计算样本值列的累计和
print("="*20)
print(a2.cumsum(axis=1)) # 计算样本值行的累计和
M N K
a 1 2 NaN
b 3 8 4.0
c 6 16 13.0
d 8 17 NaN
====================
M N K
a 1 2 NaN
b 3 8 4.0
c 6 16 13.0
d 8 17 NaN
====================
M N K
a 1.0 3.0 NaN
b 2.0 8.0 12.0
c 3.0 11.0 20.0
d 2.0 3.0 NaN
describe()
用于描述性统计包括总结数据集分布的集中趋势、分散和形状的统计,但不包括NaN
值。
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
针对Series或DataFrame各列进行计算汇总统计。
b = pd.DataFrame([[1,2,np.nan],
[2,6,4],
[3,8,9],
[2,1,np.nan]],index=list('abcd'),columns=list('MNK'))
b
M N K
a 1 2 NaN
b 2 6 4.0
c 3 8 9.0
d 2 1 NaN
count:元素个数(不统计NAN值)
mean:均值
std:标准差
25%:四分之一分位数
50%:二分之一分位数
75%:四分之三分位数
max:最大值
min:最小值
print(b.describe())
M N K
count 4.000000 4.000000 2.000000
mean 2.000000 4.250000 6.500000
std 0.816497 3.304038 3.535534
min 1.000000 1.000000 4.000000
25% 1.750000 1.750000 5.250000
50% 2.000000 4.000000 6.500000
75% 2.250000 6.500000 7.750000
max 3.000000 8.000000 9.000000
percentiles:输出百分位数,默认是[.25, .5, .75],表示25%,50%,75%。列表中可以有多个值,范围在0-1之间即可,例如[.12, .345, 23, 67, 3446],一般百分数是保留0或1位小数,50%即使没指定也会有。
b1 = pd.DataFrame([[1,2,np.nan],
[2,6,np.nan],
[3,8,np.nan],
[2,1,np.nan]],index=list('abcd'),columns=list('MNK'))
print(b1)
print("="*40)
print(b1.describe())
print("="*40)
print(b1.describe(percentiles=[.12, .345, .23, .67, .985]))
M N K
a 1 2 NaN
b 2 6 NaN
c 3 8 NaN
d 2 1 NaN
========================================
M N K
count 4.000000 4.000000 0.0
mean 2.000000 4.250000 NaN
std 0.816497 3.304038 NaN
min 1.000000 1.000000 NaN
25% 1.750000 1.750000 NaN
50% 2.000000 4.000000 NaN
75% 2.250000 6.500000 NaN
max 3.000000 8.000000 NaN
========================================
M N K
count 4.000000 4.000000 0.0
mean 2.000000 4.250000 NaN
std 0.816497 3.304038 NaN
min 1.000000 1.000000 NaN
12% 1.360000 1.360000 NaN
23% 1.690000 1.690000 NaN
34.5% 2.000000 2.140000 NaN
50% 2.000000 4.000000 NaN
67% 2.010000 6.020000 NaN
98.5% 2.955000 7.910000 NaN
max 3.000000 8.000000 NaN
include:
all
结果将包括所有列
None
结果将包括所有数字列
dtype
结果限制为提供的数据类型,例如number、object类型
b2 = pd.DataFrame([[1,'a',np.nan],
[2,'b',4],
[3,'c',9],
[2,'d',np.nan]],index=list('abcd'),columns=list('MNK'))
b2
M N K
a 1 a NaN
b 2 b 4.0
c 3 c 9.0
d 2 d NaN
print(b2.describe()) # 默认统计数字列
print("="*40)
print(b2.describe(include='all')) # 统计所有列(可以有任何数据类型)
print("="*40)
print(b2.describe(include=None)) # 统计所有数字列
print("="*40)
print(b2.describe(include='number')) # 指定统计number类型(数字列)的列
print("="*40)
print(b2.describe(include='O')) # 指定统计含object类型的列
M K
count 4.000000 2.000000
mean 2.000000 6.500000
std 0.816497 3.535534
min 1.000000 4.000000
25% 1.750000 5.250000
50% 2.000000 6.500000
75% 2.250000 7.750000
max 3.000000 9.000000
========================================
M N K
count 4.000000 4 2.000000
unique NaN 4 NaN
top NaN c NaN
freq NaN 1 NaN
mean 2.000000 NaN 6.500000
std 0.816497 NaN 3.535534
min 1.000000 NaN 4.000000
25% 1.750000 NaN 5.250000
50% 2.000000 NaN 6.500000
75% 2.250000 NaN 7.750000
max 3.000000 NaN 9.000000
========================================
M K
count 4.000000 2.000000
mean 2.000000 6.500000
std 0.816497 3.535534
min 1.000000 4.000000
25% 1.750000 5.250000
50% 2.000000 6.500000
75% 2.250000 7.750000
max 3.000000 9.000000
========================================
M K
count 4.000000 2.000000
mean 2.000000 6.500000
std 0.816497 3.535534
min 1.000000 4.000000
25% 1.750000 5.250000
50% 2.000000 6.500000
75% 2.250000 7.750000
max 3.000000 9.000000
========================================
N
count 4
unique 4
top c
freq 1
exclude:指定要省略的数据类型,默认为None
b3 = pd.DataFrame([[1,'a',np.nan],
[2,'b','t'],
[4,'c',9],
[2,'d',np.nan]],index=list('abcd'),columns=list('MNK'))
b3
M N K
a 1 a NaN
b 2 b t
c 4 c 9
d 2 d NaN
print(b3.describe()) # 默认统计数字列
print("="*40)
print(b3.describe(exclude=None))
print("="*40)
print(b3.describe(exclude="number")) # 不统计整列都是数字的列
M
count 4.000000
mean 2.250000
std 1.258306
min 1.000000
25% 1.750000
50% 2.000000
75% 2.500000
max 4.000000
==============================
M
count 4.000000
mean 2.250000
std 1.258306
min 1.000000
25% 1.750000
50% 2.000000
75% 2.500000
max 4.000000
==============================
N K
count 4 2
unique 4 2
top c 9
freq 1 1