numpy 库中含有很多统计学函数。
在numpy - ref - 1.14.pdf 文件的3.27 statistics 章节 中
1 排序统计 order statistics
1.1 最大值amax()、nanmax() 最小值amin() nanmin()
numpy.amin(a, [axis=None,] [out=None,] [keepdims=<class ’numpy._globals._NoValue’>]) # amax nanmax nanmin 与 amin 参数一样
参数
a - 输入的数据 array_like
axis - 无 None(默认)、整型 int、元组 tuple;沿轴线方向的操作方式
amin() / amax()
Return the minimum of an array or minimum along an axis.
返回一个数组的最小 / 大值或最小 / 大值沿一个轴。
nanmin() / nanmax()
Return minimum of an array or minimum along an axis, ignoring any NaNs.
返回一个忽略任一个NaN的数组中最小值或沿某轴的最小值
When all-NaN slices are encountered a RuntimeWarning is raised and Nan is returned for that slice.
当运行遇到Nan切片时都会抛出RuntimeWarning,并为该切片返回Nan
a = np.arange(4).reshape((2,2)) print(a) # [[0 1] # [2 3]] b = np.amin(a) # Minimum of the flattened array print(b) # 0 c = np.amin(a, axis=0) # Minima along the first axis print(c) # [0 1] d = np.amin(a, axis=1) # Minima along the second axis print(d) # [0 2]
其参数及常规用法见amin() 下面是关于再数组中出现 nan 时的非常规用法
a = np.arange(5, dtype=float) print(a) # [0. 1. 2. 3. 4.] print('-' * 20) a[2] = np.nan b = np.amin(a) print(b) # nan # RuntimeWarning: invalid value encountered in reduce # return umr_minimum(a, axis, None, out, keepdims) print('-' * 20) c = np.nanmin(a) print(c) # 0.0 print('-' * 20) d = np.amax(a) print(d) # nan # RuntimeWarning: invalid value encountered in reduce # return umr_maximum(a, axis, None, out, keepdims) print('-'* 20) f = np.nanmax(a) print(f) # 4.0
实际上,运行结果中的RuntimeWarning是无序的,它可能随机出现在任何地方,而不是固定出现在相应代码后面,见下面所示。
C:\Users\zyong\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\_methods.py:29: RuntimeWarning: invalid value encountered in reduce [0. 1. 2. 3. 4.] -------------------- return umr_minimum(a, axis, None, out, keepdims) nan C:\Users\zyong\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\_methods.py:26: RuntimeWarning: invalid value encountered in reduce -------------------- 0.0 return umr_maximum(a, axis, None, out, keepdims) -------------------- nan -------------------- 4.0
1.2 轴方向上的取值范围 ptp()
numpy.ptp(a, [axis=None,] [out=None])
返回沿某轴axis方向上的最大值 - 最小区 差值,即 maximum - minimum 的值形成的数组
ptp 的函数名来自于 peak to peak 的缩写。
a = np.arange(9).reshape((3,3)) print(a) # [[0 1 2] # [3 4 5] # [6 7 8]]
b = np.ptp(a, axis=0) print(b) # [6 6 6] 差值形成的数组
c = np.ptp(a, axis=1) print(c) # [2 2 2]
1.3 百分 位数 percentile 、nanpercentile
percentile(a, q[, axis, out, . . . ])
计算数组中沿指定方向上的第q 数值百分位(点),
a - array_like ,可以是数组以及可以转化成数组的对象。
q - [ 0, 100 ] 范围的浮点数 float
percentile可以计算多维数组的任意百分比分位数,
注:
(1)数列递增排序后计算
(2)位于 p% 位置处的值,称之为第p百分位数
为了帮助了解该数的意义,以 中值 为例深入了解
a = np.arange(1,10) print(a) # [1 2 3 4 5 6 7 8 9]
b = np.percentile(a,0) print(b) # 1.0
c = np.percentile(a,50) print(c) # 5.0 此时为中值
d = np.percentile(a,100) print(d) # 9.0
percentile在多维数组中的应用
a = np.arange(9).reshape((3,3)) print(a) # [[0 1 2] # [3 4 5] # [6 7 8]]
b = np.percentile(a,50) print(b) # 4.0
c = np.percentile(a,50, axis=0) print(c) # [3. 4. 5.]
d = np.percentile(a,60 , axis=0) print(d) # [3.6 4.6 5.6]
同样,nanpercentile 在计算百分位数时遇见 nan 值时会忽略。
2 平均与方差
该部分内容见 3.27.2 Averages and variances (numpy - ref - 1.14.5 )
2.1 中位数 median、nanmedian
np.median / nanmean (a[, axis, out, overwrite_input, keepdims])
median - 计算数组 a 在沿 axis 轴方向上的中位数
nanmedian - 计算数组 a 在沿 axis 轴方向上忽略 NaNs 值得中位数
a = np.array( [[0, 1, 2], [3, 4, 6], [6, 7, 12]]) print(a) b = np.median(a) print(b) # 4.0 c = np.median(a,axis = 0) print(c) # [3. 4. 6.] d = np.median(a,axis = 1) print(d) # [1. 4. 7.]
2.2 加权平均数average
average(a[, axis = None, weights = None, returned])
计算数组 a 在 axis 方向上的加权 weight 平均值
当axis = None 时意味整个数组的加权平均数
weights 的要求
1)权重值必须是数组或 array_like;
2)权重数组可以是一维的,其长度必须是沿给定axis轴向的长度一致;注意,该处没有广播概念的扩展,如果长度不一致则会报错
3)权重数组也可以与a数组的形状相同。
加权平均数的数学概念:
n个数 的权分别是 ,
那么
叫做这n个数的加权平均值。
当每个数对应的权重为1时,也即我们常见的平均数算法
1)weights = None时的计算
)此时默认每一个数的权重均为1
a = np.array( [[0, 1, 2], [3, 4, 6], [6, 7, 9]]) print(a) b = np.average(a) print(b) # 4.222222222222222 c = np.average(a,axis = 0) print(c) # [3. 4. 5.66666667] d = np.average(a,axis = 1) print(d) # [3. 4. 5.66666667]
2)当 权重值weights不为1时
一维数组时
a = np.array([1,2,3,4,5]) a_weight = [5,4,3,2,1] b = np.average(a,weights=a_weight) print(b) # 2.3333333333333335
多维数组时
a = np.array( [[0, 1, 2], [3, 4, 6], [6, 7, 9]]) print(a) a_weight_1 = [1,2,3] # (1,2,3) tuple值可用 a_weight_2 = np.array([1,2,3]) b = np.average(a,weights=a_weight_1,axis = 0) print(b) # [4. 5. 6.83333333] c = np.average(a,weights=a_weight_2,axis = 0) print(c) # [4. 5. 6.83333333]
2.3 算术平均数mean、nanmean
算术平均数是所有元素的总和除以元素的数量
mean() 计算数组的或者轴方向的算术平均数
a = np.array([[1, 2], [3, 4]]) b = np.mean(a) print(b) # 2.5 c = np.mean(a, axis=0) print(c) # [2. 3.] d = np.mean(a, axis=1) print(d) # [1.5 3.5]
注意:有时候会存在精度偏差
Notes:
The arithmetic mean is the sum of the elements along the axis divided by the number of elements.Note that for floating-point input, the mean is computed using the same precision the input has. Depending on the input data, this can cause the results to be inaccurate, especially for float32 (see example below). Specifying a higher-precision accumulator using the dtype keyword can alleviate this issue. By default, float16 results are computed using float32 intermediates for extra precision.
a = np.zeros((2, 512*512), dtype=np.float16) a[0, :] = 1.0 a[1, :] = 0.1 b = np.mean(a) print(b) # 0.55 a = np.zeros((2, 512*512), dtype=np.float32) a[0, :] = 1.0 a[1, :] = 0.1 b = np.mean(a) print(b) # 0.54999924
2.4 标准差std、nanstd
标准差(Standard Deviation)也称为标准偏差,在概率统计中最常使用作为统计分布程度(statistical dispersion)上的测量。
标准差定义是总体各单位标准值与其平均数离差平方的算术平均数的平方根。它反映组内个体间的离散程度。
a = np.array([[1, 2], [3, 4]]) b = np.std(a) print(b) # 1.118033988749895 c = np.std(a, axis=0) print(c) # [1. 1.] d = np.std(a, axis=1) print(d) # [0.5 0.5]
当数组的精度不一样时,其输出结果也不相同
a = np.zeros((2, 512*512), dtype=np.float32) a[0, :] = 1.0 a[1, :] = 0.1 b = np.std(a) print(b) # 0.45000005 a = np.zeros((2, 512*512), dtype=np.float64) a[0, :] = 1.0 a[1, :] = 0.1 b = np.std(a) print(b) # 0.45
2.5 方差var、nanvar
方差是 元素与元素的平均数 差 的 平方 的 平均数 var = mean(abs(x - x.mean())**2).
numpy.var(a [,axis=None, dtype=None, out=None, ddof=0, keepdims=<class ’numpy._globals._NoValue’>])
无np.nan时
a = np.array([[1, 2], [3, 4]]) b = np.var(a) print(b) # 1.25 c = np.var(a, axis=0) print(c) # [1. 1.] d = np.var(a, axis=1) print(d) # [0.25 0.25]
带有np.nan时
a = np.array([[np.nan, 2], [3, 4]]) b = np.var(a) print(b) # nan c = np.var(a, axis=0) print(c) # [nan 1.] d = np.nanvar(a, axis=0) print(d) # [0. 1.]
参考
NumPy Reference, Release 1.14.5
NumPy统计函数