在概率论和统计学中,协方差用于衡量两个变量的总体误差。而方差是协方差的一种特殊情况,即当两个变量是相同的情况。其定义的数学形式是: C o v ( X , Y ) = E [ ( X − E ( X ) ) ( Y − E ( Y ) ) ] = E [ X Y ] − E [ X ] E [ Y ] Cov(X,Y)=E[(X-E(X))(Y-E(Y))] =E[XY]-E[X]E[Y] Cov(X,Y)=E[(X−E(X))(Y−E(Y))]=E[XY]−E[X]E[Y]
公式
c o v ( X , Y ) = ∑ i = 1 n ( X i − X ˉ ) ( Y i − Y ˉ ) n − 1 cov(X,Y) = \frac{\sum_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})}{n-1} cov(X,Y)=n−1∑i=1n(Xi−Xˉ)(Yi−Yˉ)
结果形式
C = ( c o v ( 1 , 1 ) c o v ( 1 , 2 ) c o v ( 1 , 3 ) ⋯ c o v ( 1 , n ) c o v ( 2 , 1 ) c o v ( 2 , 2 ) c o v ( 2 , 3 ) ⋯ c o v ( 2 , n ) c o v ( 3 , 1 ) c o v ( 3 , 2 ) c o v ( 3 , 3 ) ⋯ c o v ( 3 , n ) ⋮ ⋮ ⋮ ⋱ ⋮ c o v ( n , 1 ) c o v ( n , 2 ) c o v ( n , 3 ) ⋯ c o v ( n , n ) ) C = \begin{pmatrix} \color{#F00}{cov(1,1)} & \color{#0F0}{cov(1,2)} & \color{#F0F}{cov(1,3)} & \cdots & cov(1,n) \\ \color{#0F0}{cov(2,1)} & \color{#F00}{cov(2,2)} & cov(2,3) & \cdots & cov(2,n) \\ \color{#F0F}{cov(3,1)} & cov(3,2) &\color{#F00}{cov(3,3)} & \cdots & cov(3,n) \\ \vdots & \vdots& \vdots & \ddots & \vdots \\ cov(n,1) & cov(n,2) & cov(n,3) & \cdots & \color{#F00}{cov(n,n)} \end{pmatrix} C=⎝⎜⎜⎜⎜⎜⎛cov(1,1)cov(2,1)cov(3,1)⋮cov(n,1)cov(1,2)cov(2,2)cov(3,2)⋮cov(n,2)cov(1,3)cov(2,3)cov(3,3)⋮cov(n,3)⋯⋯⋯⋱⋯cov(1,n)cov(2,n)cov(3,n)⋮cov(n,n)⎠⎟⎟⎟⎟⎟⎞
函数原型:def cov
(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None,aweights=None)
import numpy as np
# 计算协方差的时候,一行代表一个特征
# 下面计算cov(T, S, M)
T = np.array([9, 15, 25, 14, 10, 18, 0, 16, 5, 19, 16, 20])
S = np.array([39, 56, 93, 61, 50, 75, 32, 85, 42, 70, 66, 80])
M = np.asarray([38, 56, 90, 63, 56, 77, 30, 80, 41, 79, 64, 88])
X = np.vstack((T, S, M))
# X每行代表一个属性
# 每列代表一个示例,或者说观测
print(np.cov(X))
# [[ 47.71969697 122.9469697 129.59090909]
# [122.9469697 370.08333333 374.59090909]
# [129.59090909 374.59090909 399. ]]
重点:协方差矩阵计算的是不同维度之间的协方差,而不是不同样本之间。拿到一个样本矩阵,首先要明确的就是行代表什么,列代表什么。
frequency weights:一维数组,代表每个观测要重复的次数(相当于给观测赋予权重)
T = np.array([9, 15, 25, 14, 10, 18, 0, 16, 5, 19, 16, 20])
S = np.array([39, 56, 93, 61, 50, 75, 32, 85, 42, 70, 66, 80])
M = np.asarray([38, 56, 90, 63, 56, 77, 30, 80, 41, 79, 64, 88])
X = np.vstack((T, S, M))
print(np.cov(X, None, True, False, fweights=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
# 和上面例子结果一样
# [[ 47.71969697 122.9469697 129.59090909]
# [122.9469697 370.08333333 374.59090909]
# [129.59090909 374.59090909 399. ]]
T = np.array([9, 15, 25, 14, 10, 18, 0, 16, 5, 19, 16, 20])
S = np.array([39, 56, 93, 61, 50, 75, 32, 85, 42, 70, 66, 80])
M = np.asarray([38, 56, 90, 63, 56, 77, 30, 80, 41, 79, 64, 88])
X = np.vstack((T, S, M))
print(np.cov(X, None, True, False, fweights=[2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
# 结果变了,就是因为相当于在X中增加了一列[9,39,38].T
# [[ 45.6025641 121.55769231 128.43589744]
# [121.55769231 381.42307692 389.30769231]
# [128.43589744 389.30769231 415.76923077]]
T = np.array([9, 9, 15, 25, 14, 10, 18, 0, 16, 5, 19, 16, 20])
S = np.array([39, 39, 56, 93, 61, 50, 75, 32, 85, 42, 70, 66, 80])
M = np.asarray([38, 38, 56, 90, 63, 56, 77, 30, 80, 41, 79, 64, 88])
X = np.vstack((T, S, M))
print(np.cov(X, None, True, False))
# 这样就验证了上面的话咯~
# [[ 45.6025641 121.55769231 128.43589744]
# [121.55769231 381.42307692 389.30769231]
# [128.43589744 389.30769231 415.76923077]]
很遗憾,暂时不知道它的计算方式,等有时间我仔细看看源码怎么算的,再修正!
T = np.array([9, 15, 25, 14, 10, 18, 0, 16, 5, 19, 16, 20])
S = np.array([39, 56, 93, 61, 50, 75, 32, 85, 42, 70, 66, 80])
M = np.asarray([38, 56, 90, 63, 56, 77, 30, 80, 41, 79, 64, 88])
X = np.vstack((T, S, M))
# 你会惊奇发现,这个结果和上面的结果一致,这就是参数 m, y,不知道为什么要设置这样一个参数,hhh
print(np.cov(X[0:1], X[1:]))
# [[ 47.71969697 122.9469697 129.59090909]
# [122.9469697 370.08333333 374.59090909]
# [129.59090909 374.59090909 399. ]]
>>> a = [1,2,3,4] # 当a是一维向量时
>>> import numpy as np
>>> np.cov(a) # 计算样本方差
array(1.66666667)
>>> np.var(a) # 计算总体方差
1.25
下面是 cov(a) 和 var(a)的区别
>>> 1.666666666666666667*3/4
1.25
>>>