From wikipedia, in probability theory and statistics, covariance is a measure of the joint variability of two random variables.
If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive.
In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, the covariance is negative.
The sign of the covariance therefore shows the tendency in the linear relationship between the variables.
The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables.
The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.
From wikipedia, in probability theory and statictics, the mathematical concepts of covariance and correlation are very similar.
Both describe the degree to which two random variables or sets of random variables tend to deviate from their expected values in similar ways.
If X and Y are two random variables, with means (expected values) μ X \mu_X μXand μ Y \mu_Y μYand standard deviations σ X \sigma_X σXand σ Y \sigma_Y σY,respectively, then their covariance and correlation are as follows:
c o v a r i a n c e : c o v X Y = σ X Y = E [ ( X − μ X ) ( Y − μ Y ) ] covariance:\\cov_{XY}=\sigma_{XY}=E[(X- \mu_X)(Y-\mu_Y)] covariance:covXY=σXY=E[(X−μX)(Y−μY)]
c o r r e l a t i o n : c o r r X Y = ρ X Y = E [ ( X − μ X ) ( Y − μ Y ) ] σ X σ Y correlation:\\corr_{XY}=\rho_{XY}=\frac{E[(X- \mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y} correlation:corrXY=ρXY=σXσYE[(X−μX)(Y−μY)]
so that:
ρ X Y = σ X Y σ X σ Y \rho_{XY}=\frac{\sigma_{XY}}{\sigma_X \sigma_Y} ρXY=σXσYσXY
Both terms measure the relationship and the dependency between two variables.
“Covariance” indicates the direction of the linear relationship between variables.
“Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables.
Correlation is a function of the variance. What sets them apart is the fact that correlation values are standardized whereas, covariance values are not.
You can obtain the correlation coefficient of two variables by dividing the covariance of these variables by the product of the standard deviations of the same values.
If we revisit the definition of Standard Deviation, it essentially measures the absolute variability of a dataset’s distribution.
When you divide the covariance values by the standard deviation, it essentially scales the value down to a limited range of -1 to +1. This is precisely the range of the correlation values.
Covariance compares two variables in terms of the deviations from their mean (or expected) value.
The covariance of two variables (x and y) can be represented as cov(x,y).
If E[x] is the expected value or mean of a sample ‘x’, then cov(x,y) can be represented in the following way:
c o v ( x , y ) = E [ ( x − μ x ) ( y − μ y ) ] = E [ x y ] − E [ x ] E [ y ] = E [ x y ] − μ x μ y cov(x,y)=E[(x-\mu_x)(y-\mu_y)]\\=E[xy]-E[x]E[y]\\=E[xy]-\mu_x\mu_y cov(x,y)=E[(x−μx)(y−μy)]=E[xy]−E[x]E[y]=E[xy]−μxμy
c o v ( y , y ) = E [ ( y − μ y ) ( y − μ y ) ] v a r ( y ) = E [ ( y − μ y ) 2 ] = s 2 cov(y,y)=E[(y-\mu_y)(y-\mu_y)]\\var(y)=E[(y-\mu_y)^2]=s^2 cov(y,y)=E[(y−μy)(y−μy)]var(y)=E[(y−μy)2]=s2
s 2 = c o v ( x , x ) = ∑ i = 1 n ( x i − μ x ) 2 n − 1 s^2=cov(x,x)=\frac{\sum^n_{i=1}(x_i-\mu_x)^2}{n-1}\\ s2=cov(x,x)=n−1∑i=1n(xi−μx)2
c o v ( x , y ) = ∑ i = 1 n ( x i − μ x ) ( y i − μ y ) n − 1 cov(x,y)=\frac{\sum^n_{i=1}(x_i-\mu_x)(y_i-\mu_y)}{n-1} cov(x,y)=n−1∑i=1n(xi−μx)(yi−μy)
numerator of (6) is called the sum of squared deviations;
numerator of (7) is called the sum of cross products.
n n n is the number of samples in the data set.
n − 1 n-1 n−1 indicates the degree of freedom.
E is the expectation and μ \mu μ is the mean.
degrees of freedom is the number of independent data point that went into calculating the estimate.
A set with number n=3, and μ \mu μ is 10, 给定2个数字,第三个数字是可以推断出来的,第三个是确定的不能随机自由的。因此自由度是2.
A covariance of 0 indicates that two variables are totally unrelated.
If the covariance is positive, the variables increase in the same direction, and if the covariance is negative, the variables change in oppsote directions.
Correlation is a normalization of covariance by the standard deviation of each variable.
C o r r ( X , Y ) = E [ ( X − μ x ) σ X ( Y − μ y ) σ Y ] Corr(X,Y) =E[\frac{(X-\mu_x)}{\sigma_X}\frac{(Y-\mu_y)}{\sigma_Y}] Corr(X,Y)=E[σX(X−μx)σY(Y−μy)]
σ \sigma σ is the standard deviarion.
The correlation coefficient is also known as the Pearson product-moment correlation coefficient, or Pearson’s correlation coefficient.
It is obtained by dividing the covariance of the two variables by the product of their standard deviations.
c o r r ( x , y ) = c o v ( x , y ) s x s y = E [ ( x − μ x ) ( y − μ y ) ] s x s y = E [ ( x − μ x ) ( y − μ y ) ] σ x σ y corr(x,y)=\frac{cov(x,y)}{s_xs_y}=\frac{E[(x-\mu_x)(y-\mu_y)]}{s_xs_y}\\=\frac{E[(x-\mu_x)(y-\mu_y)]}{\sigma_x \sigma_y} corr(x,y)=sxsycov(x,y)=sxsyE[(x−μx)(y−μy)]=σxσyE[(x−μx)(y−μy)]
The values of the correlation coefficient can range from -1 to +1. The closer it is to +1 or -1, the more closely are the two variables are related.
The positive sign signifies the direction of the correlation i.e. if one of the variables increase, and other variables is also supposed to increase.
Correlation is dimensionless, it is a unit-free measure of the relationship between variables. This is because we divide the value of covariance by the product of standard deviations which have the same units.
While correlation coefficients lie between -1
and +1
, covariance can take any value between -uninfinite
and +uninfinite
.
The unit of covariance is a product of the units of the two variables.
Correlation is simply a normalized form of covariance and not affected by scale. Both covariance and correlation measure the linear relationship between variables but cannot be used interchangeably.
Now that we are done with mathematical theory, let us explore how and where it can be applied in the field of data analytics.
Correlation analysis, as a lot of analysts would know is a vital tool for feature selection and multivariate analysis in data preprocessing and exploration. Correlation helps us investigate and establish relationships between variabels. This is employed in feature selection before any kind of statistical modelling or data analysis.
You are advised to use the covariance matrix when the variable are on similar scales and the correlation matrix when the scales of the variables differ.
Covariance is used to determine how much two random variables vary together; whereas correlation is used to determine when a change in one variable can result in a change in another.
Covariance 只是用来表示两家的不同,没有情绪;
Correlation就是竞争了,看看我家买了新电视后,他家会不会换个新摩托车。For positive, A 买奥迪,B买宝马;For negative, A买奥迪,B卖宝马。Without relation, A买奥迪,B约妹子。
前者重在区别,后者重在联系。
When the correlation coefficient is positive, an increase in one variable also result in an increase in the other.
If we are comparing the relationship among three or more variables, it is better to use correlation because the value ranges or unit may cause false assumptions.
If we want to measure the relationship between X-Y and X-Z, using Covariance first, maybe get the result which show the covariance of X and Y is bigger than X and Z, but this may because the data range. Therefore, we need to use correlation to eliminate the effect of different value ranges.
Covariance 只能用来查看两个变量之间的deviation,不用用来对比两组数据之间的covariance值,因为值是unit相关的。
当对比三个或以上变量之间的区别,也就是存在至少两组Cov值的对比,就需要用Corr来实现无量纲化对比。
Cov是两个变量的对比,Corr是三个及以上变量(which means两个及以上Cov)
np.cov()
returns the covariance matrix.
array([[0.07707877, 0.11350354],
[0.11350354, 0.18856815]])
The value at position [0,0] shows the covariance of X with itself,which means the variance;
The value at position [1,1] shows the covariance of Y with itself;
The covariance of X and Y is 0.11.
For the negative covariance example like :
array([[0.07707877, -0.11350354],
[-0.11350354, 0.18856815]])
For the independent variables :
array([[0.07707877, 0.00350354],
[0.00350354, 0.18856815]])
which has a covariance value that is close to zero.
This figure indicates the notion, Covariance describes how similarly two random variables deviate from their mean.