In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936.It is based on correlations (相关性)between variables by which different patterns can be identified and analyzed. It gauges similarity(相似性 of an unknown sample set to a known one. It differs fromEuclidean distance in that it takes into account the correlations of the data set and is scale-invariant(尺度不变. In other words, it is a multivariateeffect size.


Formally, the Mahalanobis distance of a multivariate vector x = ( x_1, x_2, x_3, \dots, x_N )^T from a group of values with mean \mu = ( \mu_1, \mu_2, \mu_3, \dots , \mu_N )^T and covariance matrix S is defined as:

D_M(x) = \sqrt{(x - \mu)^T S^{-1} (x-\mu)}.\,


Mahalanobis distance (or "generalized squared interpoint distance" for its squared value) can also be defined as a dissimilarity measure between two random vectors  \vec{x} and  \vec{y} of the same distribution with the covariance matrix S :

 d(\vec{x},\vec{y})=\sqrt{(\vec{x}-\vec{y})^T S^{-1} (\vec{x}-\vec{y})}.\,

If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. If the covariance matrix is diagonal, then the resulting distance measure is called the normalized Euclidean distance:

 d(\vec{x},\vec{y})=\sqrt{\sum_{i=1}^N  {(x_i - y_i)^2 \over s_{i}^2}},

where s_{i} is the standard deviation of the  x_i  (  y_i ) over the sample set.



1.马氏距离的计算是 建立在总体样本的基础上的 ,这一点可以从上述协方差矩阵的解释中可以得出,也就是说,如果拿同样的两个样本,放入两个不同的总体中,最后计算得出的两个样本间的马氏距离通常是不相同的,除非这两个总体的协方差矩阵碰巧相同。
2.在计算马氏距离过程中,要求 总体样本数大于样本的维数 ,否则得到的总体样本协方差矩阵逆矩阵不存在,这种情况下,用欧式距离计算即可。
3.还有一种情况,满足了条件总体样本数大于样本的维数,但是 协方差矩阵的逆矩阵仍然不存在 ,比如三个样本点(3,4),(5,6)和(7,8)这种情况是因为这三个样本在其所处的二维空间平面内共线。这种情况下,也采用欧式距离计算。
