wiki: https://en.wikipedia.org/wiki/Kernel_density_estimation
博客:http://blog.163.com/zhuandi_h/blog/static/1802702882012111092743556/
In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.
核密度估计(kernel density estimation)是在概率论中用来估计未知的密度函数,属於非参数检验方法之一,由Rosenblatt (1955)和Emanuel Parzen(1962)提出,又名Parzen窗(Parzen window)。Ruppert和Cline基于数据集密度函数聚类算法提出修订的核密度估计方法。
Let (x1, x2, …, xn) be an independent and identically distributed sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is
where K(•) is the kernel — a non-negative function that integrates to one and has mean zero — and h > 0 is a smoothing parameter called the bandwidth. A kernel with subscript h is called the scaled kernel and defined asKh(x) = 1/h K(x/h). Intuitively one wants to choose h as small as the data allow, however there is always a trade-off between the bias of the estimator and its variance; more on the choice of bandwidth below.
A range of kernel functions are commonly used: uniform, triangular, biweight, triweight, Epanechnikov, normal, and others. The Epanechnikov kernel is optimal in a mean square error sense,[3] though the loss of efficiency is small for the kernels listed previously,[4] and due to its convenient mathematical properties, the normal kernel is often used, which means K(x) = ϕ(x), where ϕ is the standard normal density function.
以上的意思:
假设(x1,...,xn)是取之一个分布函数f的,我们现在感兴趣的就是如何估计出这个f函数。这个函数的估计器就是这个公式:
其中的h是一个带宽值,这样的Kernel也被称为scale kernel,核函数一般有这几种类型:uniform, triangular, biweight, triweight, Epanechnikov, normal。虽然这些函数的loss of efficiency很小,考虑到数学的性质,一般选用正态核。ϕ is the standard normal density function
当kernel density estimation采用平滑核时,估计出的概率密度函数也是平滑的。我们很多情况下都采用Gaussin Kernel.
然而,核密度估计也不是很完美,还是存在着一些缺陷。我们想要获得比较好的概率密度函数,h带宽(bandwidth)的选择就是个很大的问题,太大或者太小都能很大程度上影响p(x)结果。
来个例子理解下(还是上文提到过的6个样本数据点):
假设我们采用Gaussin Kernel,方差取2.25.
注:蓝线代表估计出的p(x),每条红线代表一个样本数据点。我们看p(x)是连续的,从某种程度上来说就相当于在估计值之外的区域进行插值处理。