高斯混合模型和期望最大化的完整解释

In the previous article, we described the Bayesian framework for linear regression and how we can use latent variables to reduce model complexity.

在上一篇文章中,我们描述了线性回归的贝叶斯框架以及如何使用潜变量降低模型复杂度。

In this post, we will explain how latent variables can also be used to frame a classification problem, namely the Gaussian Mixture model (or GMM in short) that allows us to perform soft probabilistic clustering.

在这篇文章中,我们将解释潜在变量也如何用于构成分类问题,即允许我们执行软概率聚类的高斯混合模型(简称GMM)。

This model is classically trained by an optimization procedure named the Expectation-Maximization (or EM in short) for which we will have a thorough review. At the end of this article, we will also see why we do not use traditional optimization methods.

该模型通常通过名为“期望最大化”(简称为EM)的优化程序进行训练,对此我们将进行全面的回顾。 在本文的结尾,我们还将看到为什么不使用传统的优化方法。

This article contains a few mathematical notations and derivations. We are not trying to scare anybody. We believe that once the intuition is given is it important to dive into the math to understand things for real.

本文包含一些数学符号和派生词。 我们不是要吓any任何人。 我们相信,一旦给出直觉,重要的是要深入数学以真正理解事物。

This post was inspired by this excellent course on Coursera: Bayesian Methods for machine learning. If you are into machine learning I definitely recommend this course.

这篇文章的灵感来自于Coursera:机器学习的贝叶斯方法的一门精品课程。 如果您对机器学习感兴趣,我绝对会推荐这门课程。

高斯混合模型 (Gaussian Mixture Model)

This model is a soft probabilistic clustering model that allows us to describe the membership of points to a set of clusters using a mixture of Gaussian densities. It is a soft classification (in contrast to a hard one) because it assigns probabilities of belonging to a specific class instead of a definitive choice. In essence, each observation will belong to every class but with different probabilities.

该模型是一种软概率聚类模型,它允许我们使用混合的高斯密度来描述一组聚类的点的隶属关系。 这是一种软分类(与硬分类相反),因为它分配属于特定类别的概率而不是确定的选择。 本质上,每个观察将属于每个类别,但概率不同。

We take the famous Iris classification problem as an example. So we have 150 Iris flowers divided between 3 classes. For each of them, we have the sepal length and width, the petal length and width, and the class.

我们以著名的虹膜分类问题为例。 因此,我们有150个鸢尾花,分为3类。 对于它们中的每一个,我们都有萼片的长度和宽度,花瓣的长度和宽度以及类。

Let’s have a quick view of the data using the pair plot of the seaborn package:

让我们使用seaborn软件包的配对图快速查看数据:

高斯混合模型和期望最大化的完整解释_第1张图片
Image by Author 图片作者

The Gaussian Mixture model tries to describe the data as if it originated from a mixture of Gaussian distributions. So first, if we only take one dimension, say the petal width, and try to fit 3 different Gaussians, we would end up with something like this :

高斯混合模型试图描述数据,就好像它源自高斯分布的混合一样。 因此,首先,如果我们仅采用一个维度,说出花瓣的宽度,并尝试适合3个不同的高斯人,我们最终将得到如下结果:

高斯混合模型和期望最大化的完整解释_第2张图片
Image by Author 图片作者

The algorithm found that the mixture that is most likely to represent the data generation process is made of the three following normal distributions :

该算法发现,最有可能代表数据生成过程的混合是由以下三个正态分布构成的:

Image for post

The setosa petal widths are much more concentrated with a mean of 0.26 and a variance of 0.04. The other two classes are comparably more spread out but with different locations. Now let’s see the result with two dimensions say petal width and petal length.

刚毛的花瓣宽度更加集中,平均值为0.26,方差为0.04。 相对而言,其他两个类别分布更广,但位置不同。 现在,让我们看一下具有两个维度(花瓣宽度和花瓣长度)的结果。

高斯混合模型和期望最大化的完整解释_第3张图片
Image by Author 图片作者

Now the constituents of the mixture are the following :

现在,混合物的成分如下:

Image for post

Note that the GMM is a pretty flexible model. It can be shown that for a large enough number of mixtures, and appropriate choice of the involved parameters, one can approximate arbitrarily close any continuous pdf (with the extra computational cost that it entails).

注意,GMM是一个非常灵活的模型。 可以证明,对于足够多的混合物,以及对所涉及参数的适当选择,可以任意近似地关闭任何连续的pdf(这会带来额外的计算成本)。

正规化-MLE (Formalization — MLE)

So how does the algorithm finds the best set of parameters to describe the mixture?

那么,算法如何找到最佳的参数集来描述混合物呢?

Well, we start by defining the probability model. The probability of observing any observation, that is the probability density, is a weighted sum of K Gaussian distributions (as pictured in the previous section) :

好吧,我们首先定义概率模型。 观察到任何观测值的概率,即概率密度,是K高斯分布的加权和(如上一节中所示):

Image for post

Each point is a mixture of K weighted Gaussians which are parameterized by a mean and a covariance matrix. So overall, we can describe the probability of observing a specific observation as a mixture. To be sure that you fully understand, in the one-dimensional example above using the petal width, the probability of observing a petal width in the region [0.2, 0.3] is the highest. We also have a strong probability of observing a petal width in the regions [1.2, 1.5] and [1.8, 2.2].

每个点都是K加权高斯的混合,由均值和协方差矩阵进行参数化。 因此,总的来说,我们可以描述将特定观察结果作为混合观察的概率。 为确保您完全理解,在上述使用花瓣宽度的一维示例中,观察到区域[0.2,0.3]中花瓣宽度的可能性最高。 我们也很有可能观察到[1.2,1.5]和[1.8,2.2]区域的花瓣宽度。

Note that in the case of 3 clusters, the set of parameters will be:

请注意,对于3个群集,参数集将为:

Image for post

Then we want to find parameter values that maximize the likelihood of the dataset. We want to find the maximum likelihood estimates of the parameters. That is, we want to find the parameters that maximize the probability of observing all the data points together (ie, the joint probability), solving the following optimization problem :

然后,我们想找到使数据集的可能性最大化的参数值。 我们想要找到参数的最大似然估计。 也就是说,我们要找到使观察所有数据点在一起的概率最大的参数(即联合概率),从而解决以下优化问题:

Image for post

Note that the full joint probability of the data set can be vectorized (ie., decomposed as a product of individual probabilities, with the Π operator) only in the assumption that the observations are drawn i.i.d (identically independently distributed). When they are, the events are independent and the probability of observing one data point is not influenced by the other probabilities.

注意,仅在假设观察值被绘制成id(相同地独立分布)的前提下,才可以对数据集的全部联合概率进行矢量化处理(即,使用Π运算符分解为单个概率的乘积)。 如果是这样,则事件是独立的,并且观察一个数据点的概率不受其他概率的影响。

Instead of the likelihood, we usually maximize the log-likelihood, in part because it turns the product of probabilities into a sum (simpler to work with). This is because the natural logarithm is a monotonically increasing concave function and does not change the location of the maximum (the location where the derivative is null will remain the same).

我们通常将对数似然性最大化,而不是将可能性最大化,部分原因是它将概率的乘积转化为总和(使用起来更简单)。 这是因为自然对数是单调递增的凹函数,并且不会改变最大值的位置(导数为null的位置将保持不变)。

Image for post

Now maximum likelihood estimation can be done in a large number of different ways. It can be done by direct optimization (finding the points where the partial derivatives are null) or by numerical optimization like gradient descent. The MLE of GMMs is not done using those methods for a number of reasons that I will explain. But I leave that for the end of the article because I want to get to the most relevant materials first. The MLE of GMMs is done using the expectation-maximization algorithm.

现在,可以以多种不同方式完成最大似然估计。 可以通过直接优化(找到偏导数为零的点)或通过数值优化(例如梯度下降)来完成。 出于多种原因,我将不使用这些方法来完成GMM的MLE。 但是我将其保留在本文的末尾,因为我想先阅读最相关的材料。 GMM的MLE使用期望最大化算法完成。

期望最大化 (Expectation-Maximization)

GMM训练直觉(GMM training intuition)

First, we are going to visually describe what happens during the training of a GMM model because it will really help to build the necessary intuition for EM. So let’s say we are back into the one-dimensional example but without labels this time. Try to imagine how we could assign cluster labels to the below observations:

首先,我们将在视觉上描述GMM模型训练期间发生的情况,因为它确实有助于为EM建立必要的直觉。 假设我们回到了一维示例,但是这次没有标签。 尝试想象如何将聚类标签分配给以下观察结果:

高斯混合模型和期望最大化的完整解释_第4张图片
Image by Author 图片作者

Well, if we already knew where the Gaussians are in the above plot, for each observation, we could compute the cluster probabilities. Let’s draw this picture so you have it in mind. So we would be assigning a color to the points below:

好吧,如果我们已经知道高斯图在上述图中的位置,则对于每次观察,我们都可以计算聚类概率。 让我们绘制这张图片,以便您牢记在心。 因此,我们将为以下几点分配颜色:

高斯混合模型和期望最大化的完整解释_第5张图片
Image by Author 图片作者

Intuitively, for one selected observation and one selected Gaussian, the probability of the observation belonging to the cluster would be the ratio between the Gaussian value and the sum of all the Gaussians. Something like:

直观地,对于一个选定的观测值和一个选定的高斯分布,观测值属于该簇的概率将是高斯值与所有高斯值之和之间的比率。 就像是:

高斯混合模型和期望最大化的完整解释_第6张图片

Ok, but how do know where the Gaussians are to be located in the above plot (ie. how do we find the Gaussian parameters)? Well, let's say we are already in possession of the observation’s labels, like so:

好的,但是如何知道高斯分布在上述曲线图中的位置(即,如何找到高斯参数)? 好吧,假设我们已经拥有观测的标签,就像这样:

高斯混合模型和期望最大化的完整解释_第7张图片
Image by Author 图片作者

Now we could easily find the parameter values and draw the Gaussians. It suffices to consider the points independently, say the red points, and find the maximum likelihood estimate. For a Gaussian distribution, one can demonstrate the following results:

现在,我们可以轻松找到参数值并绘制高斯曲线。 足以独立考虑这些点(例如红色点)并找到最大似然估计就足够了。 对于高斯分布,可以证明以下结果:

高斯混合模型和期望最大化的完整解释_第8张图片

Applying the above formula, to the red points, then the blue points, and then the yellow points, we get the following normal distributions:

将上面的公式应用于红色点,然后是蓝色点,然后是黄色点,我们得到以下正态分布:

高斯混合模型和期望最大化的完整解释_第9张图片
Image by Author 图片作者

Ok so given the normal distribution parameters, we can find the observation labels, and given the observation labels, we can find the normal distribution parameters. So it seems like we have a kind of a chicken and egg problem, right?

好了,给定正态分布参数,我们可以找到观测标签,给定观测标签,我们可以找到正态分布参数。 看来我们遇到了鸡和鸡蛋的问题,对吗?

Well, in fact solving it is not that hard. We just have to start somewhere. So we can set the Gaussian parameters to random values. Then we perform the optimization by iterating through the two successive steps until convergence :

好吧,实际上解决这个问题并不难。 我们只需要从某个地方开始。 因此我们可以将高斯参数设置为随机值。 然后,我们通过迭代两个连续的步骤直到收敛,来执行优化:

  1. We assign labels to the observations using the current Gaussian parameters

    我们使用当前的高斯参数将标签分配给观测
  2. We update the Gaussian parameters so that the fit is more likely

    我们更新高斯参数,以使拟合更有可能

This would give the following result:

这将产生以下结果:

高斯混合模型和期望最大化的完整解释_第10张图片
Image by Author 图片作者

As you can see, as soon as we reach step 4 we are already at the best possible fit.

如您所见,一旦我们到达第4步,我们就已经处于最佳状态。

电磁直觉 (EM intuition)

The Expectation-Maximization algorithm is performed exactly the same way. In fact, the optimization procedure we describe above for GMMs is a specific implementation of the EM algorithm. The EM algorithm is just more generally and formally defined (as it can be applied to many other optimization problems).

期望最大化算法的执行方式完全相同。 实际上,我们上面描述的针对GMM的优化过程EM算法的特定实现。 EM算法只是更笼统和正式定义的(因为它可以应用于许多其他优化问题)。

So the general idea is that we are trying to maximize a likelihood (and more frequently a log-likelihood), that is, we are trying to solve the following optimization problem:

因此,一般的想法是我们试图使可能性最大化(更常见的是对数可能性),也就是说,我们试图解决以下优化问题:

Image for post

This time we are not saying that the likelihood P(x_i|Θ) is a mixture of Gaussians. It can be anything.

这次我们不是说似然P(x_i |Θ)是高斯混合。 可以是任何东西。

Now, let’s visualize things! Imagine that the log-likelihood (log P(X|Θ)) is the following one-dimensional distribution:

现在,让我们可视化事物! 假设对数似然(log P(X |Θ))是以下一维分布:

高斯混合模型和期望最大化的完整解释_第11张图片
Image by Author 图片作者

The major trick, that makes this algorithm works, lies in the definition and usage of a specific function. This function is defined in such a way that at any given point in the parameter space, we know for sure that it will always have a value lower or equal to the log-likelihood. It is called a lower-bound. We will call it L (pictured in red below):

使该算法起作用的主要技巧在于特定函数的定义和用法。 定义此函数的方式是,在参数空间中的任何给定点处,我们都可以肯定地知道它将始终具有小于或等于对数似然性的值。 这称为下限。 我们将其称为L(下面以红色显示):

高斯混合模型和期望最大化的完整解释_第12张图片
Image by Author 图片作者

Now, in fact, we don’t use a single lower-bound but a family of lower-bounds parameterized by the vector of parameters Θ and a variational distribution q. So L(Θ, q) can be located anywhere as long as it remains a lower bound for the likelihood:

现在,实际上,我们不再使用单个下界,而是使用由参数Θ和变化分布q的向量参数化的一系列下界。 因此,L(Θ,q)可以位于任何地方,只要它保持可能性的下限即可:

高斯混合模型和期望最大化的完整解释_第13张图片
Image by Author 图片作者

The EM algorithm starts by assigning random parameters. So let’s say we start with the following lower bound:

EM算法从分配随机参数开始。 假设我们从以下下限开始:

高斯混合模型和期望最大化的完整解释_第14张图片
Image by Author 图片作者

The algorithm will now perform two successive steps:

该算法现在将执行两个连续的步骤:

  1. Fix Θ and adjust q so that the lower-bound gets as close as possible to the log-likelihood. For example, during the first step, we compute q1:

    固定θ并调整q,使下限尽可能接近对数似然。 例如,在第一步中,我们计算q1:
高斯混合模型和期望最大化的完整解释_第15张图片
Image by Author 图片作者

2. Fix q and adjust Θ so that the lower-bound gets maximized. For example, during the second step, we compute Θ1:

2.固定q并调整Θ,使下限最大化。 例如,在第二步中,我们计算Θ1:

高斯混合模型和期望最大化的完整解释_第16张图片
Image by Author 图片作者

So to sum up this intuition, the EM algorithm breaks up the difficulty of finding the maximum of the likelihood (or a least a local maximum) into a series of successive steps that are much easier to deal with. In order to do so, it introduces a lower bound that is parametrized by the vector Θ for which we want to find the optimum and a variational lower bound q that we can also modify at will.

因此,总而言之,EM算法将寻找最大似然性(或至少是局部极大值)的难度分解为一系列易于处理的连续步骤。 为了做到这一点,它引入了一个下界,该下界由我们要为其找到最佳值的矢量Θ参数化,并且我们也可以随意修改它的变化下界q。

詹森不等式 (The Jensen’s inequality)

This inequality is in some way just a rewording of the definition of a concave function. Recall that for any concave function f, any weight α and any two points x and y:

这种不等式在某种程度上只是凹函数定义的改写。 回想一下,对于任何凹函数f,任何权重α以及任意两个点x和y:

Image for post

In fact, this definition can be generalized to more than two points as long as the weights sum up to 1 :

实际上,只要权重之和为1,就可以将该定义概括为两个以上的点:

高斯混合模型和期望最大化的完整解释_第17张图片

If the weights sump up to 1, then we can say that they represent a probability distribution and this gives us the definition of the Jensen’s inequality:

如果权重等于1,那么我们可以说它们代表概率分布,这给了我们詹森不等式的定义:

Image for post

The projection of the expected value by a concave function is always greater or equal to the expected value of a concave function.

凹函数的期望值的投影始终大于或等于凹函数的期望值。

EM形式化 (EM Formalization)

The Expectation-Maximization algorithm is used with models that make use of latent variables. In general, we define a latent variable t that explains an observation x. There is one instance of the latent variable by observation. So we can draw the following diagram:

期望最大化算法与使用潜在变量的模型一起使用。 通常,我们定义一个潜在变量t来解释观测值x。 通过观察有一个潜在变量的实例。 所以我们可以画出下图:

Image for post
Image by Author 图片作者

Also because t explains the observation x, it defines the probability that the observation belongs to one of the clusters. So we can write:

同样因为t解释了观测值x,所以它定义了观测值属于群集之一的概率。 所以我们可以这样写:

Image for post

Now the full likelihood of the observation can be written as a marginal likelihood (ie. by marginalizing out t):

现在,观察的全部似然可以写为边际似然(即通过将t边际化):

Image for post

Now recall that we are trying to resolve the following optimization problem:

现在回想一下,我们正在尝试解决以下优化问题:

Image for post

And also we just introduced the Jensen’s inequality, which can be written as:

我们还介绍了詹森不等式,可以写成:

高斯混合模型和期望最大化的完整解释_第18张图片

We want to use this inequality to helps us define the lower bound. So by identification with the marginal likelihood, we could write that:

我们想利用这种不等式来帮助我们定义下限。 因此,通过确定边际可能性,我们可以这样写:

Image for post

Which gives us:

这给了我们:

Image for post

But if we think about it, the right-hand part of the inequality (the lower bound) is now a function of Θ, and Θ only. Θ is the only thing in this lower bound we have our hands on because the rest depends on the data. So we need to do something else. We want a lower bound that depends on Θ and a variational distribution q. Well, the trick is to introduce q like so:

但是,如果我们考虑一下,不等式的右手部分(下限)现在是Θ和Θ的函数。 Θ是我们在这个下界中唯一可以使用的东西,因为其余部分取决于数据。 因此,我们需要做其他事情。 我们想要一个下限,该下限取决于Θ和变化分布q。 好吧,诀窍是像这样引入q:

高斯混合模型和期望最大化的完整解释_第19张图片

By multiplying and dividing by q, we don’t change anything. And now we can use the Jensen’s inequality:

通过乘以除以q,我们什么都不会改变。 现在我们可以使用詹森不等式:

Image for post

And we succeeded, we built a lower bound for the full marginal log-likelihood that depends both on Θ and q. So we are now able to maximize the lower bound by alternating the two following steps:

我们成功了,我们为整个边际对数似然率建立了一个下限,该下限既取决于θ也取决于q。 因此,我们现在可以通过交替执行以下两个步骤来最大化下限:

Expectation step:

期望步骤:

Image for post

We fix Θ, and we try to get the lower bound as close as possible to the likelihood, that is we try to minimize:

我们固定Θ,然后尝试使下限尽可能接近可能性,即尝试最小化:

Image for post

With further extra steps (that I will save you here), we can demonstrate that :

通过进一步的额外步骤(我将在这里为您节省),我们可以证明:

Image for post

So in order to find the next value (k+1) of the variational distribution q, we need to consider every observation x_i independently and for every class, we compute the probability of that observation belonging to the class p(t_i|x_i, Θ). Recall that by the Bayes rule:

因此,为了找到变化分布q的下一个值(k + 1),我们需要独立考虑每个观察值x_i,对于每个类别,我们都要计算属于该类别p(t_i | x_i,Θ )。 回忆一下贝叶斯规则:

Image for post

Now we can make the link with the Gaussian Mixture Models. We find in the above formula what we intuitively derived, ie:

现在,我们可以与高斯混合模型建立链接。 我们可以在上面的公式中找到我们直观得出的内容,即:

Image for post

Maximization step:

最大化步骤:

Image for post

We fix q, and we maximize the lower bound which is defined as:

我们修复q,并最大化定义为的下限:

高斯混合模型和期望最大化的完整解释_第20张图片

Now the second term in the subtraction is not dependent on Θ, so we can write:

现在减法中的第二项不依赖于Θ,因此我们可以这样写:

Image for post

Now we usually select a concave function that is easy to optimize. In the case of Gaussian Mixture Models, we used the MLE of the Gaussian distributions.

现在,我们通常选择易于优化的凹函数。 在高斯混合模型的情况下,我们使用了高斯分布的MLE。

If you want to see the full derivations that allow us to get the closed-form expressions for the updates of the parameters in the m-step, I wrote them up in a dedicated article (in order not to overload this one).

如果要查看允许我们在m步中获取参数更新的封闭形式表达式的完整派生,我将它们写在一篇专门的文章中(以免使此内容超载)。

If you red up this point, congratulations! You should now have a very good grasp of GMM and EM. Optionally, if you want to understand why we don’t use traditional methods to find the best set of parameters, read on.

如果您对此表示赞同,那么恭喜! 您现在应该对GMM和EM有很好的了解。 可选地,如果您想了解为什么我们不使用传统方法来找到最佳的参数集,请继续阅读。

So if we were to start from scratch, how would one perform maximum likelihood estimation in the case of Gaussian Mixture Models?

因此,如果我们要从头开始,那么在高斯混合模型的情况下如何执行最大似然估计?

直接优化:第一种方法 (Direct optimization: A first approach)

A way to find the maximum likelihood estimate is to set the partial derivatives of the log-likelihood with respect to the parameters to 0 and solve the equations. We would call that approach direct optimization.

找到最大似然估计的一种方法是将参数的对数似然的偏导数设置为0并求解方程。 我们将这种方法称为直接优化。

高斯混合模型和期望最大化的完整解释_第21张图片

As you can see, this approach is impractical because the sum over the Gaussian components appears inside the log making all the parameters tied together. For example, if K=2 and we try to solve the equation for α_1, we have:

如您所见,这种方法是不切实际的,因为高斯分量的总和出现在日志中,从而将所有参数捆绑在一起。 例如,如果K = 2并尝试求解α_1的方程,则有:

高斯混合模型和期望最大化的完整解释_第22张图片

We got through the first step using the chain rule… but we are not capable of expressing the parameter α_1 in terms of the other parameters. So we are stuck! We can not solve this system of equations analytically. This means that we can not find the global optimum, or at least a local one, in one step using analytical expressions.

我们使用链式规则完成了第一步……但是我们无法根据其他参数来表达参数α_1。 所以我们被卡住了! 我们不能解析地解决这个方程组。 这意味着我们无法使用解析表达式一步找到全局最优值,或者至少是局部最优值。

数值优化:第二种方法 (Numerical optimization: A second approach)

Ok, what else can we do? Well, we have to rely on a numerical optimization method. We could, for example, try to use our favorite stochastic gradient descent algorithm. How does it work?

好吧,我们还能做什么? 好吧,我们必须依靠数值优化方法。 例如,我们可以尝试使用我们最喜欢的随机梯度下降算法。 它是如何工作的?

Well, we initialize the vector of parameters Θ randomly and we iterate through the observations of the dataset. At step k, we compute the gradient of the likelihood for one selected observation. Then we update the parameter values by taking one step in the opposite direction of the gradient (using a specific learning rate η), that is:

好吧,我们随机初始化参数Θ的向量,然后迭代数据集的观测值。 在步骤k,我们计算一个选定观测值的似然梯度。 然后,我们通过在梯度的相反方向上执行一个步骤(使用特定的学习率η)来更新参数值,即:

Image for post

Using the chain rule, we can compute the partial derivatives for the remaining parameters (like we did above for α_1). For example, we would get the following result for μ_1:

使用链式规则,我们可以计算其余参数的偏导数(就像我们上面对α_1所做的那样)。 例如,对于μ_1,我们将得到以下结果:

Image for post

Now you are going to tell me that the parameters are still dependent on each other. Well yes, but we are not solving equations anymore. To take a step, we evaluate the partial derivatives with the values at the current location and with the current set of parameters. It is an iterative algorithm.

现在,您将告诉我参数仍然相互依赖。 是的,但是我们不再解决方程式了。 要采取步骤,我们用当前位置的值和当前参数集评估偏导数。 这是一个迭代算法。

So let’s say we are computing Θ¹. We have at our disposal Θ⁰=(α_0, α_1, μ_0, μ_1, Σ_0, Σ_1) initiated randomly and we have the first selected observation x_0. Given the formulas of partial derivatives, we have all we need to compute Θ¹. And we proceed to the successive steps until convergence (that is, the current step gets too small).

假设我们正在计算Θ¹。 我们拥有随机发起的Θ⁰=(α_0,α_1,μ_0,μ_1,Σ_0,Σ_1),并拥有第一个选定的观察值x_0。 给定偏导数的公式,我们拥有计算Θ¹所需的全部。 然后,我们继续进行后续步骤,直到收敛为止(也就是说,当前步骤变得太小)。

高斯混合模型和期望最大化的完整解释_第23张图片
Image by the Author 图片由作者

For example, I generated the above animation by simulating the optimization of the Beale function B with two parameters (x and y). We start at a random point, here around (-3.5, -3.5), and at each step, we update the parameters towards the minimum.

例如,我通过用两个参数(x和y)模拟Beale函数B的优化来生成上述动画。 我们从一个随机点开始,在(-3.5,-3.5)附近,并且在每一步中,我们都将参数更新为最小值。

NB: Note that in reality, the differentiation of the loss function performed by numerical frameworks like Tensorflow or PyTorch is not performed like we did above. We don’t use a set of hard-coded mathematical rules like we use to do in high-school (what is called manual differentiation). There might not even be convenient formulas for the derivatives. It is done using automatic differentiation.

注意:实际上,由数值框架(如Tensorflow或PyTorch)执行的损失函数的微分并不像上面那样进行。 我们没有像在高中时那样使用一组硬编码的数学规则(称为手动微分)。 导数甚至可能没有方便的公式。 它是使用自动微分完成的。

Ok, this looks good! So we are done? Well not so fast! We forgot an important thing. We have to fulfill two constraints resolving this optimization problem. We are performing optimization under constraint.

好的,看起来不错! 我们完成了吗? 嗯,没那么快! 我们忘记了一件重要的事情。 我们必须满足两个约束条件才能解决此优化问题。 我们正在约束下执行优化。

The first one is that the sum of mixture weights α must be non-negative and sum up to 1. This allows the probability of observing a data point to be a proper probability density function; that is :

第一个是混合权重的总和α必须为非负且总和为1。这使得观察数据点的概率为适当的概率密度函数; 那是 :

Image for post

To overcome this constraint, we can incorporate it into the optimization problem with the use of Lagrangian multipliers; i.e reformulating the problem like :

为了克服这个限制,我们可以使用拉格朗日乘子将其合并到优化问题中。 即重新提出类似的问题:

Image for post

λ is added as an additional parameter to the vector θ. Then we run our stochastic optimization procedure again and we are done! Well, not so fast! The real problem lies in the second constraint. Recall that the multinormal probability density function is written as:

将λ作为附加参数添加到向量θ。 然后,我们再次运行随机优化过程,我们就完成了! 好吧,不是那么快! 真正的问题在于第二个约束。 回想一下,多态概率密度函数写为:

Image for post

But the matrix Σ can not be arbitrary! It is a covariance matrix and therefore shall respect a certain number of properties:

但是矩阵Σ不能是任意的! 它是协方差矩阵,因此应遵守某些属性:

  • The diagonal terms are variances and so must be positive

    对角项是方差,因此必须为正
  • The matrix must be symmetric

    矩阵必须对称
  • For two distinct predictors, the square of their covariances must be less than the product of their variances

    对于两个不同的预测变量,其协方差的平方必须小于其方差的乘积
  • The matrix must be invertible

    矩阵必须是可逆的
  • The determinant must be positive

    行列式必须为正

Those conditions are fulfilled when the matrix is positive semidefinite.

当矩阵为半正定矩阵时,满足这些条件。

And this is an extremely harder constraint to fulfill and is still an active research area. In fact, there is an all subfield of convex optimization dedicated to it, it is called Semidefinite programming. It has really emerged as a discipline starting from the 90s with methods like Interior-point or Augmented Lagrangian. But Mixture models have emerged before that and led to the Expectation-Maximization algorithm.

这是一个很难实现的限制,并且仍然是一个活跃的研究领域。 实际上,凸优化的所有子字段都专用于此,称为半定规划。 从90年代开始,它就已经真正成为一门学科,它采用了Interior-point或Augmented Lagrangian等方法。 但是在此之前已经出现了混合模型,并导致了期望最大化算法。

Don’t worry if you did not understand all of the above (especially the part regarding the Lagrangian formulation). All you need to understand is that traditional methods to find the maximum likelihood estimate of the parameters are not adapted to Gaussian mixture models.

如果您不了解上述所有内容(尤其是有关拉格朗日公式的部分),请不要担心。 您需要了解的是,用于找到参数的最大似然估计值的传统方法不适用于高斯混合模型。

结论 (Conclusion)

In this article, we have had a thorough review of Gaussian Mixture Models and the Expectation-Maximization both visually (to give some insight) and more formally giving the full mathematical derivations.

在本文中,我们对高斯混合模型和Expectation-Maximization进行了全面的回顾,无论是在视觉上(以提供一些见解),还是在形式上都给出了完整的数学推导。

This article has been pretty heavy on the math but I think that if you managed to take the time, it is really worth it. Having a deeper insight into this kind of model makes you understand a lot of the different technics that are spread out all over the field of machine learning and statistics. And next time you try to understand another model of the same caliber, it will be much easier. I promise you that!

本文对数学进行了大量介绍,但我认为,如果您能够花点时间,那确实值得。 对这种模型有更深入的了解,可以使您了解遍及机器学习和统计领域的许多不同技术。 下次您尝试了解具有相同口径的其他型号时,它将更加容易。 我向你保证!

In the meantime, take care of yourself and your loved ones!

同时,要照顾好自己和亲人!

翻译自: https://towardsdatascience.com/gaussian-mixture-models-and-expectation-maximization-a-full-explanation-50fa94111ddd

你可能感兴趣的:(python,机器学习,人工智能,算法,tensorflow)