Author: Li Dong; Time: 14 April 2020
The following are my rough notes on Dilirchlet Process video by Professor Richard Xu. Any errors are mine.
A motiviating example
Let’s assume that the above data comes from GMM. X={x1,x2,…xn}. Suppose they come from K samples of Gaussian Distribution, then the joint probability distribution of all samples is
Now,the question is how to determine K? The physical meaning of K is Gauss’s number of mixtures.(In EM,K is a constan), In this model, parameter θ={μl…μk,εl…εk,αl… αk}, At this point we can’t tell the value of k from the graph above.
Take K as a parameter:
θ={μl…μk,εl…εk,αl… αk,k},argmax[P(X)], In this case, it must be K=N (N is the total number of data), Because in this case, the maximum likelihood for K=N is the maximum, This clustering is obviously not what we were hoping for.
We want the model to automatically confirm K, let’s say it’s a function of N K=f(N),
(On the Dirichlet Process, E(K)∝logN), We have a set of data X={x1,x2,…xn},Each data is from a distribution with the parameter θ,That is, the data corresponds to a set of parameters {θ1,⋯,θN},now we need a distribution to explain these parameters.
Here are the assumptions:
If this H is a continuous distribution, then the sample taken from H can not be exactly the same(because if θi, θj~H,P(θi=θj)=0), At this point, the outcome K=H,which brings us back to where we don’t want to be. So we need a discrete distribution, G θi~G, meanwhile, we want G and H to be similar, at this point, we introduce the Dirichlet Process to construct this G:
H, like H above, is called the base measure, α is a scalar(α>0), it describes the degree of dispersion of G,when α=0, G is the most discrete case, G has only one value;when α= ∞,G=H. It’s exactly what we expected!
In a real situation, in the Dirichlet distribution, H can also be a continuous distribution.G is a random distribution from DP.
The characteristics of this G can be viewed as follows. First, we divide each G into different areas with several vertical lines. It can be divided into any number of areas, and the size of each area is casual. We can name each area, for example, we are divided into K areas. αl… αk,
Since G is a random measure, it is assumed that the sum of the weights of each area (that is, the total length of the vertical line) is expressed as G(α1),G(α2)… are also random measures.They also have a probabilistic nature. The nature of this probability is that they obey a Dirichlet distribution together:
This is the definition of Dirichlet Process. Note that G(a1),here represents the total weight of G in the a1 region, and H(a1) represents the total weight of H in the a1 region. In other words, a certain division under each DP sample is subject to a Dirichlet distribution.
The properties of Dirichlet distribution are as follows:
So, in the above DP, we can write that
We found that α does not appear in the mean, which means that this α has no effect on the mean.Next we look at two extreme cases: when α=0,The variance is 0. In other words, the measurement of G in a certain area is equal to the measurement of H in this area;when α=∞, Var[G(ak)]=H(αk)(1−H(αk)),at this time, the distribution of G in a certain area is a Bernoulli distribution, which means that it is either in this area or not in this area. At this time, G is the most discrete case of H.This is the same as the expectation above.
First explain what is construction. For the distribution in statistics, we all have a probability density function, and we sample all points from this distribution. However, in the Dirichlet process, each sample is a random measure, which contains countless points and corresponding weights. This kind of sampling is very difficult, and from the definition, we can not get the sampling method. Therefore, we need to have a construction method that can generate such measure.
Earlier we said that each DP sample is a random measure. It consists of countless points and their corresponding lengths. In statistics, this point is called an atom. Well, in one sampling of DP, we obtained it by extracting countless Atoms and their weights. Let’s see how to sample this Atom and its weights step by step. We have a base distribution of H. First, we randomly extract a value θ from H. This value corresponds to an Atom, and its weight is the height of the vertical line at the position. Then, the Atom of the first sample is θ1:
Next we need to determine the height of the vertical line, which is the weight corresponding to this Atom. We first extract a value from a Beta distribution whose parameter is (1, α)
The horizontal line in the figure above shows a line segment with a length of 1, and the left start point is 0 and the right end point is 1. We randomly select an atom, and its corresponding weight extraction method is to first extract a result from Beta (1, α) to get β1 , whose weight is π =β1。
At the time of sub-sampling, we still randomly extract an atom is θ2,and its corresponding weight extraction method is, first extract a result in Beta (1, α) to get β2, then its weight is π2=(1−π1)β2 . This means taking the position of β1 from the end of the first extraction to the end of 1 as a new line segment, and then extracting a new position to obtain a new weight. Then the position of the new weight must be between β and 1. as the picture shows. In other words, the result of the second Atom pumping:
For the following samples (Atom) continue to be extracted in the above manner, all the Atom and its weights obtained are the results of a G.when α=0, E[β1]=1,in other words, all the weights are on the first Atom, the other weights are all 0, which is the most discrete case; when α=∞, E[β1]=0,then all Atom weights are close to 0, then this means that each atom weight is almost and very close to 0, that is to say this G = H.Therefore, if G is a sample of DP, then it is composed of countless atoms and their weights, and G can be written as:
In Dirichlet Proscess Mixture Model, we know,Now there are some data x1 … xn, each data has parameters to generate these data, we set these corresponding parameters toθ1… θn, all these parameters should be generated from the discrete measure G,the G generated from the Dirichlet distribution can be used in this place.
The question now is when θ knows, what is the posterior of G
The relationship between Dirichlet distribution and Multinomial distribution:
The relationship between Dirichlet distribution and Dirichlet Process is:
Bring the above properties into the above
And:
In the above formula, the previous part is a continuous base measure,the latter is a discrete measure.The above formula is called stidc and slab in statistics.
Suppose a Chinese restaurant has unlimited tables, and the first customer sits on the first table after the arrival. When the second customer comes, you can choose to sit on the first table, or you can choose to sit on a new table. Assuming that the n + 1th customer arrives, there are already k customers on the table. n1 n2, …, nk customers, then the probability of the n + 1 customer sitting on the i-th table.The mathematical expression of the Chinese restaurant process is as follows: If θ1… θk are generated from the same distribution.Find P (θi∣θ-i), θ-i={θ1…θi-1, θi+1… },(I’m a bit confused here, shouldn’t it be {θ1…θi-1}?) Suppose W is the parameter of this distribution.so:
In this case, we don’t care what the value of a is, but what kind of a is in the corresponding CRP process is which table the i-th person went to.So at this time we introduce {z1…zi}, which corresponds to {θ1…θi} one-to-one, z indicates which class θi is in, and the corresponding CRP is the label of the table where people go.so:
Because the Dirichlet distribution is the conjugate prior of The Multinomial distribution, we know:
Regarding the problem of the coefficient a, this coefficient is generated when we divide the categories and only care about the same number of classes. Obviously, in the Dirichlet process, the same number of classes is not equivalent. So we remove this coefficient and bring it into
so:
nl,-imeans that there are several z in a equal to {θ1…θi-1, θi+1… },m ranges from 1 to k,at this point we sum m over 1 to k:
That means:
This result is called the Chinese Restaurant Process.