topic model 发展方向


combine semantic variables (NLP) with topological variables (SNA) to predict som other semantic variables. 

LDA and its extensions can be used to model the evolution of topics over time, to model the connections among topics, and to predict links among objects in a network. Topic modeling is a case study in machine learning rather than a field in itself; topic modeling draws on several different concepts including Bayesian statistics, time series analysis, hierarchical models, Markov chain monte carlo (MCMC), Bayesian non-parametric statistics and sparsity. In LDA, a document is represented as a mixture of topics (some hypothetical quantity that captures content clustering), and a topic is a distribution over words in a vocabulary.


there are established ways for estimating the parameters of the model as well as the topic assignments. Some of these include mean field variational methodsexpectation propagationGibbs sampling, collapsed Gibbs sampling, collapsed variational Bayes and online variational Bayes. Each of these estimation methods has its own advantages and disadvantages. Blei showed the LDA and pLSI have a lot in common. Unlike LDA, pLSI uses maximum likelihood estimations (and the EM algorithm) for parameter estimation; pLSI tends to overfit badly. The hyperparameter α adds regularization to the ϴ parameter in the LDA model. 


 A lot of preprocessing must be performed before computing a topic model. First, we should remove stopwords, which are words that provide absolutely no clues to the content of the text. If we leave stopwords in the corpus when computing the model, we may end up with meaningless topics that are described with only stopwords, due to their high probability. Second, Blei mentioned that stemming is a good idea, but modern stemming algorithms tend to be too aggressive. If resources allow, I think it would be useful to have humans manually strip words to their root words. Multiword phrases such as “black hole” are also an issue. With sufficient resources, one could ask human labelers to identify these phrases and recode them as a single word by replacing the space between words with an underscore. Hanna Wallach (U. Mass) has a paper that describes how to identify multiwork phrases by using n-grams. Blei has a similar paper that discusses an algorithm called TurboTopics. He also mentioned that a standard statistical hypothesis test such as chi-squared, permutation tests, or a nested hypothesis test would also be sufficient, though inefficient. I have not thought of how this would work however. Finally, remove rare words because they can lead to local optima in the likelihood surface probably yielding inefficient computation.


Some hairy details. One of the parameters that makes LDA useful is  α. α is a hyperparameter in the LDA model that determines the sparsity of draws from the underlying Dirichlet distribution. α is typically a small number; Blei mentioned that 0.01 is a good a priori value for α. As α gets larger, the distribution of topics tends towards the uniform (each topic equally likely) distribution and as α approaches 0, we get sparser draws, meaning more peaked topic probabilities. Setting α to be ridiculously small (i.e. 0.001) may yield a single topic dominating the model. α can be chosen, or we can fit α to the data using cross-validation or some other method. He also discussed the parameter η.


主题模型 topic model 的发展方向:

The beauty of LDA is that it can be embedded in many more complicated models. Some applications of these extensions include word sense, graphs and hierarchies. Before delving into specifics, there are a couple of changes to the LDA model that motivate the next topics.

  1. The probability of observing word given a set of topics β and a set of topic labels z is given by P(w|β,z) which is multinomial. The distribution of P(w|ß,z)can be changed depending on what we are modeling. For example, for count data, P(w|β,z) can be Poisson. This drastically changes the model, however. In LDA, P(w|ß,z) is multinomial which is convenient because it is the conjugate prior of the Dirichlet distribution.

Correlated Topic Model. In LDA, all topics are considered independent of each other, and this is usually unrealistic. CTM allows the topics to be correlated. For example, a paper classified as about calculus is more likely to also be classified as about physics, than it is to be classified as about sewing. Blei mentioned that CTM allows for better prediction, likely because it is more realistic. CTM is also more robust to overfitting. The main distinction from LDA is that ϴ follows the logistic normal distribution instead of the Dirichlet distribution.

Dynamic Topic Model. DTM models how each individual topic changes over time. One example Blei showed involved a topic that could be labeled “technology”. In the late 1700s, this topic contained the words “coal”, “steel” (I am making it up from memory…probably badly…bear with me) and in 2011 contained the words “silicon” and “solar”. The main distinction from LDA is two-fold: assuming the topic at time is normally distributed with the topic at time t-1 as the mean and some variance. That is,

and

instead of multinomial.

A limitation of DTM is that it does not handle the death of a topic gracefully.

Supervised LDA. In sLDA, we associate each document with an external variable. For example, a document may be a Yelp review containing text. The external variable associated with the Yelp review may be the number of stars in the associated rating. We can use sLDA to use the topics estimated by LDA as regressors to predict this external variable Y. Various types of regression can be performed from standard linear regression to the generalized linear model (GLM).The Yelp example would likely use an ordered logit model for Y.

Relational Topic Models. RTM applies sLDA to every pair of documents in a corpus and attempts to use content to predict connectedness in a graph. For example, given the content on my Facebook profile, one could use sLDA to predict what kind of reaction I would have to an ad (i.e. click or no click) and this could be used for targeted ad serving, or any other type of recommendation engine. Think collaborative filtering!RTM is also good for certains types of data that have spatial/geographic dependencies.

Ideal Point Topic Models were barely touched upon since we were running short on time (although we voted to extend the session by 30 mins and Blei happily obliged). They seem particularly useful in political science for predicting roll call votes.

Bayesian Non-Parametric Models are a hot topic but are too complicated to describe here. In LDA, the number of topics is determined a priori and remains fixed throughout the model. In real life, topics can be “born” and can “die” off and we may not know a priori how many topics to use. One can model the latter situation as aChinese Restaurant Process where each table is associated with a topic. Furthermore, a Chinese Restaurant Franchise can be used for modeling hierarchies (hLDA). In CRF, there is a corpus level restaurant where each table is a parameter and a topic (called plates). Then, each document has its own Chinese restaurant where each table is associated with a customer in the corpus level Chinese restaurant. Blei recommended a book by Hjort.

Algorithms. The last few minutes were dedicated to discussing inference algorithms for LDA, particularly Gibbs sampling and variational Bayes. Gibbs sampling is very simple to implement, though Blei stated that it does not work for DTM or CTM because the assumptions of conjugacy (multinomial/Dirichlet) are violated. Variational Bayes is more difficult to implement, but handles non-conjugacy in CTM and DTM much better.


面向应用,可以提升LDA分类能力的扩展方向:

1. 使用意义更加明确的用户反馈提高LDA的分类水平,通过集成主题标签、主题n元语法模型和实体发现(entity detection:从文档中提取具有意义的词汇)(capitalization/entity detection)等方法使LDA的主题分类更加智能和准确。 This talk was presented by David Andrzejwski (@davidandrzej).

2. 使用外部的元数据和主题(LDA),通过局部因子模型(localized factor models)对位用户对服务进行评分 。

3. 使用用户偏好和因子权重((i.e. how important to you is free wireless Internet in a hotel room?))对服务评分进行预测 。

4. 发现文档/主题变迁路径。Determining the network process that created a piece of text: who copied from whom?

5. 将LDA与其他信息集合,如各种tag,WordNet features,情感等。

6. 对主题和兴趣的时态变化进行建模。

7. 对词汇关系的时态变化进行建模。





你可能感兴趣的:(Parameters,each,statistics,permutation,Semantic,variables)