http://blog.csdn.net/pipisorry/article/details/43271429
ABSTRACT
Topic modeling has been widely used to mine topics from documents. However,a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of doc- uments) to provide reliable statistics to generate coherent topics. In practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results.
In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better top- ics.
Another radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. "When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web."
The proposed al- gorithm mines two forms of knowledge:must-link(meaning that two words should be in the same topic) andcannot-link(meaning that two words should not be in the same topic).
It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitiv- ity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.
主题模型已被广泛用于从文档中挖掘主题。然而,主题模型的关键弱点是它需要大量的数据提供可靠的统计量,以产生合理的主题。在实践中许多文献集合缺少大量文件,导致经典主题模型LDA产生非常不合理的主题。基于知识的主题模型的提出,要求用户提供一些现有的领域知识来指导模型,以产生更好的主题。
Introduction
解决the key weakness of topic modeling的方法
1. Inventing better topic models: This approach may be effective if a large number of documents are available. How- ever, since topic models perform unsupervised learning, if the data is small, there is simply not enough information to provide reliable statistics to generate coherent topics. Some form of supervision or external information beyond the given documents is necessary.
2. Asking users to provide prior domain knowledge: An ob- vious form of external information is the prior knowledge of the domain from the user.For example, the user can input the knowledge in the form of must-link and cannot- link. A must-link states that two terms (or words) should belong to the same topic, e.g., price and cost. A cannot- link indicates that two terms should not be in the same topic, e.g., price and picture. Some existing knowledge- based topic models (e.g., [1, 2, 9, 10, 14, 15, 26, 28]) can exploit such prior domain knowledge to produce better topics. However, asking the user to provide prior do- main knowledge can be problematic in practice because the user may not know what knowledge to provide and wants the system to discover for him/her. It also makes the approach non-automatic.
3. Learning like humans (lifelong learning): We still use the knowledge-based approach but mine the prior knowledge automatically from the results of past learning. This ap- proach works like human learning. We humans always retain the results learned in the past and use them to help future learning. However, our approach is very di erent from existing life- long learning methods (see bellow).
Lifelong learning is possible in our context due to two key observations:(怎么去寻找must-link和cannot-link)
1. Although every domain is different, there is a fair amount of topic overlapping across domains. For example, every product review domain(手机,闹钟,笔记本等等) has the topic of price, most electronic products share the topic of battery and some also have the topic of screen. From the topics learned from these domains, we can mine frequently shared terms among the topics.For example, we may find price and cost frequently appear together in some topics, which indicates that they are likely to belong to the same topic and thus form amust-link. Note that we have the frequency requirement because we want reliable knowledge.
2. From the previously generated topics from many domains, it is also possible to find that picture and price should not be in the same topic (acannot-link).This can be done by fi nding a set of topics that have picture as a top topical term, but the term price almost never appear at the top of this set of topics, i.e., they are negatively correlated.
proposed lifelong learning approach:(终身学习方法)
阶段1Phrase 1 (Initialization): Given n prior document collec- tions D = fD1; : : : ;Dng, a topic model (e.g., LDA) is run on each collection Di 2 D to produce a set of topics Si. Let S = ∪iSi, which we call the prior topics (or p-topics for short). It then mines must-links M from S using a multiple minimum supports frequent itemset mining algorithm.
阶段2Phase 2 (Lifelong learning): Given a new document collec- tion Dt, a knowledge-based topic model (KBTM) with the must-links M is run to generate a set of topics At. Based on At, the algorithm finds a set of cannot-links C. The KBTM then continues, which is now guided by both must-links M and cannot-links C, to produce the final topic set At. We will explain why we mine cannot-links based on At in Sec- tion 4.2. To enable lifelong learning, At is incorporated into S, which is used to generate a new set of must-links M.
lifelong learnning approach框图:
OVERALL ALGORITHM
MINING KNOWLEDGE
1.Mining Must-Link Knowledge
e.g. price, cost , we should expect to see price and cost as topical terms in the same topic across many domains. Note that they may not appear together in every topic about price due to the special context of the domain or past topic modeling errors.
In practice, top terms under a topic are expected to represent some similar seman- tic meaning. The lower ranked terms usually have very low probabilities due to the smoothing effect of the Dirichlet hyper-parameters rather than true correlations within the topic, leading to their unreliability. Thus, in this work, only top 15 terms are employed to represent a topic.
Given a set of prior topics (p-topics) S, we nd sets of terms that appear together in multiple topics using the data mining technique frequent itemset mining (FIM).But A single minimum support is not appropriate.
di fferent topics may have very di erent frequencies in the data.called the rare item problem.
thus use the multiple minimum supports frequent itemset mining (MS-FIM)
2. Mining Cannot-Link Knowledge
For a term w, there are usually only a few terms wm that share must-links with w while there are a huge number of terms wc that can form cannot-links with w.
However, for a new or test do- main Dt, most of these cannot-links are not useful because the vocabulary size of Dt is much smaller than V . Thus, we focus only on those terms that are relevant to Dt.
we extract cannot-links from each pair of top terms w1 and w2 in each c-topic At j ∈ At:cannot-link mining is targeted to each c-topic.
determine whether two terms form a cannot-link:
Let the number of prior domains that w1 and w2 appear in different p-topics be Ndiff and the number of prior domains that w1 and w2 share the same topic be Nshare. Ndiff should be much larger than Nshare.
We need to use two conditions or thresholds to control the formation of a cannot-link: {即shared少而diff多}
1. The ratio Ndiff =(Nshare +Ndiff ) (called the support ra- tio) is equal to or larger than a threshold πc. This condi- tion is intuitive because p-topics may contain noise due to errors of topic models.
2. Ndiff is greater than a support threshold πdiff . This condition is needed because the above ratio can be 1, but Ndiff can be very small, which may not give reliable cannot-links.
{如screen和pad在p-topics都很少出现Ndiff小,但他们也没有交集Nshare=0,这样support ratio is 1}
AMC MODEL
handling incorrect knowledge:
The idea is that the semantic relationships re ected by correct must-links and cannot-links should also be reason- ably induced by the statistical information underlying the domain collection.
Dealing with Issues of Must-Links
1. A term can have multiple meanings or senses.transitivity problem.
2. Not every must-link is suitable for a domain.
Recognizing Multiple Senses
Given two must-links m1 and m2, if they share the same word sense, the p-topics that cover m1 should have some overlapping with the p-topics that cover m2. For example, must-links flight, brightg and flight, luminanceg should be mostly coming from the same set of p-topics related to the semantic meaning \something that makes things visible" of light.
m1 and m2 share the same sense if
Detecting Possible Wrong Knowledge
apply Pointwise Mutual Information (PMI), which is a popular measure of word associations in text.
i.e.P(w1,w2) > P(w1)P(w2) => P(w1 | w2) > P(w1)
A positive PMI value implies a semantic correlation of terms, while a non-positive PMI value indicates little or no semantic correlation. Thus, we only consider the positive PMI values,
Dealing with Issues of Cannot-Links
two cases:
a) A cannot-link con- tains terms that have semantic correlations. For example, fbattery, chargerg is not a correct cannot-link.
b) A cannot- link does not fit for a particular domain.
we detect and balance cannot-links inside the sampling process. More speci cally, we extend Polya urn model to incorporate the cannot-link.
Proposed Gibbs Sampler
Pólya Urn Model
In the topic model context, a term can be seen as a ball of a certain color and a topic as an urn.
simple Polya urn (SPU) model in the sense that when a ball of a particular color is drawn from an urn, the ball is put back to the urn along with a new ball of the same color. The content of the urn changes over time, which gives a self-reinforcing property known as \the rich get richer". This process corresponds to assigning a topic to a term in Gibbs sampling.
The generalized Polya urn (GPU) model [22, 24] differs from SPU in that, when a ball of a certain color is drawn, two balls of that color are put back along with a certain number of balls of some other colors. These additional balls of some other colors added to the urn increase their proportions in the urn. This is the key technique for incorporating must- links as we will see below.
multi-generalized Polya urn
(M-GPU) model considers a set of urns in the sampling process simultaneously. M-GPU allows a ball to be transferred from one urn to another, enabling multi-urn interactions.
Thus, during sampling, the populations of several urns will evolve even if only one ball is drawn from one urn.
Proposed M-GPU Model
In M-GPU, when a ball is randomly drawn, certain numbers of additional balls of each color are returned to the urn.
Applying the idea to our case, when a term w is assigned to a topic k, each term w0 that shares a must-link with w is also assigned to topic k by a certain amount, which is decided by the matrix λw0;w (see Equation).
通过sampling distribution从must-link graph G中采样w的一个must-link set然后promote.
deal with multiple senses problem in M-GPU
If a term w does not have must-links, then we do not have the multiple sense problem caused by must-links. If w has must-links, the rationale here is to sample a must-link (say m) that contains w to be used to represent the likely word sense from the must-link graph G (built in Section 5.1.1). The sampling distribution will be given in Section 5.3.3. Then, the must-links that share the same word sense with m, including m, are used to promote the related terms of w.
deal with possible wrong must-links,
parameter factor to control how much the M-GPU model should trust the word relationship indicated by PMI.
deal with cannot-links
As M-GPU allows multi-urn interactions, when sampling a ball representing term w from a term urn UW k , we want to transfer the balls representing the cannot-terms of w, say wc (sharing cannot-links with w) to other urns (see Step 5 below), i.e., decreasing the probabilities of those cannot-terms under this topic while increasing their corre- sponding probabilities under some other topic. In order to correctly transfer a ball that represents term wc, it should be transferred to an urn which has a higher proportion of wc.
the M-GPU sampling scheme
Sampling Distributions
non- exchangeability of words under theM-GPU models. We thus take the same approach as that for GPU in [24] which ap- proximates the true Gibbs sampling distribution by treating each word as if it were the last.
For each term wi in each document d:
Phase 1 (Steps 1-4 in M-GPU): calculate the conditional probability of sampling a topic for term wi.
a) Sample a must-link mi that contains wi
b) create a set of must-links fm0g where m0 is either mi or a neighbor of mi in the must-link graph G.
c) The conditional probability of assigning topic k to term wi is de ned as below:
where ni is the count excluding the current assignment of zi, i.e., zi. w refers to all the terms in all documents in the document collection Dt and wi is the current term to be sampled with a topic denoted by zi. nd;k denotes the number of times that topic k is assigned to terms in document d. nk;w refers to the number of times that term w appears under topic k. and are prede ned Dirichlet hyper-parameters. K is the number of topics, and V is the vocabulary size. fm0 vg is the set of must- links sampled for each term v following Phase 1 a) and b), which is recorded during the iterations.
Phase 2 (Step 5 in M-GPU)deals with cannot-links.
抽出wi的一个cannot-link wc,对wc的主题重新采样,i.e. transfer wc's topic到wc比例较高的另一个主题中.
I() is an indicator function, which restricts the ball to be transferred only to an urn that contains a higher proportion of term wc. If there is no topic k has a higher proportion of wc than zc, then keep the original topic assignment, i.e.,assign zc to wc.
from:http://blog.csdn.net/pipisorry/article/details/43271429
ref:KDD2014-Zhiyuan(Brett)Chen-Mining Topics in Documents