Week1：

Guiding Questions

Develop your answers to the following guiding questions while watching the video lectures throughout the week.

What does a computer have to do in order to understand a natural language sentence?
What is ambiguity?
Why is natural language processing (NLP) difficult for computers?
What is bag-of-words representation?
Why is this word-based representation more robust than representations derived from syntactic and semantic analysis of text?
What is a paradigmatic relation?
What is a syntagmatic relation?
What is the general idea for discovering paradigmatic relations from text?
What is the general idea for discovering syntagmatic relations from text?
Why do we want to do Term Frequency Transformation when computing similarity of context?
How does BM25 Term Frequency transformation work?
Why do we want to do Inverse Document Frequency (IDF) weighting when computing similarity of context?

未完成：

已完成：

黄莉婷
http://blog.csdn.net/weixin_40962955/article/details/78828721
梁清源
http://blog.csdn.net/qq_33414271/article/details/78802272
http://www.jianshu.com/u/337e85e2a284
曾伟
http://www.jianshu.com/p/9e520d5ccdaa
程会林
http://blog.csdn.net/qq_35159009/article/details/78836340
余艾锶
http://blog.csdn.net/xy773545778/article/details/78829053
陈南浩
http://blog.csdn.net/DranGoo/article/details/78850788

Week2：

Guiding Questions
Develop your answers to the following guiding questions while watching the video lectures throughout the week.

What is entropy? For what kind of random variables does the entropy function reach its minimum and maximum, respectively? 1
What is conditional entropy? 2
What is the relation between conditional entropy H(X|Y) and entropy H(X)? Which is larger? 3
How can conditional entropy be used for discovering syntagmatic relations? 4
What is mutual information I(X;Y)? How is it related to entropy H(X) and conditional entropy H(X|Y)? 5
What’s the minimum value of I(X;Y)? Is it symmetric? 6
For what kind of X and Y, does mutual information I(X;Y) reach its minimum? For a given X, for what Y does I(X;Y) reach its maximum? 1
Why is mutual information sometimes more useful for discovering syntagmatic relations than conditional entropy?
What is a topic? 2
How can we define the task of topic mining and analysis computationally? What’s the input? What’s the output? 3
How can we heuristically solve the problem of topic mining and analysis by treating a term as a topic? What are the main problems of such an approach? 4
What are the benefits of representing a topic by a word distribution? 5
What is a statistical language model? What is a unigram language model? How can we compute the probability of a sequence of words given a unigram language model? 6
What is Maximum Likelihood estimate of a unigram language model given a text article? 1
What is the basic idea of Bayesian estimation? What is a prior distribution? What is a posterior distribution? How are they related with each other? What is Bayes rule? 2

未完成：陈南浩

已完成：
梁清源
http://blog.csdn.net/qq_33414271/article/details/78871154
程会林
https://www.jianshu.com/p/61614d406b0f
黄莉婷
http://blog.csdn.net/weixin_40962955/article/details/78877103
余艾锶
http://blog.csdn.net/xy773545778/article/details/78848613
曾伟
http://blog.csdn.net/qq_39759159/article/details/78882651

Week3：

Guiding Questions
Develop your answers to the following guiding questions while watching the video lectures throughout the week.

What is a mixture model? In general, how do you compute the probability of observing a particular word from a mixture model? What is the general form of the expression for this probability? 3
What does the maximum likelihood estimate of the component word distributions of a mixture model behave like? In what sense do they “collaborate” and/or “compete”? 4
Why can we use a fixed background word distribution to force a discovered topic word distribution to reduce its probability on the common (often non-content) words? 5
What is the basic idea of the EM algorithm? What does the E-step typically do? What does the M-step typically do? In which of the two steps do we typically apply the Bayes rule? Does EM converge to a global maximum? 6
What is PLSA? How many parameters does a PLSA model have? How is this number affected by the size of our data set to be mined? How can we adjust the standard PLSA to incorporate a prior on a topic word distribution? 1
How is LDA different from PLSA? What is shared by the two models? 2

未完成：余艾锶
已完成：
程会林：公式归一化为什么不同？
https://www.jianshu.com/p/bcef1ad7a530?utm_campaign=haruki&utm_content=note&utm_medium=reader_share&utm_source=qq
曾伟
http://www.cnblogs.com/Negan-ZW/p/8179076.html
梁清源
http://blog.csdn.net/qq_33414271/article/details/78938301
黄莉婷 LDA 的原理
http://blog.csdn.net/weixin_40962955/article/details/78941383#t10
陈南浩
http://blog.csdn.net/DranGoo/article/details/78968749

Week4：

Guiding Questions
Develop your answers to the following guiding questions while watching the video lectures throughout the week.

What is clustering? What are some applications of clustering in text mining and analysis? 3
How can we use a mixture model to do document clustering? How many parameters are there in such a model? 4
How is the mixture model for document clustering related to a topic model such as PLSA? In what way are they similar? Where are they different? 5
How do we determine the cluster for each document after estimating all the parameters of a mixture model? 6
How does hierarchical agglomerative clustering work? How do single-link, complete-link, and average-link work for computing group similarity? Which of these three ways of computing group similarity is least sensitive to outliers in the data? 1
How do we evaluate clustering results? 2
What is text categorization? What are some applications of text categorization? 3
What does the training data for categorization look like?
How does the Naïve Bayes classifier work? 4
Why do we often use logarithm in the scoring function for Naïve Bayes? 5