Week 4 (Text Mining)

Guiding Questions
Develop your answers to the following guiding questions while watching the video lectures throughout the week.

  1. What is clustering? What are some applications of clustering in text mining and analysis?
  2. How can we use a mixture model to do document clustering? 1. How many parameters are there in such a model?
  3. How is the mixture model for document clustering related to a topic model such as PLSA? In what way are they similar? Where are they different?
  4. How do we determine the cluster for each document after estimating all the parameters of a mixture model?
  5. How does hierarchical agglomerative clustering work? How do single-link, complete-link, and average-link work for computing group similarity? Which of these three ways of computing group similarity is least sensitive to outliers in the data?
  6. How do we evaluate clustering results?
  7. What is text categorization? What are some applications of text categorization?
  8. What does the training data for categorization look like?
  9. How does the Naïve Bayes classifier work?
  10. Why do we often use logarithm in the scoring function for Naïve Bayes?

4.1 Text Clustering: Motivation

Week 4 (Text Mining)_第1张图片
image.png
Week 4 (Text Mining)_第2张图片
image.png
Week 4 (Text Mining)_第3张图片
image.png

4.2 Text Clustering: Generative Probabilistic Models Part 1

Week 4 (Text Mining)_第4张图片
image.png
Week 4 (Text Mining)_第5张图片
image.png

每篇文章只有一个主题,才可以做 Cluster

Week 4 (Text Mining)_第6张图片
image.png
Week 4 (Text Mining)_第7张图片
image.png
image.png
Week 4 (Text Mining)_第8张图片
image.png
Week 4 (Text Mining)_第9张图片
image.png
  1. 对于文章中的每个词: Cluster Model 选择文档只选择一次;Topic Model 每个词都选择一次
  2. Cluster Model: Word Distribution 产生文章中的每一个词;Topic Model 不一定Word Distribution 就能产生所有文章中的词,可以在别的 Topic 中产生
Week 4 (Text Mining)_第10张图片
image.png

L:#文章中的单词数

4.3 Text Clustering: Generative Probabilistic Models Part 2

Week 4 (Text Mining)_第11张图片
image.png

如何从2个 Cluster拓展到 N 个 Cluster

Week 4 (Text Mining)_第12张图片
image.png
Week 4 (Text Mining)_第13张图片
image.png

你可能感兴趣的:(Week 4 (Text Mining))