R语言文本挖掘、情感分析和可视化哈利波特小说文本数据|附代码数据

全文下载链接：http://tecdat.cn/?p=22984

最近我们被客户要求撰写关于文本挖掘的研究报告，包括一些图形和统计输出。

一旦我们清理了我们的文本并进行了一些基本的词频分析，下一步就是了解文本中的观点或情感。这被认为是情感分析，本教程将引导你通过一个简单的方法来进行情感分析。

简而言之

本教程是对情感分析的一个介绍。本教程建立在tidy text教程的基础上，所以如果你没有读过该教程，我建议你从那里开始。在本教程中，我包括以下内容。

要求：重现本教程中的分析需要什么？
情感数据集：用来对情感进行评分的主要数据集
基本情感分析：执行基本的情感分析
比较情感：比较情感库中的情感差异
常见的情绪词：找出最常见的积极和消极词汇
大单元的情感分析：在较大的文本单元中分析情感，而不是单个词。

复制要求

本教程利用了harrypotter文本数据，以说明文本挖掘和分析能力。

library(tidyverse) # 数据处理和绘图
library(stringr) # 文本清理和正则表达式
library(tidytext) # 提供额外的文本挖掘功能

我们正在处理的七部小说，包括

philosophers_stone：《哈利-波特与魔法石》（1997）。
chamber_of_secrets: 《哈利-波特与密室》(1998)
阿兹卡班的囚徒（prisoner_of_azkaban）。Harry Potter and the Prisoner of Azkaban (1999)
Goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
Order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
half_blood_prince: 哈利-波特与混血王子(2005)
deathly_hallows: 哈利-波特与死亡圣器（2007）。

每个文本都在一个字符矢量中，每个元素代表一个章节。例如，下面说明了philosophers_stone的前两章的原始文本。

philosophers_stone[1:2]
## [1] "THE BOY WHO LIVED　　Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank
## you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold
## with such nonsense.　　Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly
## any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck,
## which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a
## small son called Dudley and in their opinion there was no finer boy anywhere.　　The Dursleys had everything they wanted, but they also
## had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out
## about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn'... 
## [2] "THE VANISHING GLASS　　Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but
## Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys'
## front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen
## that fateful news report about the owls. Only the photographs on the mantelpiece really showed how much time had passed. Ten years ago,
## there had been lots of pictures of what looked like a large pink beach ball wearing different-colored bonnets -- but Dudley Dursley was
## no longer a baby, and now the photographs showed a large blond boy riding his first bicycle, on a carousel at the fair, playing a
## computer game with his father, being hugged and kissed by his mother. The room held no sign at all that another boy lived in the house,
## too.　　Yet Harry Potter was still there, asleep at the moment, but no...

情感数据集

有各种各样的字典存在，用于评估文本中的观点或情感。tidytext包在sentiments数据集中包含了三个情感词典。

sentiments
## # A tibble: 23,165 × 4
##           word sentiment lexicon score
##                   
## 1       abacus     trust     nrc    NA
## 2      abandon      fear     nrc    NA
## 3      abandon  negative     nrc    NA
## 4      abandon   sadness     nrc    NA
## 5    abandoned     anger     nrc    NA
## 6    abandoned      fear     nrc    NA
## 7    abandoned  negative     nrc    NA
## 8    abandoned   sadness     nrc    NA
## 9  abandonment     anger     nrc    NA
## 10 abandonment      fear     nrc    NA
## # ... with 23,155 more rows

这三个词库是

AFINN
bing
nrc

这三个词库都是基于单字（或单词）的。这些词库包含了许多英语单词，这些单词被分配了积极/消极情绪的分数，也可能是快乐、愤怒、悲伤等情绪的分数。nrc词典以二元方式（"是"/"否"）将单词分为积极、消极、愤怒、期待、厌恶、恐惧、快乐、悲伤、惊讶和信任等类别。bing词库以二元方式将单词分为积极和消极类别。AFINN词库给单词打分，分数在-5到5之间，负分表示消极情绪，正分表示积极情绪。

# 查看单个词库
get_sentiments("afinn")
get_sentiments("bing")
get_sentiments("nrc")

基本情感分析

为了进行情感分析，我们需要将我们的数据整理成一个整齐的格式。下面将所有七本《哈利-波特》小说转换为一个tibble，其中每个词都按章节按书排列。更多细节请参见整洁文本教程。

#设定因素，按出版顺序保存书籍
series$book <- factor(series$book, levels = rev(titles))

series
## # A tibble: 1,089,386 × 3
##                   book chapter    word
## *                     
## 1  Philosopher's Stone       1     the
## 2  Philosopher's Stone       1     boy
## 3  Philosopher's Stone       1     who
## 4  Philosopher's Stone       1   lived
## 5  Philosopher's Stone       1      mr
## 6  Philosopher's Stone       1     and
## 7  Philosopher's Stone       1     mrs
## 8  Philosopher's Stone       1 dursley
## 9  Philosopher's Stone       1      of
## 10 Philosopher's Stone       1  number
## # ... with 1,089,376 more rows

现在让我们使用nrc情感数据集来评估整个《哈利-波特》系列所代表的不同情感。我们可以看到，负面情绪的存在比正面情绪更强烈。

        filter(!is.na(sentiment)) %>%
        count(sentiment, sort = TRUE)

## # A tibble: 10 × 2
##       sentiment     n
##            
## 1      negative 56579
## 2      positive 38324
## 3       sadness 35866
## 4         anger 32750
## 5         trust 23485
## 6          fear 21544
## 7  anticipation 21123
## 8           joy 14298
## 9       disgust 13381
## 10     surprise 12991

这给出了一个很好的整体感觉，但如果我们想了解每部小说的过程中情绪是如何变化的呢？要做到这一点，我们要进行以下工作。

创建一个索引，将每本书按500个词分开；这是每两页的大致字数，所以这将使我们能够评估情绪的变化，甚至是在章节中的变化。
用inner\_join连接bing词典，以评估每个词的正面和负面情绪。
计算每两页有多少个正面和负面的词
分散我们的数据
计算出净情绪（正面-负面）。
绘制我们的数据

        ggplot(aes(index, sentiment, fill = book)) +
          geom_bar(alpha = 0.5")

现在我们可以看到每部小说的情节是如何在故事的发展轨迹中朝着更积极或更消极的情绪变化。

点击标题查阅往期内容

主题挖掘LDA和情感分析图书馆话题知乎用户问答行为数据

左右滑动查看更多

比较情感

有了情感词典的几种选择，你可能想了解更多关于哪一种适合你的目的的信息。让我们使用所有三种情感词典，并检查它们对每部小说的不同之处。

        summarise(sentiment = sum(score)) %>%
        mutate(method = "AFINN")

bing_and_nrc <-
                  inner_join(get_sentiments("nrc") %>%
                                     filter(sentiment %in% c("positive", "negative"))) %>%
              
        spread(sentiment, n, fill = 0) %>%

我们现在有了对每个情感词库的小说文本中净情感（正面-负面）的估计。让我们把它们绘制出来。

  ggplot(aes(index, sentiment, fill = method)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_grid(book ~ method)

计算情感的三种不同的词典给出的结果在绝对意义上是不同的，但在小说中却有相当相似的相对轨迹。我们看到在小说中差不多相同的地方有类似的情绪低谷和高峰，但绝对值却明显不同。在某些情况下，AFINN词典似乎比NRC词典发现了更多积极的情绪。这个输出结果也使我们能够在不同的小说之间进行比较。首先，你可以很好地了解书籍长度的差异--《菲尼克斯的秩序》比《哲学家的石头》长很多。其次，你可以比较一个系列中的书籍在情感方面的不同。

常见情绪词

同时拥有情感和单词的数据框架的一个好处是，我们可以分析对每种情感有贡献的单词数。

word_counts
## # A tibble: 3,313 × 3
##      word sentiment     n
##           
## 1    like  positive  2416
## 2    well  positive  1969
## 3   right  positive  1643
## 4    good  positive  1065
## 5    dark  negative  1034
## 6   great  positive   877
## 7   death  negative   757
## 8   magic  positive   606
## 9  better  positive   533
## 10 enough  positive   509
## # ... with 3,303 more rows

我们可以直观地查看，以评估每种情绪的前n个词。

        ggplot(aes(reorder(word, n), n, fill = sentiment)) +
          geom_bar(alpha = 0.8, stat = "identity"

较大单位的情绪分析

很多有用的工作可以通过在词的层面上进行标记化来完成，但有时查看不同的文本单位是有用的或必要的。例如，一些情感分析算法不仅仅关注单字（即单个单词），而是试图了解一个句子的整体情感。这些算法试图理解

我今天过的不开心。

是一个悲伤的句子，而不是一个快乐的句子，因为有否定词。斯坦福大学的CoreNLP工具是这类情感分析算法的例子。对于这些，我们可能想把文本标记为句子。我使用philosophers\_stone数据集来说明。

tibble(text = philosophers_stone)
##                                                                       sentence
##                                                                          
## 1                                              the boy who lived  mr. and mrs.
## 2  dursley, of number four, privet drive, were proud to say that they were per
## 3  they were the last people you'd expect to be involved in anything strange o
## 4                                                                          mr.
## 5      dursley was the director of a firm called grunnings, which made drills.
## 6  he was a big, beefy man with hardly any neck, although he did have a very l
## 7                                                                         mrs.
## 8  dursley was thin and blonde and had nearly twice the usual amount of neck, 
## 9  the dursleys had a small son called dudley and in their opinion there was n
## 10 the dursleys had everything they wanted, but they also had a secret, and th
## # ... with 6,588 more rows

参数token = "句子 "试图通过标点符号来分割文本。

让我们继续按章节和句子来分解philosophers\_stone文本。

                        text = philosophers_stone) %>% 
  unnest_tokens(sentence, text, token = "sentences")

这将使我们能够按章节和句子来评估净情绪。首先，我们需要追踪句子的编号，然后我创建一个索引，追踪每一章的进度。然后，我按字数对句子进行解嵌。这就给了我们一个tibble，其中有每一章中按句子分列的单个词。现在，像以前一样，我加入AFINN词典，并计算每一章的净情感分数。我们可以看到，最积极的句子是第9章的一半，第17章的末尾，第4章的早期，等等。

        group_by(chapter, index) %>%
        summarise(sentiment = sum(score, na.rm = TRUE)) %>%
        arrange(desc(sentiment))


## Source: local data frame [1,401 x 3]
## Groups: chapter [17]
## 
##    chapter index sentiment
##            
## 1        9  0.47        14
## 2       17  0.91        13
## 3        4  0.11        12
## 4       12  0.45        12
## 5       17  0.54        12
## 6        1  0.25        11
## 7       10  0.04        11
## 8       10  0.16        11
## 9       11  0.48        11
## 10      12  0.70        11
## # ... with 1,391 more rows

我们可以用一个热图来形象地说明这一点，该热图显示了我们在每一章的进展中最积极和最消极的情绪。

ggplot(book_sent) +
        geom_tile(color = "white") +

点击文末 “阅读原文”

获取全文完整资料。

本文选自《R语言文本挖掘、情感分析和可视化哈利波特小说文本数据》。