介绍使用tidytext进行文本挖掘。

整洁的数据

整洁的数据应该是这样的

每一行都是一次观测
每一列都是一个变量

对于整洁的文本数据，储存在每行中的数据通常是单个单词，但也可以是n-gram，句子或段落。

unnest_tokens

使用unnest_tokens函数对数据进行处理

# unnest_tokens
text <- c("Because I could not stop for Death -",
"He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immortality")

text

library(dplyr)
text_df <- data_frame(line = 1:4, text = text)

text_df

# 但是，文本分析的基本单位，应该是单词，应该需要进行转换
library(tidytext)

text_df %>%
  unnest_tokens(word, text)

#

简单介绍一下unnest_tokens函数：
unnest_tokens这里使用的两个基本参数。首先，输出的列名，上面是word，然后是文本来输入列（text在本例中）。

使用之后unnest_tokens，我们将每行拆分

其他列（例如每个单词来自的行号）将保留。
标点符号已被删除。
默认情况下，unnest_tokens()将标记转换为小写，这样可以更容易地与其他数据集进行比较或组合。（使用to_lower = FALSE参数关闭此行为）。

文本分析的流程：

image.png

整理简奥斯汀的作品

就是写了傲慢与偏见的那个人，说实话这部作品的确值得一看
数据来自于Jane Austen的janeaustenr包

library(janeaustenr)
library(dplyr)
library(stringr)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()

original_books
# A tibble: 73,422 x 4
   text                  book                linenumber chapter
                                           
 1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
 2 ""                    Sense & Sensibility          2       0
 3 by Jane Austen        Sense & Sensibility          3       0
 4 ""                    Sense & Sensibility          4       0
 5 (1811)                Sense & Sensibility          5       0
 6 ""                    Sense & Sensibility          6       0
 7 ""                    Sense & Sensibility          7       0
 8 ""                    Sense & Sensibility          8       0
 9 ""                    Sense & Sensibility          9       0
10 CHAPTER 1             Sense & Sensibility         10       1
# ... with 73,412 more rows

linenumber对应的是多少行，chapter对应的是第多少章。

要将其作为一个整洁的数据集来处理，还需要将句子转化成文更加基本的格式

library(tidytext)
tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books


# A tibble: 725,055 x 4
   book                linenumber chapter word       
                                 
 1 Sense & Sensibility          1       0 sense      
 2 Sense & Sensibility          1       0 and        
 3 Sense & Sensibility          1       0 sensibility
 4 Sense & Sensibility          3       0 by         
 5 Sense & Sensibility          3       0 jane       
 6 Sense & Sensibility          3       0 austen     
 7 Sense & Sensibility          5       0 1811       
 8 Sense & Sensibility         10       1 chapter    
 9 Sense & Sensibility         10       1 1          
10 Sense & Sensibility         13       1 the        
# ... with 725,045 more rows

此函数使用tokenizers包将原始数据框中的每一行文本分隔为标记。默认标记化用于单词，但其他选项包括字符，n-gram，句子，行，段落或正则表达式模式周围的分隔。
也就是修改下面这个参数：

token   
Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length.

既然数据是每行一个字的格式，我们可以使用像dplyr这样的整洁工具来操作它。通常在文本分析中，我们会想要删除停用词; 停用词是对分析无用的词，通常是非常常见的词，例如英语中的“the”，“of”，“to”等等。我们可以用一个删除停用词（保存在tidytext数据集中stop_words）anti_join()。

data(stop_words)
tidy_books <- tidy_books %>%
  anti_join(stop_words)

我们也可以使用dplyr count()来查找所有书籍中最常见的单词。

 tidy_books %>%
+   count(word, sort = TRUE)
# A tibble: 13,914 x 2
   word       n
     
 1 miss    1855
 2 time    1337
 3 fanny    862
 4 dear     822
 5 lady     817
 6 sir      806
 7 day      797
 8 Emma     787
 9 sister   727
10 house    699

可以看见，最常见的单词是miss

进行可视化：

因为我们一直在使用整洁的工具，所以我们的字数存储在一个整洁的数据框中。这允许我们将它直接传递给ggplot2包，例如创建最常见单词的可视化

library(ggplot2)

tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

image.png

R语言进行文本挖掘

整洁的数据

unnest_tokens

整理简奥斯汀的作品

gutenbergr包

你可能感兴趣的:(R语言进行文本挖掘)