R语言 生成DocumentTermMatrix矩阵报错:Error in nchar(Terms(x), type = "chars") : invalid multibyte string

前段时间利用R语言做文本主题分析时,想要生成DTM矩阵,遇到了如下错误

R语言 生成DocumentTermMatrix矩阵报错:Error in nchar(Terms(x), type =

报上述错误的R语言代码如下

samgov.segmentText <- read.csv('samgov_segment.csv', header = TRUE, fill = TRUE, stringsAsFactors = F)
d.corpus <- Corpus(VectorSource(samgov.segmentText$x),readerControl = list(language = "UTF-8"))
d.corpus <- tm_map(d.corpus, removeWords, stopwordsCN())
ctrl <- list(removePunctuation = TRUE, removeNumbers= TRUE, wordLengths = c(2, Inf),weighting = weightTf, encoding = "UTF-8")
d.dtm <- DocumentTermMatrix(d.corpus,control = ctrl)

我尝试了网上提供的一些方法,推荐最多的就是设置语言,如

先设置Sys.setlocale(locale="English"),再执行以上代码,后设回Sys.setlocale(locale="Chinese (Simplified)_People's Republic of China.936") 等方法,可并不奏效。

后来又查了很多资料,终于在知乎[1]上找到了解决问题的有效方法  (*^▽^*)

解决方法如下

加一句 m <- enc2utf8(samgov.segmentText$x)

R语言代码如下

samgov.segmentText <- read.csv('samgov_segment.csv', header = TRUE, fill = TRUE, stringsAsFactors = F)
m <- enc2utf8(samgov.segmentText$x)
d.corpus <- Corpus(VectorSource(m),readerControl = list(language = "UTF-8"))
d.corpus <- tm_map(d.corpus, removeWords, stopwordsCN())
ctrl <- list(removePunctuation = TRUE, removeNumbers= TRUE, wordLengths = c(2, Inf),weighting = weightTf, encoding = "UTF-8")
d.dtm <- DocumentTermMatrix(d.corpus,control = ctrl)

运行结果为

R语言 生成DocumentTermMatrix矩阵报错:Error in nchar(Terms(x), type =

 

DTM(DocumentTermMatrix)矩阵:

           该矩阵也称为文档-词项矩阵,该矩阵的行代表文档,列代表词汇,矩阵元素即为文档中某一词汇出现的次数。

维基百科[2]解释如下

R语言 生成DocumentTermMatrix矩阵报错:Error in nchar(Terms(x), type =

对于DTM矩阵在R语言中可以使用tm包提供的函数DocumentTermMatrix来获取

 

参考:

[1] 知乎(具体链接找不到了T_T,但是非常感谢给出方法的童鞋)

[2] https://en.wikipedia.org/wiki/Document-term_matrix

你可能感兴趣的:(语言&工具)