R语言文本分析(3)

R语言文本分析(3)

跟着《R and Data Mining: Examples and Case Studies》中Text Mining节写代码。前面的步骤还算顺利,心中窃喜。但从stem completion就开始出错。尚未找到解决办法。
代码抄上:

library(twitteR)
load("rdmTweets.RData")
df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
dim(df)
head(df)

library(tm)
myCorpus <- Corpus(VectorSource(df$text)) 
# convert to lower case 
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
# remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
# remove Urls
myCorpus <- tm_map(myCorpus, removeURL)
# my stop words
myStopwords <- c(stopwords('english'), "avaialbe", "via")
myStopwords <- setdiff(myStopwords, c("r", "big"))
# remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

myCorpusCopy <- myCorpus
# stem words
myCorpus <- tm_map(myCorpus, stemDocument)
inspect(myCorpus[11:15])
for(i in 11:15) {
  cat(paste("[[",i,"]]", sep = ""))
  writeLines(strwrap(myCorpus[[i]], width = 73))
}

# stem completion
myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

至此,开始出错,错误信息为:

> myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)
Warning message:
In mclapply(content(x), FUN, ...) :
  all scheduled cores encountered errors in user code

求助stackoverflow后找到以下几个Solutions:

  • Solution 1:
    myCorpus <- tm_map(myCorpus, tolower) 替换为myCorpus <- tm_map(myCorpus, content_transformer(tolower))
    结果: 无效。
  • Solution 2

    # Stem completion
    myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)

替换为:

# Stem completion
stemCompletion_mod <- function(x,dict) { 
    PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")), type = "shortest"), sep = "", collapse = " ")))
}
# apply workaround function 
myCorpus <- lapply(myCorpus, stemCompletion_mod, myCorpusCopy)

结果:无效。

你可能感兴趣的:(R语言文本分析(3))