使用数据:
DBLP公共数据集,http://dblp.uni-trier.de/ ,DBLP数据集记录了大量文献的记录,在这里我们选取ICCS(International Conference of Computational Science)会议的论文集作为应用对象。数据示例如下:
方法:
利用R语言中的文本挖掘tm包发现该论文集中的频繁词。
代码&注释:
# load tm package
library(tm)
# load RODBC package to extract data fromMySQL database
library(RODBC)
# build a connection to DBLP database
channel <- odbcConnect("dblp",uid="root", pwd="admin")
# copy data from “paper” table
paper<-sqlFetch(channel,"paper")
# select ICCS conference subset dataset
tmsample<-subset(paper,conference =="ICCS")
# view the dataset which includes 2041paper for ICCS conference
View(tmsample)
# save title data for text mining
title<-as.vector(tmsample$title)
# establish corpus for title data
tm<-VCorpus(VectorSource(title))
# data cleaning
tm<-tm_map(tm, content_transformer(tolower))
tm<-tm_map(tm, removeWords,stopwords("english"))
# create document term matrix
dtm<-DocumentTermMatrix(tm,control=list(removePunctuation= TRUE,stopwords=TRUE))
# the smaller value of sparse lead to lessfrequent words, 0.98 means if a word has a probability less than ( 1 - 0.98 ),it will not exist in document term matrix
dtm2 <-removeSparseTerms(dtm, sparse=0.98)
# check frequentwords in dtm2, the figure below shows the result ( 35 frequent words in all )
dtm2$dimnames$Terms
从上图可以看到,我们发现了ICCS会议论文中的一些频繁词例如:algorithm; modeling; optimization等等,从实际角度考虑是比较符合会议的主题的。但频繁词中会包含一些重意的复数项例如:model与models,希望各位大神解答下如何用tm包在挖掘频繁词中可以去掉这些复数项。