首先需要安装并加载tm包。
1、读取文本
x = readLines("222.txt")
2、建立语料库
> r=Corpus(VectorSource(x)) > r A corpus with 7012 text documents
3、语料库输出,保存到硬盘
> writeCorpus(r)
4、查看语料库
> print(r) A corpus with 7012 text documents > summary(r) A corpus with 7012 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID
> inspect(r[2:2])
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Female; Genital Neoplasms, Female/*therapy; Humans
> r[[2]]
Female; Genital Neoplasms, Female/*therapy; Humans
5、建立“文档-词”矩阵
> dtm = DocumentTermMatrix(r) > head(dtm) A document-term matrix (6 documents, 16381 terms) Non-/sparse entries: 110/98176 Sparsity : 100% Maximal term length: 81 Weighting : term frequency (tf)
6、查看“文档-词”矩阵
> inspect(dtm[1:2,1:4])
7、查找出现200次以上的词
> findFreqTerms(dtm,200) [1] "acute" "adjuvant" "advanced" "after" [5] "and" "breast" "cancer" "cancer:" [9] "carcinoma" "cell" "chemotherapy" "clinical" [13] "colorectal" "factor" "for" "from" [17] "group" "growth" "iii" "leukemia" [21] "lung" "lymphoma" "metastatic" "non-small-cell" [25] "oncology" "patients" "phase" "plus" [29] "prostate" "randomized" "receptor" "response" [33] "results" "risk" "study" "survival" [37] "the" "therapy" "treatment" "trial" [41] "tumor" "with"
7、移除出现次数较少的词
inspect(removeSparseTerms(dtm, 0.4))
8、查找和“stem”的相关系数在0.5以上的词
> findAssocs(dtm, "stem", 0.5) stem cells 1.00 0.61
9、计算文档相似度(用cosine计算距离)
> dist_dtm <- dissimilarity(dtm, method = 'cosine') > head(dist_dtm) [1] 1.0000000 0.7958759 0.8567770 0.9183503 0.9139337 0.9309934
10、聚类
> hc <- hclust(dist_dtm, method = 'ave') > plot(hc,xlab='')