python tfidf值计算方法汇总

1、sklearn包计算
1.1 transformer函数计算

from sklearn.feature_extraction.text import TfidfTransformer
if __name__ == "__main__":
    corpus=["我 来到 北京 清华大学",
    "小明 硕士 毕业 与 中国 科学院",
    "我 爱 北京 天安门"]
    vectorizer=CountVectorizer()
    transformer=TfidfTransformer()
tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))
    word=vectorizer.get_feature_names()
    weight=tfidf.toarray()

1.2TfidfVectorizer函数计算

from sklearn.feature_extraction.text import TfidfVectorizer
count_vec = TfidfVectorizer()  
x_train = count_vec.fit_transform(sentences_train)  
x_test  = count_vec.transform(sentences_test)  

TfidfVectorizer函数的输入为分词后的句子列表,而transformer函数需要先用CountVectorizer函数把句子列表变为词袋模型在转换为tfidf值

2、gensim包计算tfidf值

walk = os.walk('/u01/jerry/Reduced')
for root, dirs, files in walk:
    for name in files:
        f = open(os.path.join(root, name), 'r')
   raw = f.read()
   word_list = list(jieba.cut(raw, cut_all = False))
   train_set.append(word_list)


dic = corpora.Dictionary(train_set)
corpus = [dic.doc2bow(text) for text in train_set]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word = dic, num_topics = 10)
corpus_lda = lda[corpus_tfidf]

你可能感兴趣的:(数据挖掘,NLP)