写在最前面的话,最好的学习材料是官方文档及API:
http://radimrehurek.com/gensim/tutorial.html
http://radimrehurek.com/gensim/apiref.html
以下内空有部分是出自官方文档。
使用TFIDF/LDA来对中文文档做主题分类,TFIDF scikit-learn也有实现,中文的先做分词处理,然后生成向量,根据向量去做主题聚类。
scikit-learn官方文档为:
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
documents = ["新春备年货,新年联欢晚会", "新春节目单,春节联欢晚会红火", "金猴新春 红火新年", "新车新年 年货新春", "新年,看春节联欢晚会", "大盘下跌股市散户", "下跌股市赚钱", "股市反弹下跌", "股市散户赚钱", "大盘下跌散户"] words_list = [[word for word in jieba.cut(str(document), cut_all=False) if len(word) > 1 ] for document in documents] logging.info(words_list) #构建字典 words_dictionary = corpora.Dictionary(words_list) logging.info(words_dictionary.token2id) #根据词袋模型生成对应的文档向量 word_corpus = [words_dictionary.doc2bow(word) for word in words_list] for corpu in word_corpus: logging.info(corpu) #第一步初始货模型 tfidf = models.TfidfModel(word_corpus) corpus_tfidf = tfidf[word_corpus] #使用lda模型分类找出主题 text_lda = models.LdaModel(word_corpus,id2word=words_dictionary,num_topics=2) logging.info(text_lda) doc_bow = text_lda[word_corpus] corpus_lda = text_lda[doc_bow] for doc in corpus_lda: logging.info(doc)
对以前抓取的豆瓣书评做word2vec,要实现的目标是,把从豆瓣网站上的书评做相似性查询,因为抓取的数据大部分是中文,所以需要做中文分词处理,然后根据分词的数据生成词向量模型,然后根据API去查询字典里面的词的相关性。根据语料库的规模,结果会有不同的差异。
我抓取了848条关于斯通纳的书评,这个语料库其实远远不够用来测试word2vec,只是用来帮助理解word2vec,代码部分如下:
stop_words = [line.strip() for line in open('stop_words.txt',encoding='utf-8').readlines()] logging.info(stop_words) text_list = [] # 读取csv文件 reviews = pd.read_csv('26425831.csv', encoding='utf-8') for idx, row in reviews.iterrows(): review_content = row['UserComment'] seg_list = jieba.cut(str(review_content), cut_all=False) word_list = [item for item in seg_list if len(item) > 1] text_list.append(list(set(word_list) - set(stop_words))) logging.info(text_list) model = Word2Vec(text_list) #model.save("word.model") print(model.most_similar('格蕾丝')) #print(model['通纳']) print(model.similarity("通纳", "小说")) print(model.vocab)