使用Gensim进行文本信息分类

pip install gensim

数据源:

朋友圈信息

代码:

# -*- coding:utf-8 -*-

import numpyas np

from gensimimport corpora, models, similarities

import time

import jieba

def load_stopword():

'''

加载停用词表

:return: 返回停用词的列表'''

f_stop =open('stopword.txt')

sw = [line.strip()for linein f_stop]

f_stop.close()

return sw

if __name__ =='__main__':

print('1.初始化停止词列表 ------')

# 开始的时间

t_start = time.time()

# 加载停用词表

stop_words = load_stopword()

print('2.开始读入语料数据 ------ ')

wxid_list =list()

texts =list()

# 读入语料库

with open('sample2_combined_7201wxid.txt', 'r', encoding='utf-8')as file_to_read1:

while True:

one_line = file_to_read1.readline()

if one_line =='' or one_line =='\n':

break

line_strs = one_line.split('\n')[0].split('\t')

wx_id = line_strs[0]

text = line_strs[1]

words = jieba.lcut(text)

wxid_list.append(wx_id)

texts.append(words)

print('读入语料数据完成，用时%.3f秒' % (time.time() - t_start))

M =len(texts)

print('文本数目：%d个' % M)

print('3.正在建立词典 ------')

# 建立字典

dictionary = corpora.Dictionary(texts)

V =len(dictionary)

print('4.正在计算文本向量 ------')

# 转换文本数据为索引，并计数

corpus = [dictionary.doc2bow(text)for textin texts]

print('5.正在计算文档TF-IDF ------')

t_start = time.time()

# 计算tf-idf值

corpus_tfidf = models.TfidfModel(corpus)[corpus]

print('建立文档TF-IDF完成，用时%.3f秒' % (time.time() - t_start))

print('6.LDA模型拟合推断 ------')

# 训练模型

num_topics =30

t_start = time.time()

lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,

alpha=0.01, eta=0.01, minimum_probability=0.001,

update_every=1, chunksize=100, passes=1)

print('LDA模型完成，训练时间为\t%.3f秒' % (time.time() - t_start))

# 随机打印某10个文档的主题

num_show_topic =10 # 每个文档显示前几个主题

print('7.结果：10个文档的主题分布：--')

doc_topics = lda.get_document_topics(corpus_tfidf)# 所有文档的主题分布

idx = np.arange(M)

np.random.shuffle(idx)

idx = idx[:10]

for iin idx:

topic = np.array(doc_topics[i])

topic_distribute = np.array(topic[:, 1])

# print topic_distribute

topic_idx = topic_distribute.argsort()[:-num_show_topic -1:-1]

print(('第%d个文档的前%d个主题：' % (i, num_show_topic)), topic_idx)

print(topic_distribute[topic_idx])

num_show_term =7 # 每个主题显示几个词

print('8.结果：每个主题的词分布：--')

for topic_idin range(num_topics):

print('主题#%d：\t' % topic_id)

term_distribute_all = lda.get_topic_terms(topicid=topic_id)

term_distribute = term_distribute_all[:num_show_term]

term_distribute = np.array(term_distribute)

term_id = term_distribute[:, 0].astype(np.int)

print('词：\t', )

for tin term_id:

print(dictionary.id2token[t], )

print('\n概率：\t', term_distribute[:, 1])

运行结果:

1.初始化停止词列表 ------

2.开始读入语料数据 ------

Building prefix dict from the default dictionary ...

Loading model from cache /tmp/jieba.cache

Loading model cost 0.453 seconds.

Prefix dict has been built succesfully.

读入语料数据完成，用时20.850秒

文本数目：7201个

3.正在建立词典 ------

4.正在计算文本向量 ------

5.正在计算文档TF-IDF ------

建立文档TF-IDF完成，用时0.340秒

6.LDA模型拟合推断 ------

LDA模型完成，训练时间为 39.582秒

7.结果：10个文档的主题分布：--

第4776个文档的前10个主题： [12 29 28 1 2 3 4 5 6 7]

[0.58493376 0.01431263 0.01431263 0.01431263 0.01431263 0.01431263

0.01431263 0.01431263 0.01431263 0.01431263]

第4029个文档的前10个主题： [12 3 20 4 29 13 1 2 5 6]

[0.37388983 0.3025465 0.12476926 0.11684622 0.00315185 0.00315185

0.00315185 0.00315185 0.00315185 0.00315185]

第3246个文档的前10个主题： [ 3 5 25 4 29 13 1 2 6 7]

[0.37498331 0.2976712 0.19527724 0.05764455 0.00286245 0.00286245

0.00286245 0.00286245 0.00286245 0.00286245]

第4338个文档的前10个主题： [ 3 25 4 0 5 12 13 1 2 6]

[0.5332467 0.15297471 0.12927592 0.07833721 0.04124153 0.02898731

0.00149736 0.00149736 0.00149736 0.00149736]

第2161个文档的前10个主题： [ 3 25 5 4 27 0 23 12 1 2]

[0.38401735 0.21167417 0.14914817 0.10948798 0.04541766 0.03539393

0.02509075 0.00172913 0.00172913 0.00172913]

第4651个文档的前10个主题： [ 3 25 5 12 29 13 1 2 4 6]

[0.32238352 0.31823975 0.21561985 0.04819759 0.00367536 0.00367536

0.00367536 0.00367536 0.00367536 0.00367536]

第5238个文档的前10个主题： [12 3 25 5 27 29 13 1 2 4]

[0.27936012 0.26403227 0.1416378 0.1241198 0.08557552 0.00421098

0.00421098 0.00421098 0.00421098 0.00421098]

第4130个文档的前10个主题： [ 3 27 5 4 29 13 1 2 6 7]

[0.59734505 0.10868986 0.10651511 0.07392456 0.00436636 0.00436636

0.00436636 0.00436636 0.00436636 0.00436636]

第5269个文档的前10个主题： [25 3 4 27 0 13 1 2 5 6]

[0.343871 0.30449691 0.19992411 0.05840852 0.05270721 0.00162369

0.00162369 0.00162369 0.00162369 0.00162369]

第6264个文档的前10个主题： [25 12 3 5 27 6 29 13 1 2]

[0.33372545 0.19687143 0.16887571 0.10230102 0.07008811 0.04191673

0.00359256 0.00359256 0.00359256 0.00359256]

8.结果：每个主题的词分布：--

主题#0：

词：

早餐

牛肉

维生素

C

早安



导致

概率： [0.03386845 0.02483381 0.02025204 0.01660909 0.01363995 0.01068665

0.00706152]

主题#1：

词：

讲授

宛伊

春季班

浅秋

网课

航道

帮转

概率： [7.93909385e-06 7.93909385e-06 7.93909385e-06 7.93909385e-06

7.93909385e-06 7.93909385e-06 7.93909385e-06]

主题#2：

词：



小伙伴

流泪

辅导

宠儿

疯狂

微笑

概率： [1.29068419e-02 7.86041255e-06 7.84468011e-06 7.84414897e-06

7.84136773e-06 7.83974428e-06 7.83950691e-06]

主题#3：

词：

，

]

[

！

️

太阳

概率： [0.01985054 0.01532708 0.01354059 0.0135403 0.00976824 0.00958529

0.00929036]

主题#4：

词：

含有

预防

油脂

玫瑰

代谢

气血

高

概率： [0.02105247 0.01467578 0.00972506 0.00949167 0.00915031 0.00889741

0.00666286]

主题#5：

词：

本人

信息

工作

食用

联系电话

联系

求购

概率： [0.02339416 0.01594725 0.01259175 0.01181062 0.01179646 0.01147621

0.01118094]

主题#6：

词：

首付

E

拥抱

一厅

出租

跳跳

家电

概率： [0.00836604 0.008056 0.00526544 0.00428819 0.00402217 0.00337555

0.00293302]

主题#7：

词：

★

复制

三网

站

☆

下载

概率： [7.94328753e-06 7.94041080e-06 7.94032258e-06 7.94012067e-06

7.93994695e-06 7.93993877e-06 7.93991512e-06]

主题#8：

词：

110

码数

珍惜

靴

围巾

概率： [0.01233861 0.00504165 0.00441902 0.00437879 0.00392553 0.00380516

0.00368871]

主题#9：

词：

平米



下款



配合

服装

一批

概率： [2.61398800e-05 7.94631796e-06 7.94381049e-06 7.94274365e-06

7.94042626e-06 7.94038988e-06 7.94034531e-06]

主题#10：

词：

抽绳

吧

就是

到

298

外套

39.6

概率： [7.94010339e-06 7.94001789e-06 7.93999061e-06 7.93988329e-06

7.93971594e-06 7.93969502e-06 7.93959043e-06]

主题#11：

词：

刺激

趣

☕

閃電

高檔

抓

設計

概率： [7.93989784e-06 7.93967229e-06 7.93963409e-06 7.93953132e-06

7.93952859e-06 7.93952768e-06 7.93951222e-06]

主题#12：

词：

-

尺码

面料

新款

款

概率： [0.01297308 0.01196654 0.00738309 0.00686454 0.00652365 0.00635273

0.00621089]

主题#13：

词：

手串

极品

克

精品

重

mm

克价

概率： [7.93997333e-06 7.93986601e-06 7.93975596e-06 7.93964227e-06

7.93962499e-06 7.93960680e-06 7.93958588e-06]

主题#14：

词：

贷款

办理

自动

信用卡

软件

导航

刷卡

概率： [0.0075584 0.00740018 0.00605603 0.0055211 0.00507937 0.0047877

0.00400551]

主题#15：

词：

松茸

姬

十袋

、

豆包

降血糖

黏

概率： [7.93930394e-06 7.93929939e-06 7.93925938e-06 7.93925210e-06

7.93922936e-06 7.93922754e-06 7.93922663e-06]

主题#16：

词：

越

缺钱

先找

不放马

尽我所能

让开

骄纵

概率： [7.93957497e-06 7.93950130e-06 7.93943491e-06 7.93943309e-06

7.93943309e-06 7.93943309e-06 7.93943309e-06]

主题#17：

词：

批

跑步

鞋

高帮

Jordan

流泪

概率： [0.0087569 0.00830467 0.00424777 0.00316585 0.00305585 0.00290347

0.00265311]

主题#18：

词：

笑

坏

出单

散客

客源

父母

鹿鞭

概率： [7.11068418e-03 3.31070670e-03 1.58573002e-05 9.37107234e-06

8.43134512e-06 7.91026287e-06 7.89694332e-06]

主题#19：

词：

丈母娘

女儿

照料

用卡

她

困难

购点

概率： [7.93986328e-06 7.93965410e-06 7.93955314e-06 7.93948948e-06

7.93940035e-06 7.93935214e-06 7.93934214e-06]

主题#20：

词：

领取

兔绒

ML

咖色

概率： [1.21185910e-02 8.46080638e-06 7.86211695e-06 7.84995154e-06

7.84794338e-06 7.84665463e-06 7.84584790e-06]

主题#21：

词：

话费

充

补单

吸引

成功

记住

概率： [9.07428004e-03 2.24792189e-03 8.18135049e-06 7.97946359e-06

7.96390214e-06 7.91229650e-06 7.89194746e-06]

主题#22：

词：

┈

封号

┄

安卓

群

╰

打扰

概率： [0.02807738 0.00463936 0.00386185 0.00289498 0.00250433 0.00160384

0.00153386]

主题#23：

词：

cn

用户

http

注册

t

链接

下载

概率： [0.00543195 0.00511594 0.00499532 0.00436718 0.00374179 0.00372209

0.00302911]

主题#24：

词：

我手

残

佛为

儒为表

思在脑

技

大度

概率： [7.93967683e-06 7.93967502e-06 7.93964318e-06 7.93963864e-06

7.93963409e-06 7.93959862e-06 7.93957406e-06]

主题#25：

词：

..

、

，

。

：

概率： [0.03780913 0.02099271 0.01741591 0.01197653 0.01115799 0.01110726

0.01001806]

主题#26：

词：

讲授

宛伊

春季班

浅秋

网课

航道

帮转

概率： [7.93909385e-06 7.93909385e-06 7.93909385e-06 7.93909385e-06

7.93909385e-06 7.93909385e-06 7.93909385e-06]

主题#27：

词：

皮

美容

早

万人

商机

人们

女生

概率： [0.0233007 0.01607903 0.01494002 0.01140604 0.01007851 0.01006253

0.0097663 ]

主题#28：

词：

尺寸

蕾丝

牛仔

长款

大牌

礼盒

建议

概率： [0.00684191 0.00433312 0.00364425 0.00293131 0.00268158 0.00265364

0.00230511]

主题#29：

词：

楼

缩阴

宫颈

精准

阴道

丰胸

糜烂

概率： [0.00535257 0.00325915 0.0028789 0.00250742 0.00237429 0.00229252

0.0020758 ]

从每个分类的top7词汇可以看出，该方法有一定分类效果。下一步优化方向：增加停用词列表、调参。

使用Gensim进行文本信息分类

你可能感兴趣的:(使用Gensim进行文本信息分类)