jieba是优秀的中文分词第三方库
-中文文本需要通过分词获得单个的词语
-jieba是优秀的中文分词第三方库,需要额外安装
-jieba库提供三种分词模式,最简单只需要掌握一个函数
(cmd命令行)pip install jieba 或 easy_install jieba
C:\Users\lenovo>easy_install jieba
Searching for jieba
Reading https://pypi.python.org/simple/jieba/
Downloading https://files.pythonhosted.org/packages/71/46/c6f9179f73b818d5827202ad1c4a94e371a29473b7f043b736b4dab6b8cd/jieba-0.39.zip#sha256=de385e48582a4862e55a9167334d0fbe91d479026e5dac40e59e22c08b8e883e
Best match: jieba 0.39
Processing jieba-0.39.zip
Writing C:\Users\lenovo\AppData\Local\Temp\easy_install-o02rlo5j\jieba-0.39\setup.cfg
Running jieba-0.39\setup.py -q bdist_egg --dist-dir C:\Users\lenovo\AppData\Local\Temp\easy_install-o02rlo5j\jieba-0.39\egg-dist-tmp-9zp6cf8i
zip_safe flag not set; analyzing archive contents...
jieba.__pycache__._compat.cpython-37: module references __file__
jieba.analyse.__pycache__.tfidf.cpython-37: module references __file__
creating d:\python37\lib\site-packages\jieba-0.39-py3.7.egg
Extracting jieba-0.39-py3.7.egg to d:\python37\lib\site-packages
Adding jieba 0.39 to easy-install.pth file
Installed d:\python37\lib\site-packages\jieba-0.39-py3.7.egg
Processing dependencies for jieba
Finished processing dependencies for jieba
(1)jieba分词依靠中文词库
-利用一个中文词库,确定汉字之间的关联概率
-汉字间概率大的组成词组,形成分词结果
-除了分词,用户还可以添加自定义的词组
(2)jieba分词的三种模式
-精确模式:把文本精确的切分开,不存在冗余单词
-全模式:把文本中所有可能的词语都扫描出来,有冗余
-搜索引擎模式:在精确模式基础上,对长词再次切分,切分为短词语,进行搜索引擎
函数 | 描述 |
jieba.lcut(s) | 精确模式,返回一个列表类型的分词结果 >>> jieba.lcut("中国是一个伟大的国家") |
jieba.lcut(s,cut_all=true) | 全模式,返回一个列表类型的分词结果,存在冗余 >>> jieba.lcut("中国是一个伟大的国家",cut_all=True) |
jieba.lcut_for_search(s) | 搜索引擎模式,返回一个列表类型的分词结果,存在冗余 >>> jieba.lcut_for_search("中华人民共和国是一个伟大的国家!") |
jieba.add_word(w) | 向分词词典增加新词w >>> jieba.add_word("蟒蛇语言") |
-英文文本:Hamet 分析词频
https://python123.io/resources/pye/hamlet.txt
-中文文本:《三国演义》 分析人物
https://python123.io/resources/pye/threekingdoms.txt
(1)hamlet
#CalHamletV1.py
def getText():
txt = open(r"C:\Users\lenovo\Desktop\hamlet.txt","r").read() #文件路径前不加r报错
txt = txt.lower() #把所有的英文字符变成小写
for ch in '!#$%^&*()_"+./<>=;:,-~`?@[]{}\\|':
txt = txt.replace(ch," ")
return txt
hamletTxt = getText() #对文件进行读取和归一化
words = hamletTxt.split() #默认用空格分隔,存放在一个列表
counts = {} #定义一个字典
for word in words:
counts[word] = counts.get(word,0) + 1 #get方法获得某一个键对应的值
items = list(counts.items()) #转换为列表
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
word,count = items[i]
print("{0:<10}{1:>5}".format(word,count))
运行结果:
the 1138
and 965
to 754
of 669
you 550
i 542
a 542
my 514
hamlet 462
in 436
(2)三国演义
import jieba
txt = open(r"C:\Users\lenovo\Desktop\三国演义.txt","r",encoding="gb18030").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
word,count = items[i]
print("{0:<10}{1:>5}".format(word,count))
运行结果:
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\lenovo\AppData\Local\Temp\jieba.cache
Loading model cost 2.325 seconds.
Prefix dict has been built succesfully.
曹操 953
孔明 836
将军 772
却说 656
玄德 585
关公 510
丞相 491
二人 469
不可 440
荆州 425
玄德曰 390
孔明曰 390
不能 384
如此 378
张飞 358
import jieba
txt = open(r"C:\Users\lenovo\Desktop\三国演义.txt","r",encoding="gb18030").read()
excludes = {"将军","却说","荆州","二人","不可","不能","如此","左右"}
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword == "关羽"
elif word == "玄德" or word == "玄德曰":
rword == "刘备"
elif word == "孟德" or word == "丞相":
rword == "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
word,count = items[i]
print("{0:<10}{1:>5}".format(word,count))
运行结果:
孔明 1391
曹操 963
张飞 366
商议 353
如何 344
主公 338
军士 320
吕布 303
军马 297
赵云 282
扩展:政府工作报告、科研论文、新闻报道、词云......
打开文件时,遇到一个错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte
使用python的时候经常会遇到文本的编码与解码问题,其中很常见的一种解码错误如题目所示,下面介绍该错误的解决方法,将‘utf-8’换成‘gbk’也适用。
(1)首先在打开文本的时候,设置其编码格式,如:open(‘1.txt’,encoding=’gbk’);
(2)若(1)不能解决,可能是文本中出现的一些特殊符号超出了gbk的编码范围,可以选择编码范围更广的‘gb18030’,如:open(‘1.txt’,encoding=’gb18030’);
(3)若(2)仍不能解决,说明文中出现了连‘gb18030’也无法编码的字符,可以使用‘ignore’属性进行忽略,如:open(‘1.txt’,encoding=’gb18030’,errors=‘ignore’);
(4)还有一种常见解决方法为open(‘1.txt’).read().decode(‘gb18030’,’ignore’)