北大开源全新中文分词工具包:准确率远超THULAC、jieba 分词(转)

https://www.jianshu.com/p/3d9cd356da1a
https://www.jianshu.com/p/528e46284cbc

(nlp) spring@ubuntu18:~$ pip install pkuseg
Looking in indexes: https://mirrors.aliyun.com/pypi/simple
Collecting pkuseg
  Downloading https://mirrors.aliyun.com/pypi/packages/36/d8/2cd2d21fc960815d4bb521e1e2e2f725c0e4d1ab88cefa4c73520cd84825/pkuseg-0.0.22-cp36-cp36m-manylinux1_x86_64.whl (50.2MB)
     |████████████████████████████████| 50.2MB 1.9MB/s 
Requirement already satisfied: numpy>=1.16.0 in ./anaconda3/envs/nlp/lib/python3.6/site-packages (from pkuseg) (1.17.4)
Installing collected packages: pkuseg
Successfully installed pkuseg-0.0.22
(nlp) spring@ubuntu18:~$ python
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pkuseg
>>> seg = pkuseg.pkuseg()
tex>>> 
>>> text = seg.cut('我爱杭州西湖')
>>> print(text)
['我', '爱', '杭州', '西湖']
>>> text = seg.cut('我叫马化腾,我想学区块链,你说好不好啊,天青色等烟雨,而我在 等你,月色被打捞器,晕开了结局')
>>> text
['我', '叫', '马化腾', ',', '我', '想', '学区', '块链', ',', '你', '说', '好不', '好', '啊', ',', '天青色', '等', '烟雨', ',', '而', '我', '在', '等', '你', ',', '月色', '被', '打捞器', ',', '晕开', '了', '结局']
>>> lexicon = ['区块链','好不好', '天青色']
>>> seg = pkuseg.pkuseg(user_dict=lexicon)
>>> text = seg.cut('我叫马化腾,我想学区块链,你说好不好啊,天青色等烟雨,而我在 等你,月色被打捞器,晕开了结局')
>>> text
['我', '叫', '马化腾', ',', '我', '想', '学', '区块链', ',', '你', '说', '好不好', '啊', ',', '天青色', '等', '烟雨', ',', '而', '我', '在', '等', '你', ',', '月色', '被', '打捞器', ',', '晕开', '了', '结局']

你可能感兴趣的:(北大开源全新中文分词工具包:准确率远超THULAC、jieba 分词(转))