python 中文分词(pymmseg -cpp)和中文乱码的问题

python 中文分词(pymmseg -cpp)和中文乱码的问题
  pymmseg-cpp
http://code.google.com/p/pymmseg-cpp/

pymmseg-cpp is a Python port of the rmmseg-cpp project. rmmseg-cpp is a MMSEG Chinese word segmenting algorithm implemented in C++ with a Ruby interface.

Download the binary release on the right sidebar and copy the pymmseg directory to your Python's path (e.g. /usr/lib/python2.5/site-packages/). Here's an example of usage:

from pymmseg import mmseg

mmseg
.dict_load_defaults()
text
= # ...
algor
= mmseg.Algorithm(text)
for tok in algor:
print '%s [%d..%d]' % (tok.text, tok.start, tok.end)

Or you can download the source tarball or check out the latest code from the git repo hosted at github. Then you'll need to build the mmseg-cpp module yourself: goto the mmseg-cpp subdirectory and run the build.py script. It will build the native module for you.

For more information, refer to the README file.


很多同学都会出现乱码的问题。可能是mmseg支持的是utf8, windows的本地默认编码是cp936,也就是gbk编码,所以在控制台直接打印utf-8的字符串当然是乱码了。 
解决方法:
在控制台打印的地方用一个转码就ok了,打印的时候这么写:
print myname.decode('UTF-8').encode('GBK') 


from pymmseg import mmseg

mmseg
.dict_load_defaults()
text
= # ...
algor
= mmseg.Algorithm(text)
for tok in algor:
print '%s [%d..%d]' % (tok.text.decode('UTF-8').encode('GBK') , tok.start, tok.end)

你可能感兴趣的:(python 中文分词(pymmseg -cpp)和中文乱码的问题)