涓枃鍒嗚瘝椤圭洰鎬荤粨

1锛塈CTCLAS

鏈�鏃╃殑涓枃寮�婧愬垎璇嶉」鐩箣涓�锛岀敱涓闄㈣绠楁墍鐨勫紶鍗庡钩銆佸垬缇ゆ墍寮�鍙戯紝閲囩敤C/C++缂栧啓锛岀畻娉曞熀浜庛�婂熀浜庡灞傞殣椹ā鍨嬬殑姹夎璇嶆硶鍒嗘瀽鐮旂┒銆嬨�傚叾涓紑

婧愮増鏈负FreeICTCLAS,鏈�鏂癆PI璋冪敤鐗堟湰涓篘LPIR/ICTCLAS2014鍒嗚瘝绯荤粺(NLPIR鍒嗚瘝绯荤粺鍓嶈韩涓�2000骞村彂甯冪殑

ICTCLAS璇嶆硶鍒嗘瀽绯荤粺锛屼粠2009骞村紑濮嬶紝涓轰簡鍜屼互鍓嶅伐浣滆繘琛屽ぇ鐨勫尯闅旓紝骞舵帹骞縉LPIR鑷劧璇█澶勭悊涓庝俊鎭绱㈠叡浜钩鍙帮紝璋冩暣鍛藉悕涓篘LPIR鍒�

璇嶇郴缁�)

FreeICTCLAS婧愪唬鐮佸湴鍧�涓猴細

https://github.com/hecor/ICTCLAS-2009-free

https://github.com/pierrchen/ictclas_plus (ICTCLAS Version 1.0 for Linux锛�

http://download.csdn.net/detail/shinezlee/1535796

http://www.codeforge.cn/article/106151

NLPIR/ICTCLAS2014 API涓嬭浇鍦板潃涓猴細

http://ictclas.nlpir.org/downloads

鍏朵粬鐗堟湰锛�

(a) 鍦‵reeICTCLAS鍩虹涓婏紝鐢卞悤闇囧畤鑰佸笀鏍规嵁寮�婧愮増C++鏀瑰啓鎴愮殑C#鐗堛��

涓嬭浇鍦板潃涓猴細

https://github.com/smartbooks/SharpICTCLAS 锛堝師鐗堬級

https://github.com/geekfivestart/SharpICTCLAS 锛堟敮鎸佸绾跨▼鐗堬級

(b) 瀵笽CTCLAS鍒嗚瘝绯荤粺浠g爜鍙奡harpICTCLAS浠g爜鐞嗚В鍙弬鑰�:

http://www.cnblogs.com/zhenyulu/articles/653254.html

http://sewm.pku.edu.cn/QA/reference/ICTCLAS/FreeICTCLAS/codes.html

(c) ictclas4j涓枃鍒嗚瘝绯荤粺鏄痵inboy鍦‵reeICTCLAS鐨勫熀纭�涓婂畬鎴愮殑涓�涓猨ava寮�婧愬垎璇嶉」鐩紝绠�鍖栦簡鍘熷垎璇嶇▼搴忕殑澶嶆潅搴︺��

涓嬭浇鍦板潃涓猴細

http://sourceforge.net/projects/ictclas4j/

https://code.google.com/p/ictclas4j/

(d) ICTCLAS Python璋冪敤

Python涓嬭皟鐢∟LPIR(ICTCLAS2013)鍙弬鑰�:

http://ictclas.nlpir.org/newsDetail?DocId=382

Python wrapper for ICTCLAS 2015 鍙弬鑰�:

https://github.com/haobibo/ICTCLAS_Python_Wrapper

https://github.com/tsroten/pynlpir 锛堜竴涓鍥藉皬鍝ユ悶寰楋紝杩樻湁鏂囨。浠嬬粛http://pynlpir.rtfd.org锛�

2锛塎MSEG

閲囩敤Chih-Hao Tsai鐨凪MSEG绠楁硶(A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm)銆侻MSeg 绠楁硶鏈変袱绉嶅垎璇嶆柟娉曪細Simple(only forward maximum matching)鍜孋omplex(three-word chunk maximum matching and 3 additional rules to solve ambiguities)锛岄兘鏄熀浜庢鍚戞渶澶у尮閰嶏紝Complex 鍔犱簡鍥涗釜瑙勫垯杩囪檻銆�

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

http://technology.chtsai.org/mmseg/

娉細

(a) LibMMSeg 鏄疌oreseek.com涓篠phinx鍏ㄦ枃鎼滅储寮曟搸璁捐鐨勪腑鏂囧垎璇嶈蒋浠跺寘锛屽叾鍦℅PL鍗忚涓嬪彂琛岀殑涓枃鍒嗚瘝娉曪紝涔熸槸閲囩敤Chih-Hao Tsai鐨凪MSEG绠楁硶銆侺ibMMSeg 閲囩敤C++寮�鍙戯紝鍚屾椂鏀寔Linux骞冲彴鍜學indows骞冲彴銆�

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

http://www.coreseek.cn/opensource/mmseg/

(b) friso鏄娇鐢╟璇█寮�鍙戠殑涓�涓腑鏂囧垎璇嶅櫒锛屼娇鐢ㄦ祦琛岀殑mmseg绠楁硶瀹炵幇銆傛敮鎸佸UTF-8/GBK缂栫爜鐨勫垏鍒嗭紝缁戝畾浜唒hp鎵╁睍鍜宻phinx token鎻掍欢

涓夌鍒囧垎妯″紡锛�(1).绠�鏄撴ā寮忥細FMM绠楁硶 (2).澶嶆潅妯″紡-MMSEG鍥涚杩囨护绠楁硶 (3)妫�娴嬫ā寮忥細鍙繑鍥炶瘝搴撲腑宸叉湁鐨勮瘝鏉�

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://code.google.com/p/friso/

http://git.oschina.net/lionsoul/friso

(c) MMSEG4J 鏄熀浜嶮MSeg 绠楁硶鐨凧ava寮�婧愪腑鏂囧垎璇嶇粍浠讹紝鎻愪緵lucene鍜宻olr 鎺ュ彛

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://code.google.com/p/mmseg4j/

(d) RMMSeg is written in pure Ruby. RMMSegis an implementation of MMSEG word segmentation algorithm. It is based on two variants of maximum matching algorithms.

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

http://rmmseg.rubyforge.org/

(e) rmmseg-cpp is a re-written of the original RMMSeggem in C++, the core part is written in C++ independent of Ruby. It ismuch faster and cosumes much less memory than RMMSeg. The interface of rmmseg-cpp is almost identical to RMMSeg.

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

http://rmmseg-cpp.rubyforge.org/

https://github.com/pluskid/rmmseg-cpp/

(f) pymmseg-cpp is a Python interface to rmmseg-cpp.

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://github.com/pluskid/pymmseg-cpp/

https://code.google.com/p/pymmseg-cpp/

3锛塈KAnalyzer

IKAnalyzer鏄竴涓紑婧愬熀浜巎ava璇█寮�鍙戠殑杞婚噺绾х殑涓枃鍒嗚瘝宸ュ叿鍖呫�備粠2006骞�12鏈堟帹鍑�1.0鐗堝紑濮嬶紝IKAnalyzer宸茬粡鎺ㄥ嚭 浜�3涓ぇ鐗堟湰銆傛渶鍒濓紝瀹冩槸浠ュ紑婧愰」鐩甃uence涓哄簲鐢ㄤ富浣撶殑锛岀粨鍚堣瘝鍏稿垎璇嶅拰鏂囨硶鍒嗘瀽绠楁硶鐨勪腑鏂囧垎璇嶇粍浠躲�傛柊鐗堟湰IKAnalyzer3.0閲囩敤浜嗙壒鏈夌殑鈥滄鍚戣凯浠f渶缁嗙矑搴﹀垏鍒嗙畻娉曗��,宸插彂灞曚负闈㈠悜Java鐨勫叕鐢ㄥ垎璇嶇粍浠讹紝鐙珛浜嶭ucene椤圭洰锛屽悓鏃舵彁渚涗簡瀵筁ucene鐨勯粯璁や紭鍖栧疄鐜般��

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://code.google.com/p/ik-analyzer/

https://github.com/yozhao/IKAnalyzer

4锛塅NLP(FudanNLP)

FudanNLP涓昏鏄负涓枃鑷劧璇█澶勭悊鑰屽紑鍙戠殑宸ュ叿鍖�(鐜板凡鏇村悕涓篎NLP)锛屽姛鑳藉寘鍚俊鎭绱紙鏂囨湰鍒嗙被銆佹柊闂昏仛绫伙級锛屼腑鏂囧鐞嗭紙涓枃鍒嗚瘝銆佽瘝鎬ф爣娉ㄣ�佸疄浣撳悕璇嗗埆銆佸叧閿瘝鎶藉彇銆佷緷瀛樺彞娉曞垎鏋� 鏃堕棿鐭璇嗗埆锛夛紝缁撴瀯鍖栧涔狅紙鍦ㄧ嚎瀛︿範銆佸眰娆″垎绫汇�佽仛绫伙級銆備粠鍔熻兘鐨勮搴﹁�岃█锛孎NLP涓庤憲鍚嶇殑Python鑷劧璇█澶勭悊宸ュ叿鍖匩LTK杈冧负绫讳技锛屼絾鍚庤�呭涓枃澶勭悊鐨勮兘鍔涜緝宸�侳NLP閲囩敤Java缂栧啓锛屽彲杞绘澗杩愯鍦ㄥ悇绉嶄笉鍚岀殑骞冲彴涔嬩笂銆�

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://github.com/xpqiu/fnlp/

5锛塏iuParser

涓枃鍙ユ硶璇箟鍒嗘瀽绯荤粺NiuParser鏀寔涓枃鍙ュ瓙绾х殑鑷姩鍒嗚瘝銆佽瘝鎬ф爣娉ㄣ�佸懡鍚嶅疄浣撹瘑鍒�佺粍鍧楄瘑鍒�佹垚鍒嗗彞娉曞垎鏋愩�佷緷瀛樺彞娉曞垎鏋愬拰璇箟瑙掕壊鏍囨敞涓冨ぇ璇█鍒嗘瀽鎶�鏈�傛墍鏈変唬鐮侀噰鐢–++璇█寮�鍙戯紝涓嶅寘鍚换浣曞叾瀹冨紑婧愪唬鐮併�侼iuParser绯荤粺鍙互鍏嶈垂鐢ㄤ簬鐮旂┒鐩殑锛屼絾鍟嗕笟鐢ㄩ�旈渶鑾峰緱鍟嗕笟鎺堟潈璁稿彲銆�

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

http://www.niuparser.com/index.en.html

6) LTP

璇█鎶�鏈钩鍙帮紙Language Technology Platform锛孡TP锛夋槸鎻愪緵鍖呮嫭涓枃鍒嗚瘝銆佽瘝鎬ф爣娉ㄣ�佸懡鍚嶅疄浣撹瘑鍒�佷緷瀛樺彞娉曞垎鏋愩�佽涔夎鑹叉爣娉ㄧ瓑涓板瘜銆� 楂樻晥銆佺簿鍑嗙殑鑷劧璇█澶勭悊鎶�鏈�侺TP鍒跺畾浜嗗熀浜嶺ML鐨勮瑷�澶勭悊缁撴灉琛ㄧず锛屽苟鍦ㄦ鍩虹涓婃彁渚涗簡涓�鏁村鑷簳鍚戜笂鐨勪赴瀵岃�屼笖楂樻晥鐨勪腑鏂囪瑷�澶勭悊妯″潡锛堝寘鎷瘝娉曘�佸彞娉曘�佽涔夌瓑6椤逛腑鏂囧鐞嗘牳蹇冩妧鏈級锛屼互鍙婂熀浜庡姩鎬侀摼鎺ュ簱锛圖ynamic Link Library, DLL锛夌殑搴旂敤绋嬪簭鎺ュ彛銆佸彲瑙嗗寲宸ュ叿锛屽苟涓旇兘澶熶互缃戠粶鏈嶅姟锛圵eb Service锛夌殑褰㈠紡杩涜浣跨敤銆�

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://github.com/HIT-SCIR/ltp

娉細

(a) LTP鐨勫垎璇嶆ā鍧�(LTP-CWS)鍩轰簬缁撴瀯鍖栨劅鐭ュ櫒锛圫tructured

Perceptron锛夌畻娉曟瀯寤猴紝鏀寔鐢ㄦ埛鑷畾涔夎瘝鍏革紝閫傚簲涓嶅悓鐢ㄦ埛鐨勯渶姹傦紱鍙﹀杩樻柊澧炰簡涓�у寲锛堝閲忓紡锛夎缁冨姛鑳斤紝鐢ㄦ埛鍙互鏍规嵁鑷繁鐨勫疄闄呴渶姹傦紝濡傚鏂�

棰嗗煙鐨勬枃鏈繘琛屽垎璇嶇瓑锛岃嚜琛屾爣娉ㄥ皯閲忓彞瀛愮殑鍒嗚瘝缁撴灉锛堟瘮濡傚LTP鍒嗚瘝缁撴灉鐨勪慨姝o級锛孡TP鍒嗚瘝妯″潡鍙互閲嶆柊璁粌涓�涓洿濂藉簲瀵规柊棰嗗煙鐨勫垎璇嶅櫒锛岃繘涓�姝ユ彁楂�

鏂伴鍩熶笂鍒嗚瘝鐨勫噯纭巼銆�

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://github.com/HIT-SCIR/ltp-cws

(b) pyltp鏄疞TP鐨凱ython灏佽

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://github.com/HIT-SCIR/pyltp

7锛堿nsj涓枃鍒嗚瘝

鍩轰簬google璇箟妯″瀷+鏉′欢闅忔満鍦烘ā鍨嬬殑涓枃鍒嗚瘝鐨刯ava瀹炵幇锛屽疄鐜颁簡.涓枃鍒嗚瘝. 涓枃濮撳悕璇嗗埆 . 鐢ㄦ埛鑷畾涔夎瘝鍏搞�侫nsj鏄熀浜巌ctclas宸ュ叿鐨刯ava瀹炵幇锛屽熀鏈笂閲嶅啓浜嗘墍鏈夌殑鏁版嵁缁撴瀯鍜岀畻娉曘�備娇鐢ㄥ紑婧愮増鐨刬ctclas璇嶅吀.骞朵笖杩涜浜嗛儴鍒嗙殑浜哄伐浼樺寲銆�

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://github.com/NLPchina/ansj_seg

8) jieba涓枃鍒嗚瘝

jieba"缁撳反"鍒嗚瘝涓篜ython 涓枃鍒嗚瘝缁勪欢锛屾敮鎸佷笁绉嶅垎璇嶆ā寮忥細(a)绮剧‘妯″紡锛岃瘯鍥惧皢鍙ュ瓙鏈�绮剧‘鍦板垏寮�锛岄�傚悎鏂囨湰鍒嗘瀽锛�(b)鍏ㄦā寮忥紝鎶婂彞瀛愪腑鎵�鏈夌殑鍙互鎴愯瘝鐨勮瘝璇兘鎵弿鍑烘潵, 閫熷害闈炲父蹇紝浣嗘槸涓嶈兘瑙e喅姝т箟锛�(c)鎼滅储寮曟搸妯″紡锛屽湪绮剧‘妯″紡鐨勫熀纭�涓婏紝瀵归暱璇嶅啀娆″垏鍒嗭紝鎻愰珮鍙洖鐜囷紝閫傚悎鐢ㄤ簬鎼滅储寮曟搸鍒嗚瘝銆傚彟澶杍ieba鍒嗚瘝鏀寔绻佷綋鍒嗚瘝鍜岃嚜瀹氫箟璇嶅吀銆�

绠楁硶涓昏鍖呮嫭锛氬熀浜嶵rie鏍戠粨鏋勫疄鐜伴珮鏁堢殑璇嶅浘鎵弿锛岀敓鎴愬彞瀛愪腑姹夊瓧鏋勬垚鐨勬湁鍚戞棤鐜浘锛圖AG)锛涢噰鐢ㄤ簡璁板繂鍖栨悳绱㈠疄鐜版渶澶ф鐜囪矾寰勭殑璁$畻, 鎵惧嚭鍩轰簬璇嶉鐨勬渶澶у垏鍒嗙粍鍚堬紱瀵逛簬鏈櫥褰曡瘝锛岄噰鐢ㄤ簡鍩轰簬姹夊瓧浣嶇疆姒傜巼鐨勬ā鍨嬶紝浣跨敤浜哣iterbi绠楁硶銆�

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

https://github.com/fxsjy/jieba

娉細

(a)妯″瀷鏁版嵁鐢熸垚锛屽弬瑙侊細

https://github.com/fxsjy/jieba/issues/7

(b)CppJieba鏄�"缁撳反"涓枃鍒嗚瘝鐨凜++鐗堟湰锛屼唬鐮佺粏鑺傝瑙侊細

https://github.com/yanyiwu/cppjieba

(c) cppjiebapy is a wrap for cppjieba by swig. 鑻ユ兂浣跨敤python鏉ヨ皟鐢╟ppjieba,鍙煡闃咃細

https://github.com/jannson/cppjiebapy

(d) jieba鍒嗚瘝瀛︿範绗旇锛屽弬瑙侊細

http://segmentfault.com/a/1190000004061791

9)HanLP

HanLP鏄敱涓�绯诲垪妯″瀷涓庣畻娉曠粍鎴愮殑Java姹夎瑷�澶勭悊宸ュ叿鍖咃紝鎻愪緵涓枃鍒嗚瘝銆佽瘝鎬ф爣娉ㄣ�佸懡鍚嶅疄浣撹瘑鍒�佷緷瀛樺彞娉曞垎鏋愩�佸叧閿瘝鎻愬彇銆佽嚜鍔ㄦ憳瑕併�佺煭璇彁鍙栥�佹嫾闊炽�佺畝绻佽浆鎹㈢瓑瀹屽鐨勫姛鑳姐�侰RFSegment鏀寔鑷畾涔夎瘝鍏革紝鑷畾涔夎瘝鍏哥殑浼樺厛绾ч珮浜庢牳蹇冭瘝鍏搞��

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

http://hanlp.linrunsoft.com/

https://github.com/hankcs/HanLP

10)BosonNLP

BosonNLP鏄竴瀹跺垵鍒涘叕鍙告彁渚涚殑API SDK璋冪敤鎺ュ彛銆傚姛鑳藉寘鎷細Tokenization and part of speech

tagging, named-entity recognition, tokenization and compute word weight,

automatic detection of opinions embodied in text, work out the

grammatical structure of sentences, categorization

the given articles, Get relative words.

API涓嬭浇鍦板潃涓猴細

https://github.com/liwenzhu/bosonnlp

11)Pullword鍦ㄧ嚎鎶借瘝

Pullword鏄案涔呭厤璐圭殑鍩轰簬娣卞害瀛︿範鐨勪腑鏂囧湪绾挎娊璇�

API璋冪敤Pullword锛屽寘鍚玴ython,R绛夎瑷�锛屽弬瑙侊細

http://api.pullword.com/

12)sogo鍦ㄧ嚎鍒嗚瘝

sogo鍦ㄧ嚎鍒嗚瘝閲囩敤浜嗗熀浜庢眽瀛楁爣娉ㄧ殑鍒嗚瘝鏂规硶锛屼富瑕佷娇鐢ㄤ簡绾挎�ч摼閾綜RF锛圠inear-chain CRF锛夋ā鍨嬨�傝瘝鎬ф爣娉ㄦā鍧椾富瑕佸熀浜庣粨鏋勫寲绾挎�фā鍨嬶紙Structured Linear Model锛�

鍦ㄧ嚎浣跨敤鍦板潃涓猴細

http://www.sogou.com/labs/webservice/

13)THULAC

THULAC锛圱HU Lexical Analyzer for Chinese锛夋槸娓呭崕寮�婧愮殑涓�濂椾腑鏂囪瘝娉曞垎鏋愬伐鍏峰寘锛屼富瑕佸寘鎷腑鏂囧垎璇嶅拰璇嶆�ф爣娉ㄥ姛鑳姐�傝宸ュ叿鍖呬娇鐢ㄤ簡鍩轰簬璇嶅浘锛坵ord lattice锛夌殑閲嶆帓搴忕畻娉�(re-ranking method).

婧愪唬鐮佷笅杞藉湴鍧�涓猴細

http://thulac.thunlp.org

鏈�鍚庣殑褰╄泲锛�

(1) CRF鍒嗚瘝璁粌宸ュ叿锛�

CRFsuite (http://www.chokkan.org/software/crfsuite/)

CRF++ 聽(http://taku910.github.io/crfpp/)

wapiti (https://github.com/Jekub/Wapiti) or (https://wapiti.limsi.fr/)

chinesesegmentor (https://github.com/fancyerii/chinesesegmentor) or 聽(http://fancyerii.github.io/sgdcrf/index.html)

CRF decoder 鍖呭惈CRF++杞欢鍖呬腑鍒嗚瘝瑙g爜鍣ㄩ儴鍒嗭紝绠�鍖栦簡CRF++澶嶆潅浠g爜缁撴瀯锛屾竻闄や簡鍒嗚瘝瑙g爜鍣ㄤ笉闇�瑕佺殑浠g爜锛屽ぇ澶ф彁楂樹簡鍒嗚瘝瑙g爜鍣ㄧ殑鍙鎬у拰鍙噦搴︺�� 涓嬭浇鍦板潃锛歨ttp://sourceforge.net/projects/crfdecoder/

(2) 涓枃鍒嗚瘝鍣ㄥ垎璇嶆晥鏋滆瘎浼板姣旓紝鍙傝锛�

https://github.com/ysc/cws_evaluation

(3)涓枃璇嶅吀寮�婧愰」鐩紙CC-CEDICT锛�

涓�浠戒互姹夎鎷奸煶涓轰腑鏂囪緟鍔╃殑姹夎嫳杈炲吀,鍙敤浜庝腑鏂囧垎璇嶄娇鐢紝鑰屼笖涓嶅瓨鍦ㄧ増鏉冮棶棰樸�侰hrome涓枃鐗堝氨鏄娇鐢ㄧ殑杩欎釜璇嶅吀杩涜涓枃鍒嗚瘝鐨勩��

鏁版嵁鍙婃枃妗d笅杞藉湴鍧�涓猴細

http://www.mdbg.net/chindict/chindict.php?page=cedict

http://cc-cedict.org/wiki/

你可能感兴趣的:(涓枃鍒嗚瘝椤圭洰鎬荤粨)