本文转载自:https://www.jiqizhixin.com/articles/2019-03-04-8,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有。
1. textfilter: 中英文敏感词过滤
https://github.com/observerss/textfilter
2. langid:97种语言检测
https://github.com/saffsd/langid.py
3. langdetect:检测另一种语言
https://code.google.com/archive/p/language-detection/
4. phone国际手机、电话归属地查询:
https://github.com/AfterShip/phone
6. ngender:根据名字判断性别,基于朴素贝叶斯计算的概率
https://github.com/observerss/ngender
7.抽取身份证号的正则表达式
IDCards_pattern=r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])$'
IDs = re.findall(IDCards_pattern, text, flags=0)
8.中文缩写库
https://github.com/zhangyics/Chinese-abbreviation-dataset/blob/master/dev_set.txt
9.汉语拆字词典
https://github.com/kfcd/chaizi
10.词汇情感值
https://github.com/rainarch/SentiBridge/blob/master/Entity_Emotion_Express/CCF_data/pair_mine_result
11.中文词库、停用词、敏感词,此 package 的敏感词库分类更细,包含反动词库, 敏感词库表统计, 暴恐词库, 民生词库, 色情词库
https://github.com/fighting41love/Chinese_from_dongxiexidian
12.汉字转拼音
https://github.com/mozillazg/python-pinyin
13.同义词库、反义词库、否定词库
https://github.com/guotong1988/chinese_dictionary
14.无空格英文串分割、抽取单词
https://github.com/keredson/wordninja
15.THU整理的词库,包含 IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库
http://thuocl.thunlp.org/sendMessage
16.百度中文问答数据集
链接:
https://pan.baidu.com/s/1QUsKcFWZ7Tg1dk_AbldZ1A
提取码: 2dva
17.Bert 资源
文本分类实践
https://github.com/NLPScott/bert-Chinese-classification-task
Bert Tutorial文本分类教程
https://github.com/Socialbird-AILab/BERT-Classification-Tutorial
Bert pytorch实现
https://github.com/huggingface/pytorch-pretrained-BERT
Bert用于中文命名实体识别,tensorflow版本
https://github.com/macanv/BERT-BiLSTM-CRF-NER
Bert 基于 Keras 的封装分类标注框架 Kashgari,几分钟即可搭建一个分类或者序列标注模型
https://github.com/BrikerMan/Kashgari
Bert、ELMO的图解
https://jalammar.github.io/illustrated-bert/
Bert: Pre-trained models and downstream applications
https://github.com/asyml/texar/tree/master/examples/bert