4.7-多语言及中文分词与检索

⾃然语⾔与查询 Recall

当处理⼈类⾃然语⾔时，有些情况，尽管搜索和原⽂不完全匹配，但是希望搜到⼀些内容
- Quick brown fox 和 fast brown fox / Jumping fox 和 Jumped foxes
⼀些可采取的优化
- 归⼀化词元：清除变⾳符号，如 rôle 的时候也会匹配 role
- 抽取词根：清除单复数和时态的差异
- 包含同义词
- 拼写错误：拼写错误，或者同⾳异形词

混合多语⾔的挑战

⼀些具体的多语⾔场景
- 不同的索引使⽤不同的语⾔ / 同⼀个索引中，不同的字段使⽤不同的语⾔ / ⼀个⽂档的⼀个字段内混合不同的语⾔
混合语⾔存在的⼀些挑战
- 词⼲提取：以⾊列⽂档，包含了希伯来语，阿拉伯语，俄语和英⽂
- 不正确的⽂档频率 – 英⽂为主的⽂章中，德⽂算分⾼（稀有）
- 需要判断⽤户搜索时使⽤的语⾔，语⾔识别（Compact Language Detector)
  - 例如，根据语⾔，查询不同的索引

分词的挑战

英⽂分词：You’re 分成⼀个还是多个？Half-baked
中⽂分词
- 分词标准：哈⼯⼤标准中，姓和名分开。 HanLP 是在⼀起的。具体情况需制定不同的标准
- 歧义（组合型歧义，交集型歧义，真歧义）
  - 中华⼈⺠共和国 / 美国会通过对台售武法案 / 上海仁和服装⼚

中⽂分词⽅法的演变 – 字典法

查字典 - 最容易想到的分词⽅法（北京航空⼤学的梁南元教授提出）
- ⼀个句⼦从左到右扫描⼀遍。遇到有的词就标示出来。找到复合词，就找最⻓的
- 不认识的字串就分割成单字词
最⼩词数的分词理论 – 哈⼯⼤王晓⻰博⼠把查字典的⽅法理论化
- ⼀句话应该分成数量最少的词串
- 遇到⼆义性的分割，⽆能为⼒（例如：“发展中国家” / “上海⼤学城书店”）
- ⽤各种⽂化规则来解决⼆义性，都并不成功

中⽂分词⽅法的演变 – 基于统计法的机器学习算法

统计语⾔模型 – 1990年前后，清华⼤学电⼦⼯程系郭进博⼠
- 解决了⼆义性问题，将中⽂分词的错误率降低了⼀个数量级。概率问题，动态规划 + 利⽤维特⽐算法快速找到最佳分词

基于统计的机器学习算法
- 这类⽬前常⽤的是算法是HMM、CRF、SVM、深度学习等算法。⽐如 Hanlp 分词⼯具是基于CRF 算法以CRF为例，基本思路是对汉字进⾏标注训练，不仅考虑了词语出现的频率，还考虑上下⽂，具备较好的学习能⼒，因此其对歧义词和未登录词的识别都具有良好的效果。
- 随着深度学习的兴起，也出现了基于神经⽹络的分词器，有⼈尝试使⽤双向LSTM+CRF实现分词器，其本质上是序列标注，据报道其分词器字符准确率可⾼达97.5%

中⽂分词器现状

中⽂分词器以统计语⾔模型为基础，经过⼏⼗年的发展，今天基本已经可以看作是⼀个已经解决的问题
不同分词器的好坏，主要的差别在于数据的使⽤和⼯程使⽤的精度
常⻅的分词器都是使⽤机器学习算法和词典相结合，⼀⽅⾯能够提⾼分词准确率，另⼀⽅⾯能够改善领域适应性。

⼀些中⽂分词器

HanLP – ⾯向⽣产环境的⾃然语⾔处理⼯具包
- http://hanlp.com/
- https://github.com/KennFalcon/elasticsearch-analysis-hanlp
IK 分词器
- https://github.com/medcl/elasticsearch-analysis-ik

HanLP Analysis

HanLP
- ./elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis- hanlp/releases/download/v7.1.0/elasticsearch-analysis-hanlp-7.1.0.zip

image.png

IK Analysis

HanLP
- ./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis- ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
特性
- ⽀持字典热更新

image.png

Pinyin Analysis

Pinyin
- ./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis- pinyin/releases/download/v7.1.0/elasticsearch-analysis-pinyin-7.1.0.zip

image.png

中⽂分词 Demo

使⽤不同的分词器测试效果
索引时，尽量切分的短，查询的时候，尽量⽤⻓的词
拼⾳分词器

本节知识点回顾

多语⾔搜索的挑战
分词 / 语⾔检测 / 相关性算分
Elasticsearch 中，多语⾔搜索所使⽤的⼀些技巧
归⼀化词元 / 单词词根抽取 / 同义词 / 拼写错误
中⽂分词的演进及⼀些 ES 中⽂分词器 & 拼⾳分词器介绍

课程demo

来到杨过曾经生活过的地方，小龙女动情地说：“我也想过过过儿过过的生活。”
你也想犯范范玮琪犯过的错吗
校长说衣服上除了校徽别别别的
这几天天天天气不好
我背有点驼，麻麻说“你的背得背背背背佳

#stop word

DELETE my_index
PUT /my_index/_doc/1
{ "title": "I'm happy for this fox" }

PUT /my_index/_doc/2
{ "title": "I'm not happy about my fox problem" }


POST my_index/_search
{
  "query": {
    "match": {
      "title": "not happy fox"
    }
  }
}


#虽然通过使用 english （英语）分析器，使得匹配规则更加宽松，我们也因此提高了召回率，但却降低了精准匹配文档的能力。为了获得两方面的优势，我们可以使用multifields（多字段）对 title 字段建立两次索引： 一次使用 `english`（英语）分析器，另一次使用 `standard`（标准）分析器:

DELETE my_index

PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": {
          "type": "string",
          "analyzer": "english"
        }
      }
    }
  }
}

PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": {
          "type": "string",
          "fields": {
            "english": {
              "type":     "string",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}


PUT /my_index/blog/1
{ "title": "I'm happy for this fox" }

PUT /my_index/blog/2
{ "title": "I'm not happy about my fox problem" }

GET /_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields",
      "query":    "not happy foxes",
      "fields": [ "title", "title.english" ]
    }
  }
}


#安装插件
./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
#安装插件
bin/elasticsearch install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.1.0/elasticsearch-analysis-hanlp-7.1.0.zip




#ik_max_word
#ik_smart
#hanlp: hanlp默认分词
#hanlp_standard: 标准分词
#hanlp_index: 索引分词
#hanlp_nlp: NLP分词
#hanlp_n_short: N-最短路分词
#hanlp_dijkstra: 最短路分词
#hanlp_crf: CRF分词（在hanlp 1.6.6已开始废弃）
#hanlp_speed: 极速词典分词

POST _analyze
{
  "analyzer": "hanlp_standard",
  "text": ["剑桥分析公司多位高管对卧底记者说，他们确保了唐纳德·特朗普在总统大选中获胜"]

}     

#Pinyin
PUT /artists/
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "user_name_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_first_letter_and_full_pinyin_filter" : {
                    "type" : "pinyin",
                    "keep_first_letter" : true,
                    "keep_full_pinyin" : false,
                    "keep_none_chinese" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "trim_whitespace" : true,
                    "keep_none_chinese_in_first_letter" : true
                }
            }
        }
    }
}


GET /artists/_analyze
{
  "text": ["刘德华 张学友 郭富城 黎明 四大天王"],
  "analyzer": "user_name_analyzer"
}

相关资源

Elasticsearch IK分词插件 https://github.com/medcl/elasticsearch-analysis-ik/> releases

Elasticsearch hanlp 分词插件 https://github.com/KennFalcon/> elasticsearch-analysis-hanlp

分词算法综述 https://zhuanlan.zhihu.com/p/50444885

一些分词工具，供参考：

中科院计算所NLPIR http://ictclas.nlpir.org/nlpir/

ansj分词器 https://github.com/NLPchina/ansj_seg

哈工大的LTP https://github.com/HIT-SCIR/ltp

清华大学THULAC https://github.com/thunlp/THULAC

斯坦福分词器 https://nlp.stanford.edu/software/segmenter.shtml

Hanlp分词器 https://github.com/hankcs/HanLP

结巴分词 https://github.com/yanyiwu/cppjieba

KCWS分词器(字嵌入+Bi-LSTM+CRF) https://github.com/koth/kcws

ZPar https://github.com/frcchang/zpar/releases

IKAnalyzer https://github.com/wks/ik-analyzer

4.7-多语言及中文分词与检索

⾃然语⾔与查询 Recall

混合多语⾔的挑战

分词的挑战

中⽂分词⽅法的演变 – 字典法

中⽂分词⽅法的演变 – 基于统计法的机器学习算法

中⽂分词器现状

⼀些中⽂分词器

HanLP Analysis

IK Analysis

Pinyin Analysis

中⽂分词 Demo

本节知识点回顾

课程demo

相关资源

一些分词工具，供参考：

你可能感兴趣的:(4.7-多语言及中文分词与检索)