Spacy 使用中文WordVector

Spacy 目前支持jieba中文分词,尝试使用Spacy导入预训练的wordvector,最后测试基于wordvector的词语相似度。

不同于早期版本,Spacy 2.0对词向量等模型的load和save接口做了修改,这里使用Spacy 2.0进行测试,用了官方的一段代码:

loadvector.py :

#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals
import plac
import numpy

import spacy
from spacy.language import Language


@plac.annotations(
    vectors_loc=("Path to .vec file", "positional", None, str),
    lang=("Optional language ID. If not set, blank Language() will be used.",
          "positional", None, str))
def main(vectors_loc, lang=None):
    if lang is None:
        nlp = Language()
    else:
        # create empty language class – this is required if you're planning to
        # save the model to disk and load it back later (models always need a
        # "lang" setting). Use 'xx' for blank multi-language class.
        nlp = spacy.blank(lang)
    with open(vectors_loc, 'rb') as file_:
        header = file_.readline()
        nr_row, nr_dim = header.split()
        print(nr_row, nr_dim)

        nlp.vocab.reset_vectors(width=int(nr_dim))
        for line in file_:
            line = line.rstrip().decode('utf8')
            pieces = line.rsplit(' ', int(nr_dim))
            word = pieces[0]
            vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
            nlp.vocab.set_vector(word, vector)  # add the vectors to the vocab
    # test the vectors and similarity
    text = '您好,你好'
    doc = nlp(text)
    print(text, doc[0].similarity(doc[2]))
    nlp.to_disk("./zh_model")


if __name__ == '__main__':
    plac.call(main)

运行代码,传入Word2vector路径,这里的词向量使用fasttext训练的非二进制文件:

17484 100
的 -0.093497 -0.16838 -0.31183 0.18158 0.14234 0.047932 -0.17727 -0.11675 0.037068 -0.090361 -0.24306 0.096267 -0.17542 0.17559 -0.012545 0.1336 0.13552 -0.10716 0.10519 -0.076989 -0.11632 -0.14894
 -0.099211 -0.068264 -0.16019 -0.20795 0.10994 -0.19069 -0.070186 -0.10722 0.056536 0.037165 -0.16839 -0.2232 -0.42118 -0.25819 0.086529 0.18487 -0.044813 0.07809 0.15395 0.096284 -0.0054108 0.0963
5 0.045701 0.11826 0.02093 -0.061605 -0.069395 0.098948 -0.093462 -0.10125 -0.14047 0.12453 -0.1935 0.12049 -0.25669 0.08099 -0.086279 0.23138 -0.097905 -0.19973 0.34899 0.11208 -0.025583 -0.11361 
-0.23792 -0.32146 0.25924 -0.013813 -0.2467 0.039815 -0.073362 -0.31727 -0.050605 0.075048 -0.25274 0.18276 0.097259 0.17918 -0.052097 -0.24945 -0.034484 -0.093092 0.095478 -0.017527 0.03188 0.0355
25 -0.26906 0.016382 -0.22175 -0.061337 -0.15386 -0.12769 0.115 0.082991 -0.055711 0.033301 0.0084445 0.1597 

运行代码,指定中文语言(zh)

python loadvector.py ./tests/data/mobile.vec zh

这时候会在zh_model下生成模型。

zh_model
├── meta.json
├── tokenizer
└── vocab
    ├── key2row
    ├── lexemes.bin
    ├── strings.json
    └── vectors

之后就可以调用spacy.load("./zh_model")使用模型。

小结:

在调用spacy.load("./zh_model")时当前版本会抛出:

ValueError: No valid 'lang' setting found in model meta.json

原因是spacy中文支持没有实现meta.json信息的写入,仿照其他语言实现该部分,并向spacy提了issue和pull request,想不到第二天就被merge。最后引用spacy开发者指南上的一句话:

You don't have to be an NLP expert or Python pro to contribute, and we're happy to help you get started.
更多关注公众号:


Spacy 使用中文WordVector_第1张图片
wechat

你可能感兴趣的:(Spacy 使用中文WordVector)