Stanford NLP 中文分词(segmenter)中文主体识别(NER)

1、工具包下载

分词segmenter:https://nlp.stanford.edu/software/segmenter.shtml

Stanford NLP 中文分词(segmenter)中文主体识别(NER)_第1张图片

主体识别NER:https://nlp.stanford.edu/software/CRF-NER.shtml

Stanford NLP 中文分词(segmenter)中文主体识别(NER)_第2张图片Stanford NLP 中文分词(segmenter)中文主体识别(NER)_第3张图片

注意:需下载stanford-ner-2012-11-11-chinese.zip,stanford-corenlp-full-2017-06-09.zip

2、项目搭建

需将下面文件放入项目根目录下并加载jar包。

Stanford NLP 中文分词(segmenter)中文主体识别(NER)_第4张图片Stanford NLP 中文分词(segmenter)中文主体识别(NER)_第5张图片

3、Stanford NLP segmeter代码参考:

public class ZH_SegDemo {
    public static CRFClassifier segmenter;
    static {
        // 设置一些初始化参数
        Properties props = new Properties();
        props.setProperty("sighanCorporaDict", "data");
        props.setProperty("serDictionary", "data/dict-chris6.ser.gz");
        props.setProperty("inputEncoding", "UTF-8");
        props.setProperty("sighanPostProcessing", "true");
        segmenter = new CRFClassifier(props);
        segmenter.loadClassifierNoExceptions("data/ctb.gz", props);
        segmenter.flags.setProperties(props);
    }

    public static String doSegment(String sent) {
        String[] strs = (String[]) segmenter.segmentString(sent).toArray();
        StringBuffer buf = new StringBuffer();
        for (String s : strs) {
            buf.append(s + " ");
        }
        System.out.println("segmented res: " + buf.toString());
        return buf.toString();
    }

    public static void main(String[] args) {
        try {
            String readFileToString = FileUtils.readFileToString(new File("a.txt"));
            String doSegment = doSegment(readFileToString);
            System.out.println(doSegment);

           
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

4、NER主体识别,需要先分词后主体识别

public class ExtractDemo {
    private static AbstractSequenceClassifier ner;
    public ExtractDemo() {
        InitNer();
    }
    public void InitNer() {
        String serializedClassifier = "classifiers/chinese.misc.distsim.crf.ser.gz"; // chinese.misc.distsim.crf.ser.gz
        if (ner == null) {
            ner = CRFClassifier.getClassifierNoExceptions(serializedClassifier);
        }
    }

    public String doNer(String sent) {
        return ner.classifyWithInlineXML(sent);
    }

    public static void main(String args[]) {
        String str = "北海 已 成为 中国 对外开放 中 升起 的 一 颗 明星";//已分词
        ExtractDemo extractDemo = new ExtractDemo();
        System.out.println(extractDemo.doNer(str));
        System.out.println("Complete!");
    }

}   
demo已上传到github

你可能感兴趣的:(机器学习)