使用opennlp自定义命名实体

序

本文主要研究一下如何使用opennlp自定义命名实体，标注训练及模型运用。

maven

        
            org.apache.opennlp
            opennlp-tools
            1.8.4

实践

训练模型

        // train the name finder
        String typedEntities = " NATO \n" +
                " United States \n" +
                " NATO Parliamentary Assembly \n" +
                " Edinburgh \n" +
                " Britain \n" +
                " Anders Fogh Rasmussen \n" +
                " U . S . \n" +
                " Barack Obama \n" +
                " Afghanistan \n" +
                " Rasmussen \n" +
                " Afghanistan \n" +
                " 2010 ";
        ObjectStream sampleStream = new NameSampleDataStream(
                new PlainTextByLineStream(new MockInputStreamFactory(typedEntities), "UTF-8"));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT");
        params.put(TrainingParameters.ITERATIONS_PARAM, 70);
        params.put(TrainingParameters.CUTOFF_PARAM, 1);

        TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
                params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));

opennlp使用及来进行自定义标注实体，命名实体的话则在START之后用冒号标明，比如

参数说明

ALGORITHM_PARAM

On the engineering level, using maxent is an excellent way of creating programs which perform very difficult classification tasks very well.

ITERATIONS_PARAM

number of training iterations, ignored if -params is used.

CUTOFF_PARAM

minimal number of times a feature must be seen

使用模型

上面训练完模型之后，就可以使用该模型进行解析

      NameFinderME nameFinder = new NameFinderME(nameFinderModel);

        // now test if it can detect the sample sentences

        String[] sentence = "NATO United States Barack Obama".split("\\s+");

        Span[] names = nameFinder.find(sentence);

        Stream.of(names)
                .forEach(span -> {
                    String named = IntStream.range(span.getStart(),span.getEnd())
                            .mapToObj(i -> sentence[i])
                            .collect(Collectors.joining(" "));
                    System.out.println("find type: "+ span.getType()+",name: " + named);
                });

输出如下：

find type: organization,name: NATO
find type: location,name: United States
find type: person,name: Barack Obama

小结

opennlp的自定义命名实体的标注，给以了一定定制空间，方便开发者定制各自领域特殊的命名实体，以提高特定命名实体分词的准确性。

doc

opennlp-1.8.4-docs
OpenNLP进行中文命名实体识别（上：预处理及训练模型）
OpenNLP进行中文命名实体识别（下：载入模型识别实体）

使用opennlp自定义命名实体

序

maven

实践

训练模型

使用模型

小结

doc

你可能感兴趣的:(nlp)