ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipelin

错误代码:

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))

def normalize(text):
    text = text.lower().strip()
    doc = nlp(text)
    filtered_sentences = []
    for sentence in tqdm(doc.sents):#错误在这

 

错误:

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

 

原因:

This is currently a limitation of the sentencizer, because the is_sentenced property is based on whether the Token.is_sent_start properties were changed. However, for the first token in a sentence, this will always default to True. So if the sentence only contains one token, there's no way for spaCy to tell whether the sentence boundaries have been set or not.

As a workaround, you could trick spaCy into ignoring this by setting doc.is_parsed = True, i.e. by making it believe that the dependency parse was assigned and sentence boundaries were applied this way.

这当前是sentencizer的限制,因为is_sentenced属性基于Token.is_sent_start属性是否已更改。 但是,对于句子中的第一个标记,它将始终默认为True。 因此,如果句子只包含一个标记,则spaCy无法判断是否已设置句子边界。

作为一种解决方法,你可以通过设置doc.is_parsed = True来欺骗spaCy忽略它,即通过让它相信分配了依赖关系解析并以这种方式应用了句子边界。

解决办法:spacy版本问题,2.1.3换成2.1.0

pip uninstall spacy 

pip install spacy==2.1.0

艹怼死我了 这个问题

你可能感兴趣的:(ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipelin)