《Python自然语言处理-雅兰·萨纳卡(Jalaj Thanaki)》学习笔记:04 预处理

03 预处理

  • 4.1 处理原始语料库文本
    • 4.1.1 获取原始文本
    • 4.1.2 小写化转换
    • 4.1.3 分句
    • 4.1.4 原始文本词干提取
    • 4.1.5 原始文本词形还原
    • 4.1.6 停用词去除
  • 4.2 处理原始语料库句子
    • 4.2.1 词条化
    • 4.2.2 单词词形还原
  • 4.3 基础预处理
    • 4.3.1 正则表达式
    • 4.3.2 基本级正则表达式
    • 4.3.3 高级正则表达式
  • 4.4 实践和个性化预处理
    • 4.4.1 由你自己决定
    • 4.4.2 预处理流程
    • 4.4.3 预处理的类型
    • 4.4.4 理解预处理的案例
  • 4.5 总结

4.1 处理原始语料库文本

4.1.1 获取原始文本

以下是数据源:

  • 原始文本文件

  • 以局部变量的形式定义脚本内的原始数据文本

  • 使用来自nltk的任何可用语料库
    原始文本文件访问:在本地计算机上保存了一个.txt文件,其中包含文本段落形式的数据。读取该文件的内容,然后加载内容作为下一步。运行一个句子标记器来删除其中的句子。

以局部变量的形式在脚本中定义原始数据文本:如果我们有一个数据量,然后我们可以将数据分配给本地字符串变量。例如:text=这是一个句子

使用来自nltk的可用语料库:导入可用语料库,例如布朗语料库、古登堡语料库等从NLTK和加载内容。定义了三个函数:

fileread():读取文件的内容
localtextValue():这将加载本地定义的文本
readcorpus():读取gutenberg语料库内容

import nltk
from nltk.corpus import gutenberg as cg
from nltk.tokenize import sent_tokenize as st
# 读取文件的内容
def fileread():
    file_contents = open("rawtextcorpus.txt", "r").read()
    # print file_contents
    return file_contents
# 加载本地定义的文本
def localtextvalue():
    text = """ one paragraph, of 100-250 words, which summarizes the purpose, methods, results and conclusions of the paper.
    It is not easy to include all this information in just a few words. Start by writing a summary that includes whatever you think is important,
    and then gradually prune it down to size by removing unnecessary words, while still retaini ng the necessary concepts.
    Don't use abbreviations or citations in the abstract. It should be able to stand alone without any footnotes. Fig 1.1.1 shows below."""
    return text

# 读取gutenberg语料库内容
def readcorpus():
    raw_content_cg = cg.raw("burgess-busterbrown.txt")
    return raw_content_cg[0:1000]

4.1.2 小写化转换

在进行解析时,将所有数据转换为小写有助于预处理过程以及NLP应用程序的后续阶段。

text= "I am a person. Do you know what is time now?"
text.lower()
'i am a person. do you know what is time now?'

4.1.3 分句

在原始文本数据中,数据采用段落形式。现在,如果你想要段落中的句子,那么你需要在句子层面上标记化。

OpenNLP
Stanford CoreNLP
GATE
nltk

这里我们使用的是NLTK语句标记器。

rawtext = localtextvalue()
sent_tokenize(rawtext)
[' one paragraph, of 100-250 words, which summarizes the purpose, methods, results and conclusions of the paper.',
 'It is not easy to include all this information in just a few words.',
 'Start by writing a summary that includes whatever you think is important,\n    and then gradually prune it down to size by removing unnecessary words, while still retaini ng the necessary concepts.',
 "Don't use abbreviations or citations in the abstract.",
 'It should be able to stand alone without any footnotes.',
 'Fig 1.1.1 shows below.']
filecontentdetails = fileread()
print(filecontentdetails )
ABSTRACT

1. An abstract, or summary, is published together with a research article, giving the reader a "preview" of what's to come. Such abstracts may also be published separately in bibliographical sources, such as Biologic al Abstracts. They allow other scientists to quickly scan the large scientific literature, and decide which articles they want to read in depth. The abstract should be a little less technical than the article itself; you don't want to dissuade your potent ial audience from reading your paper.

2. Your abstract should be one paragraph, of 100-250 words, which summarizes the purpose, methods, results and conclusions of the paper.

3. It is not easy to include all this information in just a few words. Start by writing a summary that includes whatever you think is important, and then gradually prune it down to size by removing unnecessary words, while still retaini ng the necessary concepts.

3. Don't use abbreviations or citations in the abstract. It should be able to stand alone without any footnotes.

INTRODUCTION

What question did you ask in your experiment? Why is it interesting? The introduction summarizes the relevant literature so that the reader will understand why you were interested in the question you asked. One to fo ur paragraphs should be enough. End with a sentence explaining the specific question you asked in this experiment.

MATERIALS AND METHODS

1. How did you answer this question? There should be enough information here to allow another scientist to repeat your experiment. Look at other papers that have been published in your field to get some idea of what is included in this section.

2. If you had a complicated protocol, it may helpful to include a diagram, table or flowchart to explain the methods you used.

3. Do not put results in this section. You may, however, include preliminary results that were used to design the main experiment that you are reporting on. ("In a preliminary study, I observed the owls for one week, and found that 73 % of their locomotor activity occurred during the night, and so I conducted all subsequent experiments between 11 pm and 6 am.")

4. Mention relevant ethical considerations. If you used human subjects, did they consent to participate. If you used animals, what measures did you take to minimize pain?

RESULTS

1. This is where you present the results you've gotten. Use graphs and tables if appropriate, but also summarize your main findings in the text. Do NOT discuss the results or speculate as to why something happened; t hat goes in th e Discussion.

2. You don't necessarily have to include all the data you've gotten during the semester. This isn't a diary.

3. Use appropriate methods of showing data. Don't try to manipulate the data to make it look like you did more than you actually did.

"The drug cured 1/3 of the infected mice, another 1/3 were not affected, and the third mouse got away."

TABLES AND GRAPHS

1. If you present your data in a table or graph, include a title describing what's in the table ("Enzyme activity at various temperatures", not "My results".) For graphs, you should also label the x and y axes.

2. Don't use a table or graph just to be "fancy". If you can summarize the information in one sentence, then a table or graph is not necessary.

DISCUSSION

1. Highlight the most significant results, but don't just repeat what you've written in the Results section. How do these results relate to the original question? Do the data support your hypothesis? Are your results consistent with what other investigators have reported? If your results were unexpected, try to explain why. Is there another way to interpret your results? What further research would be necessary to answer the questions raised by your results? How do y our results fit into the big picture?

2. End with a one-sentence summary of your conclusion, emphasizing why it is relevant.

ACKNOWLEDGMENTS

This section is optional. You can thank those who either helped with the experiments, or made other important contributions, such as discussing the protocol, commenting on the manuscript, or buying you pizza.

REFERENCES (LITERATURE CITED)

There are several possible ways to organize this section. Here is one commonly used way:

1. In the text, cite the literature in the appropriate places:

Scarlet (1990) thought that the gene was present only in yeast, but it has since been identified in the platypus (Indigo and Mauve, 1994) and wombat (Magenta, et al., 1995).

2. In the References section list citations in alphabetical order.

Indigo, A. C., and Mauve, B. E. 1994. Queer place for qwerty: gene isolation from the platypus. Science 275, 1213-1214.

Magenta, S. T., Sepia, X., and Turquoise, U. 1995. Wombat genetics. In: Widiculous Wombats, Violet, Q., ed. New York: Columbia University Press. p 123-145.

Scarlet, S.L. 1990. Isolation of qwerty gene from S. cerevisae. Journal of Unusual Results 36, 26-31.
st_list_rawfile = st(filecontentdetails)
len(st_list_rawfile)
80

分句技术的挑战

  • 如果一个点后面有一个小字母,那么句子不应该在点后面分开,如:

He has completed his Ph.D. degree. He is happy.

  • 如果圆点后面有一个小字母,这是一个常见的错误,那么句子应该在圆点后面分开。

This is an apple.an apple is good for health.

  • 如果句子中有首字母名,则句子不应在首字母后拆分:

Harry Potter was written by J.K. Rowling. It is an entertaining one.

  • grammarly Inc.,语法规则校正软件,提供句子的识别和高精度句子边界检测方法。

https://tech.grammarly.com/blog/posts/How-to-Split-Sentences.html
可以开发一个基于规则的系统来提高分句性能:

对于前面的方法,您可以使用名称实体识别(NER)工具,词性标注器,然后分析所描述工具的输出,以及分句器输出并纠正出了差错分句的位置。借助NER工具、词性标注器,您可以修复分句的错误输出。在这种情况下,编写规则,然后编写代码,检查输出是否如预期?
测试代码!检查是否有异常情况。您的代码运行良好吗?如果是的话,太好了!如果没有,就稍微改变一下:你可以使用机器学习或深度学习技术来改进分句。

4.1.4 原始文本词干提取

from nltk.stem import PorterStemmer
port = PorterStemmer()
text="""
Stemming is funnier than a bummer says the sushi loving computer scientist.She really wants to buy cars. She told me angrily.
"""
print(" ".join([port.stem(i) for i in text.split()]))
stem is funnier than a bummer say the sushi love comput scientist.sh realli want to buy cars. she told me angrily.

词干的挑战

最初,词干工具是为英语制作的。英语词干工具的准确性很高,但对于乌尔都语和希伯来语等语言,词干工具的性能不好。因此,为其他语言开发词干工具非常具有挑战性。它仍然是一个开放的研究领域。

4.1.5 原始文本词形还原

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
text = """Stemming is funnier than a bummer says the sushi loving computer scientist.
She really wants to buy cars. She told me angrily.
It is better for you. Man is walking. We are meeting tomorrow."""
# Pos = verb
print("\nVerb lemma") 
print(" ".join([wordnet_lemmatizer.lemmatize(i,pos="v") for i in text.split()])) 
# Pos =  noun
print("\nNoun lemma") 
print(" ".join([wordnet_lemmatizer.lemmatize(i,pos="n") for i in text.split()])) 
# Pos = Adjective
print("\nAdjective lemma") 
print(" ".join([wordnet_lemmatizer.lemmatize(i, pos="a") for i in text.split()])) 
# Pos = satellite adjectives
print("\nSatellite adjectives lemma") 
print(" ".join([wordnet_lemmatizer.lemmatize(i, pos="s") for i in text.split()])) 
print("\nAdverb lemma") 
# POS = Adverb
print(" ".join([wordnet_lemmatizer.lemmatize(i, pos="r") for i in text.split()])) 
Verb lemma
Stemming be funnier than a bummer say the sushi love computer scientist. She really want to buy cars. She tell me angrily. It be better for you. Man be walking. We be meet tomorrow.

Noun lemma
Stemming is funnier than a bummer say the sushi loving computer scientist. She really want to buy cars. She told me angrily. It is better for you. Man is walking. We are meeting tomorrow.

Adjective lemma
Stemming is funny than a bummer says the sushi loving computer scientist. She really wants to buy cars. She told me angrily. It is good for you. Man is walking. We are meeting tomorrow.

Satellite adjectives lemma
Stemming is funny than a bummer says the sushi loving computer scientist. She really wants to buy cars. She told me angrily. It is good for you. Man is walking. We are meeting tomorrow.

Adverb lemma
Stemming is funnier than a bummer says the sushi loving computer scientist. She really wants to buy cars. She told me angrily. It is well for you. Man is walking. We are meeting tomorrow.

词形还原的挑战
词形还原使用带标记的字典,如WordNet。大多数情况下,它是一本带有人类标签的字典。因此,人类的努力和制作不同语言的WordNet所需的时间具有挑战性。

4.1.6 停用词去除

from nltk.corpus import stopwords
stopwordlist = stopwords.words('english')
for s in stopwordlist:
    print(s)
i
me
my
myself
we
our
ours
ourselves
you
you're
you've
you'll
you'd
your
yours
yourself
yourselves
he
him
his
himself
she
she's
her
hers
herself
it
it's
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
that'll
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
don't
should
should've
now
d
ll
m
o
re
ve
y
ain
aren
aren't
couldn
couldn't
didn
didn't
doesn
doesn't
hadn
hadn't
hasn
hasn't
haven
haven't
isn
isn't
ma
mightn
mightn't
mustn
mustn't
needn
needn't
shan
shan't
shouldn
shouldn't
wasn
wasn't
weren
weren't
won
won't
wouldn
wouldn't
stop = set(stopwords.words('english'))
sentence = "this is a test sentence. I am very happy today."
print(" ".join([i for i in sentence.lower().split() if i not in stop]))
test sentence. happy today.

你也可以根据NLP应用程序自定义要删除的单词。

stop_words = set(["hi", "bye"])
line = """hi this is foo. bye"""
print(" ".join(word for word in line.split() if word not in stop_words))
this is foo.

4.2 处理原始语料库句子

4.2.1 词条化

词条化技术是将文本流切为单词、短语和有意义的字符串的过程。

from nltk.tokenize import word_tokenize
content = """Stemming is funnier than a bummer says the sushi loving computer scientist.
    She really wants to buy cars. She told me angrily. It is better for you.
    Man is walking. We are meeting tomorrow. You really don't know..!"""
print(word_tokenize(content))
['Stemming', 'is', 'funnier', 'than', 'a', 'bummer', 'says', 'the', 'sushi', 'loving', 'computer', 'scientist', '.', 'She', 'really', 'wants', 'to', 'buy', 'cars', '.', 'She', 'told', 'me', 'angrily', '.', 'It', 'is', 'better', 'for', 'you', '.', 'Man', 'is', 'walking', '.', 'We', 'are', 'meeting', 'tomorrow', '.', 'You', 'really', 'do', "n't", 'know..', '!']

词条化的挑战

如果分析前面的输出,那么可以观察到单词don’t被标记化了。为了解决前面的问题,您可以编写异常代码并临时提高准确性。根据应用程序定制和变化,您需要编写模式匹配规则,以解决定义的挑战。另一个挑战涉及一些语言,如乌尔都语、希伯来语、阿拉伯语等。他们在确定单词边界和找出有意义的方面相当困难。

4.2.2 单词词形还原

from nltk.stem.wordnet import WordNetLemmatizer
wordlemma = WordNetLemmatizer()
print(wordlemma.lemmatize('cars')) 
print(wordlemma.lemmatize('walking',pos='v')) 
print(wordlemma.lemmatize('meeting',pos='n')) 
print(wordlemma.lemmatize('meeting',pos='v')) 
print(wordlemma.lemmatize('better',pos='a')) 
car
walk
meeting
meet
good

词形还原的挑战
建立一本词典是很费时的。如果考虑到前面句子的上下文,建立一个词形还原工具,仍然是一个开放的研究领域。

4.3 基础预处理

4.3.1 正则表达式

正则表达式有助于从字符序列中查找或查找替换特定模式。在生成regex时,需要遵循特定的语法。

在线工具https://regex101.com/

4.3.2 基本级正则表达式

import re
line = "This is test sentence and test sentence is also a sentence."
findallobj = re.findall(r'sentence', line)
print(findallobj)
['sentence', 'sentence', 'sentence']
contactInfo = 'Doe, John: 1111-1212'
groupwiseobj = re.search(r'(\w+), (\w+): (\S+)', contactInfo)
print("1st group ------- " + groupwiseobj.group(1))
1st group ------- Doe
print("2nd group ------- " + groupwiseobj.group(2))
2nd group ------- John
print("3rd group ------- " + groupwiseobj.group(3))
3rd group ------- 1111-1212
phone = "1111-2222-3333 # This is Phone Number"
num = re.sub(r'#.*$', "", phone)
print( "Phone Num : ", num)
Phone Num :  1111-2222-3333 
contactInforevised = re.sub(r'John', "Peter", contactInfo)
print("Revised contactINFO : ", contactInforevised)
Revised contactINFO :  Doe, Peter: 1111-1212

4.3.3 高级正则表达式

text = "I play on playground. It is the best ground."
  • 正向先行

如果定义的模式后面跟有子字符串,则正向先行匹配字符串中的子字符串。

positivelookaheadobjpattern = re.findall(r'play(?=ground)',text,re.M | re.I)
print("Positive lookahead: " + str(positivelookaheadobjpattern)) 
positivelookaheadobj = re.search(r'play(?=ground)',text,re.M | re.I)
print("Positive lookahead character index: "+ str(positivelookaheadobj.span())) 
Positive lookahead: ['play']
Positive lookahead character index: (10, 14)
  • 正向后行
possitivelookbehindobjpattern = re.findall(r'(?<=play)ground',text,re.M | re.I)
print("Positive lookbehind: " + str(possitivelookbehindobjpattern)) 
possitivelookbehindobj = re.search(r'(?<=play)ground',text,re.M | re.I)
print("Positive lookbehind character index: " + str(possitivelookbehindobj.span())) 
Positive lookbehind: ['ground']
Positive lookbehind character index: (14, 20)
  • 反向先行
negativelookaheadobjpattern = re.findall(r'play(?!ground)', text, re.M | re.I)
print("Negative lookahead: " + str(negativelookaheadobjpattern)) 
negativelookaheadobj = re.search(r'play(?!ground)', text, re.M | re.I)
print("Negative lookahead character index: " + str(negativelookaheadobj.span())) 
Negative lookahead: ['play']
Negative lookahead character index: (2, 6)
  • 反向后行
negativelookbehindobjpattern = re.findall(r'(?, text, re.M | re.I)
print("negative lookbehind: " + str(negativelookbehindobjpattern)) 
negativelookbehindobj = re.search(r'(?, text, re.M | re.I)
print("Negative lookbehind character index: " + str(negativelookbehindobj.span())) 
negative lookbehind: ['ground']
Negative lookbehind character index: (37, 43)

4.4 实践和个性化预处理

4.4.1 由你自己决定

对于新闻文本摘要,我们将如何进行预处理?

4.4.2 预处理流程

1、现在您有了用于文本摘要的原始数据,并且您的数据集包含HTML标记,重复文本等。

2、如果原始数据包含我在第一点中描述的所有内容,那么需要预处理,在这种情况下,我们需要删除HTML标记和重复的句子;否则,不需要预处理。

3、您还需要应用小写约定。

4、之后,您需要在文本摘要数据集上应用分句器。

5、最后,您需要在文本摘要数据集上应用分词器。

6、数据集是否需要预处理取决于问题语句和原始数据集包含哪些数据。

4.4.3 预处理的类型

在我们的文本摘要示例中,如果原始数据集包含HTML标记、长文本、重复文本,那么在开发应用程序的过程中以及在输出中,您不需要以下数据:1、您不需要HTML标记,因此可以删除它们

2、你不需要重复的句子,所以你也可以删除它们。

3、如果有长文本内容,那么如果你能找到停止词和高频小文字,你应该删除它们

4.4.4 理解预处理的案例

语法纠正系统
你正在建立一个语法修正系统。现在,想想它的子任务。您希望构建一个系统,该系统在特定句子中的位置预测文章a、an和the 。对于这种系统,如果你每次都需要删除停止词,那么,哎呀,你错了,因为这次我们真的不能盲目删除所有停止词。我们需要预测文章a、an和the可以删除根本没有意义的单词,例如当数据集包含数学符号,然后可以删除它们。但这次,你需要做一个详细分析是否可以删除短词,如缩写,因为您的系统还需要预测哪些缩写哪一篇文章不需要,哪一篇可以。
情绪分析
情绪分析主要是评估客户的评价,并将其分为积极、消极和中性类别:

对于这种系统,您的数据集包含用户评论,因此用户语言通常包含非正式语言。

数据包含非正式语言,因此我们需要删除诸如hi、hey、hello等停止词。我们不使用Hi, Hey, How are u?总结用户评论是正面的、负面的还是中性的。

除此之外,您还可以删除重复的评论。

您还可以通过使用分词技术和词形还原技术来预处理数据。
机器翻译
机器翻译也是一种广泛使用的NLP应用。在机器翻译中,我们的目标是以合乎逻辑的方式将一种语言翻译成另一种语言。所以,如果我们想把英语翻译成德语,你可以采用以下预处理步骤:1、我们转换为小写的数据集。
2、在数据集上应用分句器,这样就可以获得每个句子的边界。
3、现在,假设你有所有英语句子所在的语料库和所有德语句子都在德语句子文件。现在,你知道每个英语句子都有一个对应的德语句子出现在德语句子文件中。这种语料库被称为平行语料库。所以在这种情况下,您还需要检查两个文件中的句子都适当对齐。
4、您还可以对句子中的每个单词应用词干。
拼写更正
拼写更正对于预处理也是非常有用的工具,因为它有助于改进NLP应用程序

import re
from collections import Counter
def words(text):
    return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('big.txt').read()))
def P(word, N=sum(WORDS.values())):
    "Probability of `word`."
    return WORDS[word] / N

def correction(word):
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word):
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters = 'abcdefghijklmnopqrstuvwxyz'
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [L + R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
    inserts = [L + c + R for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))
print(correction('aple')) 
print(correction('correcton')) 
print(correction('statament')) 
print(correction('tutpore')) 
able
correction
statement
tutor

4.5 总结

在这一章中,研究了各种各样的预处理技术,在开发NLP系统或NLP应用程序时对您很有用。也有涉及到拼写更正系统,可以将其视为您未来发展最有用的预处理技术。
致谢
《Python自然语言处理》1 2 3,作者:【印】雅兰·萨纳卡(Jalaj Thanaki),是实践性很强的一部新作。为进一步深入理解书中内容,对部分内容进行了延伸学习、练习,在此分享,期待对大家有所帮助,欢迎加我微信(验证:NLP),一起学习讨论,不足之处,欢迎指正。
在这里插入图片描述

参考文献


  1. https://github.com/jalajthanaki ↩︎

  2. 《Python自然语言处理》,(印)雅兰·萨纳卡(Jalaj Thanaki) 著 张金超 、 刘舒曼 等 译 ,机械工业出版社,2018 ↩︎

  3. Jalaj Thanaki ,Python Natural Language Processing ,2017 ↩︎

你可能感兴趣的:(2017年度)