《精通Python自然语言处理( Deepti Chopra)》读书笔记(第三章):形态学

《精通Python自然语言处理》

Deepti Chopra(印度)
王威 译

第三章 形态学:在实践中学习

3.1形态学简介

形态学可以定义为在语素的帮助下对标识符的构造进行研究。
语素是承载意义的基本语言单位。有两种类型:

词根(自由语素)
词缀(粘着语素)

语言可分为三类:

  • 孤立语(isolating languages)(如:汉语);
  • 粘着语(agglutinative languages)(如:土耳其语);
  • 屈折语(inflecting languages)(如:拉丁语)

3.2理解词干提取器

词干提取可以被定义为一个通过去除单词中的词缀以获取词干的过程。

使用PorterStemmer类进行词干提取:
import nltk
from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()
print(stemmerporter.stem('working'))
print(stemmerporter.stem('happiness'))

Lancaster词干提取算法比Porter词干提取算法涉及更多不同情感词的使用。

使用Lancaster类进行词干提取:
import nltk
from nltk.stem import LancasterStemmer
stemmerlan=LancasterStemmer()
print(stemmerlan.stem('working'))
print(stemmerlan.stem('happiness'))

RegexpStemmer类进行词干提取通过接收一个字符串,并在找到其匹配的单词时删除该单词的前缀或后缀。

使用RegexpStemmer类进行词干提取:
import nltk
from nltk.stem import RegexpStemmer
stemmerregexp=RegexpStemmer('ing')
print(stemmerregexp.stem('working'))
print(stemmerregexp.stem('happiness'))
print(stemmerregexp.stem('pairing'))

SnowballStemmer类用于对除英文之外的其他13种语言进行词干提取。

使用SnowballStemmer类进行词干提取:
import nltk
from nltk.stem import SnowballStemmer
print(SnowballStemmer.languages)
spanishstemmer=SnowballStemmer('spanish')
print(spanishstemmer.stem('comiendo'))
frenchstemmer=SnowballStemmer('french')
print(frenchstemmer.stem('manger'))
使用多个词干提取器进行词干提取:
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer
def obtain_tokens():
	With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_
	tokenize(stem.read())
	return tokens
def stemming(filtered):
	stem=[]
	for x in filtered:
		stem.append(PorterStemmer().stem(x))
		return stem
if_name_=="_main_":
	tok= obtain_tokens()
print("tokens is %s")%(tok)
stem_tokens= stemming(tok)
print("After stemming is %s")%stem_tokens
res=dict(zip(tok,stem_tokens))
print("{tok:stemmed}=%s")%(result)

3.3理解词形还原

用不同的词类将一个单词转换为某种形式的过程:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer_output=WordNetLemmatizer()
print(lemmatizer_output.lemmatize('working'))
print(lemmatizer_output.lemmatize('working',pos='v'))
print(lemmatizer_output.lemmatize('works'))
词干提取器和词形还原之间的区别:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer_output=PorterStemmer()
print(stemmer_output.stem('happiness'))
lemmatizer_output=WordNetLemmatizer()
print(lemmatizer_output.lemmatize('happiness'))

3.4为非英文语言开发词干提取器

使用polyglot获取语言表格:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
可使用以下代码下载必要的模型:
%%bash
Polylot download morph2 .en morph2.ar
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
[polyglot_data] Downloading packagee morph2.ar to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
考虑一个可用于从polyglot中获取输出的示例:
from polyglot.text import Text, word
token = [“unconditional”, “precooked”, "impossible",  "painful" ,"entered"]
for s in tokens:
s=Word(s,  language="en")
print("{:<20}{}".format(s,s.morphemes))

unconditional['un','conditional']
precooked['pre', ' cook','ed']
impossible[' im', 'possible']
painful['pain','ful']
entered['enter','ed']
如果没有正确地执行切分,那么我们就可以对将文本分割成原始成分的过程进行形态学分析:
sent=" Ihopeyouf inithebookinteresting"
para=Text (sent)
para. language="en"
para . morphemes
WordList(['I', 'hope', 'you' , 'find', 'the','book' ,' interesting'])

3.5 形态分析器

一个从标识符中获取语法信息的过程。可以通过以下三种方式来执行形态分析:基于语素的形态学( 或一个项目和排列方法),基于词位的形态学(或一个项目和过程方法)和基于单词的形态学(或一个单词和范式方法)。

执行形态学分析:
import enchant
s = enchant.Dict("en_US")
tok=[]
def tokenize(st1):
	if not st1:return
		for j in xrange(len(st1),-1,-1):
			if s.check(st1[0:j]):
				tok.append(st1[0:i])
				st1=st[j:]
				tokenize(st1)
			break
tokenize("itismyfavouritebook")
print(tok)
tok=[ ]
tokenize("ihopeyoufindthebookinteresting")
print(tok)

3.6形态生成器

执行形态生成任务的程序。例如:如果词根是go,词性为动词,时态为现在时,并且如果它与第三人称和单数主语一起出现,则时态生成器将生成其表层形式goes。

3.7搜索引擎

我们可以将文本转化为向量来构建向量空间搜索引擎。

1:考虑以下用于停上词和分词的代码:

def eliminatestopwords (self, list):
“”“
Eliminate words which occur often and have not much significancefrom context point of view.
”“”
	return[ word for word in list if word not in self.stopwords )
def tokenize (self,string) :
“”“
Perform the task of splitting text into stop words and tokens
”“”
Str=self.clean(str)
Words=str.split ("")
return [self.stemmer . stem (word, 0, len(word)-1) for word in words]

2.考虑如下可用于将关键词映射到向量维度的代码:

def obtainvectorkeywordindex (self, documentList) :
“”“
In the document vectors, generate the keyword for the givenposition of element
”“”
#Perform mapping of text into strings
vocabstring = "".join (documentList)

vocablist = self. parser. tokenise (vocabstring)
#Eliminate common words that have no search significance 
vocablist = self.parser.eliminatestopwords (vocablist)
uniqueVocablist = util. removeDuplicates (vocablist)

vectorIndex={}
offset=0
“””
Attach a position to keywords that performs mapping with dimension that is used to depict this token
“””
for word in uniqueVocablist:
vectorIndex[word]=offset
offset += 1
return vectorIndex     #(keyword:position)

3.将文本字符串转换为向量的代码:

def constructVector (self, wordtring): 
	# Initialise the vector with 0'S
	Vector_ val = [0] * len(self.vectorKeywordIndex)
	tokList = self.parser. tokenize (tokString)
	tokList = self.parser .eliminatestopwords (tokList)
	for word in toklist:
		vector [self.vectorKeywordIndex[wordl1 += 1;

# simple Term Count Model is used
	return vector

4.找到文档的向量之间的角度的余弦来搜索相似文档,使用SciPy来计算文本向量之间余弦的代码:

def cosine(vec1, vec2) :
“””
	Cosine = (X*Y) / ||X||  x  ||Y||
“””
return float (dot (vec1,vec2) / (norm(vecl) *norm (vec2)))

5.执行关键词到向量空间的映射。搜索向量空间:

def searching (self, searchinglist) :
“””
search for text that are matched on the basis oflist of items
“””
	askVector = self.buildQueryVector (searchinglist)
ratings  =  [util.cosine (askVector, textVector) for textVector in self .documentvectors ]
	ratings.sort (reverse-True)
	return ratings

6. 对源文本进行语言检测:

import nltk
import sys
try:
from nltk import wordpunct_ tokenize
from nltk.corpus import stopwords
except ImportError:
print( 'Error has occured')

#----------------------------------------------------
-----
def_ calculate_ languages_ ratios (text):
“”“
Compute probability of given document that can be written indifferent languages and give a dictionary that appears like('german': 2, 'french': 4, 'english': 1)
 ”“”
languages_ratios ={}
‘’’
nltk.wordpunct_tokenizel splits all punctuations into separate tokens
wordpunct_ tokenize("I hope you like the book interesting .")
[' I',' hope ', 'you ', 'like ', 'the ', 'book' , ' interesting ','.']
‘’’

tok  =  wordpunct_ tokenize (text)
wor  =  [word. lower() for word in tok]

# Compute occurence of unique stopwords in a text
for language in stopwords.fileids():
stopwords_set = set (stopwords .words (language))
words_set = set (words)
common_elements.words_set.intersection(stopwords. _set)
languages_ratios [language] = len (common_ elements)
# language "score"
return languages_ratios

#----------------------------------------------------
Def detect_language(text):
“””
Compute the probability of given Lext that is written in ditferentlanguages and obtain the one that is highest scored. It makes use of ntopwords calculalion approach, finds out unique stopwordspresent in a analyzed text.
“””
ratios = _ calculate_languages_ratios (text)
most_rated_language = max(ratios, key-ratios.get)
return most_rated_lanquage

if  __ name__  == '__ main__ ' :
text. = '''
All over this cosmos, most of the people believe that there isan invisible supreme power that is the creator and the runner ofthis world. Human being is supposed to be the most intelligent andloved creation by that power and that is being searched by humanbeings in different ways into different things. As a result peoplereveal His assumed form as per their own perceptions and beliefs.It has given birth to different religions and people are dividedon the name of reliqion viz. Hindu, Muslim, sikhs, Christian etc.People do not stop at this. They debate the superiority of oneover the other and fight to establish their views. Shrewd peoplelike politicians oppose and support them at their own convenienceto divide them and control them. It has intensified to the extentthat even parents of a new born baby ceach it about religious differences and recommendtheir own religion superior to that of others and let the childlearn to hate other people just because of religion. Jonathan
Swift, an eighteenth century novelist, observes that we have justenough religion to make us hate, but not enough to make us love one another.
The word 'religion' does not have a derogatcry meaning - A literalmeaning of religion is 'A personal or institutionalized system grounded in belief in a Godor Gods and the activities connected with this'. At its basic level, 'religion is just a set ofteachings that tells people how to lead a good life'. It has never been the purpose of religion to divide peopleinto groups of isolated followers that cannot live in harmony together. NO religion claims to teachintolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights ofan individual solely based on their religious choices. It is alsosaid that 'Majhab nhi sikhata aaps mai bair krna' .But this verymajhab or religion takes a very heinous form when it is misusedby the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties and communal organizations incited the Hindus to demolish the 16thcentury Babri Masjid in the name of religion to polarize Hindus votes. Muslim fanatics inBangladesh retaliated and destroyed a number of temples, assassinated innocent Hindus and raped Hindugirls who had nothing to do with the demolition of Babri Masjid. This very inhuman act has beenpresented by Taslima Nasrin, a Bangladeshi Doctor-cum-Writerin her controversial novel 'Lajja' (1993) in which, she seemsto utilizes fiction's mass emotional appeal, rather than itspotential for nuance and universality.
‘’’
language = detect_ language (text)
print (language)

以上代码将搜索停止词并检测文本的语言类型,  即English。

“”"***笔者的话:整理了《精通Python自然语言处理》的第三章内容:形态学。书中的每段代码都有。希望对阅读这本书的人有所帮助。FIGHTING...(热烈欢迎大家批评指正,互相讨论)
(The best way to end your fear is to face it yourself.)
***"""

你可能感兴趣的:(NLP,中文分词)