Deepti Chopra(印度)
王威 译
形态学可以定义为在语素的帮助下对标识符的构造进行研究。
语素是承载意义的基本语言单位。有两种类型:
词根(自由语素) |
---|
词缀(粘着语素) |
语言可分为三类:
词干提取可以被定义为一个通过去除单词中的词缀以获取词干的过程。
import nltk
from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()
print(stemmerporter.stem('working'))
print(stemmerporter.stem('happiness'))
Lancaster词干提取算法比Porter词干提取算法涉及更多不同情感词的使用。
import nltk
from nltk.stem import LancasterStemmer
stemmerlan=LancasterStemmer()
print(stemmerlan.stem('working'))
print(stemmerlan.stem('happiness'))
RegexpStemmer类进行词干提取通过接收一个字符串,并在找到其匹配的单词时删除该单词的前缀或后缀。
import nltk
from nltk.stem import RegexpStemmer
stemmerregexp=RegexpStemmer('ing')
print(stemmerregexp.stem('working'))
print(stemmerregexp.stem('happiness'))
print(stemmerregexp.stem('pairing'))
SnowballStemmer类用于对除英文之外的其他13种语言进行词干提取。
import nltk
from nltk.stem import SnowballStemmer
print(SnowballStemmer.languages)
spanishstemmer=SnowballStemmer('spanish')
print(spanishstemmer.stem('comiendo'))
frenchstemmer=SnowballStemmer('french')
print(frenchstemmer.stem('manger'))
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer
def obtain_tokens():
With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_
tokenize(stem.read())
return tokens
def stemming(filtered):
stem=[]
for x in filtered:
stem.append(PorterStemmer().stem(x))
return stem
if_name_=="_main_":
tok= obtain_tokens()
print("tokens is %s")%(tok)
stem_tokens= stemming(tok)
print("After stemming is %s")%stem_tokens
res=dict(zip(tok,stem_tokens))
print("{tok:stemmed}=%s")%(result)
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer_output=WordNetLemmatizer()
print(lemmatizer_output.lemmatize('working'))
print(lemmatizer_output.lemmatize('working',pos='v'))
print(lemmatizer_output.lemmatize('works'))
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer_output=PorterStemmer()
print(stemmer_output.stem('happiness'))
lemmatizer_output=WordNetLemmatizer()
print(lemmatizer_output.lemmatize('happiness'))
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
%%bash
Polylot download morph2 .en morph2.ar
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
[polyglot_data] Downloading packagee morph2.ar to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
from polyglot.text import Text, word
token = [“unconditional”, “precooked”, "impossible", "painful" ,"entered"]
for s in tokens:
s=Word(s, language="en")
print("{:<20}{}".format(s,s.morphemes))
unconditional['un','conditional']
precooked['pre', ' cook','ed']
impossible[' im', 'possible']
painful['pain','ful']
entered['enter','ed']
sent=" Ihopeyouf inithebookinteresting"
para=Text (sent)
para. language="en"
para . morphemes
WordList(['I', 'hope', 'you' , 'find', 'the','book' ,' interesting'])
一个从标识符中获取语法信息的过程。可以通过以下三种方式来执行形态分析:基于语素的形态学( 或一个项目和排列方法),基于词位的形态学(或一个项目和过程方法)和基于单词的形态学(或一个单词和范式方法)。
import enchant
s = enchant.Dict("en_US")
tok=[]
def tokenize(st1):
if not st1:return
for j in xrange(len(st1),-1,-1):
if s.check(st1[0:j]):
tok.append(st1[0:i])
st1=st[j:]
tokenize(st1)
break
tokenize("itismyfavouritebook")
print(tok)
tok=[ ]
tokenize("ihopeyoufindthebookinteresting")
print(tok)
执行形态生成任务的程序。例如:如果词根是go,词性为动词,时态为现在时,并且如果它与第三人称和单数主语一起出现,则时态生成器将生成其表层形式goes。
我们可以将文本转化为向量来构建向量空间搜索引擎。
def eliminatestopwords (self, list):
“”“
Eliminate words which occur often and have not much significancefrom context point of view.
”“”
return[ word for word in list if word not in self.stopwords )
def tokenize (self,string) :
“”“
Perform the task of splitting text into stop words and tokens
”“”
Str=self.clean(str)
Words=str.split ("")
return [self.stemmer . stem (word, 0, len(word)-1) for word in words]
def obtainvectorkeywordindex (self, documentList) :
“”“
In the document vectors, generate the keyword for the givenposition of element
”“”
#Perform mapping of text into strings
vocabstring = "".join (documentList)
vocablist = self. parser. tokenise (vocabstring)
#Eliminate common words that have no search significance
vocablist = self.parser.eliminatestopwords (vocablist)
uniqueVocablist = util. removeDuplicates (vocablist)
vectorIndex={}
offset=0
“””
Attach a position to keywords that performs mapping with dimension that is used to depict this token
“””
for word in uniqueVocablist:
vectorIndex[word]=offset
offset += 1
return vectorIndex #(keyword:position)
def constructVector (self, wordtring):
# Initialise the vector with 0'S
Vector_ val = [0] * len(self.vectorKeywordIndex)
tokList = self.parser. tokenize (tokString)
tokList = self.parser .eliminatestopwords (tokList)
for word in toklist:
vector [self.vectorKeywordIndex[wordl1 += 1;
# simple Term Count Model is used
return vector
def cosine(vec1, vec2) :
“””
Cosine = (X*Y) / ||X|| x ||Y||
“””
return float (dot (vec1,vec2) / (norm(vecl) *norm (vec2)))
def searching (self, searchinglist) :
“””
search for text that are matched on the basis oflist of items
“””
askVector = self.buildQueryVector (searchinglist)
ratings = [util.cosine (askVector, textVector) for textVector in self .documentvectors ]
ratings.sort (reverse-True)
return ratings
import nltk
import sys
try:
from nltk import wordpunct_ tokenize
from nltk.corpus import stopwords
except ImportError:
print( 'Error has occured')
#----------------------------------------------------
-----
def_ calculate_ languages_ ratios (text):
“”“
Compute probability of given document that can be written indifferent languages and give a dictionary that appears like('german': 2, 'french': 4, 'english': 1)
”“”
languages_ratios ={}
‘’’
nltk.wordpunct_tokenizel splits all punctuations into separate tokens
wordpunct_ tokenize("I hope you like the book interesting .")
[' I',' hope ', 'you ', 'like ', 'the ', 'book' , ' interesting ','.']
‘’’
tok = wordpunct_ tokenize (text)
wor = [word. lower() for word in tok]
# Compute occurence of unique stopwords in a text
for language in stopwords.fileids():
stopwords_set = set (stopwords .words (language))
words_set = set (words)
common_elements.words_set.intersection(stopwords. _set)
languages_ratios [language] = len (common_ elements)
# language "score"
return languages_ratios
#----------------------------------------------------
Def detect_language(text):
“””
Compute the probability of given Lext that is written in ditferentlanguages and obtain the one that is highest scored. It makes use of ntopwords calculalion approach, finds out unique stopwordspresent in a analyzed text.
“””
ratios = _ calculate_languages_ratios (text)
most_rated_language = max(ratios, key-ratios.get)
return most_rated_lanquage
if __ name__ == '__ main__ ' :
text. = '''
All over this cosmos, most of the people believe that there isan invisible supreme power that is the creator and the runner ofthis world. Human being is supposed to be the most intelligent andloved creation by that power and that is being searched by humanbeings in different ways into different things. As a result peoplereveal His assumed form as per their own perceptions and beliefs.It has given birth to different religions and people are dividedon the name of reliqion viz. Hindu, Muslim, sikhs, Christian etc.People do not stop at this. They debate the superiority of oneover the other and fight to establish their views. Shrewd peoplelike politicians oppose and support them at their own convenienceto divide them and control them. It has intensified to the extentthat even parents of a new born baby ceach it about religious differences and recommendtheir own religion superior to that of others and let the childlearn to hate other people just because of religion. Jonathan
Swift, an eighteenth century novelist, observes that we have justenough religion to make us hate, but not enough to make us love one another.
The word 'religion' does not have a derogatcry meaning - A literalmeaning of religion is 'A personal or institutionalized system grounded in belief in a Godor Gods and the activities connected with this'. At its basic level, 'religion is just a set ofteachings that tells people how to lead a good life'. It has never been the purpose of religion to divide peopleinto groups of isolated followers that cannot live in harmony together. NO religion claims to teachintolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights ofan individual solely based on their religious choices. It is alsosaid that 'Majhab nhi sikhata aaps mai bair krna' .But this verymajhab or religion takes a very heinous form when it is misusedby the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties and communal organizations incited the Hindus to demolish the 16thcentury Babri Masjid in the name of religion to polarize Hindus votes. Muslim fanatics inBangladesh retaliated and destroyed a number of temples, assassinated innocent Hindus and raped Hindugirls who had nothing to do with the demolition of Babri Masjid. This very inhuman act has beenpresented by Taslima Nasrin, a Bangladeshi Doctor-cum-Writerin her controversial novel 'Lajja' (1993) in which, she seemsto utilizes fiction's mass emotional appeal, rather than itspotential for nuance and universality.
‘’’
language = detect_ language (text)
print (language)
以上代码将搜索停止词并检测文本的语言类型, 即English。
“”"***笔者的话:整理了《精通Python自然语言处理》的第三章内容:形态学。书中的每段代码都有。希望对阅读这本书的人有所帮助。FIGHTING...(热烈欢迎大家批评指正,互相讨论)
(The best way to end your fear is to face it yourself.) ***"""