python自然语言处理学习笔记二

第二章  获得文本语料和词汇资源

1 获取文本语料

古腾堡语料库 gutenberg

>>> import nltk

>>>nltk.corpus.gutenberg.fileids()

['austen-emma.txt',

'austen-persuasion.txt',

'austen-sense.txt',

 'bible-kjv.txt',

'blake-poems.txt',

 'bryant-stories.txt',

 'burgess-busterbrown.txt',

 'carroll-alice.txt',

 'chesterton-ball.txt',

 'chesterton-brown.txt',

 'chesterton-thursday.txt',

'edgeworth-parents.txt',

'melville-moby_dick.txt',

'milton-paradise.txt',

 'shakespeare-caesar.txt',

 'shakespeare-hamlet.txt',

'shakespeare-macbeth.txt',

 'whitman-leaves.txt']

 

 

#查看文本所含的单词数量

>>> emma=nltk.corpus.gutenberg.words("austen-emma.txt")

>>> len(emma)

192427

 

#在文本中年搜索单词

>>>emma=nltk.Text(nltk.corpus.gutenberg.words("austen-emma.txt"))

>>>emma.concordance("surprize")

Displaying 25 of 37 matches:

er father , was sometimes taken by surprizeat his being still able to pity `

hem do the other any good ." "You surprize me ! Emma must do Harriet good : a

Knightley actually looked red with surprizeand displeasure , as he stood up ,

r . Elton , and found to his great surprize, that Mr . Elton was actually on

d aid ." Emma saw Mrs . Weston ' ssurprize , and felt that it must be great ,

father was quite taken up with the surprizeof so sudden a journey , and his f

y , in all the favouring warmth of surprizeand conjecture . She was , moreove

he appeared , to have her share of surprize, introduction , and pleasure . Th

ir plans ; and it was an agreeable surprizeto her , therefore , to perceive t

talking aunt had taken me quite by surprize, it must have been the death of m

f all the dialogue which ensued of surprize, and inquiry , and congratulation

 thepresent . They might chuse to surprize her ." Mrs . Cole had many to agre

the mode of it , the mystery , the surprize, is more like a young woman ' s s

 toher song took her agreeably by surprize -- a second , slightly but correct

" " Oh ! no -- there is nothingto surprize one at all .-- A pretty fortune ;

t to be considered . Emma ' s only surprizewas that Jane Fairfax should accep

of your admiration may take you by surprizesome day or other ." Mr . Knightle

ation for her will ever take me by surprize.-- I never had a thought of her i

 expected by the best judges , for surprize --but there was great joy . Mr .

 sound of at first , without great surprize ." So unreasonably early !" she w

d Frank Churchill , with a look of surprizeand displeasure .-- " That is easy

; and Emma could imagine with what surprizeand mortification she must be retu

tled that Jane should go . Quite a surprizeto me ! I had not the least idea !

 . Itis impossible to express our surprize . He came to speak to his father o

g engaged !" Emma even jumped withsurprize ;-- and , horror - struck , exclai

 

 

 

>>> from nltk.corpus importgutenberg

>>> for fileid ingutenberg.fileids():

...    num_chars=len(gutenberg.raw(fileid)) # 文本中出现的词汇的个数,包含空格

...    num_words=len(gutenberg.words(fileid)) #文本所含的单词数量

...    num_sents=len(gutenberg.sents(fileid))#把文本划分成句子

...    num_vocab=len(set([w.lower() for w in gutenberg.words(fileid)]))

...    print(int(num_chars/num_words), #平均词长

int(num_words/num_sents),#平均句子长度

int(num_words/num_vocab),#每个词出现的平均次数

fileid)# 文件标识

... 

 

运行结果:  

4 24 26 austen-emma.txt

4 26 16 austen-persuasion.txt

4 28 22 austen-sense.txt

4 33 79 bible-kjv.txt

4 19 5 blake-poems.txt

4 19 14 bryant-stories.txt

4 17 12 burgess-busterbrown.txt

4 20 12 carroll-alice.txt

4 20 11 chesterton-ball.txt

4 22 11 chesterton-brown.txt

4 18 10 chesterton-thursday.txt

4 20 24 edgeworth-parents.txt

4 25 15 melville-moby_dick.txt

4 52 10 milton-paradise.txt

4 11 8 shakespeare-caesar.txt

4 12 7 shakespeare-hamlet.txt

4 12 6 shakespeare-macbeth.txt

4 36 12 whitman-leaves.txt

 

网络和聊天文本

>>> from nltk.corpus importwebtext

>>> for fileid inwebtext.fileids():

...    print(fileid,webtext.raw(fileid)[:65])

...    

 

运行结果:

firefox.txt Cookie Manager: "Don'tallow sites that set removed cookies to se

grail.txt SCENE 1: [wind] [clop clop clop]

KING ARTHUR: Whoa there!  [clop

overheard.txt White guy: So, do you haveany plans for this evening?

Asian girl

pirates.txt PIRATES OF THE CARRIBEAN: DEADMAN'S CHEST, by Ted Elliott & Terr

singles.txt 25 SEXY MALE, seeks attracolder single lady, for discreet encoun

wine.txt Lovely delicate, fragrant Rhonewine. Polished leather and strawb

 

#聊天记录

>>> from nltk.corpus import nps_chat

>>>chatroom=nps_chat.posts('10-19-20s_706posts.xml')

>>> chatroom[123]

['i', 'do', "n't", 'want', 'hot','pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

 

布朗语料库

>>> from nltk.corpus import brown

>>> brown.categories() #词料库中的分类

['adventure',   #探险

'belles_lettres', #纯文学

 'editorial', #社论

'fiction', #小说

'government', #政府

 'hobbies', #爱好

 'humor', #幽默

'learned', #博览

'lore', #传说

'mystery', #推理小说

'news', #新闻

'religion',#宗教

 'reviews', #评论

'romance', #言情

'science_fiction'] #科幻

 

#查看新闻类

>>> brown.words(categories='news')

['The', 'Fulton', 'County', 'Grand','Jury', 'said', ...]

 

#查看指定文件名的单词

>>> brown.words(fileids=['cg22'])

['Does', 'our', 'society', 'have', 'a','runaway', ',', ...]

 

#指定分类划分句子

>>> brown.sents(categories=['news','editorial','reviews'])

[['The', 'Fulton', 'County', 'Grand','Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's",'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence',"''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The','jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the','City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge','of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and','thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the','manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

 

#对特定文体中的情态动词进行计数

>>>news_text=brown.words(categories='news')

>>> fdist=nltk.FreqDist([w.lower()for w in news_text])

>>>modals=['can','could','may','might','must','will']

>>> for m in modals:

...    print(m+" : ",fdist[m])

...    

can : 94

could : 87

may : 93

might : 38

must : 53

will : 389

 

#条件频率分布函数

>>> cfd=nltk.ConditionalFreqDist(

... (genre,word)

... for genre in brown.categories()

... for word inbrown.words(categories=genre))

>>>genres=['news','religion','hobbies','science_fiction','romance','humor']

>>>modals=['can','could','may','might','must','will']

>>> cfd.tabulate(conditions=genres,samples=modals)

                  can  could  may  might  must will

          news    93    86   66    38    50  389

        religion    82   59    78    12   54    71

        hobbies  268    58   131   22    83   264

   science_fiction    16   49     4    12    8    16

       romance    74   193   11    51    45   43

         humor    16    30    8     8     9   13

 

 

路透社语料库

 

>>> from nltk.corpus importreuters

>>> reuters.fileids()

 

#测试数据

['test/14826', 'test/14828', 'test/14829','test/14832', 'test/14833', 'test/14839', 'test/14840',

...

 'test/21576',

 

#训练数据

'training/1', 'training/10','training/100', 'training/1000', 'training/10000', 'training/10002','training/10005', 'training/10008', 'training/10011',

...

'training/9995']

 

#主题分类,一则新闻可能涉及多个主题

>>> reuters.categories()

['acq', 'alum', 'barley', 'bop', 'carcass','castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper','copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl','dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut','groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest','ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil','livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha','nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium','palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand','rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship','silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal','sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil','wheat', 'wpi', 'yen', 'zinc']

 

#查找一个或多个文档涵盖的主题

>>> reuters.categories('training/9865')

['barley', 'corn', 'grain', 'wheat']

 

>>>reuters.categories(['training/9865','training/9880'])

['barley', 'corn', 'grain', 'money-fx','wheat']

 

 

 

#查找包含 一个或多个类别的文档

>>> reuters.fileids('barley')

['test/15618', 'test/15649', 'test/15676','test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767','test/17769', 'test/18024', 'test/18263', 'test/18908', 'test/19275','test/19668', 'training/10175', 'training/1067', 'training/11208','training/11316', 'training/11885', 'training/12428', 'training/13099','training/13744', 'training/13795', 'training/13852', 'training/13856','training/1652', 'training/1970', 'training/2044', 'training/2171','training/2172', 'training/2191', 'training/2217', 'training/2232','training/3132', 'training/3324', 'training/395', 'training/4280','training/4296', 'training/5', 'training/501', 'training/5467','training/5610', 'training/5640', 'training/6626', 'training/7205','training/7579', 'training/8213', 'training/8257', 'training/8759', 'training/9865','training/9958']

 

>>>reuters.fileids(['barley','corn'])

['test/14832', 'test/14858', 'test/15033','test/15043', 'test/15106', 'test/15287', 'test/15341',

...

]

 

#查找我们相要的句子

>>>reuters.words('training/9865')[:14]

['FRENCH', 'FREE', 'MARKET', 'CEREAL','EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested','licences', 'to', 'export']

>>>reuters.words(['training/9865','training/9880'])

['FRENCH', 'FREE', 'MARKET', 'CEREAL','EXPORT', ...]

 

 

>>> reuters.words(categories='barley')

['FRENCH', 'FREE', 'MARKET', 'CEREAL','EXPORT', ...]

>>>reuters.words(categories=['barley','corn'])

['THAI', 'TRADE', 'DEFICIT', 'WIDENS','IN', 'FIRST', ...]

 

 

就职演说语料库

>>> from nltk.corpus importinaugural

>>> inaugural.fileids()

['1789-Washington.txt','1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt','1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt','1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt','1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt','1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt','1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt','1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt','1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt','1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt','1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt','1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt','1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt','1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt','1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt','1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt','2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']

 

 >>>print(inaugural.raw('2009-Obama.txt'))

My fellow citizens:

 

I stand here today humbled by the taskbefore us, grateful for the trust you have bestowed, mindful of the sacrificesborne by our ancestors. I thank President Bush for his service to our nation,as well as the generosity and cooperation he has shown throughout thistransition.

 

Forty-four Americans have now taken thepresidential oath. The words have been spoken during rising tides of prosperityand the still waters of peace. Yet, every so often the oath is taken amidstgathering clouds and raging storms. At these moments, America has carried onnot simply because of the skill or vision of those in high office, but becauseWe the People have remained faithful to the ideals of our forbearers, and trueto our founding documents.

 

So it has been. So it must be with thisgeneration of Americans.

 

That we are in the midst of crisis is nowwell understood. Our nation is at war, against a far-reaching network ofviolence and hatred. Our economy is badly weakened, a consequence of greed andirresponsibility on the part of some, but also our collective failure to makehard choices and prepare the nation for a new age. Homes have been lost; jobsshed; businesses shuttered. Our health care is too costly; our schools fail toomany; and each day brings further evidence that the ways we use energystrengthen our adversaries and threaten our planet.

 

These are the indicators of crisis, subjectto data and statistics. Less measurable but no less profound is a sapping ofconfidence across our land -- a nagging fear that America's decline isinevitable, that the next generation must lower its sights.

 

Today I say to you that the challenges weface are real. They are serious and they are many. They will not be met easilyor in a short span of time. But know this, America -- they will be met.

 

On this day, we gather because we havechosen hope over fear, unity of purpose over conflict and discord.

 

On this day, we come to proclaim an end tothe petty grievances and false promises, the recriminations and worn-out dogmasthat for far too long have strangled our politics.

 

We remain a young nation, but in the wordsof Scripture, the time has come to set aside childish things. The time has cometo reaffirm our enduring spirit; to choose our better history; to carry forwardthat precious gift, that noble idea, passed on from generation to generation:the God-given promise that all are equal, all are free, and all deserve achance to pursue their full measure of happiness.

 

In reaffirming the greatness of our nation,we understand that greatness is never a given. It must be earned. Our journeyhas never been one of shortcuts or settling for less. It has not been the pathfor the faint-hearted -- for those who prefer leisure over work, or seek onlythe pleasures of riches and fame. Rather, it has been the risk-takers, thedoers, the makers of things'some celebrated but more often men and womenobscure in their labor, who have carried us up the long, rugged path towardsprosperity and freedom.

 

For us, they packed up their few worldlypossessions and traveled across oceans in search of a new life.

 

For us, they toiled in sweatshops andsettled the West; endured the lash of the whip and plowed the hard earth.

 

For us, they fought and died, in placeslike Concord and Gettysburg; Normandy and Khe Sahn.

 

Time and again these men and womenstruggled and sacrificed and worked till their hands were raw so that we mightlive a better life. They saw America as bigger than the sum of our individualambitions; greater than all the differences of birth or wealth or faction.

 

This is the journey we continue today. Weremain the most prosperous, powerful nation on Earth. Our workers are no lessproductive than when this crisis began. Our minds are no less inventive, ourgoods and services no less needed than they were last week or last month orlast year. Our capacity remains undiminished. But our time of standing pat, ofprotecting narrow interests and putting off unpleasant decisions -- that timehas surely passed. Starting today, we must pick ourselves up, dust ourselvesoff, and begin again the work of remaking America.

 

For everywhere we look, there is work to bedone. The state of our economy calls for action, bold and swift, and we willact -- not only to create new jobs, but to lay a new foundation for growth. Wewill build the roads and bridges, the electric grids and digital lines thatfeed our commerce and bind us together. We will restore science to its rightfulplace, and wield technology's wonders to raise health care's quality and lowerits cost. We will harness the sun and the winds and the soil to fuel our carsand run our factories. And we will transform our schools and colleges anduniversities to meet the demands of a new age. All this we can do. All this wewill do.

 

Now, there are some who question the scaleof our ambitions -- who suggest that our system cannot tolerate too many bigplans. Their memories are short. For they have forgotten what this country hasalready done; what free men and women can achieve when imagination is joined tocommon purpose, and necessity to courage.

 

What the cynics fail to understand is thatthe ground has shifted beneath them -- that the stale political arguments thathave consumed us for so long no longer apply. The question we ask today is notwhether our government is too big or too small, but whether it works -- whetherit helps families find jobs at a decent wage, care they can afford, aretirement that is dignified. Where the answer is yes, we intend to moveforward. Where the answer is no, programs will end. And those of us who managethe public's dollars will be held to account -- to spend wisely, reform badhabits, and do our business in the light of day -- because only then can werestore the vital trust between a people and their government.

 

Nor is the question before us whether themarket is a force for good or ill. Its power to generate wealth and expandfreedom is unmatched, but this crisis has reminded us that without a watchfuleye, the market can spin out of control -- the nation cannot prosper long whenit favors only the prosperous. The success of our economy has always dependednot just on the size of our Gross Domestic Product, but on the reach of ourprosperity; on the ability to extend opportunity to every willing heart -- notout of charity, but because it is the surest route to our common good.

 

As for our common defense, we reject asfalse the choice between our safety and our ideals. Our Founding Fathers, facedwith perils that we can scarcely imagine, drafted a charter to assure the ruleof law and the rights of man, a charter expanded by the blood of generations.Those ideals still light the world, and we will not give them up forexpedience's sake. And so to all the other peoples and governments who arewatching today, from the grandest capitals to the small village where my fatherwas born: know that America is a friend of each nation and every man, woman,and child who seeks a future of peace and dignity, and we are ready to leadonce more.

 

Recall that earlier generations faced downfascism and communism not just with missiles and tanks, but with the sturdy alliancesand enduring convictions. They understood that our power alone cannot protectus, nor does it entitle us to do as we please. Instead, they knew that ourpower grows through its prudent use; our security emanates from the justness ofour cause, the force of our example, the tempering qualities of humility andrestraint.

 

We are the keepers of this legacy. Guidedby these principles once more, we can meet those new threats that demand evengreater effort -- even greater cooperation and understanding between nations.We will begin to responsibly leave Iraq to its people, and forge a hard-earnedpeace in Afghanistan. With old friends and former foes, we will work tirelesslyto lessen the nuclear threat, and roll back the specter of a warming planet. Wewill not apologize for our way of life, nor will we waver in its defense, andfor those who seek to advance their aims by inducing terror and slaughteringinnocents, we say to you now that our spirit is stronger and cannot be broken;you cannot outlast us, and we will defeat you.

 

For we know that our patchwork heritage isa strength, not a weakness. We are a nation of Christians and Muslims, Jews andHindus -- and non-believers. We are shaped by every language and culture, drawnfrom every end of this Earth; and because we have tasted the bitter swill ofcivil war and segregation, and emerged from that dark chapter stronger and moreunited, we cannot help but believe that the old hatreds shall someday pass;that the lines of tribe shall soon dissolve; that as the world grows smaller,our common humanity shall reveal itself; and that America must play its role inushering in a new era of peace.

 

To the Muslim world, we seek a new wayforward, based on mutual interest and mutual respect. To those leaders aroundthe globe who seek to sow conflict, or blame their society's ills on the West-- know that your people will judge you on what you can build, not what youdestroy. To those who cling to power through corruption and deceit and thesilencing of dissent, know that you are on the wrong side of history; but thatwe will extend a hand if you are willing to unclench your fist.

 

To the people of poor nations, we pledge towork alongside you to make your farms flourish and let clean waters flow; tonourish starved bodies and feed hungry minds. And to those nations like oursthat enjoy relative plenty, we say we can no longer afford indifference to thesuffering outside our borders; nor can we consume the world's resources withoutregard to effect. For the world has changed, and we must change with it.

 

As we consider the road that unfolds beforeus, we remember with humble gratitude those brave Americans who, at this veryhour, patrol far-off deserts and distant mountains. They have something to tellus, just as the fallen heroes who lie in Arlington whisper through the ages. Wehonor them not only because they are the guardians of our liberty, but becausethey embody the spirit of service; a willingness to find meaning in somethinggreater than themselves. And yet, at this moment -- a moment that will define ageneration -- it is precisely this spirit that must inhabit us all.

 

For as much as government can do and mustdo, it is ultimately the faith and determination of the American people uponwhich this nation relies. It is the kindness to take in a stranger when thelevees break, the selflessness of workers who would rather cut their hours thansee a friend lose their job which sees us through our darkest hours. It is thefirefighter's courage to storm a stairway filled with smoke, but also aparent's willingness to nurture a child, that finally decides our fate.

 

 Ourchallenges may be new. The instruments with which we meet them may be new. Butthose values upon which our success depends -- honesty and hard work, courageand fair play, tolerance and curiosity, loyalty and patriotism -- these thingsare old. These things are true. They have been the quiet force of progressthroughout our history. What is demanded then is a return to these truths. Whatis required of us now is a new era of responsibility -- a recognition, on thepart of every American, that we have duties to ourselves, our nation, and theworld, duties that we do not grudgingly accept but rather seize gladly, firm inthe knowledge that there is nothing so satisfying to the spirit, so defining ofour character, than giving our all to a difficult task.

 

This is the price and the promise ofcitizenship.

 

This is the source of our confidence -- theknowledge that God calls on us to shape an uncertain destiny.

 

This is the meaning of our liberty and ourcreed -- why men and women and children of every race and every faith can joinin celebration across this magnificent mall, and why a man whose father lessthan sixty years ago might not have been served at a local restaurant can nowstand before you to take a most sacred oath.

 

So let us mark this day with remembrance,of who we are and how far we have traveled. In the year of America's birth, inthe coldest of months, a small band of patriots huddled by dying campfires onthe shores of an icy river. The capital was abandoned. The enemy was advancing.The snow was stained with blood. At a moment when the outcome of our revolutionwas most in doubt, the father of our nation ordered these words be read to thepeople:

 

"Let it be told to the future world... that in the depth of winter, when nothing but hope and virtue could survive... that the city and the country, alarmed at one common danger, came forth tomeet ... it."

 

America! In the face of our common dangers,in this winter of our hardship, let us remember these timeless words. With hopeand virtue, let us brave once more the icy currents, and endure what storms maycome. Let it be said by our children's children that when we were tested werefused to let this journey end, that we did not turn back nor did we falter;and with eyes fixed on the horizon and God's grace upon us, we carried forththat great gift of freedom and delivered it safely to future generations.

 

Thank you. God bless you. And God bless theUnited States of America.

 

 

#条年频率分布

>>> cfd=nltk.ConditionalFreqDist(

... (target,fileid[:4])

... for fileid in inaugural.fileids()

... for w in inaugural.words(fileid)

... for target in ['america','citizen']

... if w.lower().startswith(target))

>>> cfd.plot()

Backend TkAgg is interactive backend.Turning interactive mode on.

 

运行结果如下:

 

标注文本语料库

 

 

其他语言的语料库

处理字符编码

>>> nltk.corpus.cess_esp.words()

['El', 'grupo', 'estatal', 'Electricité_de_France',...]

 

>>> nltk.corpus.floresta.words()

['Um', 'revivalismo', 'refrescante', 'O','7_e_Meio', ...]

 

>>>nltk.corpus.indian.words('hindi.pos')

['पूर्ण', 'प्रतिबंध', 'हटाओ', ':', 'इराक','संयुक्त', ...]

 

>>> nltk.corpus.udhr.fileids()

['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8','Achehnese-Latin1', 'Achuar-Shiwiar-Latin1', 'Adja-UTF8',

...

 'Zapoteco-SanLucasQuiavini-Latin1','Zhuang-Latin1', 'Zulu-Latin1']

 

>>> nltk.corpus.udhr.words('Javanese-Latin1')[11:]

['Saben', 'umat', 'manungsa', 'lair','kanthi', 'hak', ...]

 

#不同语言版本中字长的差异

>>> import nltk

>>> from nltk.corpus import udhr

#>>> #languages=['Chickasaw','English','German_Deutsch','Greenlandic_Inuktikut','Hungarian_Magyar',#'Ibibio_Efik']

>>> languages=['Chickasaw-Latin1','English-Latin1','German_Deutsch-Latin1','Greenlandic_Inuktikut-Latin1','Hungarian_Magyar-Latin1','Ibibio_Efik-Latin1','Chinese_Mandarin-GB2312']

 

>>> cfd=nltk.ConditionalFreqDist(

... (lang,len(word))

... for lang in languages

#... for word inudhr.words(lang+'-Latin1'))

... for word inudhr.words(lang))

>>> cfd.plot(cumulative=True)

 

运行结果如下:

 

 >>>udhr.raw("Chinese_Mandarin-GB2312")

 

文本语料库的结构

>>> import nltk

>>> from nltk.corpus importgutenberg

>>>raw=gutenberg.raw("burgess-busterbrown.txt")

>>> raw[1:20]

'The Adventures of B'

>>>words=gutenberg.words("burgess-busterbrown.txt")

>>> words[1:20]

['The', 'Adventures', 'of', 'Buster','Bear', 'by', 'Thornton', 'W', '.', 'Burgess', '1920', ']', 'I', 'BUSTER','BEAR', 'GOES', 'FISHING', 'Buster', 'Bear']

>>>sents=gutenberg.sents("burgess-busterbrown.txt")

>>> sents[1:20]

[['I'], ['BUSTER', 'BEAR', 'GOES','FISHING'], ['Buster', 'Bear', 'yawned', 'as', 'he', 'lay', 'on', 'his','comfortable', 'bed', 'of', 'leaves',

...

]]

 

 

载入自已的语料库

>>> from nltk.corpus importPlaintextCorpusReader

>>>corpus_root='H:/nltk_data/custom_data'

>>> wordlists=PlaintextCorpusReader(corpus_root,'.*')

>>> wordlists.fileids()

#文件需要

>>> wordlists.words('Jython.txt')

['下载安装文件', 'cmd直接运行', ',', '安装完成', '修改环境变量', ...]

 

2 条件频率分布

>>> from nltk.corpus import brown

>>> cfd=nltk.ConditionalFreqDist(

... (genre,word)

... for genre in brown.categories()

... for word inbrown.words(categories=genre))    

 

>>> genre_word=[(genre,word)

#只看新闻和言情两种文体

...    for genre in ['news','romance']

#文体与词的配对

...    for word in brown.words(categories=genre)]

...    

>>> len(genre_word)

170576

 

#前面配对的为(‘news’,word)

>>> genre_word[:4]

[('news', 'The'), ('news', 'Fulton'),('news', 'County'), ('news', 'Grand')]

 

#后面配对的为(‘romance’,word)

>>> genre_word[-4:]

[('romance', 'afraid'), ('romance', 'not'),('romance', "''"), ('romance', '.')]

 

 

#创建ConditionalFreqDist,输入变量名称,确认是否有两个条件

>>> cfd=nltk.ConditionalFreqDist(genre_word)

>>> cfd

 

>>> cfd.conditions()

['news', 'romance']

 

#每个条件,都只有一个频率分布

>>> cfd['news']

FreqDist({'the': 5580, ',': 5188, '.':4030, 'of': 2849, 'and': 2146, 'to': 2116, 'a': 1993, 'in': 1893, 'for': 943,'The': 806, ...})

>>> cfd['romance']

FreqDist({',': 3899, '.': 3736, 'the':2758, 'and': 1776, 'to': 1502, 'a': 1335, 'of': 1186, '``': 1045,"''": 1044, 'was': 993, ...})

>>> len(cfd['news'])

14394

>>> len(cfd['romance'])

8452

>>> cfd['romance']['could']

 

 

#绘制分布图和分布表

>>> from nltk.corpus importinaugural

>>> cfd=nltk.ConditionalFreqDist(

... (target,fileid[:4])

... for fileid in inaugural.fileids()

... for w in inaugural.words(fileid)

... for target in ['america','citizen']

... if w.lower().startswith(target))

>>> cfd.plot()

 

运行结果下:

 

>>> import nltk

>>> from nltk.corpus import udhr

>>>languages=['Chickasaw-Latin1','English-Latin1','German_Deutsch-Latin1','Greenlandic_Inuktikut-Latin1','Hungarian_Magyar-Latin1','Ibibio_Efik-Latin1','Chinese_Mandarin-GB2312']

>>> cfd =nltk.ConditionalFreqDist(

... (lang, len(word))

... for lang in languages

... for word in udhr.words(lang))

 

#英文文本中9个或小于9个字符的词有1638个

>>>cfd.tabulate(conditions=['English-Latin1','German_Deutsch-Latin1'],#指定显示哪些条件

samples=range(10),#限制要显示的样本

cumulative=True)

 

                         0    1   2    3    4   5     6     7   8     9

       English-Latin1      0  185 525  883  997  1166 1283  1440  1558 1638

German_Deutsch-Latin1     0  171 263  614  717 894   1013  1110  1213 1275

 

 

#使用双连词生成随机文本

>>> defgenerate_model(cfdist,word,num=15):

   for i in range(num):

       print(word,)

       word=cfdist[word].max()

>>>text=nltk.corpus.genesis.words('english-kjv.txt')

>>> bigrams=nltk.bigrams(text)

>>>cfd=nltk.ConditionalFreqDist(bigrams)

>>> print(cfd['living'])

>>> generate_model(cfd,'living')

living

creature

that

he

said

,

and

the

land

of

the

land

of

the

land

 

条件概率分布中的常用方法

示例

功能

Cfd=ConditionalFreqDist(pairs)

从配对链表中创建条件频率分布

Cfd.conditions()

将条件按字母排序来分类

Cfd[conditions]

此条件下的频率分布

Cfd.tabulate()

为条件频率分布制表

Cfd.tabulate(samples,conditions)

指定样本和条件限制下制表

Cfd.plot()

为条件频率绘图

Cfd.plot(samples,conditions)

在指定样本和条件限制下绘图

Cfd1

出现次数比较

 

 

3 python代码重用

使用文本编辑器创建程序

可以使用python自带的IDLE编写多行程序

 

函数

带命名的代码块,可以使用def进行定义

示例代码如下:


def plural(word):
    """
    将单词变复数
    :param word:单词
    :return: 单词复数
    """
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'


print(plural('fairy'))
print(plural('woman'))

 

运行结果如下:

fairies

women

 

模块

from Two.textproc import plural

print(plural('wish'))

 

运行结果如下:

fairies

women

wishes

 

4 词典资源

词汇列表语料库

#显示文本中罕见或拼写错误的词汇

>>> import nltk

>>> def unusual_words(text):

...    text_vocab=set(w.lower() for w in text if w.isalpha())

...    english_vocab=set(w.lower() for w in nltk.corpus.words.words())

...    unusual=text_vocab.difference(english_vocab)

...    return sorted(unusual)

...unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))

['abbeyland', 'abhorred', 'abilities', 'abounded','abridgement', 'abused', 'abuses', 'accents',

...

 'words', 'workmen', 'worlds', 'wrapt','writes', 'yards', 'years', 'yielded', 'youngest']

 

 unusual_words(nltk.corpus.nps_chat.words())

['aaaaaaaaaaaaaaaaa', 'aaahhhh','abortions', 'abou', 'abourted', 'abs', 'ack', 'acros', 'actualy',

...

 'yw', 'zebrahead', 'zoloft', 'zyban','zzzzzzzing', 'zzzzzzzz']

 

#停用词词料库,包含高频词汇

>>> from nltk.corpus importstopwords

>>> stopwords.words('english')

['i', 'me', 'my', 'myself', 'we', 'our','ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he','him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its','itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which','who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was','were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does','did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as','until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against','between', 'into', 'through', 'during', 'before', 'after', 'above', 'below','to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again','further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how','all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such','no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's','t', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're','ve', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven','isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren','won', 'wouldn']

 

#计算文本中不在停用词列表中的词的比例

>>> def content_fraction(text):

...    stopwords=nltk.corpus.stopwords.words('english')

...    content=[w for w in text if w.lower() not in stopwords]

...    return len(content)/len(text)

...content_fraction(nltk.corpus.reuters.words())

 

0.735240435097661

 

#取得符合条件的词

>>>puzzle_letters=nltk.FreqDist('egivrvonl')

>>> obligatory='r'

>>>wordlist=nltk.corpus.words.words()

>>> [w for w in wordlist iflen(w)>=6  #单词长度小于6

...    and obligatory in w  #单词中包含的字母

...    and nltk.FreqDist(w) <=puzzle_letters] #字母出频率小于等于相应词的频率

...

['glover', 'gorlin', 'govern', 'grovel','ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler','overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer','virole']

 

 

 

#找出同时出现在两个文件中的名字

>>> names=nltk.corpus.names

>>> names.fileids()

['female.txt', 'male.txt']

 

>>>male_names=names.words('male.txt')

>>>female_names=names.words('female.txt')

>>> [w for w in male_names if w infemale_names]

['Abbey', 'Abbie', 'Abby', 'Addie','Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 'Alfie', 'Ali', 'Alix', 'Allie',

...

 'Virgie', 'Wallie', 'Wallis', 'Wally','Whitney', 'Willi', 'Willie', 'Willy', 'Winnie', 'Winny', 'Wynn']

 

 

#显示男性和女性名字结尾字母的频率分布

>>> cfd=nltk.ConditionalFreqDist(

... (fileid,name[-1])

... for fileid in names.fileids()

... for name in names.words(fileid))

>>> cfd.plot()

 

运行结果如下:

 

 

发音的词典

>>>entries=nltk.corpus.cmudict.entries()

>>> len(entries)

133737

>>> for entry inentries[3943:39951]:

...    print(entry)

 

('explosion', ['IH0', 'K', 'S', 'P', 'L','OW1', 'ZH', 'AH0', 'N'])

('explosions', ['IH0', 'K', 'S', 'P', 'L','OW1', 'ZH', 'AH0', 'N', 'Z'])

('explosive', ['IH0', 'K', 'S', 'P', 'L','OW1', 'S', 'IH0', 'V'])

('explosively', ['EH2', 'K', 'S', 'P', 'L','OW1', 'S', 'IH0', 'V', 'L', 'IY0'])

 

 

#发音包含三个音素

>>> for word,pron in entries:

...    if len(pron)==3:

...        ph1,ph2,ph3=pron

...        if ph1=='P' and ph3=='T':

...             print(word,ph2,)

pait EY1

pat AE1

pate EY1

patt AE1

peart ER1

peat IY1

peet IY1

peete IY1

pert ER1

pet EH1

pete IY1

pett EH1

piet IY1

piette IY1

pit IH1

pitt IH1

pot AA1

pote OW1

pott AA1

pout AW1

puett UW1

purt ER1

put UH1

putt AH1

 

 

#找出发音结尾与nicks相似的词汇

>>> syllable=['N','IH0','K','S']

>>> [word for word,pron in entriesif pron[-4:]==syllable]

["atlantic's", 'audiotronics','avionics', 'beatniks', 'calisthenics', 'centronics', 'chamonix', 'chetniks',"clinic's", 'clinics', 'conics', 'conics', 'cryogenics', 'cynics','diasonics', "dominic's", 'ebonics', 'electronics',"electronics'", "endotronics'", 'endotronics', 'enix', 'environics','ethnics', 'eugenics', 'fibronics', 'flextronics', 'harmonics', 'hispanics','histrionics', 'identics', 'ionics', 'kibbutzniks', 'lasersonics', 'lumonics','mannix', 'mechanics', "mechanics'", 'microelectronics', 'minix','minnix', 'mnemonics', 'mnemonics', 'molonicks', 'mullenix', 'mullenix','mullinix', 'mulnix', "munich's", 'nucleonics', 'onyx', 'organics',"panic's", 'panics', 'penix', 'pennix', 'personics', 'phenix',"philharmonic's", 'phoenix', 'phonics', 'photronics', 'pinnix','plantronics', 'pyrotechnics', 'refuseniks', "resnick's",'respironics', 'sconnix', 'siliconix', 'skolniks', 'sonics', 'sputniks','technics', 'tectonics', 'tektronix', 'telectronics', 'telephonics', 'tonics','unix', "vinick's", "vinnick's", 'vitronics']

 

#书写与发音不匹配

>>> [w for w,pron in entries ifpron[-1]=='M' and w[-1]=='n']

['autumn', 'column', 'condemn', 'damn','goddamn', 'hymn', 'solemn']

 

音素包含的数字表示重音

1 主重音

2 次重音

0 无重音

 

#查找具有特定重音模式的词汇

>>> def stress(pron):

...    return [char for phone in pron for char in phone if char.isdigit()]

>>> [w for w,pron in entries ifstress(pron)==['0','1','0','2','0']]

['abbreviated', 'abbreviated','abbreviating', 'accelerated', 'accelerating', 'accelerator',

...

'unsaturated', 'velociraptor','vocabulary', 'voluntarism']

 

>>> [w for w,pron in entries ifstress(pron)==['0','2','0','1','0']]

['abbreviation', 'abbreviations','abomination', 'abortifacient', 'abortifacients', 'academicians',

...

 'wakabayashi', 'yekaterinburg']

 

#使用频率分布来寻找对应词汇的最小对比集

#查找以p开头的三音素词,按照第一个和第二个音素来分组

>>> p3=[(pron[0]+'-'+pron[2],word)

...    for (word,pron) in entries

...    if pron[0]=='P' and len(pron)==3]

... cfd=nltk.ConditionalFreqDist(p3)

>>> for template incfd.conditions():

...    if len(cfd[template])>10:

...        words=cfd[template].keys()

...        wordlist=' '.join(words)

...        print(template,wordlist[:70]+'...')

...

P-CH petsch piche poche peach putsch piechpautsch pitch patch pitsch puche...

P-N pen payne pun paign pinn pine painepenn pyne pane pawn penh poon pin ...

P-R peer pore poore par paar poor parr pourpier porr pair pare pear por...

P-UW1 prugh prue prew pshew peru peugh pluepew plew pru pugh...

P-L peel pyle pehl peal puhl perl pell pillpaull paille peele poul pohl p...

P-Z p.'s paws pause p's paz paiz pows pezpoe's perz pays poise pies purrs...

P-K pac poke pak pique puck pik pyke polkpaque pack pake pic purk pick pa...

P-S piece pass pearse pasts purse postsperse pease pesce poss perce piss ...

P-T pout pett putt pote pet pott peetepuett pert pot pait piet pate peet ...

P-P pipp poop paape pup popp paap peep papppaup pape pipe poppe pope pap ...

 

#查词典

>>>product=nltk.corpus.cmudict.dict()

>>> product['fire']

[['F', 'AY1', 'ER0'], ['F', 'AY1', 'R']]

#查找一个不存在的关键字

>>> product['blog']

Traceback (most recent call last):

 File "", line 1, in

KeyError: 'blog'

>>>product['blog']=[['B','L','AA1','G']]

>>> product['blog']

[['B', 'L', 'AA1', 'G']]

 

 

>>>text=['natural','language','processing']

>>> [ph for w in text for ph inproduct[w][0]]

['N', 'AE1', 'CH', 'ER0', 'AH0', 'L', 'L','AE1', 'NG', 'G', 'W', 'AH0', 'JH', 'P', 'R', 'AA1', 'S', 'EH0', 'S', 'IH0','NG']

 

 

比较词表

Swadesh wordlists 斯瓦迪士核心词列表:含有200个常用词列表,语言标识符使用ISO639

>>> from nltk.corpus importswadesh

>>> swadesh.fileids()

['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de','en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk','sl', 'sr', 'sw', 'uk']

>>> swadesh.words('en')

['I', 'you (singular), thou', 'he', 'we','you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what','where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one','two', 'three', 'four', 'five', 'big', 'long', 'wide', 'thick', 'heavy','small', 'short', 'narrow', 'thin', 'woman', 'man (adult male)', 'man (humanbeing)', 'child', 'wife', 'husband', 'mother', 'father', 'animal', 'fish','bird', 'dog', 'louse', 'snake', 'worm', 'tree', 'forest', 'stick', 'fruit','seed', 'leaf', 'root', 'bark (from tree)', 'flower', 'grass', 'rope', 'skin','meat', 'blood', 'bone', 'fat (noun)', 'egg', 'horn', 'tail', 'feather','hair', 'head', 'ear', 'eye', 'nose', 'mouth', 'tooth', 'tongue', 'fingernail','foot', 'leg', 'knee', 'hand', 'wing', 'belly', 'guts', 'neck', 'back','breast', 'heart', 'liver', 'drink', 'eat', 'bite', 'suck', 'spit', 'vomit','blow', 'breathe', 'laugh', 'see', 'hear', 'know (a fact)', 'think', 'smell','fear', 'sleep', 'live', 'die', 'kill', 'fight', 'hunt', 'hit', 'cut', 'split','stab', 'scratch', 'dig', 'swim', 'fly (verb)', 'walk', 'come', 'lie', 'sit','stand', 'turn', 'fall', 'give', 'hold', 'squeeze', 'rub', 'wash', 'wipe','pull', 'push', 'throw', 'tie', 'sew', 'count', 'say', 'sing', 'play', 'float','flow', 'freeze', 'swell', 'sun', 'moon', 'star', 'water', 'rain', 'river','lake', 'sea', 'salt', 'stone', 'sand', 'dust', 'earth', 'cloud', 'fog', 'sky','wind', 'snow', 'ice', 'smoke', 'fire', 'ashes', 'burn', 'road', 'mountain','red', 'green', 'yellow', 'white', 'black', 'night', 'day', 'year', 'warm','cold', 'full', 'new', 'old', 'good', 'bad', 'rotten', 'dirty', 'straight','round', 'sharp', 'dull', 'smooth', 'wet', 'dry', 'correct', 'near', 'far','right', 'left', 'at', 'in', 'with', 'and', 'if', 'because', 'name']

 

 

 

>>>fr2en=swadesh.entries(['fr','en'])

>>> fr2en

[('je', 'I'), ('tu, vous', 'you (singular),thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles','they'), ('ceci', 'this'), ('cela', 'that'), ('ici', 'here'), ('là', 'there'),('qui', 'who'), ('quoi', 'what'), ('où', 'where'),('quand', 'when'), ('comment', 'how'), ('ne...pas', 'not'), ('tout', 'all'),('plusieurs', 'many'), ('quelques', 'some'), ('peu', 'few'), ('autre','other'), ('un', 'one'), ('deux', 'two'), ('trois', 'three'), ('quatre','four'), ('cinq', 'five'), ('grand', 'big'), ('long', 'long'), ('large','wide'), ('épais', 'thick'), ('lourd', 'heavy'), ('petit', 'small'), ('court','short'), ('étroit', 'narrow'), ('mince', 'thin'), ('femme', 'woman'), ('homme','man (adult male)'), ('homme', 'man (human being)'), ('enfant', 'child'),('femme, épouse', 'wife'), ('mari, époux', 'husband'),('mère', 'mother'), ('père', 'father'), ('animal', 'animal'), ('poisson', 'fish'),('oiseau', 'bird'), ('chien', 'dog'), ('pou', 'louse'), ('serpent', 'snake'),('ver', 'worm'), ('arbre', 'tree'), ('forêt', 'forest'),('bâton', 'stick'), ('fruit', 'fruit'), ('graine', 'seed'), ('feuille','leaf'), ('racine', 'root'), ('écorce', 'bark (from tree)'), ('fleur', 'flower'), ('herbe','grass'), ('corde', 'rope'), ('peau', 'skin'), ('viande', 'meat'), ('sang','blood'), ('os', 'bone'), ('graisse', 'fat (noun)'), ('œuf', 'egg'),('corne', 'horn'), ('queue', 'tail'), ('plume', 'feather'), ('cheveu', 'hair'),('tête', 'head'), ('oreille', 'ear'), ('œil', 'eye'), ('nez','nose'), ('bouche', 'mouth'), ('dent', 'tooth'), ('langue', 'tongue'),('ongle', 'fingernail'), ('pied', 'foot'), ('jambe', 'leg'), ('genou', 'knee'),('main', 'hand'), ('aile', 'wing'), ('ventre', 'belly'), ('entrailles','guts'), ('cou', 'neck'), ('dos', 'back'), ('sein, poitrine', 'breast'), ('cœur','heart'), ('foie', 'liver'), ('boire', 'drink'), ('manger', 'eat'), ('mordre','bite'), ('sucer', 'suck'), ('cracher', 'spit'), ('vomir', 'vomit'),('souffler', 'blow'), ('respirer', 'breathe'), ('rire', 'laugh'), ('voir','see'), ('entendre', 'hear'), ('savoir', 'know (a fact)'), ('penser', 'think'),('sentir', 'smell'), ('craindre, avoir peur', 'fear'), ('dormir', 'sleep'),('vivre', 'live'), ('mourir', 'die'), ('tuer', 'kill'), ('se battre', 'fight'),('chasser', 'hunt'), ('frapper', 'hit'), ('couper', 'cut'), ('fendre','split'), ('poignarder', 'stab'), ('gratter', 'scratch'), ('creuser', 'dig'),('nager', 'swim'), ('voler', 'fly (verb)'), ('marcher', 'walk'), ('venir','come'), ("s'étendre", 'lie'), ("s'asseoir", 'sit'), ('se lever','stand'), ('tourner', 'turn'), ('tomber', 'fall'), ('donner', 'give'),('tenir', 'hold'), ('serrer', 'squeeze'), ('frotter', 'rub'), ('laver','wash'), ('essuyer', 'wipe'), ('tirer', 'pull'), ('pousser', 'push'), ('jeter','throw'), ('lier', 'tie'), ('coudre', 'sew'), ('compter', 'count'), ('dire','say'), ('chanter', 'sing'), ('jouer', 'play'), ('flotter', 'float'),('couler', 'flow'), ('geler', 'freeze'), ('gonfler', 'swell'), ('soleil','sun'), ('lune', 'moon'), ('étoile', 'star'), ('eau', 'water'), ('pluie', 'rain'), ('rivière','river'), ('lac', 'lake'), ('mer', 'sea'), ('sel', 'salt'), ('pierre', 'stone'),('sable', 'sand'), ('poussière', 'dust'), ('terre', 'earth'), ('nuage', 'cloud'), ('brouillard','fog'), ('ciel', 'sky'), ('vent', 'wind'), ('neige', 'snow'), ('glace', 'ice'),('fumée', 'smoke'), ('feu', 'fire'), ('cendres', 'ashes'), ('brûler', 'burn'),('route', 'road'), ('montagne', 'mountain'), ('rouge', 'red'), ('vert','green'), ('jaune', 'yellow'), ('blanc', 'white'), ('noir', 'black'), ('nuit','night'), ('jour', 'day'), ('an, année', 'year'),('chaud', 'warm'), ('froid', 'cold'), ('plein', 'full'), ('nouveau', 'new'),('vieux', 'old'), ('bon', 'good'), ('mauvais', 'bad'), ('pourri', 'rotten'),('sale', 'dirty'), ('droit', 'straight'), ('rond', 'round'), ('tranchant,pointu, aigu', 'sharp'), ('émoussé', 'dull'), ('lisse', 'smooth'), ('mouillé', 'wet'),('sec', 'dry'), ('juste, correct', 'correct'), ('proche', 'near'), ('loin','far'), ('à droite', 'right'), ('à gauche', 'left'), ('à', 'at'), ('dans', 'in'), ('avec', 'with'), ('et', 'and'), ('si','if'), ('parce que', 'because'), ('nom', 'name')]

 

 

>>> translate=dict(fr2en) #转化为简单的词典

>>> translate['chien']

'dog'

>>> translate['jeter']

'throw'

>>>de2en=swadesh.entries(['de','en']) #德转英

>>>es2en=swadesh.entries(['es','en']) #西班牙转英

>>> translate.update(dict(de2en))#更新原有词典

>>> translate.update(dict(es2en))

>>> translate['Hund']

'dog'

>>> translate['perro']

'dog'

 

#比较德语族与拉丁族的不同

>>>languages=['en','de','nl','es','fr','pt','la']

>>> for i in [139,140,141,142]:

...    print(swadesh.entries(languages)[i])

...

('say', 'sagen', 'zeggen', 'decir', 'dire','dizer', 'dicere')

('sing', 'singen', 'zingen', 'cantar','chanter', 'cantar', 'canere')

('play', 'spielen', 'spelen', 'jugar','jouer', 'jogar, brincar', 'ludere')

('float', 'schweben', 'zweven', 'flotar','flotter', 'flutuar, boiar', 'fluctuare')

 

 

词汇工具 toolbox和shoebox

toolbox用来管理数据的工具,之前叫shoebox

 

>>> from nltk.corpus importtoolbox

>>> toolbox.entries('rotokas.dic')

[('kaa',

 [('ps', 'V'), ('pt', 'A'), ('ge', 'gag'),('tkp', 'nek i pas'), ('dcsv', 'true'), ('vx', '1'), ('sc', '???'),

('dt', '29/Oct/2005'),

('ex', 'Apokaira kaaroi aioa-ia reoreopaoro.'),

('xp', 'Kaikai ipas long nek bilong Apoka bikos em i kaikai na toktok.'),

('xe', 'Apoka isgagging from food while talking.')]),

 

 ('kaa',

 [('ps', 'V'), ('pt', 'B'), ('ge','strangle'),('tkp', 'pasim nek'), ('arg', 'O'), ('vx', '2'),

 ('dt', '07/Oct/2006'), ('ex', 'Rera raurororera kaarevoi.'),

 ('xp', 'Em i holim pas em na nekim em.'),

 ('xe', 'He is holding him and stranglinghim.'),

 ('ex', 'Iroiro-ia oirato okoearo kaaivoi uvarerirovira kaureoparoveira.'),

 ('xp', 'Ol i pasim nek bilong man long ropbikos em i save bikhet tumas.'),

 ('xe', "They strangled the man's neckwith rope because he was very stubborn and arrogant."),

('ex', 'Oiratookoearo kaaivoi iroiro-ia. Uva viapau uvuiparoi ra vovouparo uva kopiiroi.'),

 ('xp', 'Ol i pasim nek bilong man long rop.Olsem na em i no pulim win olsem na em i dai.'),

 ('xe', "They strangled the man's neckwith a rope. And he couldn't breathe and he died.")]),

...

]

 

5 wordnet

 

意义与同义词

>>> from nltk.corpus importwordnet as wn

>>> wn.synsets('motorcar')

[Synset('car.n.01')]

#取得同意词集

>>>wn.synset('car.n.01')._lemma_names

['car', 'auto', 'automobile', 'machine','motorcar']

#取得同意词集的意义

>>>wn.synset('car.n.01').definition()

'a motor vehicle with four wheels; usuallypropelled by an internal combustion engine'

 

>>>wn.synset('car.n.01').examples()

['he needs a car to get to work']

 

#取得指定同义词的词条

>>> wn.synset('car.n.01')._lemmas

[Lemma('car.n.01.car'),

 Lemma('car.n.01.auto'),

 Lemma('car.n.01.automobile'),

 Lemma('car.n.01.machine'),

 Lemma('car.n.01.motorcar')]

 

#查找特定的词条

>>>wn.lemma('car.n.01.automobile')

Lemma('car.n.01.automobile')

 

#取得一个词条对应的同义词集

>>>wn.lemma('car.n.01.automobile').synset()

Synset('car.n.01')

 

#得到一个词条的名字(可以用属性也可以用方法)

>>> wn.lemma('car.n.01.automobile')._name

'automobile'

 

>>>wn.lemma('car.n.01.automobile').name()

'automobile'

 

#具有多个同义词集

>>> wn.synsets('car')

[Synset('car.n.01'), Synset('car.n.02'),Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]

 

>>> for synset inwn.synsets('car'):

...    print(synset._lemma_names)

...

['car', 'auto', 'automobile', 'machine','motorcar']

['car', 'railcar', 'railway_car','railroad_car']

['car', 'gondola']

['car', 'elevator_car']

['cable_car', 'car']

 

#取得所有包含car的词条

>>> wn.lemmas('car')

[Lemma('car.n.01.car'),Lemma('car.n.02.car'), Lemma('car.n.03.car'), Lemma('car.n.04.car'),Lemma('cable_car.n.01.car')]

 

 

Wordnet的层次结构

#下位词

>>>motorcar=wn.synset('car.n.01')

>>>types_of_motorcar=motorcar.hyponyms()

>>> types_of_motorcar[0]

Synset('ambulance.n.01')

 

>>> sorted([lemma._name for synsetin types_of_motorcar for lemma in synset.lemmas()])

['Model_T', 'S.U.V.', 'SUV','Stanley_Steamer', 'ambulance', 'beach_waggon', 'beach_wagon', 'bus', 'cab','compact', 'compact_car', 'convertible', 'coupe', 'cruiser', 'electric','electric_automobile', 'electric_car', 'estate_car', 'gas_guzzler', 'hack','hardtop', 'hatchback', 'heap', 'horseless_carriage', 'hot-rod', 'hot_rod','jalopy', 'jeep', 'landrover', 'limo', 'limousine', 'loaner', 'minicar','minivan', 'pace_car', 'patrol_car', 'phaeton', 'police_car', 'police_cruiser','prowl_car', 'race_car', 'racer', 'racing_car', 'roadster', 'runabout','saloon', 'secondhand_car', 'sedan', 'sport_car', 'sport_utility','sport_utility_vehicle', 'sports_car', 'squad_car', 'station_waggon','station_wagon', 'stock_car', 'subcompact', 'subcompact_car', 'taxi','taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon','wagon']

 

#上位词

>>> motorcar.hypernyms()

[Synset('motor_vehicle.n.01')]

>>>paths=motorcar.hypernym_paths()

 

>>> len(paths)

2

>>> [synset._name for synset inpaths[0]]

['entity.n.01', 'physical_entity.n.01','object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03','container.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01', 'car.n.01']

 

>>> [synset._name for synset inpaths[1]]

['entity.n.01', 'physical_entity.n.01','object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03','conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01','self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

 

#得到一个最笼统的上位同义词

>>> motorcar.root_hypernyms()

[Synset('entity.n.01')]

 

 

词汇关系

#部分

>>>wn.synset('tree.n.01').part_meronyms()

[Synset('burl.n.02'), Synset('crown.n.07'),Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')]

# 属于某物的组成材质

>>>wn.synset('tree.n.01').substance_meronyms()

[Synset('heartwood.n.01'),Synset('sapwood.n.01')]

#父集

>>>wn.synset('tree.n.01').member_holonyms()

[Synset('forest.n.01')]

 

 

#单词mint的相关意思

>>> for synset inwn.synsets('mint',wn.NOUN):

...    print(synset._name+" : ",synset._definition)

...

batch.n.02 :  (often followed by `of') a large number oramount or extent

mint.n.02 : any north temperate plant of the genus Mentha with aromatic leaves andsmall mauve flowers

mint.n.03 : any member of the mint family of plants

mint.n.04 : the leaves of a mint plant used fresh or candied

mint.n.05 : a candy that is flavored with a mint oil

mint.n.06 : a plant where money is coined by authority of the government

#mint.n.04是mint.n.02的一部分

>>>wn.synset('mint.n.04').part_holonyms()

[Synset('mint.n.02')]

#是组成mint.n.05的材质

>>>wn.synset('mint.n.04').substance_holonyms()

[Synset('mint.n.05')]

 

#动词之间的关系

>>>wn.synset('walk.v.01').entailments()

[Synset('step.v.01')]

>>>wn.synset('eat.v.01').entailments()

[Synset('chew.v.01'),Synset('swallow.v.01')]

>>>wn.synset('tease.v.03').entailments()

[Synset('arouse.v.07'),Synset('disappoint.v.01')]

 

#反义词

>>>wn.lemma('supply.n.02.supply').antonyms()

[Lemma('demand.n.02.demand')]

>>>wn.lemma('rush.v.01.rush').antonyms()

[Lemma('linger.v.04.linger')]

>>>wn.lemma('horizontal.a.01.horizontal').antonyms()

[Lemma('inclined.a.02.inclined'),Lemma('vertical.a.01.vertical')]

>>>wn.lemma('staccato.r.01.staccato').antonyms()

[Lemma('legato.r.01.legato')]

 

语义相似度

#同义词

>>> import nltk

>>> from nltk.corpus importwordnet as wn

>>>right=wn.synset('right_whale.n.01')

>>> orca=wn.synset('orca.n.01')

>>>minke=wn.synset('minke_whale.n.01')

>>> tortoise=wn.synset('tortoise.n.01')

>>> novel=wn.synset('novel.n.01')

>>>right.lowest_common_hypernyms(minke)

[Synset('baleen_whale.n.01')]

>>>right.lowest_common_hypernyms(orca)

[Synset('whale.n.02')]

>>>right.lowest_common_hypernyms(tortoise)

[Synset('vertebrate.n.01')]

>>>right.lowest_common_hypernyms(novel)

[Synset('entity.n.01')]

 

 

#查找每个同义词集的深度

>>>wn.synset('baleen_whale.n.01').min_depth()

14

>>>wn.synset('whale.n.02').min_depth()

13

>>>wn.synset('vertebrate.n.01').min_depth()

8

>>> wn.synset('entity.n.01').min_depth()

0

 

#查看语议相似度

>>> right.path_similarity(minke)

0.25

>>> right.path_similarity(orca)

0.16666666666666666

>>>right.path_similarity(tortoise)

0.07692307692307693

>>> right.path_similarity(novel)

0.043478260869565216

 

 

你可能感兴趣的:(python自然语言处理)