retacn

python自然语言处理学习笔记二

第二章获得文本语料和词汇资源

1 获取文本语料

古腾堡语料库 gutenberg

>>> import nltk

>>>nltk.corpus.gutenberg.fileids()

['austen-emma.txt',

'austen-persuasion.txt',

'austen-sense.txt',

'bible-kjv.txt',

'blake-poems.txt',

'bryant-stories.txt',

'burgess-busterbrown.txt',

'carroll-alice.txt',

'chesterton-ball.txt',

'chesterton-brown.txt',

'chesterton-thursday.txt',

'edgeworth-parents.txt',

'melville-moby_dick.txt',

'milton-paradise.txt',

'shakespeare-caesar.txt',

'shakespeare-hamlet.txt',

'shakespeare-macbeth.txt',

'whitman-leaves.txt']

#查看文本所含的单词数量

>>> emma=nltk.corpus.gutenberg.words("austen-emma.txt")

>>> len(emma)

192427

#在文本中年搜索单词

>>>emma=nltk.Text(nltk.corpus.gutenberg.words("austen-emma.txt"))

>>>emma.concordance("surprize")

Displaying 25 of 37 matches:

er father , was sometimes taken by surprizeat his being still able to pity `

hem do the other any good ." "You surprize me ! Emma must do Harriet good : a

Knightley actually looked red with surprizeand displeasure , as he stood up ,

r . Elton , and found to his great surprize, that Mr . Elton was actually on

d aid ." Emma saw Mrs . Weston ' ssurprize , and felt that it must be great ,

father was quite taken up with the surprizeof so sudden a journey , and his f

y , in all the favouring warmth of surprizeand conjecture . She was , moreove

he appeared , to have her share of surprize, introduction , and pleasure . Th

ir plans ; and it was an agreeable surprizeto her , therefore , to perceive t

talking aunt had taken me quite by surprize, it must have been the death of m

f all the dialogue which ensued of surprize, and inquiry , and congratulation

thepresent . They might chuse to surprize her ." Mrs . Cole had many to agre

the mode of it , the mystery , the surprize, is more like a young woman ' s s

toher song took her agreeably by surprize -- a second , slightly but correct

" " Oh ! no -- there is nothingto surprize one at all .-- A pretty fortune ;

t to be considered . Emma ' s only surprizewas that Jane Fairfax should accep

of your admiration may take you by surprizesome day or other ." Mr . Knightle

ation for her will ever take me by surprize.-- I never had a thought of her i

expected by the best judges , for surprize --but there was great joy . Mr .

sound of at first , without great surprize ." So unreasonably early !" she w

d Frank Churchill , with a look of surprizeand displeasure .-- " That is easy

; and Emma could imagine with what surprizeand mortification she must be retu

tled that Jane should go . Quite a surprizeto me ! I had not the least idea !

. Itis impossible to express our surprize . He came to speak to his father o

g engaged !" Emma even jumped withsurprize ;-- and , horror - struck , exclai

>>> from nltk.corpus importgutenberg

>>> for fileid ingutenberg.fileids():

... num_chars=len(gutenberg.raw(fileid)) # 文本中出现的词汇的个数,包含空格

... num_words=len(gutenberg.words(fileid)) #文本所含的单词数量

... num_sents=len(gutenberg.sents(fileid))#把文本划分成句子

... num_vocab=len(set([w.lower() for w in gutenberg.words(fileid)]))

... print(int(num_chars/num_words), #平均词长

int(num_words/num_sents),#平均句子长度

int(num_words/num_vocab),#每个词出现的平均次数

fileid)# 文件标识

...

运行结果：

4 24 26 austen-emma.txt

4 26 16 austen-persuasion.txt

4 28 22 austen-sense.txt

4 33 79 bible-kjv.txt

4 19 5 blake-poems.txt

4 19 14 bryant-stories.txt

4 17 12 burgess-busterbrown.txt

4 20 12 carroll-alice.txt

4 20 11 chesterton-ball.txt

4 22 11 chesterton-brown.txt

4 18 10 chesterton-thursday.txt

4 20 24 edgeworth-parents.txt

4 25 15 melville-moby_dick.txt

4 52 10 milton-paradise.txt

4 11 8 shakespeare-caesar.txt

4 12 7 shakespeare-hamlet.txt

4 12 6 shakespeare-macbeth.txt

4 36 12 whitman-leaves.txt

网络和聊天文本

>>> from nltk.corpus importwebtext

>>> for fileid inwebtext.fileids():

... print(fileid,webtext.raw(fileid)[:65])

...

运行结果:

firefox.txt Cookie Manager: "Don'tallow sites that set removed cookies to se

grail.txt SCENE 1: [wind] [clop clop clop]

KING ARTHUR: Whoa there! [clop

overheard.txt White guy: So, do you haveany plans for this evening?

Asian girl

pirates.txt PIRATES OF THE CARRIBEAN: DEADMAN'S CHEST, by Ted Elliott & Terr

singles.txt 25 SEXY MALE, seeks attracolder single lady, for discreet encoun

wine.txt Lovely delicate, fragrant Rhonewine. Polished leather and strawb

#聊天记录

>>> from nltk.corpus import nps_chat

>>>chatroom=nps_chat.posts('10-19-20s_706posts.xml')

>>> chatroom[123]

['i', 'do', "n't", 'want', 'hot','pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

布朗语料库

>>> from nltk.corpus import brown

>>> brown.categories() #词料库中的分类

['adventure', #探险

'belles_lettres', #纯文学

'editorial', #社论

'fiction', #小说

'government', #政府

'hobbies', #爱好

'humor', #幽默

'learned', #博览

'lore', #传说

'mystery', #推理小说

'news', #新闻

'religion',#宗教

'reviews', #评论

'romance', #言情

'science_fiction'] #科幻

#查看新闻类

>>> brown.words(categories='news')

['The', 'Fulton', 'County', 'Grand','Jury', 'said', ...]

#查看指定文件名的单词

>>> brown.words(fileids=['cg22'])

['Does', 'our', 'society', 'have', 'a','runaway', ',', ...]

#指定分类划分句子

>>> brown.sents(categories=['news','editorial','reviews'])

[['The', 'Fulton', 'County', 'Grand','Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's",'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence',"''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The','jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the','City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge','of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and','thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the','manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

#对特定文体中的情态动词进行计数

>>>news_text=brown.words(categories='news')

>>> fdist=nltk.FreqDist([w.lower()for w in news_text])

>>>modals=['can','could','may','might','must','will']

>>> for m in modals:

... print(m+" : ",fdist[m])

...

can : 94

could : 87

may : 93

might : 38

must : 53

will : 389

#条件频率分布函数

>>> cfd=nltk.ConditionalFreqDist(

... (genre,word)

... for genre in brown.categories()

... for word inbrown.words(categories=genre))

>>>genres=['news','religion','hobbies','science_fiction','romance','humor']

>>>modals=['can','could','may','might','must','will']

>>> cfd.tabulate(conditions=genres,samples=modals)

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

science_fiction 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

路透社语料库

>>> from nltk.corpus importreuters

>>> reuters.fileids()

#测试数据

['test/14826', 'test/14828', 'test/14829','test/14832', 'test/14833', 'test/14839', 'test/14840',

...

'test/21576',

#训练数据

'training/1', 'training/10','training/100', 'training/1000', 'training/10000', 'training/10002','training/10005', 'training/10008', 'training/10011',

...

'training/9995']

#主题分类,一则新闻可能涉及多个主题

>>> reuters.categories()

['acq', 'alum', 'barley', 'bop', 'carcass','castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper','copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl','dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut','groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest','ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil','livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha','nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium','palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand','rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship','silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal','sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil','wheat', 'wpi', 'yen', 'zinc']

#查找一个或多个文档涵盖的主题

>>> reuters.categories('training/9865')

['barley', 'corn', 'grain', 'wheat']

>>>reuters.categories(['training/9865','training/9880'])

['barley', 'corn', 'grain', 'money-fx','wheat']

#查找包含一个或多个类别的文档

>>> reuters.fileids('barley')

['test/15618', 'test/15649', 'test/15676','test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767','test/17769', 'test/18024', 'test/18263', 'test/18908', 'test/19275','test/19668', 'training/10175', 'training/1067', 'training/11208','training/11316', 'training/11885', 'training/12428', 'training/13099','training/13744', 'training/13795', 'training/13852', 'training/13856','training/1652', 'training/1970', 'training/2044', 'training/2171','training/2172', 'training/2191', 'training/2217', 'training/2232','training/3132', 'training/3324', 'training/395', 'training/4280','training/4296', 'training/5', 'training/501', 'training/5467','training/5610', 'training/5640', 'training/6626', 'training/7205','training/7579', 'training/8213', 'training/8257', 'training/8759', 'training/9865','training/9958']

>>>reuters.fileids(['barley','corn'])

['test/14832', 'test/14858', 'test/15033','test/15043', 'test/15106', 'test/15287', 'test/15341',

...

]

#查找我们相要的句子

>>>reuters.words('training/9865')[:14]

['FRENCH', 'FREE', 'MARKET', 'CEREAL','EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested','licences', 'to', 'export']

>>>reuters.words(['training/9865','training/9880'])

['FRENCH', 'FREE', 'MARKET', 'CEREAL','EXPORT', ...]

>>> reuters.words(categories='barley')

['FRENCH', 'FREE', 'MARKET', 'CEREAL','EXPORT', ...]

>>>reuters.words(categories=['barley','corn'])

['THAI', 'TRADE', 'DEFICIT', 'WIDENS','IN', 'FIRST', ...]

就职演说语料库

>>> from nltk.corpus importinaugural

>>> inaugural.fileids()

['1789-Washington.txt','1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt','1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt','1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt','1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt','1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt','1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt','1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt','1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt','1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt','1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt','1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt','1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt','1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt','1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt','1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt','2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']

>>>print(inaugural.raw('2009-Obama.txt'))

My fellow citizens:

I stand here today humbled by the taskbefore us, grateful for the trust you have bestowed, mindful of the sacrificesborne by our ancestors. I thank President Bush for his service to our nation,as well as the generosity and cooperation he has shown throughout thistransition.

Forty-four Americans have now taken thepresidential oath. The words have been spoken during rising tides of prosperityand the still waters of peace. Yet, every so often the oath is taken amidstgathering clouds and raging storms. At these moments, America has carried onnot simply because of the skill or vision of those in high office, but becauseWe the People have remained faithful to the ideals of our forbearers, and trueto our founding documents.

So it has been. So it must be with thisgeneration of Americans.

That we are in the midst of crisis is nowwell understood. Our nation is at war, against a far-reaching network ofviolence and hatred. Our economy is badly weakened, a consequence of greed andirresponsibility on the part of some, but also our collective failure to makehard choices and prepare the nation for a new age. Homes have been lost; jobsshed; businesses shuttered. Our health care is too costly; our schools fail toomany; and each day brings further evidence that the ways we use energystrengthen our adversaries and threaten our planet.

These are the indicators of crisis, subjectto data and statistics. Less measurable but no less profound is a sapping ofconfidence across our land -- a nagging fear that America's decline isinevitable, that the next generation must lower its sights.

Today I say to you that the challenges weface are real. They are serious and they are many. They will not be met easilyor in a short span of time. But know this, America -- they will be met.

On this day, we gather because we havechosen hope over fear, unity of purpose over conflict and discord.

On this day, we come to proclaim an end tothe petty grievances and false promises, the recriminations and worn-out dogmasthat for far too long have strangled our politics.

We remain a young nation, but in the wordsof Scripture, the time has come to set aside childish things. The time has cometo reaffirm our enduring spirit; to choose our better history; to carry forwardthat precious gift, that noble idea, passed on from generation to generation:the God-given promise that all are equal, all are free, and all deserve achance to pursue their full measure of happiness.

In reaffirming the greatness of our nation,we understand that greatness is never a given. It must be earned. Our journeyhas never been one of shortcuts or settling for less. It has not been the pathfor the faint-hearted -- for those who prefer leisure over work, or seek onlythe pleasures of riches and fame. Rather, it has been the risk-takers, thedoers, the makers of things'some celebrated but more often men and womenobscure in their labor, who have carried us up the long, rugged path towardsprosperity and freedom.

For us, they packed up their few worldlypossessions and traveled across oceans in search of a new life.

For us, they toiled in sweatshops andsettled the West; endured the lash of the whip and plowed the hard earth.

For us, they fought and died, in placeslike Concord and Gettysburg; Normandy and Khe Sahn.

Time and again these men and womenstruggled and sacrificed and worked till their hands were raw so that we mightlive a better life. They saw America as bigger than the sum of our individualambitions; greater than all the differences of birth or wealth or faction.

This is the journey we continue today. Weremain the most prosperous, powerful nation on Earth. Our workers are no lessproductive than when this crisis began. Our minds are no less inventive, ourgoods and services no less needed than they were last week or last month orlast year. Our capacity remains undiminished. But our time of standing pat, ofprotecting narrow interests and putting off unpleasant decisions -- that timehas surely passed. Starting today, we must pick ourselves up, dust ourselvesoff, and begin again the work of remaking America.

For everywhere we look, there is work to bedone. The state of our economy calls for action, bold and swift, and we willact -- not only to create new jobs, but to lay a new foundation for growth. Wewill build the roads and bridges, the electric grids and digital lines thatfeed our commerce and bind us together. We will restore science to its rightfulplace, and wield technology's wonders to raise health care's quality and lowerits cost. We will harness the sun and the winds and the soil to fuel our carsand run our factories. And we will transform our schools and colleges anduniversities to meet the demands of a new age. All this we can do. All this wewill do.

Now, there are some who question the scaleof our ambitions -- who suggest that our system cannot tolerate too many bigplans. Their memories are short. For they have forgotten what this country hasalready done; what free men and women can achieve when imagination is joined tocommon purpose, and necessity to courage.

What the cynics fail to understand is thatthe ground has shifted beneath them -- that the stale political arguments thathave consumed us for so long no longer apply. The question we ask today is notwhether our government is too big or too small, but whether it works -- whetherit helps families find jobs at a decent wage, care they can afford, aretirement that is dignified. Where the answer is yes, we intend to moveforward. Where the answer is no, programs will end. And those of us who managethe public's dollars will be held to account -- to spend wisely, reform badhabits, and do our business in the light of day -- because only then can werestore the vital trust between a people and their government.

Nor is the question before us whether themarket is a force for good or ill. Its power to generate wealth and expandfreedom is unmatched, but this crisis has reminded us that without a watchfuleye, the market can spin out of control -- the nation cannot prosper long whenit favors only the prosperous. The success of our economy has always dependednot just on the size of our Gross Domestic Product, but on the reach of ourprosperity; on the ability to extend opportunity to every willing heart -- notout of charity, but because it is the surest route to our common good.

As for our common defense, we reject asfalse the choice between our safety and our ideals. Our Founding Fathers, facedwith perils that we can scarcely imagine, drafted a charter to assure the ruleof law and the rights of man, a charter expanded by the blood of generations.Those ideals still light the world, and we will not give them up forexpedience's sake. And so to all the other peoples and governments who arewatching today, from the grandest capitals to the small village where my fatherwas born: know that America is a friend of each nation and every man, woman,and child who seeks a future of peace and dignity, and we are ready to leadonce more.

Recall that earlier generations faced downfascism and communism not just with missiles and tanks, but with the sturdy alliancesand enduring convictions. They understood that our power alone cannot protectus, nor does it entitle us to do as we please. Instead, they knew that ourpower grows through its prudent use; our security emanates from the justness ofour cause, the force of our example, the tempering qualities of humility andrestraint.

We are the keepers of this legacy. Guidedby these principles once more, we can meet those new threats that demand evengreater effort -- even greater cooperation and understanding between nations.We will begin to responsibly leave Iraq to its people, and forge a hard-earnedpeace in Afghanistan. With old friends and former foes, we will work tirelesslyto lessen the nuclear threat, and roll back the specter of a warming planet. Wewill not apologize for our way of life, nor will we waver in its defense, andfor those who seek to advance their aims by inducing terror and slaughteringinnocents, we say to you now that our spirit is stronger and cannot be broken;you cannot outlast us, and we will defeat you.

For we know that our patchwork heritage isa strength, not a weakness. We are a nation of Christians and Muslims, Jews andHindus -- and non-believers. We are shaped by every language and culture, drawnfrom every end of this Earth; and because we have tasted the bitter swill ofcivil war and segregation, and emerged from that dark chapter stronger and moreunited, we cannot help but believe that the old hatreds shall someday pass;that the lines of tribe shall soon dissolve; that as the world grows smaller,our common humanity shall reveal itself; and that America must play its role inushering in a new era of peace.

To the Muslim world, we seek a new wayforward, based on mutual interest and mutual respect. To those leaders aroundthe globe who seek to sow conflict, or blame their society's ills on the West-- know that your people will judge you on what you can build, not what youdestroy. To those who cling to power through corruption and deceit and thesilencing of dissent, know that you are on the wrong side of history; but thatwe will extend a hand if you are willing to unclench your fist.

To the people of poor nations, we pledge towork alongside you to make your farms flourish and let clean waters flow; tonourish starved bodies and feed hungry minds. And to those nations like oursthat enjoy relative plenty, we say we can no longer afford indifference to thesuffering outside our borders; nor can we consume the world's resources withoutregard to effect. For the world has changed, and we must change with it.

As we consider the road that unfolds beforeus, we remember with humble gratitude those brave Americans who, at this veryhour, patrol far-off deserts and distant mountains. They have something to tellus, just as the fallen heroes who lie in Arlington whisper through the ages. Wehonor them not only because they are the guardians of our liberty, but becausethey embody the spirit of service; a willingness to find meaning in somethinggreater than themselves. And yet, at this moment -- a moment that will define ageneration -- it is precisely this spirit that must inhabit us all.

For as much as government can do and mustdo, it is ultimately the faith and determination of the American people uponwhich this nation relies. It is the kindness to take in a stranger when thelevees break, the selflessness of workers who would rather cut their hours thansee a friend lose their job which sees us through our darkest hours. It is thefirefighter's courage to storm a stairway filled with smoke, but also aparent's willingness to nurture a child, that finally decides our fate.

Ourchallenges may be new. The instruments with which we meet them may be new. Butthose values upon which our success depends -- honesty and hard work, courageand fair play, tolerance and curiosity, loyalty and patriotism -- these thingsare old. These things are true. They have been the quiet force of progressthroughout our history. What is demanded then is a return to these truths. Whatis required of us now is a new era of responsibility -- a recognition, on thepart of every American, that we have duties to ourselves, our nation, and theworld, duties that we do not grudgingly accept but rather seize gladly, firm inthe knowledge that there is nothing so satisfying to the spirit, so defining ofour character, than giving our all to a difficult task.

This is the price and the promise ofcitizenship.

This is the source of our confidence -- theknowledge that God calls on us to shape an uncertain destiny.

This is the meaning of our liberty and ourcreed -- why men and women and children of every race and every faith can joinin celebration across this magnificent mall, and why a man whose father lessthan sixty years ago might not have been served at a local restaurant can nowstand before you to take a most sacred oath.

So let us mark this day with remembrance,of who we are and how far we have traveled. In the year of America's birth, inthe coldest of months, a small band of patriots huddled by dying campfires onthe shores of an icy river. The capital was abandoned. The enemy was advancing.The snow was stained with blood. At a moment when the outcome of our revolutionwas most in doubt, the father of our nation ordered these words be read to thepeople:

"Let it be told to the future world... that in the depth of winter, when nothing but hope and virtue could survive... that the city and the country, alarmed at one common danger, came forth tomeet ... it."

America! In the face of our common dangers,in this winter of our hardship, let us remember these timeless words. With hopeand virtue, let us brave once more the icy currents, and endure what storms maycome. Let it be said by our children's children that when we were tested werefused to let this journey end, that we did not turn back nor did we falter;and with eyes fixed on the horizon and God's grace upon us, we carried forththat great gift of freedom and delivered it safely to future generations.

Thank you. God bless you. And God bless theUnited States of America.

#条年频率分布

>>> cfd=nltk.ConditionalFreqDist(

... (target,fileid[:4])

... for fileid in inaugural.fileids()

... for w in inaugural.words(fileid)

... for target in ['america','citizen']

... if w.lower().startswith(target))

>>> cfd.plot()

Backend TkAgg is interactive backend.Turning interactive mode on.

运行结果如下：

标注文本语料库

其他语言的语料库

处理字符编码

>>> nltk.corpus.cess_esp.words()

['El', 'grupo', 'estatal', 'Electricité_de_France',...]

>>> nltk.corpus.floresta.words()

['Um', 'revivalismo', 'refrescante', 'O','7_e_Meio', ...]

>>>nltk.corpus.indian.words('hindi.pos')

['पूर्ण', 'प्रतिबंध', 'हटाओ', ':', 'इराक','संयुक्त', ...]

>>> nltk.corpus.udhr.fileids()

['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8','Achehnese-Latin1', 'Achuar-Shiwiar-Latin1', 'Adja-UTF8',

...

'Zapoteco-SanLucasQuiavini-Latin1','Zhuang-Latin1', 'Zulu-Latin1']

>>> nltk.corpus.udhr.words('Javanese-Latin1')[11:]

['Saben', 'umat', 'manungsa', 'lair','kanthi', 'hak', ...]

#不同语言版本中字长的差异

>>> import nltk

>>> from nltk.corpus import udhr

#>>> #languages=['Chickasaw','English','German_Deutsch','Greenlandic_Inuktikut','Hungarian_Magyar',#'Ibibio_Efik']

>>> languages=['Chickasaw-Latin1','English-Latin1','German_Deutsch-Latin1','Greenlandic_Inuktikut-Latin1','Hungarian_Magyar-Latin1','Ibibio_Efik-Latin1','Chinese_Mandarin-GB2312']

>>> cfd=nltk.ConditionalFreqDist(

... (lang,len(word))

... for lang in languages

#... for word inudhr.words(lang+'-Latin1'))

... for word inudhr.words(lang))

>>> cfd.plot(cumulative=True)

运行结果如下:

>>>udhr.raw("Chinese_Mandarin-GB2312")

文本语料库的结构

>>> import nltk

>>> from nltk.corpus importgutenberg

>>>raw=gutenberg.raw("burgess-busterbrown.txt")

>>> raw[1:20]

'The Adventures of B'

>>>words=gutenberg.words("burgess-busterbrown.txt")

>>> words[1:20]

['The', 'Adventures', 'of', 'Buster','Bear', 'by', 'Thornton', 'W', '.', 'Burgess', '1920', ']', 'I', 'BUSTER','BEAR', 'GOES', 'FISHING', 'Buster', 'Bear']

>>>sents=gutenberg.sents("burgess-busterbrown.txt")

>>> sents[1:20]

[['I'], ['BUSTER', 'BEAR', 'GOES','FISHING'], ['Buster', 'Bear', 'yawned', 'as', 'he', 'lay', 'on', 'his','comfortable', 'bed', 'of', 'leaves',

...

]]

载入自已的语料库

>>> from nltk.corpus importPlaintextCorpusReader

>>>corpus_root='H:/nltk_data/custom_data'

>>> wordlists=PlaintextCorpusReader(corpus_root,'.*')

>>> wordlists.fileids()

#文件需要

>>> wordlists.words('Jython.txt')

['下载安装文件', 'cmd直接运行', ',', '安装完成', '修改环境变量', ...]

2 条件频率分布

>>> from nltk.corpus import brown

>>> cfd=nltk.ConditionalFreqDist(

... (genre,word)

... for genre in brown.categories()

... for word inbrown.words(categories=genre))

>>> genre_word=[(genre,word)

#只看新闻和言情两种文体

... for genre in ['news','romance']

#文体与词的配对

... for word in brown.words(categories=genre)]

...

>>> len(genre_word)

170576

#前面配对的为(‘news’,word)

>>> genre_word[:4]

[('news', 'The'), ('news', 'Fulton'),('news', 'County'), ('news', 'Grand')]

#后面配对的为(‘romance’,word)

>>> genre_word[-4:]

[('romance', 'afraid'), ('romance', 'not'),('romance', "''"), ('romance', '.')]

#创建ConditionalFreqDist,输入变量名称,确认是否有两个条件

>>> cfd=nltk.ConditionalFreqDist(genre_word)

>>> cfd

>>> cfd.conditions()

['news', 'romance']

#每个条件,都只有一个频率分布

>>> cfd['news']

FreqDist({'the': 5580, ',': 5188, '.':4030, 'of': 2849, 'and': 2146, 'to': 2116, 'a': 1993, 'in': 1893, 'for': 943,'The': 806, ...})

>>> cfd['romance']

FreqDist({',': 3899, '.': 3736, 'the':2758, 'and': 1776, 'to': 1502, 'a': 1335, 'of': 1186, '``': 1045,"''": 1044, 'was': 993, ...})

>>> len(cfd['news'])

14394

>>> len(cfd['romance'])

8452

>>> cfd['romance']['could']

#绘制分布图和分布表

>>> from nltk.corpus importinaugural

>>> cfd=nltk.ConditionalFreqDist(

... (target,fileid[:4])

... for fileid in inaugural.fileids()

... for w in inaugural.words(fileid)

... for target in ['america','citizen']

... if w.lower().startswith(target))

>>> cfd.plot()

运行结果下:

>>> import nltk

>>> from nltk.corpus import udhr

>>>languages=['Chickasaw-Latin1','English-Latin1','German_Deutsch-Latin1','Greenlandic_Inuktikut-Latin1','Hungarian_Magyar-Latin1','Ibibio_Efik-Latin1','Chinese_Mandarin-GB2312']

>>> cfd =nltk.ConditionalFreqDist(

... (lang, len(word))

... for lang in languages

... for word in udhr.words(lang))

#英文文本中9个或小于9个字符的词有1638个

>>>cfd.tabulate(conditions=['English-Latin1','German_Deutsch-Latin1'],#指定显示哪些条件

samples=range(10),#限制要显示的样本

cumulative=True)

0 1 2 3 4 5 6 7 8 9

English-Latin1 0 185 525 883 997 1166 1283 1440 1558 1638

German_Deutsch-Latin1 0 171 263 614 717 894 1013 1110 1213 1275

#使用双连词生成随机文本

>>> defgenerate_model(cfdist,word,num=15):

for i in range(num):

print(word,)

word=cfdist[word].max()

>>>text=nltk.corpus.genesis.words('english-kjv.txt')

>>> bigrams=nltk.bigrams(text)

>>>cfd=nltk.ConditionalFreqDist(bigrams)

>>> print(cfd['living'])

>>> generate_model(cfd,'living')

living

creature

that

said

and

the

land

the

land

the

land

条件概率分布中的常用方法

示例	功能
Cfd=ConditionalFreqDist(pairs)	从配对链表中创建条件频率分布
Cfd.conditions()	将条件按字母排序来分类
Cfd[conditions]	此条件下的频率分布
Cfd.tabulate()	为条件频率分布制表
Cfd.tabulate(samples,conditions)	指定样本和条件限制下制表
Cfd.plot()	为条件频率绘图
Cfd.plot(samples,conditions)	在指定样本和条件限制下绘图
Cfd1	出现次数比较

3 python代码重用

使用文本编辑器创建程序

可以使用python自带的IDLE编写多行程序

函数

带命名的代码块,可以使用def进行定义

示例代码如下:


def plural(word):
    """
    将单词变复数
    :param word:单词
    :return: 单词复数
    """
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'


print(plural('fairy'))
print(plural('woman'))

运行结果如下:

fairies

women

模块

from Two.textproc import plural

print(plural('wish'))

运行结果如下:

fairies

women

wishes

4 词典资源

词汇列表语料库

#显示文本中罕见或拼写错误的词汇

>>> import nltk

>>> def unusual_words(text):

... text_vocab=set(w.lower() for w in text if w.isalpha())

... english_vocab=set(w.lower() for w in nltk.corpus.words.words())

... unusual=text_vocab.difference(english_vocab)

... return sorted(unusual)

...unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))

['abbeyland', 'abhorred', 'abilities', 'abounded','abridgement', 'abused', 'abuses', 'accents',

...

'words', 'workmen', 'worlds', 'wrapt','writes', 'yards', 'years', 'yielded', 'youngest']

unusual_words(nltk.corpus.nps_chat.words())

['aaaaaaaaaaaaaaaaa', 'aaahhhh','abortions', 'abou', 'abourted', 'abs', 'ack', 'acros', 'actualy',

...

'yw', 'zebrahead', 'zoloft', 'zyban','zzzzzzzing', 'zzzzzzzz']

#停用词词料库,包含高频词汇

>>> from nltk.corpus importstopwords

>>> stopwords.words('english')

['i', 'me', 'my', 'myself', 'we', 'our','ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he','him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its','itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which','who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was','were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does','did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as','until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against','between', 'into', 'through', 'during', 'before', 'after', 'above', 'below','to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again','further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how','all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such','no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's','t', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're','ve', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven','isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren','won', 'wouldn']

#计算文本中不在停用词列表中的词的比例

>>> def content_fraction(text):

... stopwords=nltk.corpus.stopwords.words('english')

... content=[w for w in text if w.lower() not in stopwords]

... return len(content)/len(text)

...content_fraction(nltk.corpus.reuters.words())

0.735240435097661

#取得符合条件的词

>>>puzzle_letters=nltk.FreqDist('egivrvonl')

>>> obligatory='r'

>>>wordlist=nltk.corpus.words.words()

>>> [w for w in wordlist iflen(w)>=6 #单词长度小于6

... and obligatory in w #单词中包含的字母

... and nltk.FreqDist(w) <=puzzle_letters] #字母出频率小于等于相应词的频率

...

['glover', 'gorlin', 'govern', 'grovel','ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler','overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer','virole']

#找出同时出现在两个文件中的名字

>>> names=nltk.corpus.names

>>> names.fileids()

['female.txt', 'male.txt']

>>>male_names=names.words('male.txt')

>>>female_names=names.words('female.txt')

>>> [w for w in male_names if w infemale_names]

['Abbey', 'Abbie', 'Abby', 'Addie','Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 'Alfie', 'Ali', 'Alix', 'Allie',

...

'Virgie', 'Wallie', 'Wallis', 'Wally','Whitney', 'Willi', 'Willie', 'Willy', 'Winnie', 'Winny', 'Wynn']

#显示男性和女性名字结尾字母的频率分布

>>> cfd=nltk.ConditionalFreqDist(

... (fileid,name[-1])

... for fileid in names.fileids()

... for name in names.words(fileid))

>>> cfd.plot()

运行结果如下:

发音的词典

>>>entries=nltk.corpus.cmudict.entries()

>>> len(entries)

133737

>>> for entry inentries[3943:39951]:

... print(entry)

('explosion', ['IH0', 'K', 'S', 'P', 'L','OW1', 'ZH', 'AH0', 'N'])

('explosions', ['IH0', 'K', 'S', 'P', 'L','OW1', 'ZH', 'AH0', 'N', 'Z'])

('explosive', ['IH0', 'K', 'S', 'P', 'L','OW1', 'S', 'IH0', 'V'])

('explosively', ['EH2', 'K', 'S', 'P', 'L','OW1', 'S', 'IH0', 'V', 'L', 'IY0'])

#发音包含三个音素

>>> for word,pron in entries:

... if len(pron)==3:

... ph1,ph2,ph3=pron

... if ph1=='P' and ph3=='T':

... print(word,ph2,)

pait EY1

pat AE1

pate EY1

patt AE1

peart ER1

peat IY1

peet IY1

peete IY1

pert ER1

pet EH1

pete IY1

pett EH1

piet IY1

piette IY1

pit IH1

pitt IH1

pot AA1

pote OW1

pott AA1

pout AW1

puett UW1

purt ER1

put UH1

putt AH1

#找出发音结尾与nicks相似的词汇

>>> syllable=['N','IH0','K','S']

>>> [word for word,pron in entriesif pron[-4:]==syllable]

["atlantic's", 'audiotronics','avionics', 'beatniks', 'calisthenics', 'centronics', 'chamonix', 'chetniks',"clinic's", 'clinics', 'conics', 'conics', 'cryogenics', 'cynics','diasonics', "dominic's", 'ebonics', 'electronics',"electronics'", "endotronics'", 'endotronics', 'enix', 'environics','ethnics', 'eugenics', 'fibronics', 'flextronics', 'harmonics', 'hispanics','histrionics', 'identics', 'ionics', 'kibbutzniks', 'lasersonics', 'lumonics','mannix', 'mechanics', "mechanics'", 'microelectronics', 'minix','minnix', 'mnemonics', 'mnemonics', 'molonicks', 'mullenix', 'mullenix','mullinix', 'mulnix', "munich's", 'nucleonics', 'onyx', 'organics',"panic's", 'panics', 'penix', 'pennix', 'personics', 'phenix',"philharmonic's", 'phoenix', 'phonics', 'photronics', 'pinnix','plantronics', 'pyrotechnics', 'refuseniks', "resnick's",'respironics', 'sconnix', 'siliconix', 'skolniks', 'sonics', 'sputniks','technics', 'tectonics', 'tektronix', 'telectronics', 'telephonics', 'tonics','unix', "vinick's", "vinnick's", 'vitronics']

#书写与发音不匹配

>>> [w for w,pron in entries ifpron[-1]=='M' and w[-1]=='n']

['autumn', 'column', 'condemn', 'damn','goddamn', 'hymn', 'solemn']

音素包含的数字表示重音

1 主重音

2 次重音

0 无重音

#查找具有特定重音模式的词汇

>>> def stress(pron):

... return [char for phone in pron for char in phone if char.isdigit()]

>>> [w for w,pron in entries ifstress(pron)==['0','1','0','2','0']]

['abbreviated', 'abbreviated','abbreviating', 'accelerated', 'accelerating', 'accelerator',

...

'unsaturated', 'velociraptor','vocabulary', 'voluntarism']

>>> [w for w,pron in entries ifstress(pron)==['0','2','0','1','0']]

['abbreviation', 'abbreviations','abomination', 'abortifacient', 'abortifacients', 'academicians',

...

'wakabayashi', 'yekaterinburg']

#使用频率分布来寻找对应词汇的最小对比集

#查找以p开头的三音素词,按照第一个和第二个音素来分组

>>> p3=[(pron[0]+'-'+pron[2],word)

... for (word,pron) in entries

... if pron[0]=='P' and len(pron)==3]

... cfd=nltk.ConditionalFreqDist(p3)

>>> for template incfd.conditions():

... if len(cfd[template])>10:

... words=cfd[template].keys()

... wordlist=' '.join(words)

... print(template,wordlist[:70]+'...')

...

P-CH petsch piche poche peach putsch piechpautsch pitch patch pitsch puche...

P-N pen payne pun paign pinn pine painepenn pyne pane pawn penh poon pin ...

P-R peer pore poore par paar poor parr pourpier porr pair pare pear por...

P-UW1 prugh prue prew pshew peru peugh pluepew plew pru pugh...

P-L peel pyle pehl peal puhl perl pell pillpaull paille peele poul pohl p...

P-Z p.'s paws pause p's paz paiz pows pezpoe's perz pays poise pies purrs...

P-K pac poke pak pique puck pik pyke polkpaque pack pake pic purk pick pa...

P-S piece pass pearse pasts purse postsperse pease pesce poss perce piss ...

P-T pout pett putt pote pet pott peetepuett pert pot pait piet pate peet ...

P-P pipp poop paape pup popp paap peep papppaup pape pipe poppe pope pap ...

#查词典

>>>product=nltk.corpus.cmudict.dict()

>>> product['fire']

[['F', 'AY1', 'ER0'], ['F', 'AY1', 'R']]

#查找一个不存在的关键字

>>> product['blog']

Traceback (most recent call last):

File "", line 1, in

KeyError: 'blog'

>>>product['blog']=[['B','L','AA1','G']]

>>> product['blog']

[['B', 'L', 'AA1', 'G']]

>>>text=['natural','language','processing']

>>> [ph for w in text for ph inproduct[w][0]]

['N', 'AE1', 'CH', 'ER0', 'AH0', 'L', 'L','AE1', 'NG', 'G', 'W', 'AH0', 'JH', 'P', 'R', 'AA1', 'S', 'EH0', 'S', 'IH0','NG']

比较词表

Swadesh wordlists 斯瓦迪士核心词列表:含有200个常用词列表,语言标识符使用ISO639

>>> from nltk.corpus importswadesh

>>> swadesh.fileids()

['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de','en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk','sl', 'sr', 'sw', 'uk']

>>> swadesh.words('en')

['I', 'you (singular), thou', 'he', 'we','you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what','where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one','two', 'three', 'four', 'five', 'big', 'long', 'wide', 'thick', 'heavy','small', 'short', 'narrow', 'thin', 'woman', 'man (adult male)', 'man (humanbeing)', 'child', 'wife', 'husband', 'mother', 'father', 'animal', 'fish','bird', 'dog', 'louse', 'snake', 'worm', 'tree', 'forest', 'stick', 'fruit','seed', 'leaf', 'root', 'bark (from tree)', 'flower', 'grass', 'rope', 'skin','meat', 'blood', 'bone', 'fat (noun)', 'egg', 'horn', 'tail', 'feather','hair', 'head', 'ear', 'eye', 'nose', 'mouth', 'tooth', 'tongue', 'fingernail','foot', 'leg', 'knee', 'hand', 'wing', 'belly', 'guts', 'neck', 'back','breast', 'heart', 'liver', 'drink', 'eat', 'bite', 'suck', 'spit', 'vomit','blow', 'breathe', 'laugh', 'see', 'hear', 'know (a fact)', 'think', 'smell','fear', 'sleep', 'live', 'die', 'kill', 'fight', 'hunt', 'hit', 'cut', 'split','stab', 'scratch', 'dig', 'swim', 'fly (verb)', 'walk', 'come', 'lie', 'sit','stand', 'turn', 'fall', 'give', 'hold', 'squeeze', 'rub', 'wash', 'wipe','pull', 'push', 'throw', 'tie', 'sew', 'count', 'say', 'sing', 'play', 'float','flow', 'freeze', 'swell', 'sun', 'moon', 'star', 'water', 'rain', 'river','lake', 'sea', 'salt', 'stone', 'sand', 'dust', 'earth', 'cloud', 'fog', 'sky','wind', 'snow', 'ice', 'smoke', 'fire', 'ashes', 'burn', 'road', 'mountain','red', 'green', 'yellow', 'white', 'black', 'night', 'day', 'year', 'warm','cold', 'full', 'new', 'old', 'good', 'bad', 'rotten', 'dirty', 'straight','round', 'sharp', 'dull', 'smooth', 'wet', 'dry', 'correct', 'near', 'far','right', 'left', 'at', 'in', 'with', 'and', 'if', 'because', 'name']

>>>fr2en=swadesh.entries(['fr','en'])

>>> fr2en

[('je', 'I'), ('tu, vous', 'you (singular),thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles','they'), ('ceci', 'this'), ('cela', 'that'), ('ici', 'here'), ('là', 'there'),('qui', 'who'), ('quoi', 'what'), ('où', 'where'),('quand', 'when'), ('comment', 'how'), ('ne...pas', 'not'), ('tout', 'all'),('plusieurs', 'many'), ('quelques', 'some'), ('peu', 'few'), ('autre','other'), ('un', 'one'), ('deux', 'two'), ('trois', 'three'), ('quatre','four'), ('cinq', 'five'), ('grand', 'big'), ('long', 'long'), ('large','wide'), ('épais', 'thick'), ('lourd', 'heavy'), ('petit', 'small'), ('court','short'), ('étroit', 'narrow'), ('mince', 'thin'), ('femme', 'woman'), ('homme','man (adult male)'), ('homme', 'man (human being)'), ('enfant', 'child'),('femme, épouse', 'wife'), ('mari, époux', 'husband'),('mère', 'mother'), ('père', 'father'), ('animal', 'animal'), ('poisson', 'fish'),('oiseau', 'bird'), ('chien', 'dog'), ('pou', 'louse'), ('serpent', 'snake'),('ver', 'worm'), ('arbre', 'tree'), ('forêt', 'forest'),('bâton', 'stick'), ('fruit', 'fruit'), ('graine', 'seed'), ('feuille','leaf'), ('racine', 'root'), ('écorce', 'bark (from tree)'), ('fleur', 'flower'), ('herbe','grass'), ('corde', 'rope'), ('peau', 'skin'), ('viande', 'meat'), ('sang','blood'), ('os', 'bone'), ('graisse', 'fat (noun)'), ('œuf', 'egg'),('corne', 'horn'), ('queue', 'tail'), ('plume', 'feather'), ('cheveu', 'hair'),('tête', 'head'), ('oreille', 'ear'), ('œil', 'eye'), ('nez','nose'), ('bouche', 'mouth'), ('dent', 'tooth'), ('langue', 'tongue'),('ongle', 'fingernail'), ('pied', 'foot'), ('jambe', 'leg'), ('genou', 'knee'),('main', 'hand'), ('aile', 'wing'), ('ventre', 'belly'), ('entrailles','guts'), ('cou', 'neck'), ('dos', 'back'), ('sein, poitrine', 'breast'), ('cœur','heart'), ('foie', 'liver'), ('boire', 'drink'), ('manger', 'eat'), ('mordre','bite'), ('sucer', 'suck'), ('cracher', 'spit'), ('vomir', 'vomit'),('souffler', 'blow'), ('respirer', 'breathe'), ('rire', 'laugh'), ('voir','see'), ('entendre', 'hear'), ('savoir', 'know (a fact)'), ('penser', 'think'),('sentir', 'smell'), ('craindre, avoir peur', 'fear'), ('dormir', 'sleep'),('vivre', 'live'), ('mourir', 'die'), ('tuer', 'kill'), ('se battre', 'fight'),('chasser', 'hunt'), ('frapper', 'hit'), ('couper', 'cut'), ('fendre','split'), ('poignarder', 'stab'), ('gratter', 'scratch'), ('creuser', 'dig'),('nager', 'swim'), ('voler', 'fly (verb)'), ('marcher', 'walk'), ('venir','come'), ("s'étendre", 'lie'), ("s'asseoir", 'sit'), ('se lever','stand'), ('tourner', 'turn'), ('tomber', 'fall'), ('donner', 'give'),('tenir', 'hold'), ('serrer', 'squeeze'), ('frotter', 'rub'), ('laver','wash'), ('essuyer', 'wipe'), ('tirer', 'pull'), ('pousser', 'push'), ('jeter','throw'), ('lier', 'tie'), ('coudre', 'sew'), ('compter', 'count'), ('dire','say'), ('chanter', 'sing'), ('jouer', 'play'), ('flotter', 'float'),('couler', 'flow'), ('geler', 'freeze'), ('gonfler', 'swell'), ('soleil','sun'), ('lune', 'moon'), ('étoile', 'star'), ('eau', 'water'), ('pluie', 'rain'), ('rivière','river'), ('lac', 'lake'), ('mer', 'sea'), ('sel', 'salt'), ('pierre', 'stone'),('sable', 'sand'), ('poussière', 'dust'), ('terre', 'earth'), ('nuage', 'cloud'), ('brouillard','fog'), ('ciel', 'sky'), ('vent', 'wind'), ('neige', 'snow'), ('glace', 'ice'),('fumée', 'smoke'), ('feu', 'fire'), ('cendres', 'ashes'), ('brûler', 'burn'),('route', 'road'), ('montagne', 'mountain'), ('rouge', 'red'), ('vert','green'), ('jaune', 'yellow'), ('blanc', 'white'), ('noir', 'black'), ('nuit','night'), ('jour', 'day'), ('an, année', 'year'),('chaud', 'warm'), ('froid', 'cold'), ('plein', 'full'), ('nouveau', 'new'),('vieux', 'old'), ('bon', 'good'), ('mauvais', 'bad'), ('pourri', 'rotten'),('sale', 'dirty'), ('droit', 'straight'), ('rond', 'round'), ('tranchant,pointu, aigu', 'sharp'), ('émoussé', 'dull'), ('lisse', 'smooth'), ('mouillé', 'wet'),('sec', 'dry'), ('juste, correct', 'correct'), ('proche', 'near'), ('loin','far'), ('à droite', 'right'), ('à gauche', 'left'), ('à', 'at'), ('dans', 'in'), ('avec', 'with'), ('et', 'and'), ('si','if'), ('parce que', 'because'), ('nom', 'name')]

>>> translate=dict(fr2en) #转化为简单的词典

>>> translate['chien']

'dog'

>>> translate['jeter']

'throw'

>>>de2en=swadesh.entries(['de','en']) #德转英

>>>es2en=swadesh.entries(['es','en']) #西班牙转英

>>> translate.update(dict(de2en))#更新原有词典

>>> translate.update(dict(es2en))

>>> translate['Hund']

'dog'

>>> translate['perro']

'dog'

#比较德语族与拉丁族的不同

>>>languages=['en','de','nl','es','fr','pt','la']

>>> for i in [139,140,141,142]:

... print(swadesh.entries(languages)[i])

...

('say', 'sagen', 'zeggen', 'decir', 'dire','dizer', 'dicere')

('sing', 'singen', 'zingen', 'cantar','chanter', 'cantar', 'canere')

('play', 'spielen', 'spelen', 'jugar','jouer', 'jogar, brincar', 'ludere')

('float', 'schweben', 'zweven', 'flotar','flotter', 'flutuar, boiar', 'fluctuare')

词汇工具 toolbox和shoebox

toolbox用来管理数据的工具,之前叫shoebox

>>> from nltk.corpus importtoolbox

>>> toolbox.entries('rotokas.dic')

[('kaa',

[('ps', 'V'), ('pt', 'A'), ('ge', 'gag'),('tkp', 'nek i pas'), ('dcsv', 'true'), ('vx', '1'), ('sc', '???'),

('dt', '29/Oct/2005'),

('ex', 'Apokaira kaaroi aioa-ia reoreopaoro.'),

('xp', 'Kaikai ipas long nek bilong Apoka bikos em i kaikai na toktok.'),

('xe', 'Apoka isgagging from food while talking.')]),

('kaa',

[('ps', 'V'), ('pt', 'B'), ('ge','strangle'),('tkp', 'pasim nek'), ('arg', 'O'), ('vx', '2'),

('dt', '07/Oct/2006'), ('ex', 'Rera raurororera kaarevoi.'),

('xp', 'Em i holim pas em na nekim em.'),

('xe', 'He is holding him and stranglinghim.'),

('ex', 'Iroiro-ia oirato okoearo kaaivoi uvarerirovira kaureoparoveira.'),

('xp', 'Ol i pasim nek bilong man long ropbikos em i save bikhet tumas.'),

('xe', "They strangled the man's neckwith rope because he was very stubborn and arrogant."),

('ex', 'Oiratookoearo kaaivoi iroiro-ia. Uva viapau uvuiparoi ra vovouparo uva kopiiroi.'),

('xp', 'Ol i pasim nek bilong man long rop.Olsem na em i no pulim win olsem na em i dai.'),

('xe', "They strangled the man's neckwith a rope. And he couldn't breathe and he died.")]),

...

]

5 wordnet

意义与同义词

>>> from nltk.corpus importwordnet as wn

>>> wn.synsets('motorcar')

[Synset('car.n.01')]

#取得同意词集

>>>wn.synset('car.n.01')._lemma_names

['car', 'auto', 'automobile', 'machine','motorcar']

#取得同意词集的意义

>>>wn.synset('car.n.01').definition()

'a motor vehicle with four wheels; usuallypropelled by an internal combustion engine'

>>>wn.synset('car.n.01').examples()

['he needs a car to get to work']

#取得指定同义词的词条

>>> wn.synset('car.n.01')._lemmas

[Lemma('car.n.01.car'),

Lemma('car.n.01.auto'),

Lemma('car.n.01.automobile'),

Lemma('car.n.01.machine'),

Lemma('car.n.01.motorcar')]

#查找特定的词条

>>>wn.lemma('car.n.01.automobile')

Lemma('car.n.01.automobile')

#取得一个词条对应的同义词集

>>>wn.lemma('car.n.01.automobile').synset()

Synset('car.n.01')

#得到一个词条的名字(可以用属性也可以用方法)

>>> wn.lemma('car.n.01.automobile')._name

'automobile'

>>>wn.lemma('car.n.01.automobile').name()

'automobile'

#具有多个同义词集

>>> wn.synsets('car')

[Synset('car.n.01'), Synset('car.n.02'),Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]

>>> for synset inwn.synsets('car'):

... print(synset._lemma_names)

...

['car', 'auto', 'automobile', 'machine','motorcar']

['car', 'railcar', 'railway_car','railroad_car']

['car', 'gondola']

['car', 'elevator_car']

['cable_car', 'car']

#取得所有包含car的词条

>>> wn.lemmas('car')

[Lemma('car.n.01.car'),Lemma('car.n.02.car'), Lemma('car.n.03.car'), Lemma('car.n.04.car'),Lemma('cable_car.n.01.car')]

Wordnet的层次结构

#下位词

>>>motorcar=wn.synset('car.n.01')

>>>types_of_motorcar=motorcar.hyponyms()

>>> types_of_motorcar[0]

Synset('ambulance.n.01')

>>> sorted([lemma._name for synsetin types_of_motorcar for lemma in synset.lemmas()])

['Model_T', 'S.U.V.', 'SUV','Stanley_Steamer', 'ambulance', 'beach_waggon', 'beach_wagon', 'bus', 'cab','compact', 'compact_car', 'convertible', 'coupe', 'cruiser', 'electric','electric_automobile', 'electric_car', 'estate_car', 'gas_guzzler', 'hack','hardtop', 'hatchback', 'heap', 'horseless_carriage', 'hot-rod', 'hot_rod','jalopy', 'jeep', 'landrover', 'limo', 'limousine', 'loaner', 'minicar','minivan', 'pace_car', 'patrol_car', 'phaeton', 'police_car', 'police_cruiser','prowl_car', 'race_car', 'racer', 'racing_car', 'roadster', 'runabout','saloon', 'secondhand_car', 'sedan', 'sport_car', 'sport_utility','sport_utility_vehicle', 'sports_car', 'squad_car', 'station_waggon','station_wagon', 'stock_car', 'subcompact', 'subcompact_car', 'taxi','taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon','wagon']

#上位词

>>> motorcar.hypernyms()

[Synset('motor_vehicle.n.01')]

>>>paths=motorcar.hypernym_paths()

>>> len(paths)

>>> [synset._name for synset inpaths[0]]

['entity.n.01', 'physical_entity.n.01','object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03','container.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01','motor_vehicle.n.01', 'car.n.01']

>>> [synset._name for synset inpaths[1]]

['entity.n.01', 'physical_entity.n.01','object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03','conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01','self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

#得到一个最笼统的上位同义词

>>> motorcar.root_hypernyms()

[Synset('entity.n.01')]

词汇关系

#部分

>>>wn.synset('tree.n.01').part_meronyms()

[Synset('burl.n.02'), Synset('crown.n.07'),Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')]

# 属于某物的组成材质

>>>wn.synset('tree.n.01').substance_meronyms()

[Synset('heartwood.n.01'),Synset('sapwood.n.01')]

#父集

>>>wn.synset('tree.n.01').member_holonyms()

[Synset('forest.n.01')]

#单词mint的相关意思

>>> for synset inwn.synsets('mint',wn.NOUN):

... print(synset._name+" : ",synset._definition)

...

batch.n.02 : (often followed by `of') a large number oramount or extent

mint.n.02 : any north temperate plant of the genus Mentha with aromatic leaves andsmall mauve flowers

mint.n.03 : any member of the mint family of plants

mint.n.04 : the leaves of a mint plant used fresh or candied

mint.n.05 : a candy that is flavored with a mint oil

mint.n.06 : a plant where money is coined by authority of the government

#mint.n.04是mint.n.02的一部分

>>>wn.synset('mint.n.04').part_holonyms()

[Synset('mint.n.02')]

#是组成mint.n.05的材质

>>>wn.synset('mint.n.04').substance_holonyms()

[Synset('mint.n.05')]

#动词之间的关系

>>>wn.synset('walk.v.01').entailments()

[Synset('step.v.01')]

>>>wn.synset('eat.v.01').entailments()

[Synset('chew.v.01'),Synset('swallow.v.01')]

>>>wn.synset('tease.v.03').entailments()

[Synset('arouse.v.07'),Synset('disappoint.v.01')]

#反义词

>>>wn.lemma('supply.n.02.supply').antonyms()

[Lemma('demand.n.02.demand')]

>>>wn.lemma('rush.v.01.rush').antonyms()

[Lemma('linger.v.04.linger')]

>>>wn.lemma('horizontal.a.01.horizontal').antonyms()

[Lemma('inclined.a.02.inclined'),Lemma('vertical.a.01.vertical')]

>>>wn.lemma('staccato.r.01.staccato').antonyms()

[Lemma('legato.r.01.legato')]

语义相似度

#同义词

>>> import nltk

>>> from nltk.corpus importwordnet as wn

>>>right=wn.synset('right_whale.n.01')

>>> orca=wn.synset('orca.n.01')

>>>minke=wn.synset('minke_whale.n.01')

>>> tortoise=wn.synset('tortoise.n.01')

>>> novel=wn.synset('novel.n.01')

>>>right.lowest_common_hypernyms(minke)

[Synset('baleen_whale.n.01')]

>>>right.lowest_common_hypernyms(orca)

[Synset('whale.n.02')]

>>>right.lowest_common_hypernyms(tortoise)

[Synset('vertebrate.n.01')]

>>>right.lowest_common_hypernyms(novel)

[Synset('entity.n.01')]

#查找每个同义词集的深度

>>>wn.synset('baleen_whale.n.01').min_depth()

>>>wn.synset('whale.n.02').min_depth()

>>>wn.synset('vertebrate.n.01').min_depth()

>>> wn.synset('entity.n.01').min_depth()

#查看语议相似度

>>> right.path_similarity(minke)

0.25

>>> right.path_similarity(orca)

0.16666666666666666

>>>right.path_similarity(tortoise)

0.07692307692307693

>>> right.path_similarity(novel)

0.043478260869565216

你可能感兴趣的:(python自然语言处理)

Python自然语言处理库之gensim使用详解 Rocky006 python 开发语言
概要Gensim是一个专门用于无监督主题建模和自然语言处理的Python开源库，由捷克共和国的RadimŘehůřek开发。该库专注于处理大规模文本数据，提供了多种经典的主题建模算法，如LDA（潜在狄利克雷分配）、LSI（潜在语义索引）等，以及现代化的词向量模型Word2Vec、Doc2Vec、FastText等。Gensim的设计理念是"为人类而非机器"，强调易用性和可扩展性，特别适合处理无标签
《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第02章获得文本语料和词汇资源 miniAI学堂 2015年度 Python 自然语言处理语料库中文资源
第02章获得文本语料和词汇资源2.1获取文本语料库古腾堡语料库网络和聊天文本布朗语料库路透社语料库就职演说语料库标注文本语料库在其他语言的语料库文本语料库的结构载入你自己的语料库中文自然语言处理语料/数据集情感/观点/评论倾向性分析中文命名实体识别推荐系统2.2条件频率分布条件和事件按文体计数词汇绘制分布图和分布表使用双连词生成随机文本2.3更多关于Python代码重用使用文本编辑器创建程序函数模
Python自然语言处理：gensim库的探索与应用丶本心灬
本文还有配套的精品资源，点击获取简介：本文档介绍了gensim库——一个专为Python设计的开源自然语言处理工具，它支持词向量模型、主题模型、相似度计算、TF-IDF和LSA等核心功能。该库适用于文档相似性和主题建模任务，特别强调其在处理大规模语料库中的高效性和准确性。包含gensim-4.0.0版本的预编译安装包，为64位Windows系统上的Python3.6版本提供便捷安装体验。文档还提供
神经网络语言模型基本原理和实践隔壁的NLP小哥 NLP学习神经网络
神经网络语言模型(NNLM)基本原理和实践本文参照了《深度学习原理与Pytorch实战》和《Python自然语言处理实战核心技术与算法》中的部分代码和原理。1文本向量化概述对于常规的文本，计算机是无法直接处理的，需要我们将文本数据转换成计算机可以进行处理的形式。在NLP领域，文本的向量化是一项十分重要和基础的工作。所谓的文本向量化，就是将文本表示成一系列能够表示文本语义的向量。在一般的文本中，能够
Python 自然语言处理实战： NLTK 与 spaCy，文本分析的左右护法清水白石008 python Python题库 python 自然语言处理 easyui
Python自然语言处理实战：NLTK与spaCy，文本分析的左右护法引言在信息爆炸的时代，文本数据以前所未有的速度增长，蕴藏着巨大的信息和价值。从社交媒体的评论，到浩如烟海的文档，文本数据无处不在，成为了解用户意图、挖掘商业情报、洞察社会趋势的关键来源。然而，文本数据本质上是非结构化的，计算机难以直接理解和处理。自然语言处理(NaturalLanguageProcessing,NLP)技术应运而
python自然语言处理—Word2vec模型之Skip-gram 诗雨时 python
Word2vec模型之Skip-gram（跳字）模型一、skip-gram模型图二、skip-gram模型图示例说明举个例子来说明这个图在干嘛：1、假设我们的文本序列有五个词，["the","man","loves","his","son"]。2、假设我们的窗口大小为skip-window=2，中心词为"loves"，那么上下文的词即为："the"、"man"、"his"、"son"。这里的上下文
Python自然语言处理之spacy模块介绍、安装与常见操作案例袁袁袁袁满 Python实用技巧大全 python 自然语言处理 easyui
文章目录spacy模块介绍安装spacy常见操作案例及代码1.加载模型并处理文本2.词性标注3.命名实体识别4.依存句法分析5.可视化（在JupyterNotebook中）spacy模块介绍spacy是一个强大的Python库，用于自然语言处理（NLP）。它提供了丰富的功能，包括分词、词性标注、依存句法分析、命名实体识别等，并且支持多种语言。spacy以其高性能、易用性和可扩展性而受到广泛欢迎。安
自然语言处理（NLP）-总览图学习汤姆和佩琦 NLP 自然语言处理学习人工智能
文章目录自然语言处理（NLP）-总览图学习1.一张总览图的学习1.语音学（Phonology）2.形态学（Morphology）3.句法学（Syntax）4.语义学（Semantics）5.推理（Reasoning）小结自然语言处理（NLP）-总览图学习转自《Python自然语言处理第二版》1.一张总览图的学习这张图片展示了一个自然语言处理的流程模型，涵盖了从语音分析到应用推理和执行的多个阶段，每
Python与自然语言处理库Gensim实战心梓知识 python 自然语言处理 easyui
一、Gensim简介Gensim是一款Python自然语言处理库。它能够自动化训练出一个文本语料库，然后用该语料库来训练出一个词向量模型。在语料库中，每个语料库都是由一个个文档组成，每个文档则是由若干个单词组成。Gensim相对于其他Python自然语言处理库的优点在于它的速度和内存占用率较低。同时它还提供了许多文本处理的功能，比如文档相似度计算和主题建模等。二、安装Gensim在安装Gensim
Python自然语言处理：NLTK库详解小雨淋林 Python基础入门教程 python 自然语言处理 easyui
自然语言处理（NaturalLanguageProcessing，NLP）是计算机科学与人工智能领域中一个重要的研究方向，旨在使计算机能够理解、解释、生成人类语言。在Python中，NLTK（NaturalLanguageToolkit）库是一个功能强大、广泛使用的自然语言处理库。本篇博客将深入介绍NLTK库的使用，包括分词、词性标注、命名实体识别、情感分析等常见任务，并通过实例演示其在实际应用中
【AI底层逻辑】——数学与机器学习：优雅的智慧之舞柯宝最帅 AI底层逻辑人工智能机器学习
目录“宝藏网站”聊聊数学“华尔兹”“智慧之舞”后续的章节我们将迎来新的篇章，新的切入点探索AI的奥秘，通过揭示高数、矩阵、概率论等数学知识与机器学习的关系来深入理解AI的奥秘！“宝藏网站”开头先给大家上几个宝藏网站（部分需要“梯子”）：sklearn主页特征工程免费专著模型选择深度学习开源专著Python自然语言处理学习手册图形讲数学与神经网络视频合集聊聊数学数学，即工具。与锤子、剪刀一样，数学也
深入NLTK：Python自然语言处理库高级教程 Python老猿 python 自然语言处理 easyui 机器学习开发语言自动化人工智能
在前面的初级和中级教程中，我们了解了NLTK库中的基本和进阶功能，如词干提取、词形还原、n-gram模型和词云的绘制等。在本篇高级教程中，我们将深入探索NLTK的更多高级功能，包括句法解析、命名实体识别、情感分析以及文本分类。一、句法解析句法解析是自然语言处理中的一项重要任务，它的目的是识别出文本中词语之间的句法关系。在NLTK中，我们可以使用StanfordParser进行句法解析：python
自然语言处理（NLP）-spacy简介以及安装指南（语言库zh_core_web_sm）汀、人工智能 python Elastic search 自然语言处理人工智能 spacy 实体抽取词法分析分词
spacy简介spacy是Python自然语言处理软件包，可以对自然语言文本做词性分析、命名实体识别、依赖关系刻画，以及词嵌入向量的计算和可视化等。1.安装spacy使用“pipinstallspacy"报错，或者安装完spacy，无法正常调用，可以通过以下链接将whl文件下载到本地，然后cd到文件路径下，通过pip安装。pipinstallspacy下载链接：Archived:PythonExt
python自然语言处理库_Python自然语言处理工具库（含中文处理） weixin_39876739 python自然语言处理库
自然语言处理（NaturalLanguageProcessing，简称NLP），是研究计算机处理人类语言的一门技术。随着深度学习在图像识别、语音识别领域的大放异彩，人们对深度学习在NLP的价值也寄予厚望。再加上AlphaGo的成功，人工智能的研究和应用变得炙手可热。自然语言处理作为人工智能领域的认知智能，成为目前大家关注的焦点。NLP研究领域包括：句法语义分析：对于给定的句子，进行分词、词性标记、
python自然语言处理实战微盘_Python自然语言处理实战：核心技术与算法 weixin_39624774 python自然语言处理实战微盘
涂铭：阿里巴巴数据架构师，对大数据、自然语言处理、Python、Java相关技术有深入的研究，积累了丰富的实践经验。曾就职于北京明略数据，是大数据方面的高级咨询顾问。在工业领域参与了设备故障诊断项目，在零售行业参与了精准营销项目。在自然语言处理方面，担任导购机器人项目的架构师，主导开发机器人的语义理解、短文本相似度匹配、上下文理解，以及通过自然语言检索产品库，在项目中构建了NoSQL+文本检索等大
Python自然语言处理实战（7）：文本向量化 CopperDong NLP
7.1文本向量化概述文本表示是自然语言处理中的基础工作，文本表示的好坏直接影响到整个自然语言处理系统的性能。文本向量化是文本表示的一种重要方式。顾名思义，文本向量化就是将文本表示成一系列能够表达文本语义的向量。无论是中文还是英文，词语都是表达文本处理的最基本单元。当前阶段，对文本向量化大部分的研究都是通过词向量化实现的。与此同时，也有相当一部分研究者将句子作为文本处理的基本单元，于是产生了doc2
学习笔记（2):Python自然语言处理-BERT模型实战-特征分配与softmax机制意慢研发管理 python 自然语言处理人工智能 NLP 框架
立即学习:https://edu.csdn.net/course/play/26498/334606?utm_source=blogtoedu
学习笔记(04):Python自然语言处理-BERT模型实战-NER标注数据处理与读取 pt net 研发管理 python 自然语言处理人工智能 NLP 框架
立即学习:https://edu.csdn.net/course/play/26498/334637?utm_source=blogtoedu-data_dir=data-output_dir=result-init_checkpoint=chinese_L-12_H-768_A-12/bert_model.ckpt-bert_config_file=chinese_L-12_H-768_A-12
Python自然语言处理入门教程 JieLun_C python 自然语言处理 easyui Python
自然语言处理（NaturalLanguageProcessing，简称NLP）是人工智能领域中的一个重要分支，它研究如何使计算机能够理解和处理人类语言。Python是一种功能强大且易于使用的编程语言，广泛应用于NLP任务的开发。本教程将向您介绍使用Python进行简单的自然语言处理的基本知识和技巧。在开始之前，请确保已经安装了Python的最新版本，并安装了以下关键库：NLTK（NaturalLa
python自然语言处理技术分析辰东的《完美世界》艾瑞娅
本篇文章的灵感主要来源于网上各种各样的关于自然语言分析的教程。曾记得我N年前读过《完美世界》。突然有种想分析其人物关系的冲动。当然现在我已经对里面主人公忘得一干二净，正好排除外界因素来检测文本处理人物关系是否正确。首先介绍一下本篇文章的主要内容。第一步先统计小说里面出现的TOP20高频词。第二步就绘制一个关于小说的高频词词云。第三步则绘制人物关系图（CP图）第一步首先先说明一点由于本次处理
【自然语言处理】NLTK库的概念和作用酒酿小圆子～自然语言处理
文章目录一、NLTK库介绍二、NLTK库的使用2.1初级使用2.2中级使用参考资料一、NLTK库介绍NaturalLanguageToolkit(NLTK)是一个广泛使用的Python自然语言处理工具库，由StevenBird、EdwardLoper和EwanKlein于2001年发起开发。NLTK的目的是为自然语言处理（NLP）提供一个完整的、易于使用的工具集，使研究人员、学生和开发人员能够更加
《Python自然语言处理-雅兰·萨纳卡(Jalaj Thanaki)》学习笔记：05 特征工程和NLP算法 miniAI学堂 2017年度自然语言处理特征工程 Stanford spaCy
05特征工程和NLP算法5.1理解特征工程5.1.1特征工程的定义5.1.2特征工程的目的5.1.3一些挑战5.2NLP中的基础特征5.2.1句法解析和句法解析器5.2.2词性标注和词性标注器理解词性标注和词性标注器的概念一步步开发词性标注器即插即用现有词性标注器使用词性标注作为特征挑战5.2.3命名实体识别NER类StanfordNERSpacyNER提取和理解特征挑战5.2.4n元语法5.2.
python自然语言处理-几种常见的平滑算法诗雨时 python
几种常见的平滑算法在计算语言模型的过程中，对于句子中的每一个字符或者词都需要一个非零的概率值，因为一旦存在一个概率为0的结果，那么整个计算公式的结果都为0，这种问题我们叫做数据匮乏（稀疏），所以必须分配给所有可能出现的字符串一个非0的概率值来避免这种错误的发生。举个例子，当我们需要计算一个sentence我喜欢看电影的概率时：P(我,喜欢,看电影)=P(我)*P(喜欢|我)*P(看电影|喜欢)如果
一款简化Python自然语言处理的开源库迷途小书童的Note python 自然语言处理开发语言人工智能
迷途小书童读完需要3分钟速读仅需1分钟1简介TextBlob是一个Python库，用于处理文本数据的自然语言处理（NLP）任务。它提供了简单且易于使用的API，使得对文本进行分析、情感分析、词性标注、名词短语提取等任务变得更加简单。TextBlob的核心功能是基于NLTK（自然语言工具包）和Pattern库构建的。它使用了机器学习算法和语言模型来执行各种文本处理任务。2安装使用之前，需要安装，打开
FileNotFoundError: [Errno 2] No such file or directory: ‘errors.out‘ （python自然语言处理章节5.6 最后的示例报错） _Meilinger_ 碎片笔记 python nlp 自然语言处理数据类型
在使用python3.7运行NaturalLanguageProcessingwithPythonChapter5的最后一个示例fromnltk.tblimportdemoasbrill_demobrill_demo.demo()print(open("errors.out").read())时，出现如下错误：Traceback(mostrecentcalllast):File"E:/Python
自学Python看什么书？这6本Python高质量书籍，总有一本适合你一秋的编程笔记计算机科技 Python 编程人工智能 python 数据分析编程 Python书籍
文章目录1、《“笨办法”学Python》2、《Python快速编程入门》3、《Python高手之路(第3版)》4、《Python算法教程》5、《Python核心编程（第3版）》6、《精通Python自然语言处理》读者福利1、Python所有方向的学习路线2、Python课程视频3、精品书籍4、清华编程大佬出品《漫画看学Python》5、Python实战案例6、互联网企业面试真题随着我国在人工智能的
python自然语言处理工具包 zerowl
[NLTK]http://www.nltk.org/:NLTK在用Python处理自然语言的工具中处于领先的地位。它提供了WordNet这种方便处理词汇资源的借口，还有分类、分词、除茎、标注、语法分析、语义推理等类库。[Pattern]https://github.com/clips/pattern:Pattern的自然语言处理工具有词性标注工具(Part-Of-SpeechTagger)，N元搜
6个强大又容易上手的Python自然语言处理库 Python学研大本营 python 自然语言处理开发语言
6个顶级自然语言处理库指南。微信搜索关注《Python学研大本营》，加入读者群，分享更多精彩自然语言处理是最热门的研究领域之一。虽然NLP任务一开始可能看起来有点复杂，但通过使用正确的工具，它们可以变得更容易。本文涵盖了6个顶级NLP库，可以节省用户的时间和精力。简介不同的语言被用于交流目的，语言被认为是最复杂的数据形式之一。你有没有想过像谷歌翻译、Alexa和Siri这样的语音助手是如何理解、处
自学python数月，开贴第一天紫竹潇潇
简单介绍下自己，本菜鸟是财务背景妹子一枚，没有编程基础自学pyhon数月，半只脚入门，但是不够系统，准备立贴记下自己每天python成长路上的足迹，起到监督和相互学习的作用。目前主攻python自然语言，也会涉及数据分析，简单了解爬虫但不能熟练操作。准备跟着python自然语言处理这本书走，把每天自己实际操作遇到的问题和收获放置于此。一、对一些概念不熟悉，百度查询url:统一资源定位符是对可以从互
Python自然语言处理：NLTK入门指南格林希尔 Python实践 python 自然语言处理机器学习人工智能开发语言
Python自然语言处理：NLTK入门指南一、Python自然语言处理简介1.什么是自然语言处理（NLP）2.Python在NLP中的应用3.为什么选择使用Python进行NLP二、NLTK介绍1.NLTK是什么2.NLTK的历史和现状3.NLTK的安装和配置4.NLTK的基本功能分词：词性标注：去除停用词：词干提取：词形归一化：三、语料库和数据预处理1.语料库介绍2.NLTK支持的语料库3.数据
Dom 周华华 JavaScript html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&q
【Spark九十六】RDD API之combineByKey bit1129 spark
1. combineByKey函数的运行机制 RDD提供了很多针对元素类型为(K,V)的API，这些API封装在PairRDDFunctions类中，通过Scala隐式转换使用。这些API实现上是借助于combineByKey实现的。combineByKey函数本身也是RDD开放给Spark开发人员使用的API之一首先看一下combineByKey的方法说明：
msyql设置密码报错：ERROR 1372 (HY000): 解决方法详解 daizj mysql 设置密码
MySql给用户设置权限同时指定访问密码时，会提示如下错误： ERROR 1372 (HY000): Password hash should be a 41-digit hexadecimal number；问题原因：你输入的密码是明文。不允许这么输入。解决办法：用select password('你想输入的密码');查询出你的密码对应的字符串，然后
路漫漫其修远兮吾将上下而求索周凡杨学习思索
王国维在他的《人间词话》中曾经概括了为学的三种境界古今之成大事业、大学问者，罔不经过三种之境界。“昨夜西风凋碧树。独上高楼，望尽天涯路。”此第一境界也。“衣带渐宽终不悔，为伊消得人憔悴。”此第二境界也。“众里寻他千百度，蓦然回首，那人却在灯火阑珊处。”此第三境界也。学习技术，这也是你必须经历的三种境界。第一层境界是说，学习的路是漫漫的，你必须做好充分的思想准备，如果半途而废还不如不要开始。这里，注
Hadoop(二)对话单的操作朱辉辉33 hadoop
Debug： 1、 A = LOAD '/user/hue/task.txt' USING PigStorage(' ') AS (col1,col2,col3); DUMP A; //输出结果前几行示例： (>ggsnPDPRecord(21),,) (-->recordType(0),,) (-->networkInitiation(1),,)
web报表工具FineReport常用函数的用法总结（日期和时间函数）老A不折腾 finereport 报表工具 web开发
web报表工具FineReport常用函数的用法总结（日期和时间函数）说明：凡函数中以日期作为参数因子的，其中日期的形式都必须是yy/mm/dd。而且必须用英文环境下双引号(" ")引用。 DATE DATE(year,month,day):返回一个表示某一特定日期的系列数。 Year:代表年，可为一到四位数。 Month:代表月份。
c++ 宏定义中的##操作符墙头上一根草 C++
#与##在宏定义中的--宏展开 #include <stdio.h> #define f(a,b) a##b #define g(a) #a #define h(a) g(a) int main() { &nbs
分析Spring源代码之，DI的实现 aijuans spring DI 现源代码
(转) 分析Spring源代码之，DI的实现 2012/1/3 by tony 接着上次的讲，以下这个sample [java] view plain copy print
for循环的进化 alxw4616 JavaScript
// for循环的进化 // 菜鸟 for (var i = 0; i < Things.length ; i++) { // Things[i] } // 老鸟 for (var i = 0, len = Things.length; i < len; i++) { // Things[i] } // 大师 for (var i = Things.le
网络编程Socket和ServerSocket简单的使用百合不是茶网络编程基础 IP地址端口
网络编程;TCP/IP协议网络:实现计算机之间的信息共享,数据资源的交换协议:数据交换需要遵守的一种协议,按照约定的数据格式等写出去端口:用于计算机之间的通信每运行一个程序，系统会分配一个编号给该程序，作为和外界交换数据的唯一标识 0~65535 查看被使用的
JDK1.5 生产消费者 bijian1013 java thread 生产消费者 java多线程
ArrayBlockingQueue：一个由数组支持的有界阻塞队列。此队列按 FIFO（先进先出）原则对元素进行排序。队列的头部是在队列中存在时间最长的元素。队列的尾部是在队列中存在时间最短的元素。新元素插入到队列的尾部，队列检索操作则是从队列头部开始获得元素。 ArrayBlockingQueue的常用方法：
JAVA版身份证获取性别、出生日期及年龄 bijian1013 java 性别出生日期年龄
工作中需要根据身份证获取性别、出生日期及年龄，且要还要支持15位长度的身份证号码，网上搜索了一下，经过测试好像多少存在点问题，干脆自已写一个。 CertificateNo.java package com.bijian.study; import java.util.Calendar; import
【Java范型六】范型与枚举 bit1129 java
首先，枚举类型的定义不能带有类型参数，所以，不能把枚举类型定义为范型枚举类，例如下面的枚举类定义是有编译错的 public enum EnumGenerics<T> { //编译错，提示枚举不能带有范型参数 OK, ERROR; public <T> T get(T type) { return null;
【Nginx五】Nginx常用日志格式含义 bit1129 nginx
1. log_format 1.1 log_format指令用于指定日志的格式，格式： log_format name(格式名称) type(格式样式) 1.2 如下是一个常用的Nginx日志格式： log_format main '[$time_local]|$request_time|$status|$body_bytes
Lua 语言 15 分钟快速入门 ronin47 lua 基础
- - 单行注释 - - [[ [多行注释] - - ]] - - - - - - - - - - - 1. 变量 & 控制流 - - - - - - - - - - num = 23 - - 数字都是双精度 str = 'aspythonstring'
java-35.求一个矩阵中最大的二维矩阵 ( 元素和最大 ) bylijinnan java
the idea is from: http://blog.csdn.net/zhanxinhang/article/details/6731134 public class MaxSubMatrix { /**see http://blog.csdn.net/zhanxinhang/article/details/6731134 * Q35 求一个矩阵中最大的二维
mongoDB文档型数据库特点开窍的石头 mongoDB文档型数据库特点
MongoDD: 文档型数据库存储的是Bson文档-->json的二进制特点：内部是执行引擎是js解释器，把文档转成Bson结构，在查询时转换成js对象。 mongoDB传统型数据库对比传统类型数据库：结构化数据，定好了表结构后每一个内容符合表结构的。也就是说每一行每一列的数据都是一样的文档型数据库：不用定好数据结构，
[毕业季节]欢迎广大毕业生加入JAVA程序员的行列 comsci java
一年一度的毕业季来临了。。。。。。。。正在投简历的学弟学妹们。。。如果觉得学校推荐的单位和公司不适合自己的兴趣和专业，可以考虑来我们软件行业，做一名职业程序员。。。软件行业的开发工具中，对初学者最友好的就是JAVA语言了，网络上不仅仅有大量的
PHP操作Excel – PHPExcel 基本用法详解 cuiyadll PHP Excel
导出excel属性设置//Include classrequire_once('Classes/PHPExcel.php');require_once('Classes/PHPExcel/Writer/Excel2007.php');$objPHPExcel = new PHPExcel();//Set properties 设置文件属性$objPHPExcel->getProperties
IBM Webshpere MQ Client User Issue (MCAUSER) darrenzhu IBM jms user MQ MCAUSER
IBM MQ JMS Client去连接远端MQ Server的时候，需要提供User和Password吗？答案是根据情况而定，取决于所定义的Channel里面的属性Message channel agent user identifier (MCAUSER)的设置。 http://stackoverflow.com/questions/20209429/how-mca-user-i
网线的接法 dcj3sjt126com
一、PC连HUB (直连线)A端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。 B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。二、PC连PC （交叉线）A端：(568A)：白绿，绿，白橙，蓝，白蓝，橙，白棕，棕； B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。三、HUB连HUB&nb
Vimium插件让键盘党像操作Vim一样操作Chrome dcj3sjt126com chrome vim
什么是键盘党？键盘党是指尽可能将所有电脑操作用键盘来完成，而不去动鼠标的人。鼠标应该说是新手们的最爱，很直观，指哪点哪，很听话！不过常常使用电脑的人，如果一直使用鼠标的话，手会发酸，因为操作鼠标的时候，手臂不是在一个自然的状态，臂肌会处于绷紧状态。而使用键盘则双手是放松状态，只有手指在动。而且尽量少的从鼠标移动到键盘来回操作，也省不少事。在chrome里安装 vimium 插件
MongoDB查询（2）——数组查询[六] eksliang mongodb MongoDB查询数组
MongoDB查询数组转载请出自出处：http://eksliang.iteye.com/blog/2177292 一、概述 MongoDB查询数组与查询标量值是一样的，例如，有一个水果列表，如下所示： > db.food.find() { "_id" : "001", "fruits" : [ "苹
cordova读写文件（1） gundumw100 JavaScript Cordova
使用cordova可以很方便的在手机sdcard中读写文件。首先需要安装cordova插件：file 命令为： cordova plugin add org.apache.cordova.file 然后就可以读写文件了，这里我先是写入一个文件，具体的JS代码为： var datas=null;//datas need write var directory=&
HTML5 FormData 进行文件jquery ajax 上传到又拍云 ileson jquery Ajax html5 FormData
html5 新东西：FormData 可以提交二进制数据。页面test.html <!DOCTYPE> <html> <head> <title> formdata file jquery ajax upload</title> </head> <body> <
swift appearanceWhenContainedIn:(version1.2 xcode6.4) 啸笑天 version
swift1.2中没有oc中对应的方法： + (instancetype)appearanceWhenContainedIn:(Class <UIAppearanceContainer>)ContainerClass, ... NS_REQUIRES_NIL_TERMINATION; 解决方法：在swift项目中新建oc类如下： #import &
java实现SMTP邮件服务器 macroli java 编程
电子邮件传递可以由多种协议来实现。目前，在Internet 网上最流行的三种电子邮件协议是SMTP、POP3 和 IMAP，下面分别简单介绍。　　◆ SMTP 协议　　简单邮件传输协议(Simple Mail Transfer Protocol,SMTP)是一个运行在TCP/IP之上的协议，用它发送和接收电子邮件。SMTP 服务器在默认端口25上监听。SMTP客户使用一组简单的、基于文本的
mongodb group by having where 查询sql qiaolevip 每天进步一点点学习永无止境 mongo 纵观千象
SELECT cust_id, SUM(price) as total FROM orders WHERE status = 'A' GROUP BY cust_id HAVING total > 250 db.orders.aggregate( [ { $match: { status: 'A' } }, { $group: {
Struts2 Pojo（六） Luob. POJO strust2
注意：附件中有完整案例 1.采用POJO对象的方法进行赋值和传值 2.web配置 <?xml version="1.0" encoding="UTF-8"?> <web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee&q
struts2步骤 wuai struts
1、添加jar包 2、在web.xml中配置过滤器 <filter> <filter-name>struts2</filter-name> <filter-class>org.apache.st