前言:从今天开始正式学习自然语言处理,同时还有统计学习方法和机器学习。希望能够一直坚持下去。
(以下答案非标准答案,如有错误请积极回复。谢谢理解。)
在开始之前首先引入nltk和nltk.book
import nltk
from nltk.book import *
12/(4+1)
output:2.4
26**100
output:3142930641582938830174357788501626427282669988762475256374173175398995908420104023465432599069702289330964075081611719197835869803511992549376
print(['Monty','Python'] * 20)
print(3 * sent1)
output:['Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python']
['Call', 'me', 'Ishmael', '.', 'Call', 'me', 'Ishmael', '.', 'Call', 'me', 'Ishmael', '.']
print(len(text2),len(set(text2)))
output:141576 6833
text2.dispersion_plot(["Elinor","Marianne","Edward","Willoughby"])
由图所知大概Elinor和Marianne是一对夫妻,另外两人是另一对夫妻。
text5.collocation_list()
output:['wanna chat', 'PART JOIN', 'MODE #14-19teens', 'JOIN PART', 'PART PART', 'cute.-ass MP3', 'MP3 player', 'JOIN JOIN', 'times .. .', 'ACTION watches', 'guys wanna', 'song lasts', 'last night', 'ACTION sits', '-...)...- S.M.R.', 'Lime Player', 'Player 12%', 'dont know', 'lez gurls', 'long time']
len(set(text4))
output:9913
该表达式由len()和set()两个方法组成,其含义为text4中不同词数的数量。
my_string = 'My String'
print(my_string)
my_string
output:My String 'My String'
print(my_string+my_string,'\n'+
my_string*3,'\n'+
my_string+' '+my_string)
output:My StringMy String
My StringMy StringMy String
My String My String
my_sent = [My","sent"]
msent1 = ''.join(my_sent)
print(msent1)
output:Mysent
msent2 = msent1.split("s")
print(msent2)
output:['My', 'ent']
phrase1 = ["123", "is"]
phrase2 = ["my", "password"]
print(phrase1 + phrase2)
print(len(phrase1 + phrase2))
print(len(phrase1)+len(phrase2))
output:['123', 'is', 'my', 'password']
4
4
len(phrase1 + phrase2)是先连接字符串后计算字符串长度
len(phrase1)+len(phrase2)则是依次计算长度后,计算长度的和
print("Monty Python"[6:12])
print(["Monty", "Python"][1])
output:Python
Python
前者是取字符串的切片,后者是取列表项。明显后者在任何机器学习的领域中更常用。
sent1[2][2]
output:'h'
sent[2]的输出是’Ishmael’,该式相当于’Ishmael’[2]即字符串的第三个字符:'h’
for i in range(len(sent3)):
if sent3[i] == 'the':
print(i)
output:1
5
8
sorted([word for word in set(text5) if word.startswith('b')])
output:['b', 'b-day', 'b/c', 'b4', 'babay', 'babble', 'babblein', 'babe', 'babes', 'babi', 'babies', 'babiess', 'baby', 'babycakeses', 'bachelorette', 'back', 'backatchya', 'backfrontsidewaysandallaroundtheworld', 'backroom', 'backup', 'bacl', 'bad', 'bag', 'bagel', 'bagels', 'bahahahaa', 'bak', 'baked', 'balad', 'balance', 'balck', 'ball', 'ballin', 'balls', 'ban', 'band', 'bandito', 'bandsaw', 'banjoes', 'banned', 'baord', 'bar', 'barbie', 'bare', 'barely', 'bares', 'barfights', 'barks', 'barn', 'barrel', 'base', 'bases', 'basically', 'basket', 'battery', 'bay', 'bbbbbyyyyyyyeeeeeeeee', 'bbiam', 'bbl', 'bbs', 'bc', 'be', 'beach', 'beachhhh', 'beam', 'beams', 'beanbag', 'beans', 'bear', 'bears', 'beat', 'beaten', 'beatles', 'beats', 'beattles', 'beautiful', 'because', 'beckley', 'become', 'bed', 'bedford', 'bedroom', 'beeeeehave', 'beeehave', 'been', 'beer', 'before'...
(以下省略)
print(list(range(10)),
list(range(10, 20)),
list(range(10, 20, 2)),
list(range(20, 10, -2)))
output:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [10, 11, 12, 13, 14, 15, 16, 17, 18, 19] [10, 12, 14, 16, 18] [20, 18, 16, 14, 12]
text9.index('sunset')
startin = 0
endin = len(text9)
for i in range(629, 1, -1):
if text9[i] == '.' or text9[i] == '?' or text9[i] == '!':
startin = i
break;
for i in range(629, endin):
if text9[i] == '.' or text9[i] == '?' or text9[i] == '!':
endin = i
break;
print(text9[startin+1:endin+1])
output:['CHAPTER', 'I', 'THE', 'TWO', 'POETS', 'OF', 'SAFFRON', 'PARK', 'THE', 'suburb', 'of', 'Saffron', 'Park', 'lay', 'on', 'the', 'sunset', 'side', 'of', 'London', ',', 'as', 'red', 'and', 'ragged', 'as', 'a', 'cloud', 'of', 'sunset', '.']
word_list = set(sent1 + sent2 + sent3 + sent4 + sent5 + sent6 + sent7 + sent8)
print(sorted(word_list))
output:['!', ',', '-', '.', '1', '25', '29', '61', ':', 'ARTHUR', 'Call', 'Citizens', 'Dashwood', 'Fellow', 'God', 'House', 'I', 'In', 'Ishmael', 'JOIN', 'KING', 'MALE', 'Nov.', 'PMing', 'Pierre', 'Representatives', 'SCENE', 'SEXY', 'Senate', 'Sussex', 'The', 'Vinken', 'Whoa', '[', ']', 'a', 'and', 'as', 'attrac', 'been', 'beginning', 'board', 'clop', 'created', 'director', 'discreet', 'earth', 'encounters', 'family', 'for', 'had', 'have', 'heaven', 'in', 'join', 'lady', 'lol', 'long', 'me', 'nonexecutive', 'of', 'old', 'older', 'people', 'problem', 'seeks', 'settled', 'single', 'the', 'there', 'to', 'will', 'wind', 'with', 'years']
print(len(sorted(set([w.lower() for w in text1]))))
print(len(sorted([w.lower() for w in set(text1)])))
output:17231
19317
前者是先循环读取了text1中的所有词后更新为小写格式后用set()筛选了不同的词。
后者是先将text1中的不同词筛选完毕后,循环读取其中的词再更新为小写。所以后者再进入列表时,大小写的词是被当成两个词的。
w = ']'
print(w.isupper())
print(not w.islower())
output:False
True
前者用来判断w是不是一个大写字母,True时可确定w是大写字母,False时无法确定w是什么。
后者则只能确定w是不是一个小写字母,True时不可确定w是什么,False时确定w是个小写字母
text2[-2::1]
output:['THE', 'END']
FreqDist([word for word in text5 if len(word)==4])
output:FreqDist({'JOIN': 1021, 'PART': 1016, 'that': 274, 'what': 183, 'here': 181, '....': 170, 'have': 164, 'like': 156, 'with': 152, 'chat': 142, ...})
set([word for word in text6 if word.isupper()])
output:{'A', 'ALL', 'AMAZING', 'ANIMATOR', 'ARMY', 'ARTHUR', 'B', 'BEDEVERE', 'BLACK', 'BORS', 'BRIDE', 'BRIDGEKEEPER', 'BROTHER', 'C', 'CAMERAMAN', 'CART', 'CARTOON', 'CHARACTER', 'CHARACTERS', 'CONCORDE', 'CRAPPER', 'CRASH', 'CRONE', 'CROWD', 'CUSTOMER', 'DEAD', 'DENNIS', 'DINGO', 'DIRECTOR', 'ENCHANTER', 'FATHER', 'FRENCH', 'GALAHAD', 'GIRLS', 'GOD', 'GREEN', 'GUARD', 'GUARDS', 'GUEST', 'GUESTS', 'HEAD', 'HEADS', 'HERBERT', 'HISTORIAN', 'I', 'INSPECTOR', 'KING', 'KNIGHT', 'KNIGHTS', 'LAUNCELOT', 'LEFT', 'LOVELY', 'LUCKY', 'MAN', 'MASTER', 'MAYNARD', 'MIDDLE', 'MIDGET', 'MINSTREL', 'MONKS', 'N', 'NARRATOR', 'NI', 'O', 'OF', 'OFFICER', 'OLD', 'OTHER', 'PARTY', 'PATSY', 'PERSON', 'PIGLET', 'PRINCE', 'PRINCESS', 'PRISONER', 'RANDOM', 'RIGHT', 'ROBIN', 'ROGER', 'S', 'SCENE', 'SECOND', 'SENTRY', 'SHRUBBER', 'SIR', 'SOLDIER', 'STUNNER', 'SUN', 'THE', 'TIM', 'U', 'VILLAGER', 'VILLAGERS', 'VOICE', 'W', 'WIFE', 'WINSTON', 'WITCH', 'WOMAN', 'Y', 'ZOOT'}
print([word for word in text6 if word.endswith("ize")],
[word for word in text6 if 'z' in word],
[word for word in text6 if 'pt' in word],
[word for word in text6 if word.istitle()])
output:[] ['zone', 'amazes', 'Fetchez', 'Fetchez', 'zoop', 'zoo', 'zhiv', 'frozen', 'zoosh'] ['empty', 'aptly', 'Thpppppt', 'Thppt', 'Thppt', 'empty', 'Thppppt', 'temptress', 'temptation', 'ptoo', 'Chapter', 'excepting', 'Thpppt'] ['Whoa', 'Halt', 'Who', 'It', 'I', 'Arthur', 'Uther', 'Pendragon', 'Camelot', 'King', 'Britons', 'Saxons', 'England', 'Pull', 'I', 'Patsy', 'We', 'Camelot', 'I', 'What', 'Ridden', 'Yes', 'You', 'What', 'You', 'So', 'We', 'Mercea', 'Where', 'We', 'Found', 'In', 'Mercea', 'The', 'What', 'Well', 'The', 'Are', 'Not', 'They', 'What', 'A', 'It', 'It', 'It', 'A', 'Well', 'Will', 'Arthur', 'Court', 'Camelot', 'Listen', 'In', 'Please', 'Am', 'I', 'I', 'It', 'African', 'Oh'...
(以下省略)
print([word for word in sent if word.startswith('sh')])
output:['she', 'shells', 'shore']
print([word for word in sent if len(word) > 4])
output:['sells', 'shells', 'shore']
sum1 = sum([len(w) for w in text1])#text1的字符总数
print(sum1 / len(text1))
output:3.830411128023649
def vocab_size(text):
return len(text)
vocab_size(text1)
output:260819
def percent(word, text):
word = word.lower()
count1 = 0
for w in text:
if len(w) == len(word):
chosen = word.lower()
if chosen == word:
count1 += 1;
return '%.2f%%' % (count1 / len(text))
percent('the',text1)
output:'0.19%'
set(text3) < set(text1)
output:False
可用来确定一个语料库是否是另一个语料库的子集。