写在前面:我们机器学习老师有说过,真正搞机器学习的,大部分时间要花在数据分析与处理上,而那些算法理论知识,呵!too simple,sometimes naive!将信将疑吧~但掌握数据分析与处理的能力,其重要性想必不用多说。那为什么要用Python进行数据分析?因为有一本书就叫做Python For Data Analysis。呃~另外,推荐另一款Python集成开发环境Anaconda, 它是利用工具/命令conda来进行package和environment的管理,并且已经包含了Python和相关的配套工具。很方便,没用过的可以试一下~
这里选择奥巴马的一篇就职演讲。
1. 方法1
speech_text='''My fellow citizens:...这里是演讲文本内容...
'''
把数据命名为speech_text,首先需要对英文进行分词。英文中主要是空格,使用split()函数对文本进行分割。lower()函数将所有字母变小写
speech=speech_text.lower().split()
speech
['my',
'fellow',
'citizens:',
'i',
'stand',
'here',
'today',
'humbled',
'by',
'the',
...
...]
计算speech中词语出现的次数
dic = {}
for word in speech:
if word not in dic:
dic[word]=1
else:
dic[word]+=1
dic
{'my': 3,
'fellow': 1,
'citizens:': 1,
'i': 3,
'stand': 2,
'here': 1,
'today': 4,
'humbled': 1,
'by': 11,
'the': 157,
......
}
通过 items() 函数以列表返回可遍历的(键, 值) 元组数组
dic.items()
dict_items([('my', 3), ('fellow', 1), ('citizens:', 1), ('i', 3), ('stand', 2), ('here', 1), ('today', 4), ......])
对元组数组进行排序
import operator
sort=sorted(dic.items(),key=operator.itemgetter(1),reverse=True)
sort
[('the', 157),
('and', 136),
('of', 97),
('to', 89),
('our', 82),
('we', 77),
('a', 60),
('that', 59),
('is', 49),
('-', 31),
('for', 30),
......
]
发现’the’,‘and’,'of’是常见非实义词,借用nlp套件nltk我们可以把这些词语去掉
from nltk.corpus import stopwords
stop_words = stopwords.words('English')
for k,v in sort:
if k not in stop_words :
print(k,v)
- 31
us 19
every 10
nation 10
new 10
cannot 10
us, 9
common 9
must 8
.......
whether 1
这里发现排第一的是’-’,显然是无意义的,可以使用append()函数将’-'加到stop_words里
stop_words.append('-')
for k,v in sort:
if k not in stop_words :
print(k,v)
如果有如下错误
LookupError:
Resource u’corpora/stopwords’ not found. Please use the
NLTK Downloader to obtain the resource: >>> nltk.download()
Searched in:
- ‘C:\Users\Tree/nltk_data’
- ‘C:\nltk_data’
- ‘D:\nltk_data’
- ‘E:\nltk_data’
- ‘F:\Program Files (x86)\python\nltk_data’
- ‘F:\Program Files (x86)\python\lib\nltk_data’
- ‘C:\Users\Tree\AppData\Roaming\nltk_data’
按提示执行如下操作就好了,其他错误参见https://www.douban.com/note/534906136/
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
2. 方法2
from collections import Counter
c = Counter(speech)
c
Counter({'my': 3,
'fellow': 1,
'citizens:': 1,
'i': 3,
'stand': 2,
'here': 1,
'today': 4,
'humbled': 1,
'by': 11,
'the': 157,
'task': 1,
'before': 5,
'us,': 9,
'grateful': 1,
......
})
c.most_common(10)
[('the', 157),
('and', 136),
('of', 97),
('to', 89),
('our', 82),
('we', 77),
('a', 60),
('that', 59),
('is', 49),
('-', 31)]
同样将无实义词去掉
stop_words.append('-')
for s in stop_words:
del c[s]
c.most_common(10)
[('us', 19),
('every', 10),
('nation', 10),
('new', 10),
('cannot', 10),
('us,', 9),
('common', 9),
('must', 8),
('whether', 8),
('america', 6)]
参考:基于Python数据科学挖掘精华实战课程