python 高级进阶之词频统计问题

现有列表如下：
[1, 7, 10, 4, 9, 10, 9, 8, 5, 8]
希望统计出各个元素出现的次数，最终得到一个这样的结果：{8: 2, 9: 2...}，即：{某个元素: 出现的次数...}。

方法一：
首先要将这些元素作为字典的键，建立一个初始值为0的字典：

>>> from random import randint
>>> data = [randint(1,10) for x in xrange(10)]
>>> data
[1, 7, 10, 4, 9, 10, 9, 8, 5, 8]
>>> d = dict.fromkeys(data, 0)
>>> d
{1: 0, 4: 0, 5: 0, 7: 0, 8: 0, 9: 0, 10: 0}
>>> for x in data:
>>>     d[x] += 1
>>> d
{1: 1, 4: 1, 5: 1, 7: 1, 8: 2, 9: 2, 10: 2}

方法二：
利用 collections 模块中的 Counter ，Counter 是一个简单的计数器：

>>> from collections import Counter
>>> c = Counter(data)
>>> c
Counter({1: 1, 4: 1, 5: 1, 7: 1, 8: 2, 9: 2, 10: 2})
>>> isinstance(c, dict)
True
# 该 Counter 对象是 dict 的子类，所以可以通过键来访问对应值
>>> c[1]
1
# most_common(n)，直接统计出前n个最高词频
>>> c.most_common(2)
[(8, 2), (9, 2)]

参考文档：

class Counter(__builtin__.dict)
 |  Dict subclass for counting hashable items.  Sometimes called a bag
 |  or multiset.  Elements are stored as dictionary keys and their counts
 |  are stored as dictionary values.
 |
 |  >>> c = Counter('abcdeabcdabcaba')  # count elements from a string
 |
 |  >>> c.most_common(3)                # three most common elements
 |  [('a', 5), ('b', 4), ('c', 3)]
 |  >>> sorted(c)                       # list all unique elements
 |  ['a', 'b', 'c', 'd', 'e']
 |  >>> ''.join(sorted(c.elements()))   # list elements with repetitions
 |  'aaaaabbbbcccdde'
 |  >>> sum(c.values())                 # total of all counts
 |  15
 |
 |  >>> c['a']                          # count of letter 'a'
 |  5
 |  >>> for elem in 'shazam':           # update counts from an iterable
 |  ...     c[elem] += 1                # by adding 1 to each element's count
 |  >>> c['a']                          # now there are seven 'a'
 |  7
 |  >>> del c['b']                      # remove all 'b'
 |  >>> c['b']                          # now there are zero 'b'
 |  0
 |
 |  >>> d = Counter('simsalabim')       # make another counter
 |  >>> c.update(d)                     # add in the second counter
 |  >>> c['a']                          # now there are nine 'a'
 |  9
 |
 |  >>> c.clear()                       # empty the counter
 |  >>> c
 |  Counter()
 |
 |  Note:  If a count is set to zero or reduced to zero, it will remain
 |  in the counter until the entry is deleted or the counter is cleared:
 |
 |  >>> c = Counter('aaabbc')
 |  >>> c['b'] -= 2                     # reduce the count of 'b' by two
 |  >>> c.most_common()                 # 'b' is still in, but its count is zero |  [('a', 3), ('c', 1), ('b', 0)]

python 高级进阶之词频统计问题

你可能感兴趣的:(python 高级进阶之词频统计问题)