python 高级进阶之词频统计问题

[1, 7, 10, 4, 9, 10, 9, 8, 5, 8]
希望统计出各个元素出现的次数,最终得到一个这样的结果:{8: 2, 9: 2...},即:{某个元素: 出现的次数...}。

  • 方法一:
>>> from random import randint
>>> data = [randint(1,10) for x in xrange(10)]
>>> data
>>> d = dict.fromkeys(data, 0)
>>> d
{1: 0, 4: 0, 5: 0, 7: 0, 8: 0, 9: 0, 10: 0}
>>> for x in data:
>>>     d[x] += 1
>>> d
{1: 1, 4: 1, 5: 1, 7: 1, 8: 2, 9: 2, 10: 2}
  • 方法二:
    利用 collections 模块中的 CounterCounter 是一个简单的计数器:
>>> from collections import Counter
>>> c = Counter(data)
>>> c
Counter({1: 1, 4: 1, 5: 1, 7: 1, 8: 2, 9: 2, 10: 2})
>>> isinstance(c, dict)
# 该 Counter 对象是 dict 的子类,所以可以通过键来访问对应值
>>> c[1]
# most_common(n),直接统计出前n个最高词频
>>> c.most_common(2)
[(8, 2), (9, 2)]


class Counter(__builtin__.dict)
 |  Dict subclass for counting hashable items.  Sometimes called a bag
 |  or multiset.  Elements are stored as dictionary keys and their counts
 |  are stored as dictionary values.
 |  >>> c = Counter('abcdeabcdabcaba')  # count elements from a string
 |  >>> c.most_common(3)                # three most common elements
 |  [('a', 5), ('b', 4), ('c', 3)]
 |  >>> sorted(c)                       # list all unique elements
 |  ['a', 'b', 'c', 'd', 'e']
 |  >>> ''.join(sorted(c.elements()))   # list elements with repetitions
 |  'aaaaabbbbcccdde'
 |  >>> sum(c.values())                 # total of all counts
 |  15
 |  >>> c['a']                          # count of letter 'a'
 |  5
 |  >>> for elem in 'shazam':           # update counts from an iterable
 |  ...     c[elem] += 1                # by adding 1 to each element's count
 |  >>> c['a']                          # now there are seven 'a'
 |  7
 |  >>> del c['b']                      # remove all 'b'
 |  >>> c['b']                          # now there are zero 'b'
 |  0
 |  >>> d = Counter('simsalabim')       # make another counter
 |  >>> c.update(d)                     # add in the second counter
 |  >>> c['a']                          # now there are nine 'a'
 |  9
 |  >>> c.clear()                       # empty the counter
 |  >>> c
 |  Counter()
 |  Note:  If a count is set to zero or reduced to zero, it will remain
 |  in the counter until the entry is deleted or the counter is cleared:
 |  >>> c = Counter('aaabbc')
 |  >>> c['b'] -= 2                     # reduce the count of 'b' by two
 |  >>> c.most_common()                 # 'b' is still in, but its count is zero |  [('a', 3), ('c', 1), ('b', 0)]

