统计单词出现的次数

统计 THE TRAGEDY OF ROMEO AND JULIET (罗密欧与朱丽叶)英文小说中各单词出现的次数。

小说TXT文件下载链接:
链接:https://pan.baidu.com/s/1u2c7O-617MboXSwBHnoOcA 提取码:vX47

评价标准:

  • 能正确打开源代码文件并可运行

  • 正确分解出单词列表,如[‘THE’, ‘TRAGEDY’, ‘OF’, ‘ROMEO’, ‘AND’, ‘JULIET’, ‘by’, ‘William’, ‘Shakespeare’]

  • 正确得到单词频次字典,如 {‘straight;’: 1, ‘noise.’: 1}

  • 按单词频次逆序输出结果,如
    (601, ‘the’),
    (549, ‘I’),
    (468, ‘and’),
    (451, ‘to’)

import re


def wordCount():
    "统计文本中单词的出现的频率"
    with open("罗密欧与朱丽叶(英文版)莎士比亚.txt","r") as f:
        words = []
        for line in f.readlines():
            if line.strip():  # 判断空行,处理非空行
                line = re.sub(r"[,.!?:[\]-]","",line)  # 去除英文标点,此处还有待改善
                linewords = line.strip().split(" ")  # 处理\n
                words.extend(linewords)
        wordMap = {}
        for word in words:
            wordMap[word] = wordMap[word] + 1 if word in wordMap.keys() else 1
        # 按频率排序,items指元组("the",101),item[1]表示101
        wordMap = sorted(wordMap.items(), key=lambda item: item[1], reverse=True)
        for wordcount in wordMap:
            print(wordcount)


if __name__ == '__main__':
    wordCount()

运行结果:

('the', 603) ('I', 573) ('and', 483) ('to', 458) ('a', 397) ('of', 371) ('my', 311)...

你可能感兴趣的:(算法与数据结构)