PythonShowMeTheCode(0004): 检查单词个数

1. 题目

第 0004 题:任一个英文的纯文本文件,统计其中的单词出现的个数。

2. 效果

#------1.txt-----------
  There are moments in life when you miss only
 one life and one chance to do
 you want to do.is 
isn't don't word_d common

#------输出------------
do: 2
word_d: 1
want: 1
to: 2
is: 1
you: 2
isn't: 1
don't: 1
...
  • 将所有单词按照小写处理
  • isn'tword_d这种应当作为一个单词

3. 实现

# -*- coding:utf-8 -*-
import re


def get_word_dict(file_path=None):
    if file_path is None:
        print("Error")
        return

    word_dict = {}
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file.readlines():
            words = re.findall(r"[a-z\'_-]+\b", line.lower())
            for word in words:
                if word not in word_dict:
                    word_dict[word] = 1
                else:
                    word_dict[word] += 1
    for word, count in word_dict.items():
        print("%s: %d\n" % (word, count))
    return word_dict


if __name__ == "__main__":
    get_word_dic("1.txt")

4. 解决问题

I. 无法识别isn't这样的单词
在正则匹配时需要在加入一个\b来作为单词边界。

II. 读取文件出现编码错误
open()函数中加入encoding参数。

你可能感兴趣的:(PythonShowMeTheCode(0004): 检查单词个数)