python 中文,英文做词频统计小计

作为一个爬虫工程师,词频统计还是要有所了解的,对于舆情的文本处理,统计每个词出现的次数,亦或是统计文本出现top10词,为以后简单的数据分析,做一点点准备。那么我们开始来处理吧。

import re

text = '''Which year will be the turning point for the world's most populous country in which its population experiences negative growth? Chinese demographers differ in their answers.
Experts with the Chinese Academy of Social Sciences estimate the turning point could arrive around 2028 after the population peaks to 1.44 billion, says the Green Book of Population and Labor co-released by the Chinese Academy of Social Sciences and Social Sciences Academic Press on Thursday. 
However, Huang Wenzheng, a demographics expert, told the Global Times on Friday that this estimate is too optimistic. He estimated the year 2024 or 2025 will be the threshold for population negative growth.
According to Huang, the prediction in the green book is based on the fertility rate that could remain at 1.6, which is hard to realize. 
In 2016, China's fertility rate was 1.7, but in 2017, the number of births was less, according to media reports. 
The births in 2016 and 2017 were high compared to years before, said Huang. "This was due to the introduction of two-child policy for all families [in 2016] which encouraged those who had the willingness to have a second child before the policy. So they hastened to give birth in these two years."
"But the overall trend is that people are no longer willing to have more children."
Huang elaborated that people's concept of raising children has changed. Urban people care about quality, rather than quantity. "They want to provide the best resources they have to bring up their children. This won't be possible if they have several," he said. 
With rapid urbanization, many people from rural areas come to work in the city and also follow this practice. 
"Previously people thought that having two or three children is normal. But now they are accustomed to having only one child. They find this normal," Huang said.
Yi Fuxian, a research fellow at the University of Wisconsin-Madison, holds a more pessimistic view. He told the Global Times that 2018 has seen negative growth based on his own research and analysis. 
Both Yi and Huang believe that China will abandon the two-child policy this year, putting an end to family planning, in order to stimulate births. They also warned that the sharp decline in population could have negative influence on the economy.
China has introduced a series of new measures to stimulate fertility. This year, the country's tax cuts also favor families with children. Families are able to deduct 12,000 yuan ($1,748) a year from their taxable income for children's education.
Huang said this is still far from enough. He suggested the government provide free upbringing of children aged 0 to 3 and make kindergarten education compulsory to further ease the burden of educating children. 


'''
# 词频统计
def word_count(string):
    if isinstance(string, str):
        new_text = string.strip()
        str_list = re.split('\s+', new_text)
        word_dict = {}
        for str_word in str_list:
            if str_word in word_dict.keys():#如果key存在则value加1
                word_dict[str_word] = word_dict[str_word] + 1
            else:
                word_dict[str_word] = 1
        return word_dict
    else:
        raise 'Please enter a string'


word = word_count(string=text)
#print(word)

# 词频统计按降序排序取前10
word_list = sorted(word .items(), key=lambda x: x[1], reverse=True)[0:11]
print(word_list)

image.png

如上图统计文本top10词汇出现的词语,以及次数。

以上是英文词频统计,下面我们看看中文文本怎么统计吧。
首先中文统计我们需要下载一个第三方库jieba分词。
安装 pip install jieba
处理文本分词
import jieba
content_text ='''然而,我们并没有时间去探索数据集中的数千个案例。我们应该做的则是在测试案例的典型范例上继续运行LIME,看看哪些词的占有率仍能位居前列。通过这种方法,我们可以获得像以前模型那样的单词的重要性分数,并验证模型的预测'''

def get_(string):
    b = list(jieba.cut(string, cut_all=True))
    dict = {}
    for str in b:
        if str != '' and str != '\n':#去除空白字符,和换行符。
            if str in dict.keys():
                dict[str] = dict[str] + 1
            else:
                dict[str] = 1
    return dict

word = get_(string=content_text )
#取前十top10词汇
word_list = sorted(word .items(), key=lambda x: x[1], reverse=True)[0:11]
print(word_list)
image.png

这是中文版词频统计结果截图。

好了,今天小结到这里就完了,有兴趣的小伙伴,可以私信我,

你可能感兴趣的:(python 中文,英文做词频统计小计)