学习Python的第三天

首先要解决昨天安装词云失败的问题


安装失败会报详情,原来是我选择错误,应该选择32位,而自己错装成64位
卸载重装后,词云成功安装啦!yeah!!!
接着昨天未完成的词云绘制

# 绘制词云
from wordcloud import WordCloud
text = 'He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish. In the first forty days a boy had been with him. But after forty days without a fish the boy’s parents had told him that the old man was now definitely and finally salao, which is the worst form of unlucky, and the boy had gone at their orders in another boat which caught three good fish the first week. It made the boy sad to see the old man come in each day with his skiff empty and he always went down to help him carry either the coiled lines or the gaff and harpoon and the sail that was furled around the mast. The sail was patched with flour sacks and, furled, it looked like the flag of permanent defeat.'
wc = WordCloud.generate(text)
wc.to_file('老人与海.png')

执行程序之后生成图片


三国演义小说词云绘制

这里要设置 font_path='msyh.ttc'
不然会显示乱码

mask = imageio.imread('./china.jpg')
with open('./novel/threekingdom.txt', 'r', encoding='utf-8') as f:
    words = f.read()
    # print(len(words)) # 字数  55万
    words_list = jieba.lcut(words)
    # print(len(words_list)) # 分词后的词语数  35万
    print(words_list)
    # 将words_list转化成字符串
    novel_words = " ".join(words_list)
    # print(novel_words)
    # WordCloud()里面设置参数
    wc = WordCloud(
        font_path='msyh.ttc',
        background_color='white',
        width=800,
        height=600,
        mask=mask
    ).generate(novel_words)
    wc.to_file('三国词云.png')

执行程序生成图片



下面开始今天的学习内容

三国top10人物分析

  1. 读取小说内容
    2.分词
    3.词语过滤,删除无关词、重复分词
    4.排序
    5.得出结论
import jieba
# 1. 读取小说内容
with open('./novel/threekingdom.txt', 'r', encoding='utf-8') as f:
    words = f.read()
    counts = {} #{'曹操': 234, '回寨': 56}
# 2.分词
    words_list = jieba.lcut(words)
    for word in words_list:
        if len(word) <= 1:
            continue
        else:
            #向字典中更新字典中的值
            #counts[word] = 取出字典中原来键对应的值 + 1
            # counts[word] = counts[word] + 1  counts[word]没有就会报错
            #字典.get(k) 如果字典中没有这个键 返回 none
            counts[word] = counts.get(word, 0) + 1
    print(counts)
# 3.词语过滤,删除无关词、重复分词
    # 4.排序 [(), ()]
    items = list(counts.items())
    print('排序前的列表', items)
    def sort_by_count(x):
        return x[1]
    items.sort(key=sort_by_count, reverse=True)
    for i in range(20):
        #序列解包
        role, count = items[i]
        print(role, count)

排除不是人名的分词,合并人名,然后排出top10

exclude = {"将军", "却说", "丞相", "二人", "不可", "荆州", "不能", "如此", "商议",
               "如何", "主公", "军士", "军马", "左右", "次日", "引兵", "大喜", "天下",
               "东吴", "于是", "今日", "不敢", "魏兵", "陛下", "都督", "人马", "不知", 
               "孔明曰", "玄德曰", "刘备", "云长"}

   counts['孔明'] = counts['孔明'] + counts['孔明曰']
   counts['玄德'] = counts['玄德'] + counts['玄德曰'] + counts['刘备']
   counts['关公'] = counts['关公'] + counts['云长']
   for word in exclude:
       del counts[word]

最后再绘制词云,最终代码如下:

import jieba
from wordcloud import WordCloud
import imageio
# 1. 读取小说内容
with open('./novel/threekingdom.txt', 'r', encoding='utf-8') as f:
    words = f.read()
    counts = {} #{'曹操': 234, '回寨': 56}
    exclude = {"将军", "却说", "丞相", "二人", "不可", "荆州", "不能", "如此", "商议",
               "如何", "主公", "军士", "军马", "左右", "次日", "引兵", "大喜", "天下",
               "东吴", "于是", "今日", "不敢", "魏兵", "陛下", "都督", "人马", "不知",
               "孔明曰", "玄德曰", "刘备", "云长"}

    # 2.分词
    words_list = jieba.lcut(words)
    for word in words_list:
        if len(word) <= 1:
            continue
        else:
            #向字典中更新字典中的值
            #counts[word] = 取出字典中原来键对应的值 + 1
            # counts[word] = counts[word] + 1  counts[word]没有就会报错
            #字典.get(k) 如果字典中没有这个键 返回 none
            counts[word] = counts.get(word, 0) + 1
    print(counts)
    # 3.词语过滤,删除无关词、重复分词
    counts['孔明'] = counts['孔明'] + counts['孔明曰']
    counts['玄德'] = counts['玄德'] + counts['玄德曰'] + counts['刘备']
    counts['关公'] = counts['关公'] + counts['云长']
    for word in exclude:
        del counts[word]
    # 4.排序 [(), ()]
    items = list(counts.items())
    print('排序前的列表', items)
    def sort_by_count(x):
        return x[1]
    items.sort(key=sort_by_count, reverse=True)

    li = []  # ['孔明',孔明,孔明,'曹操'。。。。。]
    for i in range(10):
        #序列解包
        role, count = items[i]
        print(role, count)
        # _是告诉看代码的人,循环里面不需要使用临时变量
        for _ in range(count):
            li.append(role)
    # 5.得出结论
    mask = imageio.imread('./china.jpg')
    text = ' '.join(li)
    WordCloud(
        font_path='msyh.ttc',
        background_color='white',
        width=800,
        height=600,
        mask=mask,
        # 相邻两个重复词之间的匹配
        collocations=False
    ).generate(text).to_file('top10.png')

collocations=False 这一条语句不能少,否则程序执行结果就会变成下面这种情况:



设置为False的作用是取消相邻两个重复词之间的匹配

匿名函数

结构:lambda x1, x2....xn: 表达式,参数可以是无限多个,但是表达式只有一个
eg1:求两个数相加

sum_number = lambda x1, x2: x1 + x2
print(sum_number(2, 3))

eg2:

name_info_list = [
    ('张三', 4500),
    ('李四', 9900),
    ('王五', 2000),
    ('赵六', 5500),
]
name_info_list.sort(key=lambda  x: x[1], reverse=True)
print(name_info_list)

eg3:

stu_info = [
    {"name": 'zhangsan', "age": 18},
    {"name": 'lisi', "age": 30},
    {"name": 'wangwu', "age": 99},
    {"name": 'tiaqi', "age": 3},

]
stu_info.sort(key=lambda  x: x['age'], reverse=True)
print(stu_info)

所以三国top10任务分析中的排序也可以做优化

 def sort_by_count(x):
         return x[1]
    items.sort(key=sort_by_count, reverse=True)

优化以后:

    items.sort(key=lambda x: x[1], reverse=True)

列表推导式

之前我们使用普通for循环创建列表

li = []
for i in range(10):
    li.append(i)
print(li)

使用列表推导式,只需一条语句也可以达到同样的效果

# [表达式 for 临时变量 in 可迭代对象 可以追加条件]
print([i for i  in range(10)])

列表解析

比如说筛选出列表中的所有偶数,如果按照平常的方法

li = []
for i in range(10):
    if i%2 == 0:
        li.append(i)
print(li)

而我们使用列表解析之后

print([i for i in range(10) if i%2 == 0])

eg:筛选出列表中大于0的数

from random import randint
num_list = [randint(-10, 10) for _ in range(10)]
print(num_list)
print([i for i in num_list if i > 0])

执行程序,得到结果


字典解析

eg1:生成10个学生的成绩

from random import randint
stu_grades = {'student{}'.format(i): randint(50, 100) for i in range(1, 101)}
print(stu_grades)

运行结果如下



eg2:筛选大于60分的所有学生

from random import randint
stu_grades = {'student{}'.format(i): randint(50, 100) for i in range(1, 101)}
print({k: v for k, v in stu_grades.items() if v > 60})

运行结果如下:


Matplotlib

Matplotlib 是一个Python的2D绘图库,它以各种硬拷贝格式和跨平台的交互式环境生成出版质量级别的图形 。
通过 Matplotlib,开发者可以仅需要几行代码,便可以生成绘图,直方图,功率谱,条形图,错误图,散点图等。

使用100个点,绘制[0, 2π]正弦曲线图

注意:Python文件名不能起库名,否则会错

from matplotlib import pyplot as plt
plt.rcParams["font.sans-serif"] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
import numpy as np

# 使用100个点 绘制[0, 2π]正弦曲线图
# .linsapce 左闭右闭区间等差数列
x = np.linspace(0, 2*np.pi, num=100)
print(x)
y = np.sin(x)
# plt.plot(x, y)
# plt.show()

# 正弦和余弦在同一坐标系下
cosy = np.cos(x)
plt.plot(x, y, color='g', linestyle='--', label='sin(x)')
plt.plot(x, cosy, color='r', label='cos(x)')
plt.xlabel('时间(s)')
plt.ylabel('电压(v)')
plt.title('欢迎来到python世界')
# 图例
plt.legend()
plt.show()

运行结果如下:


柱状图

from matplotlib import pyplot as plt
plt.rcParams["font.sans-serif"] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
import numpy as np
import string
from random import randint
# print(string.ascii_uppercase[0:6])
# ['A', 'B', 'C'...]
x = ['口红{}'.format(x) for x in string.ascii_uppercase[0:5]]

y = [randint(200, 500) for _ in range(5)]
print(x)
print(y)
plt.xlabel('口红品牌')
plt.ylabel('价格(元)')
plt.bar(x, y)
plt.show()

饼图

from matplotlib import pyplot as plt
plt.rcParams["font.sans-serif"] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
import numpy as np
import string
from random import randint
counts = [randint(3500, 9000) for _ in range(9)]
labels = ['员工{}'.format(x) for x in string.ascii_lowercase[:9]]
# 距离圆心点的距离
explode = [0.1, 0, 0, 0, 0, 0, 0, 0, 0]
colors = ['red', 'purple', 'blue', 'yellow', 'gray', 'green']
plt.pie(counts, explode=explode, colors=colors, shadow=True, labels=labels, autopct='%1.1f%%')
plt.legend(loc=1)  # 图例位置
plt.axis('equal')
plt.show()

散点图

from matplotlib import pyplot as plt
plt.rcParams["font.sans-serif"] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
import numpy as np
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
# alpha是指透明度
plt.scatter(x, y, alpha=0.5)
plt.show()

绘制三国演义top10饼图

import jieba
from matplotlib import pyplot as plt
plt.rcParams["font.sans-serif"] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# 1. 读取小说内容
with open('./novel/threekingdom.txt', 'r', encoding='utf-8') as f:
    words = f.read()
    counts = {} #{'曹操': 234, '回寨': 56}
    exclude = {"将军", "却说", "丞相", "二人", "不可", "荆州", "不能", "如此", "商议",
               "如何", "主公", "军士", "军马", "左右", "次日", "引兵", "大喜", "天下",
               "东吴", "于是", "今日", "不敢", "魏兵", "陛下", "都督", "人马", "不知",
               "孔明曰", "玄德曰", "刘备", "云长"}

    # 2.分词
    words_list = jieba.lcut(words)
    for word in words_list:
        if len(word) <= 1:
            continue
        else:
            #向字典中更新字典中的值
            #counts[word] = 取出字典中原来键对应的值 + 1
            # counts[word] = counts[word] + 1  counts[word]没有就会报错
            #字典.get(k) 如果字典中没有这个键 返回 none
            counts[word] = counts.get(word, 0) + 1
    # print(counts)
    # 3.词语过滤,删除无关词、重复分词
    counts['孔明'] = counts['孔明'] + counts['孔明曰']
    counts['玄德'] = counts['玄德'] + counts['玄德曰'] + counts['刘备']
    counts['关公'] = counts['关公'] + counts['云长']
    for word in exclude:
        del counts[word]
    # 4.排序 [(), ()]
    items = list(counts.items())
    items.sort(key=lambda x: x[1], reverse=True)
    roles = []
    cishu = []
    for i in range(10):
        #序列解包
        role, count = items[i]
        roles.append(role)
        cishu.append(count)

    print(roles)
    print(cishu)
    # 距离圆心点的距离
    explode = [0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    colors = ['red', 'purple', 'blue', 'yellow', 'gray', 'green']
    plt.pie(cishu, explode=explode, colors=colors, shadow=True, labels=roles, autopct='%1.1f%%')
    plt.legend(loc=2)  # 图例位置
    plt.axis('equal')
    plt.show()

运行结果:

红楼梦top10人物分析及top10饼图绘制

import jieba
from wordcloud import WordCloud
import imageio
from matplotlib import pyplot as plt
plt.rcParams["font.sans-serif"] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# 1. 读取小说内容
with open('./novel/hongloumeng.txt', 'r', encoding='utf-8') as f:
    words = f.read()
    counts = {} #{'曹操': 234, '回寨': 56}
    exclude = {"什么", "一个", "我们", "你们", "如今", "说道", "知道", "起来", "这里",
               "出来", "众人", "那里", "自己", "一面", "只见", "太太", "两个", "没有",
               "怎么", "不是", "不知", "这个", "听见", "这样", "进来", "咱们", "就是",
               "老太太", "东西", "告诉", "回来", "只是", "大家", "姑娘", "奶奶", "凤姐儿"}

    # 2.分词
    words_list = jieba.lcut(words)
    for word in words_list:
        if len(word) <= 1:
            continue
        else:
            #向字典中更新字典中的值
            #counts[word] = 取出字典中原来键对应的值 + 1
            # counts[word] = counts[word] + 1  counts[word]没有就会报错
            #字典.get(k) 如果字典中没有这个键 返回 none
            counts[word] = counts.get(word, 0) + 1
    # print(counts)
    # 3.词语过滤,删除无关词、重复分词
    counts['贾母'] = counts['贾母'] + counts['老太太']
    counts['黛玉'] = counts['黛玉'] + counts['林黛玉']
    counts['宝玉'] = counts['宝玉'] + counts['贾宝玉']
    counts['宝钗'] = counts['宝钗'] + counts['薛宝钗']
    counts['老爷'] = counts['老爷'] + counts['贾政']
    counts['王夫人'] = counts['王夫人'] + counts['太太']
    counts['凤姐'] = counts['凤姐儿'] + counts['凤姐'] + counts['王熙凤']
    for word in exclude:
        del counts[word]
    # 4.排序 [(), ()]
    items = list(counts.items())
    # print('排序前的列表', items)
    # def sort_by_count(x):
    #     return x[1]
    # items.sort(key=sort_by_count, reverse=True)
    items.sort(key=lambda x: x[1], reverse=True)
    li = []  # ['宝玉',宝玉,宝玉,'贾母'。。。。。]
    roles = []
    cishu = []
    for i in range(10):
        #序列解包
        role, count = items[i]
        roles.append(role)
        cishu.append(count)
        print(role, count)
        # _是告诉看代码的人,循环里面不需要使用临时变量
        for _ in range(count):
            li.append(role)
    # 5.得出结论
    mask = imageio.imread('./china.jpg')
    text = ' '.join(li)

    WordCloud(
        font_path='msyh.ttc',
        background_color='white',
        width=800,
        height=600,
        mask=mask,
        # 相邻两个重复词之间的匹配
        collocations=False
    ).generate(text).to_file('hlm_top10.png')

    print(roles)
    print(cishu)
    # 距离圆心点的距离
    explode = [0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    colors = ['red', 'purple', 'blue', 'yellow', 'gray', 'green']
    plt.pie(cishu, explode=explode, colors=colors, shadow=True, labels=roles, autopct='%1.1f%%')
    plt.legend(loc=2)  # 图例位置
    plt.axis('equal')
    plt.show()

运行结果:

你可能感兴趣的:(学习Python的第三天)