第二节 python知识点梳理

文章目录

      • Part One
        • 1.使用array创建数组
        • 2.使用函数创建数组
        • 3.存取、切片
        • 4.去重、组合
        • 5.绘图
        • 6.概率分布
        • 7. 绘制三维图像
        • 8. 其他
      • Part Two
        • 1.图像像素到字符的转换
        • 2.信息摘要与安全哈希算法MD5/SHAI
        • 3.统计量:均值/方差/偏度/峰度
        • 4.多元高斯分布
        • 5.阶乘的实数域推广:Gamma函数
        • 6.相关系数的计算
        • 7.快速傅里叶变换FFT与信号滤波
        • 8.奇异值分解SVD与图像特征
        • 9.股票收盘价格曲线、滑动MA曲线
        • 10.股票K线图
        • 11.图像的卷积
        • 12.蝴蝶效应:Lorenz系统的曲线生成
        • 13.绘制分形图:Mandelbrot集合
      • Part Three
        • 1.庄家与赔率
        • 2.Pandas数据读取和处理
        • 3.数据清洗和校正
        • 4.Fuzzywuzzy字符串模糊查找
        • 5.特征提取主成分分析PCA
        • 6.One-hot编码

Part One

1.使用array创建数组

2.使用函数创建数组

3.存取、切片

3.1常规办法:数组元素的存取方法和Python的标准方法相同
3.2 整数/布尔数组存取
3.3 二维数组的切片

4.去重、组合

4.1 numpy与Python数学库的时间比较
4.2 元素去重
4.3 stack and axis

5.绘图

5.1 绘制正态分布概率密度函数
5.2 损失函数:Logistic损失(-1,1)/SVM Hinge损失/ 0/1损失
5.3 x^x
5.4 心形线
5.5 渐开线
5.6 Bar

6.概率分布

6.1 均匀分布
6.2 验证中心极限定理与其他分布的中心极限定理
6.3 Poisson分布
6.4 直方图的使用
6.5 插值
6.6 Poisson分布

7. 绘制三维图像

8. 其他

8.1 使用scipy实现线性回归
8.2 使用scipy计算函数极值

#!/usr/bin/python
# -*- coding:utf-8 -*-

# 导入NumPy函数库,一般都是用这样的形式(包括别名np,几乎是约定俗成的)
import numpy as np
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import time
from scipy.optimize import leastsq
from scipy import stats
import scipy.optimize as opt
import matplotlib.pyplot as plt
from scipy.stats import norm, poisson
from scipy.interpolate import BarycentricInterpolator
from scipy.interpolate import CubicSpline
import scipy as sp
import math
# import seaborn


def residual(t, x, y):
    return y - (t[0] * x ** 2 + t[1] * x + t[2])


def residual2(t, x, y):
    print(t[0], t[1])
    return y - (t[0]*np.sin(t[1]*x) + t[2])


# x ** x        x > 0
# (-x) ** (-x)  x < 0
def f(x):
    y = np.ones_like(x)
    i = x > 0
    y[i] = np.power(x[i], x[i])
    i = x < 0
    y[i] = np.power(-x[i], -x[i])
    return y


if __name__ == "__main__":
    # # 开场白:
    # numpy是非常好用的数据包,如:可以这样得到这个二维数组
    # [[ 0  1  2  3  4  5]
    #  [10 11 12 13 14 15]
    #  [20 21 22 23 24 25]
    #  [30 31 32 33 34 35]
    #  [40 41 42 43 44 45]
    #  [50 51 52 53 54 55]]
    a = np.arange(0, 60, 10).reshape((-1, 1)) + np.arange(6)
    print(a)

    # 正式开始  -:)
    # 标准Python的列表(list)中,元素本质是对象。
    # 如:L = [1, 2, 3],需要3个指针和三个整数对象,对于数值运算比较浪费内存和CPU。
    # 因此,Numpy提供了ndarray(N-dimensional array object)对象:存储单一数据类型的多维数组。

    # 1.使用array创建
    通过array函数传递list对象
    L = [1, 2, 3, 4, 5, 6]
    print("L = ", L)
    a = np.array(L)
    print("a = ", a)
    print(type(a), type(L))
    # 若传递的是多层嵌套的list,将创建多维数组
    b = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
    print(b)

    # # # # 数组大小可以通过其shape属性获得
    print(a.shape)
    print(b.shape)

    # # 也可以强制修改shape
    b.shape = 4, 3
    print(b)
    # 注:从(3,4)改为(4,3)并不是对数组进行转置,而只是改变每个轴的大小,数组元素在内存中的位置并没有改变

    # # 当某个轴为-1时,将根据数组元素的个数自动计算此轴的长度
    b.shape = 2, -1
    print(b)
    print(b.shape)

    b.shape = 3, 4
    print(b)
    # # # 使用reshape方法,可以创建改变了尺寸的新数组,原数组的shape保持不变
    c = b.reshape((4, -1))
    print("b = \n", b)
    print('c = \n', c)

    # # # 数组b和c共享内存,修改任意一个将影响另外一个
    b[0][1] = 20
    print("b = \n", b)
    print("c = \n", c)

    # # # 数组的元素类型可以通过dtype属性获得
    print(a.dtype)
    print(b.dtype)
    # # # #
    # # # # 可以通过dtype参数在创建时指定元素类型
    d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]], dtype=np.float)
    # f = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]], dtype=np.complex)
    print(d)
    # print(f)

    # # 如果更改元素类型,可以使用astype安全的转换
    f = d.astype(np.int)
    print(f)
    # #
    # # # 但不要强制仅修改元素类型,如下面这句,将会以int来解释单精度float类型
    d.dtype = np.int
    print(d)

    # 2.使用函数创建
    # 如果生成一定规则的数据,可以使用NumPy提供的专门函数
    # arange函数类似于python的range函数:指定起始值、终止值和步长来创建数组
    # 和Python的range类似,arange同样不包括终值;但arange可以生成浮点类型,而range只能是整数类型
    np.set_printoptions(linewidth=100, suppress=True)
    a = np.arange(1, 10, 0.5)
    print(a)

    # # # linspace函数通过指定起始值、终止值和元素个数来创建数组,缺省包括终止值
    b = np.linspace(1, 10, 10)
    print('b = ', b)

    # 可以通过endpoint关键字指定是否包括终值
    c = np.linspace(1, 10, 10, endpoint=False)
    print('c = ', c)

    # # 和linspace类似,logspace可以创建等比数列
    # 下面函数创建起始值为10^1,终止值为10^2,有10个数的等比数列
    d = np.logspace(1, 4, 4, endpoint=True, base=2)
    print(d)
    # # # #
    # # # # 下面创建起始值为2^0,终止值为2^10(包括),有10个数的等比数列
    f = np.logspace(0, 10, 11, endpoint=True, base=2)
    print(f)

    # # # 使用 frombuffer, fromstring, fromfile等函数可以从字节序列创建数组
    s = 'abcdzzzz'
    g = np.fromstring(s, dtype=np.int8)
    print(g)
    #
    3.存取
    3.1常规办法:数组元素的存取方法和Python的标准方法相同
    a = np.arange(10)
    print(a)
    # # 获取某个元素
    print(a[3])
    # # # # 切片[3,6),左闭右开
    print(a[3:6])
    # # 省略开始下标,表示从0开始
    print(a[:5])
    # # 下标为负表示从后向前数
    print(a[3:])
    # 步长为2
    print(a[1:9:2])
    # # # # # 步长为-1,即翻转
    print(a[::-1])
    # 切片数据是原数组的一个视图,与原数组共享内容空间,可以直接修改元素值
    # a[1:4] = 10, 20, 30
    # print(a)
    # 因此,在实践中,切实注意原始数据是否被破坏,如:
    b = a[2:5]
    b[0] = 200
    print(a)

    # 3.2 整数/布尔数组存取
    # 3.2.1
    # 根据整数数组存取:当使用整数序列对数组元素进行存取时,
    # 将使用整数序列中的每个元素作为下标,整数序列可以是列表(list)或者数组(ndarray)。
    # 使用整数序列作为下标获得的数组不和原始数组共享数据空间。
    a = np.logspace(0, 9, 10, base=2)
    print(a)
    i = np.arange(0, 10, 2)
    print(i)
    # # 利用i取a中的元素
    b = a[i]
    print(b)
    # # b的元素更改,a中元素不受影响
    b[2] = 1.6
    print(b)
    print(a)

    # # 3.2.2
    # 使用布尔数组i作为下标存取数组a中的元素:返回数组a中所有在数组b中对应下标为True的元素
    # 生成10个满足[0,1)中均匀分布的随机数
    a = np.random.rand(10)
    print(a)
    # 大于0.5的元素索引
    print(a > 0.5)
    # # 大于0.5的元素
    b = a[a > 0.5]
    print(b)
    # # 将原数组中大于0.5的元素截取成0.5
    a[a > 0.5] = 0.5
    print(a)
    # # # # b不受影响
    print(b)

    # 3.3 二维数组的切片
    # [[ 0  1  2  3  4  5]
    #  [10 11 12 13 14 15]
    #  [20 21 22 23 24 25]
    #  [30 31 32 33 34 35]
    #  [40 41 42 43 44 45]
    #  [50 51 52 53 54 55]]
    a = np.arange(0, 60, 10)    # 行向量
    print('a = ', a)
    b = a.reshape((-1, 1))      # 转换成列向量
    print(b)
    c = np.arange(6)
    print(c)
    f = b + c   # 行 + 列
    print(f)
    # 合并上述代码:
    a = np.arange(0, 60, 10).reshape((-1, 1)) + np.arange(6)
    print(a)
    # # 二维数组的切片
    print(a[[0, 1, 2], [2, 3, 4]])
    print(a[4, [2, 3, 4]])
    print(a[4:, [2, 3, 4]])
    i = np.array([True, False, True, False, False, True])
    print(a[i])
    print(a[i, 3])

    # 4.1 numpy与Python数学库的时间比较
    for j in np.logspace(0, 7, 8):
        x = np.linspace(0, 10, j)
        start = time.clock()
        y = np.sin(x)
        t1 = time.clock() - start

        x = x.tolist()
        start = time.clock()
        for i, t in enumerate(x):
            x[i] = math.sin(t)
        t2 = time.clock() - start
        print(j, ": ", t1, t2, t2/t1)

    # 4.2 元素去重
    # 4.2.1直接使用库函数
    # a = np.array((1, 2, 3, 4, 5, 5, 7, 3, 2, 2, 8, 8))
    # print('原始数组:', a)
    # # # 使用库函数unique
    # b = np.unique(a)
    # print('去重后:', b)
    # # 4.2.2 二维数组的去重,结果会是预期的么?
    c = np.array(((1, 2), (3, 4), (5, 6), (1, 3), (3, 4), (7, 6)))
    print('二维数组:\n', c)
    print('去重后:', np.unique(c))
    # # # 4.2.3 方案1:转换为虚数
    r, i = np.split(c, (1, ), axis=1)
    x = r + i * 1j
    # x = c[:, 0] + c[:, 1] * 1j
    print('转换成虚数:', x)
    print('虚数去重后:', np.unique(x))
    print(np.unique(x, return_index=True))   # 思考return_index的意义
    idx = np.unique(x, return_index=True)[1]
    print('二维数组去重:\n', c[idx])
    # # 4.2.3 方案2:利用set
    print('去重方案2:\n', np.array(list(set([tuple(t) for t in c]))))

    # 4.3 stack and axis
    a = np.arange(1, 7).reshape((2, 3))
    b = np.arange(11, 17).reshape((2, 3))
    c = np.arange(21, 27).reshape((2, 3))
    d = np.arange(31, 37).reshape((2, 3))
    print('a = \n', a)
    print('b = \n', b)
    print('c = \n', c)
    print('d = \n', d)
    s = np.stack((a, b, c, d), axis=0)
    print('axis = 0 ', s.shape, '\n', s)
    s = np.stack((a, b, c, d), axis=1)
    print('axis = 1 ', s.shape, '\n', s)
    s = np.stack((a, b, c, d), axis=2)
    print('axis = 2 ', s.shape, '\n', s)

    # a = np.arange(1, 10).reshape(3,3)
    # print(a)
    # b = a + 10
    # print(b)
    # print(np.dot(a, b)
    # print(a * b)
    # 
    # a = np.arange(1, 10)
    # print(a)
    # b = np.arange(20,25)
    # print(b)
    # print(np.concatenate((a, b)))

    # 5.绘图
    # 5.1 绘制正态分布概率密度函数
    mpl.rcParams['font.sans-serif'] = [u'SimHei']  #FangSong/黑体 FangSong/KaiTi
    mpl.rcParams['axes.unicode_minus'] = False
    mu = 0
    sigma = 1
    x = np.linspace(mu - 3 * sigma, mu + 3 * sigma, 51)
    y = np.exp(-(x - mu) ** 2 / (2 * sigma ** 2)) / (math.sqrt(2 * math.pi) * sigma)
    print(x.shape)
    print('x = \n', x)
    print(y.shape)
    print('y = \n', y)
    plt.figure(facecolor='w')
    plt.plot(x, y, 'ro-', linewidth=2)
    # plt.plot(x, y, 'r-', x, y, 'go', linewidth=2, markersize=8)
    plt.xlabel('X', fontsize=15)
    plt.ylabel('Y', fontsize=15)
    plt.title(u'高斯分布函数', fontsize=18)    #
    plt.grid(True)
    plt.show()

    # 5.2 损失函数:Logistic损失(-1,1)/SVM Hinge损失/ 0/1损失
    plt.figure(figsize=(10,8))
    x = np.linspace(start=-2, stop=3, num=1001, dtype=np.float)
    y_logit = np.log(1 + np.exp(-x)) / math.log(2)
    y_boost = np.exp(-x)
    y_01 = x < 0
    y_hinge = 1.0 - x
    y_hinge[y_hinge < 0] = 0
    plt.plot(x, y_logit, 'r-', label='Logistic Loss', linewidth=2)
    plt.plot(x, y_01, 'g-', label='0/1 Loss', linewidth=2)
    plt.plot(x, y_hinge, 'b-', label='Hinge Loss', linewidth=2)
    plt.plot(x, y_boost, 'm--', label='Adaboost Loss', linewidth=2)
    plt.grid()
    plt.legend(loc='upper right')
    plt.savefig('1.png')
    plt.show()

    # 5.3 x^x
    plt.figure(facecolor='w')
    x = np.linspace(-1.3, 1.3, 101)
    y = f(x)
    plt.plot(x, y, 'g-', label='x^x', linewidth=2)
    plt.grid()
    plt.legend(loc='upper left')
    plt.show()

    # #5.4 心形线
    t = np.linspace(0, 2*np.pi, 100)
    x = 16 * np.sin(t) ** 3
    y = 13 * np.cos(t) - 5 * np.cos(2*t) - 2 * np.cos(3*t) - np.cos(4*t)
    plt.plot(x, y, 'r-', linewidth=2)
    plt.grid(True)
    plt.show()

    # # 5.5 渐开线
    t = np.linspace(0, 50, num=1000)
    x = t*np.sin(t) + np.cos(t)
    y = np.sin(t) - t*np.cos(t)
    plt.plot(x, y, 'r-', linewidth=2)
    plt.grid()
    plt.show()

    # #Bar
    x = np.arange(0, 10, 0.1)
    y = np.sin(x)
    plt.bar(x, y, width=0.04, linewidth=0.2)
    plt.plot(x, y, 'r--', linewidth=2)
    plt.title(u'Sin曲线')
    plt.xticks(rotation=-60)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.grid()
    plt.show()

    # 6. 概率分布
    # 6.1 均匀分布
    x = np.random.rand(10000)
    t = np.arange(len(x))
    # plt.hist(x, 30, color='m', alpha=0.5, label=u'均匀分布')
    plt.plot(t, x, 'g.', label=u'均匀分布')
    plt.legend(loc='upper left')
    plt.grid()
    plt.show()

    # # 6.2 验证中心极限定理
    t = 1000
    a = np.zeros(10000)
    for i in range(t):
        a += np.random.uniform(-5, 5, 10000)
    a /= t
    plt.hist(a, bins=30, color='g', alpha=0.5, normed=True, label=u'均匀分布叠加')
    plt.legend(loc='upper left')
    plt.grid()
    plt.show()

    # 6.21 其他分布的中心极限定理
    lamda = 7
    p = stats.poisson(lamda)
    y = p.rvs(size=1000)
    mx = 30
    r = (0, mx)
    bins = r[1] - r[0]
    plt.figure(figsize=(15, 8), facecolor='w')
    plt.subplot(121)
    plt.hist(y, bins=bins, range=r, color='g', alpha=0.8, normed=True)
    t = np.arange(0, mx+1)
    plt.plot(t, p.pmf(t), 'ro-', lw=2)
    plt.grid(True)

    N = 1000
    M = 10000
    plt.subplot(122)
    a = np.zeros(M, dtype=np.float)
    p = stats.poisson(lamda)
    for i in np.arange(N):
        a += p.rvs(size=M)
    a /= N
    plt.hist(a, bins=20, color='g', alpha=0.8, normed=True)
    plt.grid(b=True)
    plt.show()

    # 6.3 Poisson分布
    x = np.random.poisson(lam=5, size=10000)
    print(x)
    pillar = 15
    a = plt.hist(x, bins=pillar, normed=True, range=[0, pillar], color='g', alpha=0.5)
    plt.grid()
    plt.show()
    print(a)
    print(a[0].sum())

    # # 6.4 直方图的使用
    mu = 2
    sigma = 3
    data = mu + sigma * np.random.randn(1000)
    h = plt.hist(data, 30, normed=1, color='#FFFFA0')
    x = h[1]
    y = norm.pdf(x, loc=mu, scale=sigma)
    plt.plot(x, y, 'r-', x, y, 'ro', linewidth=2, markersize=4)
    plt.grid()
    plt.show()


    # # 6.5 插值
    rv = poisson(5)
    x1 = a[1]
    y1 = rv.pmf(x1)
    itp = BarycentricInterpolator(x1, y1)  # 重心插值
    x2 = np.linspace(x.min(), x.max(), 50)
    y2 = itp(x2)
    cs = sp.interpolate.CubicSpline(x1, y1)       # 三次样条插值
    plt.plot(x2, cs(x2), 'm--', linewidth=5, label='CubicSpine')           # 三次样条插值
    plt.plot(x2, y2, 'g-', linewidth=3, label='BarycentricInterpolator')   # 重心插值
    plt.plot(x1, y1, 'r-', linewidth=1, label='Actural Value')             # 原始值
    plt.legend(loc='upper right')
    plt.grid()
    plt.show()

    # 6.6 Poisson分布
    size = 1000
    lamda = 5
    p = np.random.poisson(lam=lamda, size=size)
    plt.figure()
    plt.hist(p, bins=range(3 * lamda), histtype='bar', align='left', color='r', rwidth=0.8, normed=True)
    plt.grid(b=True, ls=':')
    # plt.xticks(range(0, 15, 2))
    plt.title('Numpy.random.poisson', fontsize=13)

    plt.figure()
    r = stats.poisson(mu=lamda)
    p = r.rvs(size=size)
    plt.hist(p, bins=range(3 * lamda), color='r', align='left', rwidth=0.8, normed=True)
    plt.grid(b=True, ls=':')
    plt.title('scipy.stats.poisson', fontsize=13)
    plt.show()

    # 7. 绘制三维图像
    x, y = np.mgrid[-3:3:7j, -3:3:7j]
    print(x)
    print(y)
    u = np.linspace(-3, 3, 101)
    x, y = np.meshgrid(u, u)
    print(x)
    print(y)
    z = x*y*np.exp(-(x**2 + y**2)/2) / math.sqrt(2*math.pi)
    # z = x*y*np.exp(-(x**2 + y**2)/2) / math.sqrt(2*math.pi)
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    # ax.plot_surface(x, y, z, rstride=5, cstride=5, cmap=cm.coolwarm, linewidth=0.1)  #
    ax.plot_surface(x, y, z, rstride=3, cstride=3, cmap=cm.gist_heat, linewidth=0.5)
    plt.show()
    # # cmaps = [('Perceptually Uniform Sequential',
    # #           ['viridis', 'inferno', 'plasma', 'magma']),
    # #          ('Sequential', ['Blues', 'BuGn', 'BuPu',
    # #                          'GnBu', 'Greens', 'Greys', 'Oranges', 'OrRd',
    # #                          'PuBu', 'PuBuGn', 'PuRd', 'Purples', 'RdPu',
    # #                          'Reds', 'YlGn', 'YlGnBu', 'YlOrBr', 'YlOrRd']),
    # #          ('Sequential (2)', ['afmhot', 'autumn', 'bone', 'cool',
    # #                              'copper', 'gist_heat', 'gray', 'hot',
    # #                              'pink', 'spring', 'summer', 'winter']),
    # #          ('Diverging', ['BrBG', 'bwr', 'coolwarm', 'PiYG', 'PRGn', 'PuOr',
    # #                         'RdBu', 'RdGy', 'RdYlBu', 'RdYlGn', 'Spectral',
    # #                         'seismic']),
    # #          ('Qualitative', ['Accent', 'Dark2', 'Paired', 'Pastel1',
    # #                           'Pastel2', 'Set1', 'Set2', 'Set3']),
    # #          ('Miscellaneous', ['gist_earth', 'terrain', 'ocean', 'gist_stern',
    # #                             'brg', 'CMRmap', 'cubehelix',
    # #                             'gnuplot', 'gnuplot2', 'gist_ncar',
    # #                             'nipy_spectral', 'jet', 'rainbow',
    # #                             'gist_rainbow', 'hsv', 'flag', 'prism'])]

    # 8.1 scipy
    # 线性回归例1
    x = np.linspace(-2, 2, 50)
    A, B, C = 2, 3, -1
    y = (A * x ** 2 + B * x + C) + np.random.rand(len(x))*0.75

    t = leastsq(residual, [0, 0, 0], args=(x, y))
    theta = t[0]
    print('真实值:', A, B, C)
    print('预测值:', theta)
    y_hat = theta[0] * x ** 2 + theta[1] * x + theta[2]
    plt.plot(x, y, 'r-', linewidth=2, label=u'Actual')
    plt.plot(x, y_hat, 'g-', linewidth=2, label=u'Predict')
    plt.legend(loc='upper left')
    plt.grid()
    plt.show()

    # # 线性回归例2
    x = np.linspace(0, 5, 100)
    a = 5
    w = 1.5
    phi = -2
    y = a * np.sin(w*x) + phi + np.random.rand(len(x))*0.5

    t = leastsq(residual2, [3, 5, 1], args=(x, y))
    theta = t[0]
    print('真实值:', a, w, phi)
    print('预测值:', theta)
    y_hat = theta[0] * np.sin(theta[1] * x) + theta[2]
    plt.plot(x, y, 'r-', linewidth=2, label='Actual')
    plt.plot(x, y_hat, 'g-', linewidth=2, label='Predict')
    plt.legend(loc='lower left')
    plt.grid()
    plt.show()

    # # 8.2 使用scipy计算函数极值
    a = opt.fmin(f, 1)
    b = opt.fmin_cg(f, 1)
    c = opt.fmin_bfgs(f, 1)
    print(a, 1/a, math.e)
    print(b)
    print(c)

    # marker	description
    # ”.”	point
    # ”,”	pixel
    # “o”	circle
    # “v”	triangle_down
    # “^”	triangle_up
    # “<”	triangle_left
    # “>”	triangle_right
    # “1”	tri_down
    # “2”	tri_up
    # “3”	tri_left
    # “4”	tri_right
    # “8”	octagon
    # “s”	square
    # “p”	pentagon
    # “*”	star
    # “h”	hexagon1
    # “H”	hexagon2
    # “+”	plus
    # “x”	x
    # “D”	diamond
    # “d”	thin_diamond
    # “|”	vline
    # “_”	hline
    # TICKLEFT	tickleft
    # TICKRIGHT	tickright
    # TICKUP	tickup
    # TICKDOWN	tickdown
    # CARETLEFT	caretleft
    # CARETRIGHT	caretright
    # CARETUP	caretup
    # CARETDOWN	caretdown

Part Two

1.图像像素到字符的转换

#!/usr/bin/env python
# coding: utf-8

import numpy as np
from PIL import Image

if __name__ == '__main__':
    image_file = '2023.png'
    height = 100

    img = Image.open(image_file)
    img_width, img_height = img.size
    width = 2 * height * img_width// img_height   # 假定字符的高度是宽度的2倍
    img = img.resize((width, height), Image.ANTIALIAS)
    pixels = np.array(img.convert('L'))
    print(pixels.shape)
    print(pixels)
    chars = "MNHQ$OC?7>!:-;. "
    N = len(chars)
    step = 256 // N
    print(N)
    result = ''
    for i in range(height):
        for j in range(width):
            result += chars[pixels[i][j] // step]
        result += '\n'
    with open('text.txt', mode='w') as f:
        f.write(result)

2.信息摘要与安全哈希算法MD5/SHAI

#!/usr/bin/python

import hashlib


if __name__ == "__main__":
    md5 = hashlib.md5()
    md5.update('This is a sentence.'.encode('utf-8'))
    md5.update('This is a second sentence.'.encode('utf-8'))
    print('不出意外,这个将是“乱码”:', md5.digest())
    print('MD5:', md5.hexdigest())

    md5 = hashlib.md5()
    md5.update('This is a sentence.This is a second sentence.'.encode('utf-8'))
    print('MD5:', md5.hexdigest())
    print(md5.digest_size, md5.block_size)
    print('------------------')

    sha1 = hashlib.sha1()
    sha1.update('This is a sentence.'.encode('utf-8'))
    sha1.update('This is a second sentence.'.encode('utf-8'))
    print('不出意外,这个将是“乱码”:', sha1.digest())
    print('SHA1:', sha1.hexdigest())

    sha1 = hashlib.sha1()
    sha1.update('This is a sentence.This is a second sentence.'.encode('utf-8'))
    print('SHA1:', sha1.hexdigest())
    print(sha1.digest_size, sha1.block_size)
    print('=====================')

    md5 = hashlib.new('md5', 'This is a sentence.This is a second sentence.'.encode('utf-8'))
    print(md5.hexdigest())
    sha1 = hashlib.new('sha1', 'This is a sentence.This is a second sentence.'.encode('utf-8'))
    print(sha1.hexdigest())

    print(hashlib.algorithms_available)

3.统计量:均值/方差/偏度/峰度

#!/usr/bin/python
#  -*- coding:utf-8 -*-

import numpy as np
from scipy import stats
import math
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import seaborn


def calc_statistics(x):
    n = x.shape[0]  # 样本个数

    # 手动计算
    m = 0
    m2 = 0
    m3 = 0
    m4 = 0
    for t in x:
        m += t
        m2 += t*t
        m3 += t**3
        m4 += t**4
    m /= n
    m2 /= n
    m3 /= n
    m4 /= n

    mu = m
    sigma = np.sqrt(m2 - mu*mu)
    skew = (m3 - 3*mu*m2 + 2*mu**3) / sigma**3
    kurtosis = (m4 - 4*mu*m3 + 6*mu*mu*m2 - 4*mu**3*mu + mu**4) / sigma**4 - 3
    print('手动计算均值、标准差、偏度、峰度:', mu, sigma, skew, kurtosis)

    # 使用系统函数验证
    mu = np.mean(x, axis=0)
    sigma = np.std(x, axis=0)
    skew = stats.skew(x)
    kurtosis = stats.kurtosis(x)
    return mu, sigma, skew, kurtosis


if __name__ == '__main__':
    d = np.random.randn(10000)
    print(d)
    print(d.shape)
    mu, sigma, skew, kurtosis = calc_statistics(d)
    print('函数库计算均值、标准差、偏度、峰度:', mu, sigma, skew, kurtosis)
    # 一维直方图
    mpl.rcParams['font.sans-serif'] = 'SimHei'
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(num=1, facecolor='w')
    y1, x1, dummy = plt.hist(d, bins=50, normed=True, color='g', alpha=0.75)
    t = np.arange(x1.min(), x1.max(), 0.05)
    y = np.exp(-t**2 / 2) / math.sqrt(2*math.pi)
    plt.plot(t, y, 'r-', lw=2)
    plt.title('高斯分布,样本个数:%d' % d.shape[0])
    plt.grid(True)
    # plt.show()

    d = np.random.randn(100000, 2)
    mu, sigma, skew, kurtosis = calc_statistics(d)
    print('函数库计算均值、标准差、偏度、峰度:', mu, sigma, skew, kurtosis)

    # 二维图像
    N = 30
    density, edges = np.histogramdd(d, bins=[N, N])
    print('样本总数:', np.sum(density))
    density /= density.max()
    x = y = np.arange(N)
    print('x = ', x)
    print('y = ', y)
    t = np.meshgrid(x, y)
    print(t)
    fig = plt.figure(facecolor='w')
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(t[0], t[1], density, c='r', s=50*density, marker='o', depthshade=True)
    ax.plot_surface(t[0], t[1], density, cmap=cm.Accent, rstride=1, cstride=1, alpha=0.9, lw=0.75)
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    plt.title('二元高斯分布,样本个数:%d' % d.shape[0], fontsize=15)
    plt.tight_layout(0.1)
    plt.show()

4.多元高斯分布

#!/usr/bin/python
#  -*- coding:utf-8 -*-

import numpy as np
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm


if __name__ == '__main__':
    x1, x2 = np.mgrid[-5:5:51j, -5:5:51j]
    x = np.stack((x1, x2), axis=2)

    mpl.rcParams['axes.unicode_minus'] = False
    mpl.rcParams['font.sans-serif'] = 'SimHei'
    plt.figure(figsize=(9, 8), facecolor='w')
    sigma = (np.identity(2), np.diag((3,3)), np.diag((2,5)), np.array(((2,1), (1,5))))
    for i in np.arange(4):
        ax = plt.subplot(2, 2, i+1, projection='3d')
        norm = stats.multivariate_normal((0, 0), sigma[i])
        y = norm.pdf(x)
        ax.plot_surface(x1, x2, y, cmap=cm.Accent, rstride=2, cstride=2, alpha=0.9, lw=0.3)
        ax.set_xlabel('X')
        ax.set_ylabel('Y')
        ax.set_zlabel('Z')
    plt.suptitle('二元高斯分布方差比较', fontsize=18)
    plt.tight_layout(1.5)
    plt.show()

5.阶乘的实数域推广:Gamma函数

# -*- coding:utf-8 -*-
# /usr/bin/python

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.special import gamma
from scipy.special import factorial

mpl.rcParams['axes.unicode_minus'] = False
mpl.rcParams['font.sans-serif'] = 'SimHei'


if __name__ == '__main__':
    N = 5
    x = np.linspace(0, N, 50)
    y = gamma(x+1)
    plt.figure(facecolor='w')
    plt.plot(x, y, 'r-', x, y, 'mo', lw=2, ms=7)
    z = np.arange(0, N+1)
    f = factorial(z, exact=True)    # 阶乘
    print(f)
    plt.plot(z, f, 'go', markersize=9)
    plt.grid(b=True)
    plt.xlim(-0.1,N+0.1)
    plt.ylim(0.5, np.max(y)*1.05)
    plt.xlabel('X', fontsize=15)
    plt.ylabel('Gamma(X) - 阶乘', fontsize=15)
    plt.title('阶乘和Gamma函数', fontsize=16)
    plt.show()

6.相关系数的计算

#!/usr/bin/python
#  -*- coding:utf-8 -*-

import numpy as np
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import warnings

mpl.rcParams['axes.unicode_minus'] = False
mpl.rcParams['font.sans-serif'] = 'SimHei'


def calc_pearson(x, y):
    std1 = np.std(x)
    # np.sqrt(np.mean(x**2) - np.mean(x)**2)
    std2 = np.std(y)
    cov = np.cov(x, y, bias=True)[0,1]
    return cov / (std1 * std2)


def intro():
    N = 10
    x = np.random.rand(N)
    y = 2 * x + np.random.randn(N) * 0.1
    print(x)
    print(y)
    print('系统计算:', stats.pearsonr(x, y)[0])
    print('手动计算:', calc_pearson(x, y))


def rotate(x, y, theta=45):
    data = np.vstack((x, y))
    # print data
    mu = np.mean(data, axis=1)
    mu = mu.reshape((-1, 1))
    # print mu
    data -= mu
    # print data
    theta *= (np.pi / 180)
    c = np.cos(theta)
    s = np.sin(theta)
    m = np.array(((c, -s), (s, c)))
    return m.dot(data) + mu


def pearson(x, y, tip):
    clrs = list('rgbmycrgbmycrgbmycrgbmyc')
    plt.figure(figsize=(10, 8), facecolor='w')
    for i, theta in enumerate(np.linspace(0, 90, 6)):
        xr, yr = rotate(x, y, theta)
        p = stats.pearsonr(xr, yr)[0]
        # print calc_pearson(xr, yr)
        print('旋转角度:', theta, 'Pearson相关系数:', p)
        str = '相关系数:%.3f' % p
        plt.scatter(xr, yr, s=40, alpha=0.9, linewidths=0.5, c=clrs[i], marker='o', label=str)
    plt.legend(loc='upper left', shadow=True)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Pearson相关系数与数据分布:%s' % tip, fontsize=18)
    plt.grid(b=True)
    plt.show()


if __name__ == '__main__':
    # warnings.filterwarnings(action='ignore', category=RuntimeWarning)
    np.random.seed(0)

    intro()

    N = 1000
    tip = '一次函数关系'
    x = np.random.rand(N)
    y = np.zeros(N) + np.random.randn(N)*0.001

    # tip = u'二次函数关系'
    # x = np.random.rand(N)
    # y = x ** 2 #+ np.random.randn(N)*0.002

    # tip = u'正切关系'
    # x = np.random.rand(N) * 1.4
    # y = np.tan(x)

    # tip = u'二次函数关系'
    # x = np.linspace(-1, 1, 101)
    # y = x ** 2

    # tip = u'椭圆'
    # x, y = np.random.rand(2, N) * 60 - 30
    # y /= 5
    # idx = (x**2 / 900 + y**2 / 36 < 1)
    # x = x[idx]
    # y = y[idx]

    pearson(x, y, tip)

7.快速傅里叶变换FFT与信号滤波

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt


def triangle_wave(size, T):
    t = np.linspace(-1, 1, size, endpoint=False)
    # where
    # y = np.where(t < 0, -t, 0)
    # y = np.where(t >= 0, t, y)
    y = np.abs(t)
    y = np.tile(y, T) - 0.5
    x = np.linspace(0, 2*np.pi*T, size*T, endpoint=False)
    return x, y


def sawtooth_wave(size, T):
    t = np.linspace(-1, 1, size)
    y = np.tile(t, T)
    x = np.linspace(0, 2*np.pi*T, size*T, endpoint=False)
    return x, y


def triangle_wave2(size, T):
    x, y = sawtooth_wave(size, T)
    return x, np.abs(y)


def non_zero(f):
    f1 = np.real(f)
    f2 = np.imag(f)
    eps = 1e-4
    return f1[(f1 > eps) | (f1 < -eps)], f2[(f2 > eps) | (f2 < -eps)]


if __name__ == "__main__":
    mpl.rcParams['font.sans-serif'] = ['simHei']
    mpl.rcParams['axes.unicode_minus'] = False
    np.set_printoptions(suppress=True)

    x = np.linspace(0, 2*np.pi, 16, endpoint=False)
    print('时域采样值:', x)
    y = np.sin(2*x) + np.sin(3*x + np.pi/4) + np.sin(5*x)
    # y = np.sin(x)

    N = len(x)
    print('采样点个数:', N)
    print('\n原始信号:', y)
    f = np.fft.fft(y)
    print('\n频域信号:', f/N)
    a = np.abs(f/N)
    print('\n频率强度:', a)

    iy = np.fft.ifft(f)
    print('\n逆傅里叶变换恢复信号:', iy)
    print('\n虚部:', np.imag(iy))
    print('\n实部:', np.real(iy))
    print('\n恢复信号与原始信号是否相同:', np.allclose(np.real(iy), y))

    plt.figure(facecolor='w')
    plt.subplot(211)
    plt.plot(x, y, 'go-', lw=2)
    plt.title('时域信号', fontsize=15)
    plt.grid(True)
    plt.subplot(212)
    w = np.arange(N) * 2*np.pi / N
    print('频率采样值:', w)
    plt.stem(w, a, linefmt='r-', markerfmt='ro')
    plt.title('频域信号', fontsize=15)
    plt.grid(True)
    plt.show()

    # 三角/锯齿波
    x, y = triangle_wave(20, 5)
    # x, y = sawtooth_wave(20, 5)
    N = len(y)
    f = np.fft.fft(y)
    # print '原始频域信号:', np.real(f), np.imag(f)
    print('原始频域信号:', non_zero(f))
    a = np.abs(f / N)

    # np.real_if_close
    f_real = np.real(f)
    eps = 0.1 * f_real.max()
    print('f_real = \n', f_real)
    print(eps)
    f_real[(f_real < eps) & (f_real > -eps)] = 0
    f_imag = np.imag(f)
    eps = 0.3 * f_imag.max()
    print(eps)
    f_imag[(f_imag < eps) & (f_imag > -eps)] = 0
    f1 = f_real + f_imag * 1j
    y1 = np.fft.ifft(f1)
    y1 = np.real(y1)
    # print '恢复频域信号:', np.real(f1), np.imag(f1)
    print('恢复频域信号:', non_zero(f1))

    plt.figure(figsize=(8, 8), facecolor='w')
    plt.subplot(311)
    plt.plot(x, y, 'g-', lw=2)
    plt.title('三角波', fontsize=15)
    plt.grid(True)
    plt.subplot(312)
    w = np.arange(N) * 2*np.pi / N
    plt.stem(w, a, linefmt='r-', markerfmt='ro')
    plt.title('频域信号', fontsize=15)
    plt.grid(True)
    plt.subplot(313)
    plt.plot(x, y1, 'b-', lw=2, markersize=4)
    plt.title('三角波恢复信号', fontsize=15)
    plt.grid(True)
    plt.tight_layout(1.5, rect=[0, 0.04, 1, 0.96])
    plt.suptitle('快速傅里叶变换FFT与频域滤波', fontsize=17)
    plt.show()

8.奇异值分解SVD与图像特征

#!/usr/bin/python
#  -*- coding:utf-8 -*-

import numpy as np
import os
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib as mpl
from pprint import pprint


def restore1(sigma, u, v, K):  # 奇异值、左特征向量、右特征向量
    m = len(u)
    n = len(v[0])
    a = np.zeros((m, n))
    for k in range(K):
        uk = u[:, k].reshape(m, 1)
        vk = v[k].reshape(1, n)
        a += sigma[k] * np.dot(uk, vk)
    a[a < 0] = 0
    a[a > 255] = 255
    # a = a.clip(0, 255)
    return np.rint(a).astype('uint8')


def restore2(sigma, u, v, K):  # 奇异值、左特征向量、右特征向量
    m = len(u)
    n = len(v[0])
    a = np.zeros((m, n))
    for k in range(K+1):
        for i in range(m):
            a[i] += sigma[k] * u[i][k] * v[k]
    a[a < 0] = 0
    a[a > 255] = 255
    return np.rint(a).astype('uint8')


if __name__ == "__main__":
    A = Image.open("..\\lena.png", 'r')
    print(A)
    output_path = r'.\SVD_Output'
    if not os.path.exists(output_path):
        os.mkdir(output_path)
    a = np.array(A)
    print(a.shape)
    K = 50
    u_r, sigma_r, v_r = np.linalg.svd(a[:, :, 0])
    u_g, sigma_g, v_g = np.linalg.svd(a[:, :, 1])
    u_b, sigma_b, v_b = np.linalg.svd(a[:, :, 2])
    plt.figure(figsize=(11, 9), facecolor='w')
    mpl.rcParams['font.sans-serif'] = ['simHei']
    mpl.rcParams['axes.unicode_minus'] = False
    for k in range(1, K+1):
        print(k)
        R = restore1(sigma_r, u_r, v_r, k)
        G = restore1(sigma_g, u_g, v_g, k)
        B = restore1(sigma_b, u_b, v_b, k)
        I = np.stack((R, G, B), axis=2)
        Image.fromarray(I).save('%s\\svd_%d.png' % (output_path, k))
        if k <= 12:
            plt.subplot(3, 4, k)
            plt.imshow(I)
            plt.axis('off')
            plt.title('奇异值个数:%d' % k)
    plt.suptitle('SVD与图像分解', fontsize=20)
    plt.tight_layout(0.3, rect=(0, 0, 1, 0.92))
    # plt.subplots_adjust(top=0.9)
    plt.show()

9.股票收盘价格曲线、滑动MA曲线

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt


if __name__ == "__main__":
    stock_max, stock_min, stock_close, stock_amount = np.loadtxt('..\\SH600000.txt', delimiter='\t', skiprows=2, usecols=(2, 3, 4, 5), unpack=True)
    N = 100
    stock_close = stock_close[:N]
    print(stock_close)

    n = 10
    weight = np.ones(n)
    weight /= weight.sum()
    print(weight)
    stock_sma = np.convolve(stock_close, weight, mode='valid')  # simple moving average

    weight = np.linspace(1, 0, n)
    weight = np.exp(weight)
    weight /= weight.sum()
    print(weight)
    stock_ema = np.convolve(stock_close, weight, mode='valid')  # exponential moving average

    t = np.arange(n-1, N)
    poly = np.polyfit(t, stock_ema, 10)
    print(poly)
    stock_ema_hat = np.polyval(poly, t)

    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(facecolor='w')
    plt.plot(np.arange(N), stock_close, 'ro-', linewidth=2, label='原始收盘价')
    t = np.arange(n-1, N)
    plt.plot(t, stock_sma, 'b-', linewidth=2, label='简单移动平均线')
    plt.plot(t, stock_ema, 'g-', linewidth=2, label='指数移动平均线')
    plt.legend(loc='upper right')
    plt.title('股票收盘价与滑动平均线MA', fontsize=15)
    plt.grid(True)
    plt.show()

    print(plt.figure(figsize=(8.8, 6.6), facecolor='w'))
    plt.plot(np.arange(N), stock_close, 'ro-', linewidth=1, label='原始收盘价')
    plt.plot(t, stock_ema, 'g-', linewidth=2, label='指数移动平均线')
    plt.plot(t, stock_ema_hat, '-', color='#FF4040', linewidth=3, label='指数移动平均线估计')
    plt.legend(loc='upper right')
    plt.title('滑动平均线MA的估计', fontsize=15)
    plt.grid(True)
    plt.show()

10.股票K线图

#!/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.finance import candlestick_ohlc


if __name__ == "__main__":
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False

    np.set_printoptions(suppress=True, linewidth=100, edgeitems=5)
    data = np.loadtxt('..\\SH600000.txt', dtype=np.float, delimiter='\t', skiprows=2, usecols=(1, 2, 3, 4))
    data = data[:50]
    N = len(data)

    t = np.arange(1, N+1).reshape((-1, 1))
    data = np.hstack((t, data))

    fig, ax = plt.subplots(facecolor='w')
    fig.subplots_adjust(bottom=0.2)
    candlestick_ohlc(ax, data, width=0.6, colorup='r', colordown='b', alpha=0.9)
    plt.xlim((0, N+1))
    plt.grid(b=True)
    plt.title('股票K线图', fontsize=15)
    plt.tight_layout(2)
    plt.show()

11.图像的卷积

#!/usr/bin/python
#  -*- coding:utf-8 -*-

import numpy as np
import os
from PIL import Image


def convolve(image, weight):
    height, width = image.shape
    h, w = weight.shape
    height_new = height - h + 1
    width_new = width - w + 1
    image_new = np.zeros((height_new, width_new), dtype=np.float)
    for i in range(height_new):
        for j in range(width_new):
            image_new[i,j] = np.sum(image[i:i+h, j:j+w] * weight)
    image_new = image_new.clip(0, 255)
    image_new = np.rint(image_new).astype('uint8')
    return image_new

# image_new = 255 * (image_new - image_new.min()) / (image_new.max() - image_new.min())

if __name__ == "__main__":
    A = Image.open("..\\son.png", 'r')
    output_path = '.\\ImageConvolve\\'
    if not os.path.exists(output_path):
        os.mkdir(output_path)
    a = np.array(A)
    avg3 = np.ones((3, 3))
    avg3 /= avg3.sum()
    avg5 = np.ones((5, 5))
    avg5 /= avg5.sum()
    gauss = np.array(([0.003, 0.013, 0.022, 0.013, 0.003],
                      [0.013, 0.059, 0.097, 0.059, 0.013],
                      [0.022, 0.097, 0.159, 0.097, 0.022],
                      [0.013, 0.059, 0.097, 0.059, 0.013],
                      [0.003, 0.013, 0.022, 0.013, 0.003]))
    soble_x = np.array(([-1, 0, 1], [-2, 0, 2], [-1, 0, 1]))
    soble_y = np.array(([-1, -2, -1], [0, 0, 0], [1, 2, 1]))
    soble = np.array(([-1, -1, 0], [-1, 0, 1], [0, 1, 1]))
    prewitt_x = np.array(([-1, 0, 1], [-1, 0, 1], [-1, 0, 1]))
    prewitt_y = np.array(([-1, -1,-1], [0, 0, 0], [1, 1, 1]))
    prewitt = np.array(([-2, -1, 0], [-1, 0, 1], [0, 1, 2]))
    laplacian4 = np.array(([0, -1, 0], [-1, 4, -1], [0, -1, 0]))
    laplacian8 = np.array(([-1, -1, -1], [-1, 8, -1], [-1, -1, -1]))
    weight_list = ('avg3', 'avg5', 'gauss', 'soble_x', 'soble_y', 'soble', 'prewitt_x', 'prewitt_y', 'prewitt', 'laplacian4', 'laplacian8')
    print('梯度检测:')
    for weight in weight_list:
        print(weight, 'R', end=' ')
        R = convolve(a[:, :, 0], eval(weight))
        print('G', end=' ')
        G = convolve(a[:, :, 1], eval(weight))
        print('B')
        B = convolve(a[:, :, 2], eval(weight))
        I = np.stack((R, G, B), 2)
        Image.fromarray(I).save(output_path + weight + '.png')

    # # X & Y
    # print '梯度检测XY:'
    # for w in (0, 2):
    #     weight = weight_list[w]
    #     print weight, 'R',
    #     R = convolve(a[:, :, 0], eval(weight))
    #     print 'G',
    #     G = convolve(a[:, :, 1], eval(weight))
    #     print 'B'
    #     B = convolve(a[:, :, 2], eval(weight))
    #     I1 = np.stack((R, G, B), 2)
    #
    #     weight = weight_list[w+1]
    #     print weight, 'R',
    #     R = convolve(a[:, :, 0], eval(weight))
    #     print 'G',
    #     G = convolve(a[:, :, 1], eval(weight))
    #     print 'B'
    #     B = convolve(a[:, :, 2], eval(weight))
    #     I2 = np.stack((R, G, B), 2)
    #
    #     I = 255 - np.maximum(I1, I2)
    #     Image.fromarray(I).save(output_path + weight[:-2] + '.png')

12.蝴蝶效应:Lorenz系统的曲线生成

#!/usr/bin/python
# -*- coding:utf-8 -*-

from scipy.integrate import odeint
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


def lorenz(state, t):
    # print w
    # print t
    sigma = 10
    rho = 28
    beta = 3
    x, y, z = state
    return np.array([sigma*(y-x), x*(rho-z)-y, x*y-beta*z])


def lorenz_trajectory(s0, N):
    sigma = 10
    rho = 28
    beta = 8/3.

    delta = 0.001
    s = np.empty((N+1, 3))
    s[0] = s0
    for i in np.arange(1, N+1):
        x, y, z = s[i-1]
        a = np.array([sigma*(y-x), x*(rho-z)-y, x*y-beta*z])
        s[i] = s[i-1] + a * delta
    return s


if __name__ == "__main__":
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False

    # Figure 1
    s0 = (0., 1., 0.)
    t = np.arange(0, 30, 0.01)
    s = odeint(lorenz, s0, t)
    plt.figure(figsize=(12, 8), facecolor='w')
    plt.subplot(121, projection='3d')
    plt.plot(s[:, 0], s[:, 1], s[:, 2], c='g')
    plt.title('微分方程计算结果', fontsize=16)

    s = lorenz_trajectory(s0, 40000)
    plt.subplot(122, projection='3d')
    plt.plot(s[:, 0], s[:, 1], s[:, 2], c='r')
    plt.title('沿着梯度累加结果', fontsize=16)

    plt.tight_layout(1, rect=(0,0,1,0.98))
    plt.suptitle('Lorenz系统', fontsize=20)
    plt.show()

    # Figure 2
    ax = Axes3D(plt.figure(figsize=(8, 8)))
    s0 = (0., 1., 0.)
    s1 = lorenz_trajectory(s0, 50000)
    s0 = (0., 1.0001, 0.)
    s2 = lorenz_trajectory(s0, 50000)
    # 曲线
    ax.plot(s1[:, 0], s1[:, 1], s1[:, 2], c='g', lw=0.4)
    ax.plot(s2[:, 0], s2[:, 1], s2[:, 2], c='r', lw=0.4)
    # 起点
    ax.scatter(s1[0, 0], s1[0, 1], s1[0, 2], c='g', s=50, alpha=0.5)
    ax.scatter(s2[0, 0], s2[0, 1], s2[0, 2], c='r', s=50, alpha=0.5)
    # 终点
    ax.scatter(s1[-1, 0], s1[-1, 1], s1[-1, 2], c='g', s=100)
    ax.scatter(s2[-1, 0], s2[-1, 1], s2[-1, 2], c='r', s=100)
    ax.set_title('Lorenz方程与初始条件', fontsize=20)
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    plt.show()

13.绘制分形图:Mandelbrot集合

# /usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm


def divergent(c):
    z = 0
    i = 0
    while i < 100:
        z = z**2 + c
        if abs(z) > 2:
            break
        i += 1
    return i


def draw_mandelbrot(center_x, center_y, size):
    x1, x2 = center_x - size, center_x + size
    y1, y2 = center_y - size, center_y + size
    x, y = np.mgrid[x1:x2:500j, y1:y2:500j]
    c = x + y * 1j
    divergent_ = np.frompyfunc(divergent, 1, 1)
    mandelbrot = divergent_(c)
    mandelbrot = mandelbrot.astype(np.float64)    # ufunc返回PyObject数组
    print(size, mandelbrot.max(), mandelbrot.min())
    plt.pcolormesh(x, y, mandelbrot, cmap=cm.jet)
    plt.xlim((np.min(x), np.max(x)))
    plt.ylim((np.min(y), np.max(y)))
    plt.savefig(str(size)+'.png')
    plt.show()


if __name__ == '__main__':
    draw_mandelbrot(0, 0, 2)

    interested_x = 0.33987
    interested_y = -0.575578
    interested_x, interested_y = 0.27322626, 0.595153338
    for size in np.logspace(0, -3, 4, base=10):
        print(size)
        draw_mandelbrot(interested_x, interested_y, size)

Part Three

1.庄家与赔率

#!/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from time import time
import math


def is_prime(x):
    return 0 not in [x % i for i in range(2, int(math.sqrt(x)) + 1)]


def is_prime3(x):
    flag = True
    for p in p_list2:
        if p > math.sqrt(x):
            break
        if x % p == 0:
            flag = False
            break
    if flag:
        p_list2.append(x)
    return flag


if __name__ == "__main__":
    a = 2
    b = 1000

    # 方法1:直接计算
    t = time()
    p = [p for p in range(a, b) if 0 not in [p % d for d in range(2, int(math.sqrt(p)) + 1)]]
    print(time() - t)
    print(p)

    # 方法2:利用filter
    t = time()
    p = list(filter(is_prime, list(range(a, b))))
    print(time() - t)
    print(p)

    # 方法3:利用filter和lambda
    t = time()
    is_prime2 = (lambda x: 0 not in [x % i for i in range(2, int(math.sqrt(x)) + 1)])
    p = list(filter(is_prime2, list(range(a, b))))
    print(time() - t)
    print(p)

    # 方法4:定义
    t = time()
    p_list = []
    for i in range(2, b):
        flag = True
        for p in p_list:
            if p > math.sqrt(i):
                break
            if i % p == 0:
                flag = False
                break
        if flag:
            p_list.append(i)
    print(time() - t)
    print(p_list)

    # 方法5:定义和filter
    p_list2 = []
    t = time()
    list(filter(is_prime3, list(range(2, b))))
    print(time() - t)
    print(p_list2)

    print('---------------------')
    a = 750
    b = 900
    p_list2 = []
    np.set_printoptions(linewidth=150)
    p = np.array(list(filter(is_prime3, list(range(2, b+1)))))
    p = p[p >= a]
    print(p)
    p_rate = float(len(p)) / float(b-a+1)
    print('素数的概率:', p_rate, end='\t  ')
    print('公正赔率:', 1/p_rate)
    print('合数的概率:', 1-p_rate, end='\t  ')
    print('公正赔率:', 1 / (1-p_rate))

    alpha1 = 5.5 * p_rate
    alpha2 = 1.1 * (1 - p_rate)
    print('赔率系数:', alpha1, alpha2)
    print(1 - (alpha1 + alpha2) / 2)
    print((1 - alpha1) * p_rate + (1 - alpha2) * (1 - p_rate))

2.Pandas数据读取和处理

3.数据清洗和校正

4.Fuzzywuzzy字符串模糊查找

#!/usr/bin/python
# -*- encoding: utf-8

import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process


def enum_row(row):
    print(row['state'])


def find_state_code(row):
    if row['state'] != 0:
        print(process.extractOne(row['state'], states, score_cutoff=80))


def capital(str):
    return str.capitalize()


def correct_state(row):
    if row['state'] != 0:
        state = process.extractOne(row['state'], states, score_cutoff=80)
        if state:
            state_name = state[0]
            return ' '.join(map(capital, state_name.split(' ')))
    return row['state']


def fill_state_code(row):
    if row['state'] != 0:
        state = process.extractOne(row['state'], states, score_cutoff=80)
        if state:
            state_name = state[0]
            return state_to_code[state_name]
    return ''


if __name__ == "__main__":
    pd.set_option('display.width', 200)
    data = pd.read_excel('..\\sales.xlsx', sheetname='sheet1', header=0)
    print('data.head() = \n', data.head())
    print('data.tail() = \n', data.tail())
    print('data.dtypes = \n', data.dtypes)
    print('data.columns = \n', data.columns)
    for c in data.columns:
        print(c, end=' ')
    print()
    data['total'] = data['Jan'] + data['Feb'] + data['Mar']
    print(data.head())
    print(data['Jan'].sum())
    print(data['Jan'].min())
    print(data['Jan'].max())
    print(data['Jan'].mean())

    print('=============')
    # 添加一行
    s1 = data[['Jan', 'Feb', 'Mar', 'total']].sum()
    print(s1)
    s2 = pd.DataFrame(data=s1)
    print(s2)
    print(s2.T)
    print(s2.T.reindex(columns=data.columns))
    # 即:
    s = pd.DataFrame(data=data[['Jan', 'Feb', 'Mar', 'total']].sum()).T
    s = s.reindex(columns=data.columns, fill_value=0)
    print(s)
    data = data.append(s, ignore_index=True)
    data = data.rename(index={15:'Total'})
    print(data.tail())

    # apply的使用
    print('==============apply的使用==========')
    data.apply(enum_row, axis=1)

    state_to_code = {"VERMONT": "VT", "GEORGIA": "GA", "IOWA": "IA", "Armed Forces Pacific": "AP", "GUAM": "GU",
                     "KANSAS": "KS", "FLORIDA": "FL", "AMERICAN SAMOA": "AS", "NORTH CAROLINA": "NC", "HAWAII": "HI",
                     "NEW YORK": "NY", "CALIFORNIA": "CA", "ALABAMA": "AL", "IDAHO": "ID",
                     "FEDERATED STATES OF MICRONESIA": "FM",
                     "Armed Forces Americas": "AA", "DELAWARE": "DE", "ALASKA": "AK", "ILLINOIS": "IL",
                     "Armed Forces Africa": "AE", "SOUTH DAKOTA": "SD", "CONNECTICUT": "CT", "MONTANA": "MT",
                     "MASSACHUSETTS": "MA",
                     "PUERTO RICO": "PR", "Armed Forces Canada": "AE", "NEW HAMPSHIRE": "NH", "MARYLAND": "MD",
                     "NEW MEXICO": "NM",
                     "MISSISSIPPI": "MS", "TENNESSEE": "TN", "PALAU": "PW", "COLORADO": "CO",
                     "Armed Forces Middle East": "AE",
                     "NEW JERSEY": "NJ", "UTAH": "UT", "MICHIGAN": "MI", "WEST VIRGINIA": "WV", "WASHINGTON": "WA",
                     "MINNESOTA": "MN", "OREGON": "OR", "VIRGINIA": "VA", "VIRGIN ISLANDS": "VI",
                     "MARSHALL ISLANDS": "MH",
                     "WYOMING": "WY", "OHIO": "OH", "SOUTH CAROLINA": "SC", "INDIANA": "IN", "NEVADA": "NV",
                     "LOUISIANA": "LA",
                     "NORTHERN MARIANA ISLANDS": "MP", "NEBRASKA": "NE", "ARIZONA": "AZ", "WISCONSIN": "WI",
                     "NORTH DAKOTA": "ND",
                     "Armed Forces Europe": "AE", "PENNSYLVANIA": "PA", "OKLAHOMA": "OK", "KENTUCKY": "KY",
                     "RHODE ISLAND": "RI",
                     "DISTRICT OF COLUMBIA": "DC", "ARKANSAS": "AR", "MISSOURI": "MO", "TEXAS": "TX", "MAINE": "ME"}
    states = list(state_to_code.keys())
    print(fuzz.ratio('Python Package', 'PythonPackage'))
    print(process.extract('Mississippi', states))
    print(process.extract('Mississipi', states, limit=1))
    print(process.extractOne('Mississipi', states))
    data.apply(find_state_code, axis=1)

    print('Before Correct State:\n', data['state'])
    data['state'] = data.apply(correct_state, axis=1)
    print('After Correct State:\n', data['state'])
    data.insert(5, 'State Code', np.nan)
    data['State Code'] = data.apply(fill_state_code, axis=1)
    print(data)

    # group by
    print('==============group by================')
    print(data.groupby('State Code'))
    print('All Columns:\n')
    print(data.groupby('State Code').sum())
    print('Short Columns:\n')
    print(data[['State Code', 'Jan', 'Feb', 'Mar', 'total']].groupby('State Code').sum())

    # 写入文件
    data.to_excel('sales_result.xls', sheet_name='Sheet1', index=False)

5.特征提取主成分分析PCA

参考博客

# -*- coding:utf-8 -*-

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, SelectPercentile, chi2
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.manifold import TSNE
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches


def extend(a, b):
    return 1.05*a-0.05*b, 1.05*b-0.05*a


if __name__ == '__main__':
    stype = 'pca'
    pd.set_option('display.width', 200)
    data = pd.read_csv('..\\iris.data', header=None)
    # columns = np.array(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'type'])
    columns = np.array(['花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度', '类型'])
    data.rename(columns=dict(list(zip(np.arange(5), columns))), inplace=True)
    data['类型'] = pd.Categorical(data['类型']).codes
    print(data.head(5))
    x = data[columns[:-1]]
    y = data[columns[-1]]

    if stype == 'pca':
        pca = PCA(n_components=2, whiten=True, random_state=0)
        x = pca.fit_transform(x)
        print('各方向方差:', pca.explained_variance_)
        print('方差所占比例:', pca.explained_variance_ratio_)
        x1_label, x2_label = '组分1', '组分2'
        title = '鸢尾花数据PCA降维'
    else:
        fs = SelectKBest(chi2, k=2)
        # fs = SelectPercentile(chi2, percentile=60)
        fs.fit(x, y)
        idx = fs.get_support(indices=True)
        print('fs.get_support() = ', idx)
        x = x[idx]
        x = x.values    # 为下面使用方便,DataFrame转换成ndarray
        x1_label, x2_label = columns[idx]
        title = '鸢尾花数据特征选择'
    print(x[:5])
    cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
    mpl.rcParams['font.sans-serif'] = 'SimHei'
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(facecolor='w')
    plt.scatter(x[:, 0], x[:, 1], s=30, c=y, marker='o', cmap=cm_dark)
    plt.grid(b=True, ls=':', color='k')
    plt.xlabel(x1_label, fontsize=12)
    plt.ylabel(x2_label, fontsize=12)
    plt.title(title, fontsize=15)
    # plt.savefig('1.png')
    plt.show()

    x, x_test, y, y_test = train_test_split(x, y, train_size=0.7)
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=2, include_bias=True)),
        ('lr', LogisticRegressionCV(Cs=np.logspace(-3, 4, 8), cv=5, fit_intercept=False))
    ])
    model.fit(x, y)
    print('最优参数:', model.get_params('lr')['lr'].C_)
    y_hat = model.predict(x)
    print('训练集精确度:', metrics.accuracy_score(y, y_hat))
    y_test_hat = model.predict(x_test)
    print('测试集精确度:', metrics.accuracy_score(y_test, y_test_hat))

    N, M = 500, 500     # 横纵各采样多少个值
    x1_min, x1_max = extend(x[:, 0].min(), x[:, 0].max())   # 第0列的范围
    x2_min, x2_max = extend(x[:, 1].min(), x[:, 1].max())   # 第1列的范围
    t1 = np.linspace(x1_min, x1_max, N)
    t2 = np.linspace(x2_min, x2_max, M)
    x1, x2 = np.meshgrid(t1, t2)                    # 生成网格采样点
    x_show = np.stack((x1.flat, x2.flat), axis=1)   # 测试点
    y_hat = model.predict(x_show)  # 预测值
    y_hat = y_hat.reshape(x1.shape)  # 使之与输入的形状相同
    plt.figure(facecolor='w')
    plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)  # 预测值的显示
    plt.scatter(x[:, 0], x[:, 1], s=30, c=y, edgecolors='k', cmap=cm_dark)  # 样本的显示
    plt.xlabel(x1_label, fontsize=12)
    plt.ylabel(x2_label, fontsize=12)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.grid(b=True, ls=':', color='k')
    # 画各种图
    # a = mpl.patches.Wedge(((x1_min+x1_max)/2, (x2_min+x2_max)/2), 1.5, 0, 360, width=0.5, alpha=0.5, color='r')
    # plt.gca().add_patch(a)
    patchs = [mpatches.Patch(color='#77E0A0', label='Iris-setosa'),
              mpatches.Patch(color='#FF8080', label='Iris-versicolor'),
              mpatches.Patch(color='#A0A0FF', label='Iris-virginica')]
    plt.legend(handles=patchs, fancybox=True, framealpha=0.8, loc='lower right')
    plt.title('鸢尾花Logistic回归分类效果', fontsize=15)
    plt.show()

6.One-hot编码

# coding:utf-8

import pandas as pd

if __name__ == '__main__':
    from sklearn.preprocessing import OneHotEncoder
    ohe = OneHotEncoder(sparse=False)
    x = [[1, 2, 1],
         [1, 2, 0],
         [2, 0, 2],
         [0, 2, 2]]
    x_onehot = ohe.fit_transform(x)
    print(x_onehot)

# -*- coding:utf-8 -*-

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
import matplotlib as mpl
import matplotlib.pyplot as plt


if __name__ == '__main__':
    pd.set_option('display.width', 300)
    pd.set_option('display.max_columns', 300)

    data = pd.read_csv('..\\car.data', header=None)
    n_columns = len(data.columns)
    columns = ['buy', 'maintain', 'doors', 'persons', 'boot', 'safety', 'accept']
    new_columns = dict(list(zip(np.arange(n_columns), columns)))
    data.rename(columns=new_columns, inplace=True)
    print(data.head(10))

    # one-hot编码
    x = pd.DataFrame()
    for col in columns[:-1]:
        t = pd.get_dummies(data[col])
        t = t.rename(columns=lambda x: col+'_'+str(x))
        x = pd.concat((x, t), axis=1)
    print(x.head(10))
    # print x.columns
    y = np.array(pd.Categorical(data['accept']).codes)
    # y[y == 1] = 0
    # y[y >= 2] = 1

    x, x_test, y, y_test = train_test_split(x, y, train_size=0.7)
    clf = LogisticRegressionCV(Cs=np.logspace(-3, 4, 8), cv=5)
    clf.fit(x, y)
    print(clf.C_)
    y_hat = clf.predict(x)
    print('训练集精确度:', metrics.accuracy_score(y, y_hat))
    y_test_hat = clf.predict(x_test)
    print('测试集精确度:', metrics.accuracy_score(y_test, y_test_hat))
    n_class = len(np.unique(y))
    if n_class > 2:
        y_test_one_hot = label_binarize(y_test, classes=np.arange(n_class))
        y_test_one_hot_hat = clf.predict_proba(x_test)
        fpr, tpr, _ = metrics.roc_curve(y_test_one_hot.ravel(), y_test_one_hot_hat.ravel())
        print('Micro AUC:\t', metrics.auc(fpr, tpr))
        auc = metrics.roc_auc_score(y_test_one_hot, y_test_one_hot_hat, average='micro')
        print('Micro AUC(System):\t', auc)
        auc = metrics.roc_auc_score(y_test_one_hot, y_test_one_hot_hat, average='macro')
        print('Macro AUC:\t', auc)
    else:
        fpr, tpr, _ = metrics.roc_curve(y_test.ravel(), y_test_hat.ravel())
        print('AUC:\t', metrics.auc(fpr, tpr))
        auc = metrics.roc_auc_score(y_test, y_test_hat)
        print('AUC(System):\t', auc)

    mpl.rcParams['font.sans-serif'] = 'SimHei'
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(8, 7), dpi=80, facecolor='w')
    plt.plot(fpr, tpr, 'r-', lw=2, label='AUC=%.4f' % auc)
    plt.legend(loc='lower right')
    plt.xlim((-0.01, 1.02))
    plt.ylim((-0.01, 1.02))
    plt.xticks(np.arange(0, 1.1, 0.1))
    plt.yticks(np.arange(0, 1.1, 0.1))
    plt.xlabel('False Positive Rate', fontsize=14)
    plt.ylabel('True Positive Rate', fontsize=14)
    plt.grid(b=True, ls=':')
    plt.title('ROC曲线和AUC', fontsize=18)
    plt.show()

你可能感兴趣的:(机器学习,python,机器学习)