机器学习小技巧,各算法建模代码

一、机器学习常用技巧

1.自动编码方法:
    第一种是pd.Categorical().Codes
    第二种是用:LableEncoder非常智能,会按照原数据的某种顺序关系来编码
        sklearn:先from sklearn.preprocessing import LableEncoder
        LableEncoder().fit_transform(data[4])
    第三种:标称型数据映射称为一组数字pd.factorize
    
2.独热编码(one hot encoder):通过独热编码,会让特征之间的距离计算更加合理,并且,将特征转为01,有利于提升计算速度。要是one hot encoding的类别数目不太多,建议优先考虑。 当类别的数量很多时,特征空间会变得非常大。在这种情况下,一般可以用PCA来减少维度。而且one hot encoding+PCA这种组合在实际中也非常有用。
    pd.get_dummies,OneHotEncoder
    sklearn中的几种方法:sklearn.preprocessing 中的OneHotEncoder,LableEncoder,LabelBinarizer,MultiLabelBinarizer
    https://blog.csdn.net/qq_40587575/article/details/81118610

3.数据归一化:MinMaxScaler(sklearn):x = MinMaxScaler().fit_transform(x)from sklearn.preprocessing import Normalizer
             scaler = Normalizer().fit(X_train)
             normalized_X = scaler.transform(X_train)
            normalized_X_test = scaler.transform(X_test)
            还有一种在19
    
4.简单分箱:利用pd.cut(等距)和pd.qcut(分位数)函数,一种处理连续型变量的方法,先将连续型变量分箱离散化,然后再进行离散化处理如编码

5.y(目标变量)定义:为离散时叫分类,连续时叫回归。

6.逻辑回归是一个对数的广义的线性回归,虽然叫回归但其实是解决分类问题的。

7.因为在求解最大似然估计时是求后面负项的最小值,后面这个负项就是关于θ的目标函数(obj),由于是求此项的最小值因此也叫损失函数

8.过拟合:过于符合训练集,导致模型对其他数据的预测较差

9.设置一行最多显示个数(常用显示格式):
    pd.set_option('display.width', 100)
    np.set_printoptions(linewidth=100, suppress=True)
    pd.set_option('expand_frame_repr', False)

10.忽略警告(调用包时候的那一块粉色的警告):
    from statsmodels.tools.sm_exceptions import HessianInversionWarning, warnings
    warnings.filterwarnings(action='ignore', category=HessianInversionWarning)#
    
11.移动数据(上下还是左右看axis=几):x.shift(period=(移动量,默认为1))

12.bagging和boosting算法(集成学习算法):
    https://blog.csdn.net/chenyukuai6625/article/details/73692347   
        
13.GBDT,Adaboost,XGboost特点和区别(讲的非常好)
    https://blog.csdn.net/chengfulukou/article/details/76906710

14.决策树是一个弱分类器,随机森林是若干个弱分类器等权值线性组合(集成),GBDT的调整权值是算梯度来定位模型不足,(可以适用的分类器就可以是多种),adaboost,XGboost是提升错分数据权值,降低错分类器权重

15.保存模型:
        from sklearn.externals import joblib
        model_file = 'F:\\data\\adult.pkl'
        joblib.dump(model, model_file)#存
        model = joblib.load(model_file)#读

16.剪枝:
    https://blog.csdn.net/zfan520/article/details/82454814
        
17.自动生成数据:
    from scipy import stats
    x = np.empty((4*N,2))#二维
    means = [(-1, 1), (1, 1), (1, -1), (-1, -1)]
    sigmas = [np.eye(2), 2*np.eye(2), np.diag((1,2)), np.array(((3, 2), (2, 3)))]
    for i in range(4):
        stats.multivariate_normal(means[i], sigmas[i]*0.1)#生成标准正态分布数据
        x[i*N:(i+1)*N, :] = mn.rvs(N)#随机变量的采样random variable samples
    或者
    from sklearn.datasets import make_blobs
    x, y = make_blobs(800, n_features=2, centers=means, cluster_std=(0.1, 0.2, 0.3, 0.4))#800样本,2维,4个分布中心,方差默认都取1
    
18.随机化、洗牌,随机挑选样本
    N=100
    test = np.arange(N)
    np.random.shuffle(test)
   还有一种:
    test=np.random.randint(0,N,size=1000)#有可能会重复

19.数据标准化: https://blog.csdn.net/quiet_girl/article/details/72517053
    from sklearn.preprocessing import StandardScaler
    # 标准化数据,保证每个维度的特征数据方差为1,均值为0,使得预测结果不会被某些维度过大的特征值而主导
    ss = StandardScaler()
    # fit_transform()先拟合数据,再标准化
    X_train = ss.fit_transform(X_train)
    # transform()数据标准化
    X_test = ss.transform(X_test)
    
20.某列数据统计:data.Fare.value_counts()

21.条形图,箱图
    train_data['Age'].hist(bins=70)
    train_data.boxplot(column='Age', showfliers=False)
    
22.正则化re模块:https://www.cnblogs.com/MrFiona/p/5954084.html
        
23.apply, map, applymap区别
    import pandas as pd
    import numpy as np
    from pandas import DataFrame
    from pandas import Series
    df1= DataFrame({
                    "sales1":[-1,2,3],
                    "sales2":[3,-5,7],
                   })
    df1
    # 1、当我们要对数据框(DataFrame)的数据进行按行或按列操作时用apply()
    # 2、当我们要对数据框(DataFrame)的每一个数据进行操作时用applymap(),返回结果是DataFrame格式
    # 3、当我们要对Series的每一个数据进行操作时用map()
    df1.apply(lambda x :x.max()-x.min(),axis=1)
    #axis=1,表示按行对数据进行操作
    #从下面的结果可以看出,我们使用了apply函数之后,系统自动按行找最大值和最小值计算,每一行输出一个值
    
24.['Fare']是Series,[['Fare']]是DF,data每个元素位置的取值由transform里函数计算,如下取对应分组列的均值
    combined_data['Fare'] = combined_data[['Fare']].fillna(combined_data.groupby('Pclass').transform(np.mean))

25.  .drop(['Title_code'],axis=1,inplace=True)#inplace就是替代原表

26.heatmap
    #heatmap:
    # data:矩阵数据集,可以是numpy的数组(array),也可以是pandas的DataFrame。如果是DataFrame,则df的index/column信息会分别对应到heatmap的columns和rows,即pt.index是热力图的行标,pt.columns是热力图的列标
    # mask:控制某个矩阵块是否显示出来。默认值是None。如果是布尔型的DataFrame,则将DataFrame里True的位置用白色覆盖掉 
    # cmap:从数字到色彩空间的映射,取值是matplotlib包里的colormap名称或颜色对象,或者表示颜色的列表;改参数默认值:根据center参数设定
    # square:设置热力图矩阵小块形状,默认值是False ,True就会得到正方形
    # ax:设置作图的坐标轴,一般画多个子图时需要修改不同的子图的该值
    sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10,as_cmap=True),square=True, ax=ax)
        #np.zeros_like(corr, dtype=np.bool)意思是判断corr这个相关系数矩阵中的值是否等于0,用True/False表示
    #创建分散颜色diverging_palette(h_neg, h_pos, s=75, l=50, sep=10, n=6, center='light', as_cmap=False)
    # h_neg和h_pos:起始,终止颜色值
        # optional(可选参数):
    # s:饱和度
    # l:亮度
    # n:调色板中的颜色数(如果不返回cmap)
    # center:调色板中心是亮是暗(调色板其实就是右边那个条),默认light
    # as_cmap:如果为true,则返回一个matplotlib颜色映射对象,而不是一个颜色列表,这个函数单独运行一下就知道了

27.显示中文
    mpl.rcParams['font.sans-serif'] = ['simHei']
    mpl.rcParams['axes.unicode_minus'] = False
    
28.简单填充缺失数据:
    from sklearn.impute import SimpleImputer
    my_imputer = SimpleImputer()
    data_with_imputed_values = my_imputer.fit_transform(original_data)
    
    # make copy to avoid changing original data (when Imputing)
    new_data = original_data.copy()

    # make new columns indicating what will be imputed
    cols_with_missing = (col for col in new_data.columns 
                                 if new_data[col].isnull().any())
    for col in cols_with_missing:
        new_data[col + '_was_missing'] = new_data[col].isnull()

    # Imputation
    my_imputer = SimpleImputer()
    new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
    new_data.columns = original_data.columns
    
29.返回数据类型
    def generic_type_name(v):
    """Return a descriptive type name that isn't Python specific. For example,
    an int value will return 'integer' rather than 'int'."""
        if isinstance(v, numbers.Integral):
            # Must come before real numbers check since integrals are reals too
            return 'integer'
        elif isinstance(v, numbers.Real):
            return 'float'
        elif isinstance(v, (tuple, list)):
            return 'list'
        elif isinstance(v, six.string_types):
            return 'string'
        elif v is None:
            return 'null'
        else:
            return type(v).__name__

二、时间序列

1 时间序列建模基本步骤
    获取被观测系统时间序列数据;
    对数据绘图,观测是否为平稳时间序列;对于非平稳时间序列要先进行d阶差分运算,化为平稳时间序列;
    经过第二步处理,已经得到平稳时间序列。要对平稳时间序列分别求得其自相关系数ACF 和偏自相关系数PACF ,通过对自相关图和偏自相关图的分析,得到最佳的阶层 p 和阶数 q
    由以上得到的 ,得到ARIMA模型。然后开始对得到的模型进行模型检验。
    
    
2. 时间序列:整个过程介绍的非常详细                 https://www.jianshu.com/p/cced6617b423
    from statsmodels.tsa.arima_model import ARIMA
    import warnings
    from statsmodels.tools.sm_exceptions import HessianInversionWarning

    def extend(a,b):
        return 1.05*a-0.05*b,1.05*b-0.05*a

    def date_parser(date):
        return pd.datetime.strptime(date,'%Y-%m')

    if __name__ == '__main__':
        warnings.filterwarnings(action='ignore', category=HessianInversionWarning)#忽略警告
        pd.set_option('display.width',100)
        np.set_printoptions(linewidth=100, suppress=True)

        f = rf'F:\data\AirPassengers.csv'
        data = pd.read_csv(f, header=0, parse_dates=['Month'], date_parser=date_parser, index_col=['Month'] )
        data.rename(columns={'#Passengers':'Passengers'}, inplace=True)

        x = data['Passengers'].astype(np.float)
        x = np.log(x)
        #     print(x.head())

        show = 'prime'#'diff','ma','prime'
        d = 1
        diff = x - x.shift(periods=d)
        ma = x.rolling(window=12).mean()#滚动窗口运算
        xma = x - ma

        p = 2
        q = 2
        model = ARIMA(endog=x, order=(p, d, q))#内生变量,其实就是模型需要决定的变量, 自回归函数p,差分d,移动平均数q
        arima = model.fit(disp=-1)#disp<1不输出过程
        prediction = arima.fittedvalues#type:
        y = prediction.cumsum() + x[0]
        mse = ((x-y)**2).mean()
        rmse = np.sqrt(mse)

        plt.figure(facecolor='w')
        if show == 'diff':
            plt.plot(x,'r-',lw=2,label='原始数据')
            plt.plot(diff,'g-',lw=2,label='{}阶差分'.format(d))
            #plt.plot(prediction,'r-',lw=2,label=u'预测数据')
            title = '乘客人数变化曲线-取对数'
        elif show == 'ma':
            #plt.plot(x, 'r-', lw=2, label=u'原始数据')
            #plt.plot(ma, 'g-', lw=2, label=u'滑动平均数据')
            plt.plot(xma, 'g-', lw=2, label='ln原始数据 - ln滑动平均数据')
            plt.plot(prediction, 'r-', lw=2, label='预测数据')
            title = '滑动平均值与MA预测值'
        else:
            plt.plot(x, 'r-', lw=2, label='原始数据')
            plt.plot(y, 'g-', lw=2, label='预测数据')
            title = '对数乘客人数与预测值(AR=%d, d=%d, MA=%d):RMSE=%.4f' % (p, d, q, rmse)
        plt.legend(loc='lower right')
        plt.grid(b=True,ls=':')
        plt.title(title,fontsize=16)
        plt.tight_layout(2)
        # plt.savefig('%s.png' % title)
        plt.show()

三、决策树、随机森林

1.决策树与随机森林建模(包含graphviz使用)
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pydotplus

if __name__ == "__main__":
    mpl.rcParams['font.sans-serif'] = ['simHei']
    mpl.rcParams['axes.unicode_minus'] = False

    iris_feature_E = 'sepal length', 'sepal width'
    #, 'petal length', 'petal width'
    iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'
    iris_class = 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'

    path = rf'F:\data\iris.data'  # 数据文件路径
    data = pd.read_csv(path, header=None)
    #x = data[list(range(4))]
    x = data[[0,1]]
    # y = pd.Categorical(data[4]).codes
    y = LabelEncoder().fit_transform(data[4])
    # 为了可视化,仅使用前两列特征
    #x = x.iloc[:, :2]
    # x = x[[0,1]]
    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=1)

    # 决策树参数估计
    # min_samples_split = 10:如果该结点包含的样本数目大于10,则(有可能)对其分支
    # min_samples_leaf = 10:若将某结点分支后,得到的每个子结点样本数目都大于10,则完成分支;否则,不进行分支
    model = DecisionTreeClassifier(criterion='entropy')
    model.fit(x_train, y_train)
    y_test_hat = model.predict(x_test)      # 测试数据
    print('accuracy_score:', accuracy_score(y_test, y_test_hat))
    

    # 保存
    # dot -Tpng my.dot -o my.png
    # 1、输出
    with open('G:\Download\python\iris.dot', 'w') as f:
        tree.export_graphviz(model, out_file=f)
    # 2、给定文件名
    # tree.export_graphviz(model, out_file='iris1.dot')
    # 3、输出为pdf格式
    dot_data = tree.export_graphviz(model, out_file=None, feature_names=iris_feature_E, class_names=iris_class,
                                    filled=True, rounded=True, special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data)
    graph.write_pdf('G:\Download\python\iris.pdf')
    f = open('G:\Download\python\iris.png', 'wb')
    f.write(graph.create_png())
    f.close()

    # 画图
    N, M = 50, 50  # 横纵各采样多少个值
    x1_min, x2_min = x.min()
    x1_max, x2_max = x.max()
    t1 = np.linspace(x1_min, x1_max, N)
    t2 = np.linspace(x2_min, x2_max, M)
    x1, x2 = np.meshgrid(t1, t2)  # 生成网格采样点
    x_show = np.stack((x1.flat, x2.flat), axis=1)  # 测试点
    print(x_show.shape)

    cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
    y_show_hat = model.predict(x_show)  # 预测值
    print(y_show_hat.shape)
    print(y_show_hat)
    y_show_hat = y_show_hat.reshape(x1.shape)  # 使之与输入的形状相同
    print(y_show_hat)
    plt.figure(facecolor='w')
    plt.pcolormesh(x1, x2, y_show_hat, cmap=cm_light)  # 预测值的显示
    plt.scatter(x_test[0], x_test[1], c=y_test.ravel(), edgecolors='k', s=100, zorder=10, cmap=cm_dark, marker='*')  # 测试数据
    plt.scatter(x[0], x[1], c=y.ravel(), edgecolors='k', s=20, cmap=cm_dark)  # 全部数据
    plt.xlabel(iris_feature[0], fontsize=13)
    plt.ylabel(iris_feature[1], fontsize=13)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.grid(b=True, ls=':', color='#606060')
    plt.title('鸢尾花数据的决策树分类', fontsize=15)
    plt.show()

    # 训练集上的预测结果
    y_test = y_test.reshape(-1)
    print(y_test_hat)
    print(y_test)
    result = (y_test_hat == y_test)   # True则预测正确,False则预测错误
    acc = np.mean(result)
    print('准确度: %.2f%%' % (100 * acc))

    # 过拟合:错误率
    depth = np.arange(1, 15)
    err_list = []
    for d in depth:
        clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
        clf.fit(x_train, y_train)
        y_test_hat = clf.predict(x_test)  # 测试数据
        result = (y_test_hat == y_test)  # True则预测正确,False则预测错误
        err = 1 - np.mean(result)
        err_list.append(err)
        # print d, ' 准确度: %.2f%%' % (100 * err)
        print(d, ' 错误率: %.2f%%' % (100 * err))
    plt.figure(facecolor='w')
    plt.plot(depth, err_list, 'ro-', markeredgecolor='k', lw=2)
    plt.xlabel('决策树深度', fontsize=13)
    plt.ylabel('错误率', fontsize=13)
    plt.title('决策树深度与过拟合', fontsize=15)
    plt.grid(b=True, ls=':', color='#606060')
    plt.show()

2.特征组合,等高线图    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

if __name__ == '__main__':
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False    
    
    iris_feature = u'花萼长度', u'花萼宽度', u'花瓣长度', u'花瓣宽度' #unicode
    path = rf'G:\Download\python\iris.data'
    data = pd.read_csv(path, header=None)
    x_prime = data[list(range(4))]
    y = pd.Categorical(data[4]).codes
    x_prime__train, x_prime_test, y_train, y_test = train_test_split(x_prime, y, train_size=0.7,random_state=0)
    
    feature_pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
    plt.figure(figsize=(8,6), facecolor='#FFFFFF')
    for i,pair in enumerate(feature_pairs):#枚举,(下标,值)这样一组tuple
        x_train = x_prime__train[pair]
        x_test = x_prime_test[pair]
        
        model = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=3)
        model.fit(x_train, y_train)
        
        N,M=50,50
        x1_min, x2_min = x_train.min() 
        x1_max, x2_max = x_train.max()
        t1 = np.linspace(x1_min,x1_max,N)
        t2 = np.linspace(x2_min,x2_max,M)
        x1,x2 = np.meshgrid(t1,t2)#(2, 50, 50) 网格采样
        x_show = np.stack((x1.flat,x2.flat),axis=1)#.flat把数组拆成一维,横着拼起来
        
        y_train_pred = model.predict(x_train)
        acc_train = accuracy_score(y_train, y_train_pred)
        y_test_pred = model.predict(x_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        print('特征',iris_feature[pair[0]],iris_feature[pair[1]])
        print(acc_train,acc_test)
        
        cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
        cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
        
        y_hat = model.predict(x_show)
        y_hat = y_hat.reshape(x1.shape) 
        
        plt.subplot(2,3,i+1)
        plt.contour(x1,x2,y_hat,colors='k', levels=[0,1], antialiased=True, linewidths=1)#等高线图
        plt.pcolormesh(x1,x2,y_hat,cmap=cm_light)
        plt.scatter(x_train[pair[0]],x_train[pair[1]], c=y_train, s=20, edgecolors='k', cmap=cm_dark, label='训练集')
        plt.scatter(x_test[pair[0]], x_test[pair[1]], c=y_test, s=80, marker='*', edgecolors='k', cmap=cm_dark, label=u'测试集')
        plt.xlabel(iris_feature[pair[0]], fontsize=12)
        plt.ylabel(iris_feature[pair[1]], fontsize=12)
        plt.legend(loc='upper right', fancybox=True, framealpha=0.3)
        plt.xlim(x1_min, x1_max)
        plt.ylim(x2_min, x2_max)
        plt.grid(b=True, ls=':', color='#606060')
    plt.suptitle(u'决策树对鸢尾花数据两特征组合的分类结果', fontsize=15)
    plt.tight_layout(1, rect=(0, 0, 1, 0.94))    # (left, bottom, right, top)
    plt.show()
    
3.决策树回归
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

if __name__ == '__main__':
    N=100
    x = np.random.rand(N)*6-3
    x.sort()
    y = np.sin(x) + np.random.rand(N)*0.05
    print(y)
    x = x.reshape(-1,1)
    
    dt = DecisionTreeRegressor(criterion='mse', max_depth=9)
    dt.fit(x,y)
    x_test = np.linspace(-3,3,100).reshape(-1,1)
    y_hat = dt.predict(x_test)
    
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(facecolor='w')
    plt.plot(x, y, 'r*', markersize=10, markeredgecolor='k', label='实际值')
    plt.plot(x_test,y_hat, 'g-', linewidth=2, label='预测')
    plt.legend(loc='upper left',fontsize=12)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.grid(b=True, ls=':', color='#606060')
    plt.title('决策树回归', fontsize=15)
    plt.tight_layout(2)
    plt.show()
    
    depth = [2,4,6,8,10]
    clr = 'rgbmy'
    dtr = DecisionTreeRegressor(criterion='mse')
    plt.figure(facecolor='w')
    plt.plot(x,y,'ro',ms=5, mec='k', label='实际值')
    x_test = np.linspace(-3,3,100).reshape(-1,1)
    for d, c in zip(depth, clr):
        dtr.set_params(max_depth=d)
        dtr.fit(x,y)
        y_hat = dtr.predict(x_test)
        print(mean_squared_error(y,y_hat))
        plt.plot(x_test, y_hat, '-', color=c, lw=2, mec='k', label='Depth={}'.format(d))
    plt.legend(loc='upper left', fontsize=12)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.grid(b=True, ls=':', color='#606060')
    plt.title('决策树回归', fontsize=15)
    plt.tight_layout(2)
    plt.show()

4.决策树多输出回归    
N=400
x = np.random.rand(N) * 8 - 4     # [-4,4)
x = np.random.rand(N) * 4*np.pi     # [-4,4)
x.sort()
print(x.shape)
print('====================')
# y1 = np.sin(x) + 3 + np.random.randn(N) * 0.1
# y2 = np.cos(0.3*x) + np.random.randn(N) * 0.01
# y1 = np.sin(x) + np.random.randn(N) * 0.05
# y2 = np.cos(x) + np.random.randn(N) * 0.1
y1 = 16 * np.sin(x) ** 3 + np.random.randn(N)*0.5
y2 = 13 * np.cos(x) - 5 * np.cos(2*x) - 2 * np.cos(3*x) - np.cos(4*x) + np.random.randn(N)*0.5
np.set_printoptions(suppress=True)#显示格式
#
y = np.vstack((y1, y2)).T
print(y1.shape)
print(y2.shape)
print(y.shape)
data = np.vstack((x,y1,y2)).T
print(data.shape)
x = x.reshape(-1,1)

deep=10
reg = DecisionTreeRegressor(criterion='mse',max_depth=deep)
dt = reg.fit(x,y)

x_test = np.linspace(x.min(), x.max(), num=1000).reshape(-1,1)
print(x_test.shape)
y_hat = reg.predict(x_test)
print(y_hat.shape)
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False

plt.figure(facecolor='w')
plt.scatter(y[:,0],y[:,1], c='r', marker='s', edgecolors='k', s=60, label='真实值', alpha=0.8)
plt.scatter(y_hat[:,0],y_hat[:,1], c='g', marker='o', edgecolor='k', edgecolors='g', s=30, label='预测值', alpha=0.8)
plt.legend(loc='lower left', fancybox=True, fontsize=12)
plt.xlabel('$Y_1$', fontsize=12)
plt.ylabel('$Y_2$', fontsize=12)
plt.grid(b=True,ls=':',color='#606060')
plt.title('决策树多输出回归',fontsize=15)
plt.tight_layout(2)
plt.show()

5.随机森林对鸢尾花数据两特征组合的分类结果
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


if __name__ == "__main__":
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False

    iris_feature = u'花萼长度', u'花萼宽度', u'花瓣长度', u'花瓣宽度'
    path = rf'F:\data\iris.data'  # 数据文件路径
    data = pd.read_csv(path, header=None)
    x_prime = data[list(range(4))]
    y = pd.Categorical(data[4]).codes
    x_prime_train, x_prime_test, y_train, y_test = train_test_split(x_prime, y, train_size=0.7, random_state=0)

    feature_pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
    plt.figure(figsize=(8, 6), facecolor='#FFFFFF')
    for i, pair in enumerate(feature_pairs):#枚举
        # 准备数据
        x_train = x_prime_train[pair]
        x_test = x_prime_test[pair]

        # 决策树学习
        model = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=5, oob_score=True)
        model.fit(x_train, y_train)

        # 画图
        N, M = 500, 500  # 横纵各采样多少个值
        x1_min, x2_min = x_train.min()
        x1_max, x2_max = x_train.max()
        t1 = np.linspace(x1_min, x1_max, N)
        t2 = np.linspace(x2_min, x2_max, M)
        x1, x2 = np.meshgrid(t1, t2)  # 生成网格采样点
        x_show = np.stack((x1.flat, x2.flat), axis=1)  # 测试点

        # 训练集上的预测结果
        y_train_pred = model.predict(x_train)
        acc_train = accuracy_score(y_train, y_train_pred)
        y_test_pred = model.predict(x_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        print('特征:', iris_feature[pair[0]], ' + ', iris_feature[pair[1]])
        print('OOB Score:', model.oob_score_)
        print('\t训练集准确率: %.4f%%' % (100*acc_train))
        print('\t测试集准确率: %.4f%%\n' % (100*acc_test))

        cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
        cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
        y_hat = model.predict(x_show)
        y_hat = y_hat.reshape(x1.shape)
        plt.subplot(2, 3, i+1)
        plt.contour(x1, x2, y_hat, colors='k', levels=[0, 1], antialiased=True, linestyles='--', linewidths=1)
        plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)  # 预测值
        plt.scatter(x_train[pair[0]], x_train[pair[1]], c=y_train, s=20, edgecolors='k', cmap=cm_dark, label='训练集')
        plt.scatter(x_test[pair[0]], x_test[pair[1]], c=y_test, s=100, marker='*', edgecolors='k', cmap=cm_dark, label='测试集')
        plt.xlabel(iris_feature[pair[0]], fontsize=12)
        plt.ylabel(iris_feature[pair[1]], fontsize=12)
        plt.legend(loc='upper right', fancybox=True, framealpha=0.3)
        plt.xlim(x1_min, x1_max)
        plt.ylim(x2_min, x2_max)
        plt.grid(b=True, ls=':', color='#606060')
    plt.suptitle('随机森林对鸢尾花数据两特征组合的分类结果', fontsize=15)
    plt.tight_layout(1, rect=(0, 0, 1, 0.95))    # (left, bottom, right, top)
    plt.show()

6.波士顿房价建模(随机森林)
    import numpy as np
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import ElasticNetCV
    import sklearn.datasets
    from pprint import pprint
    from sklearn.preprocessing import PolynomialFeatures, StandardScaler
    from sklearn.pipeline import Pipeline
    from sklearn.metrics import mean_squared_error
    from sklearn.ensemble import RandomForestRegressor
    import warnings


def not_empty(s):
    return s != ''


if __name__ == "__main__":
#     warnings.filterwarnings(action='ignore')
#     np.set_printoptions(suppress=True)
#     file_data = pd.read_csv('..\\housing.data', header=None)
#     # a = np.array([float(s) for s in str if s != ''])
#     data = np.empty((len(file_data), 14))
#     for i, d in enumerate(file_data.values):
#         d = list(map(float, list(filter(not_empty, d[0].split(' ')))))
#         data[i] = d
#     x, y = np.split(data, (13, ), axis=1)
    data = sklearn.datasets.load_boston()
    x = np.array(data.data)
    y = np.array(data.target)
    print('样本个数:%d, 特征个数:%d' % x.shape)
    print(y.shape)
    y = y.ravel()

    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=0)
    # model = Pipeline([
    #     ('ss', StandardScaler()),
    #     ('poly', PolynomialFeatures(degree=3, include_bias=True)),
    #     ('linear', ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.99, 1], alphas=np.logspace(-3, 2, 5),
    #                             fit_intercept=False, max_iter=1e3, cv=3))
    # ])
    model = RandomForestRegressor(n_estimators=50, criterion='mse')#树的个数
    print('开始建模...')
    model.fit(x_train, y_train)
    # linear = model.get_params('linear')['linear']
    # print u'超参数:', linear.alpha_
    # print u'L1 ratio:', linear.l1_ratio_
    # print u'系数:', linear.coef_.ravel()

    order = y_test.argsort(axis=0)#得到的是排序后的数据对应的索引号(下标)
    y_test = y_test[order]
    x_test = x_test[order, :]
    y_pred = model.predict(x_test)
    r2 = model.score(x_test, y_test)
    mse = mean_squared_error(y_test, y_pred)
    print('R2:', r2)
    print('均方误差:', mse)

    t = np.arange(len(y_pred))
    mpl.rcParams['font.sans-serif'] = ['simHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(facecolor='w')
    plt.plot(t, y_test, 'r-', lw=2, label='真实值')
    plt.plot(t, y_pred, 'g-', lw=2, label='估计值')
    plt.legend(loc='best')
    plt.title('波士顿房价预测', fontsize=18)
    plt.xlabel('样本编号', fontsize=15)
    plt.ylabel('房屋价格', fontsize=15)
    plt.grid()
    plt.show()   
    

四、五、岭,决策树,集成岭,集成树的回归比较

1.ridge,DT,baggingRidge,baggingDT的回归比较
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.linear_model import RidgeCV
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

def f(x):
    return 0.5*np.exp(-(x+3)**2) + np.exp(-x**2) + 1.5*np.exp(-(x-3)**2)

if __name__ == '__main__':
    np.random.seed(0)
    N=200
    x = np.random.rand(N)*10-5  # [-5,5)
    x = np.sort(x)
    y = f(x)+0.05*np.random.rand(N)
    x.shape = -1,1
    
    degree=6
    n_estimators=50
    max_samples=0.5
    ridge = RidgeCV(alphas=np.logspace(-3,2,20), fit_intercept=False)
    ridged = Pipeline([('poly',PolynomialFeatures(degree=degree)),('Ridge',ridge)])
    bagging_ridged = BaggingRegressor(ridged, n_estimators=n_estimators, max_samples=max_samples)
    dtr = DecisionTreeRegressor(max_depth=9)
    regs = [
        ('DecisionTree', dtr),
        ('Ridge(%d Degree)' % degree, ridged),
        ('Bagging Ridge(%d Degree)' % degree, bagging_ridged),
        ('Bagging DecisionTree', BaggingRegressor(dtr, n_estimators=n_estimators, max_samples=max_samples))]
    x_test = np.linspace(1.1*x.min(), 1.1*x.max(), 1000)
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(8,6), facecolor='w')
    plt.plot(x,y,'ro', mec='k',label='训练数据')
    plt.plot(x_test, f(x_test), color='k', lw=3, ls='-', label='真实值')
    clrs = '#FF2020', 'm', 'y', 'g'
    for i,(name,reg) in enumerate(regs):
        reg.fit(x,y)
        label = '%s, $R^2$=%.3f' % (name, reg.score(x, y))
        y_test = reg.predict(x_test.reshape(-1,1))
        plt.plot(x_test, y_test, color=clrs[i], lw=(i+1)*0.5, label=label, zorder=6-i)
    plt.legend(loc='upper left', fontsize=11)
    plt.xlabel('X', fontsize=12)
    plt.ylabel('Y', fontsize=12)
    plt.title('回归曲线拟合:samples_rate(%.1f), n_trees(%d)' % (max_samples, n_estimators), fontsize=15)
    plt.ylim((-0.2, 1.1*y.max()))
    plt.tight_layout(2)
    plt.grid(b=True, ls=':', color='#606060')
    plt.show()

五、XGB,LR,RF,adaboost比较

1.xgboost 二分类
    import xgboost as xgb
    import numpy as np
    from sklearn.tree import DecisionTreeClassifier

    # 1、xgBoost的基本使用
    # 2、自定义损失函数的梯度和二阶导
    # 3、binary:logistic/logitraw


# 定义f: theta * x
def g_h(y_hat, y):
    p = 1.0 / (1.0 + np.exp(-y_hat))
    g = p - y.get_label()
    h = p * (1.0-p)
    return g, h


def error_rate(y_hat, y):
    return 'error', float(sum(y.get_label() != (y_hat > 0.5))) / len(y_hat)


if __name__ == "__main__":
    # 读取数据
    data_train = xgb.DMatrix(rf'F:\data\agaricus_train.txt')
    data_test = xgb.DMatrix(rf'F:\data\agaricus_test.txt')
    print (data_train)
    print (type(data_train))

    # 设置参数
    param = {'max_depth': 3, 'eta': 0.4, 'silent': 1, 'objective': 'binary:logistic'} #其实是交叉熵损失,回归用'reg:logistic'
    #logitraw(最大深度)(衰减因子(学习率))(是否静默式输出,是都输出还是只输出标准的信息) (目标函数:二分类问题的logistic回归)
    # param = {'max_depth': 3, 'eta': 0.3, 'silent': 1, 'objective': 'reg:logistic'}
    watchlist = [(data_test, 'eval'), (data_train, 'train')]#测试数据、训练数据
    n_round = 3 #几颗决策树
    bst = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist)
    # bst = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist, obj=g_h, feval=error_rate)

    # 计算错误率
    y_hat = bst.predict(data_test)
    y = data_test.get_label()
    print (y_hat)
    print (y)
    error = sum(y != (y_hat > 0.5))#先构造一个类似于y的True or False或者0,1的bool array,再判断俩不一样的个数
    error_rate = float(error) / len(y_hat)
    print ('样本总数:\t', len(y_hat))
    print ('错误数目:\t%4d' % error)
    print ('错误率:\t%.5f%%' % (100*error_rate))

2.xgboost 鸢尾花三分类,并于LR、RF进行比较
    import numpy as np
    import pandas as pd
    import xgboost as xgb
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegressionCV
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.preprocessing import MinMaxScaler, StandardScaler

if __name__ == '__main__':
    path = rf'F:\data\iris.data'
    data = pd.read_csv(path,header=None)
    x, y = data[list(range(0,4))], data[4]
    y = pd.Categorical(y).codes
    x = MinMaxScaler().fit_transform(x)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=50, random_state=1)
    
    data_train = xgb.DMatrix(x_train, label=y_train)
    data_test = xgb.DMatrix(x_test, label=y_test)
    watch_list = [(data_test,'eval'), (data_train,'train')]
    param = {'max_depth':4, 'eta':0.3,'silent':1, 'objective':'multi:softmax', 'num_class':3}
    
    bst = xgb.train(param, data_train, num_boost_round=6, evals=watch_list)
    y_hat = bst.predict(data_test)
    result = y_test == y_hat
    print('正确率:{}'.format(float(np.sum(result))/len(y_hat)))
    print('==========')
    
    models = [('LogisticRegression',LogisticRegressionCV(Cs=10, cv=5)),('RandomForest',RandomForestClassifier(n_estimators=30, criterion='gini'))]#5折
    #Cs=10是超参,1/λ,e的-4到4次方里取10个数,cv=5是5折交叉验证
    for name, model in models:
        model.fit(x_train, y_train)
        print(name, accuracy_score(y_train, model.predict(x_train)))
        print(name, accuracy_score(y_test, model.predict(x_test)))
        
3.xgboost、LR、RF 红酒三分类
    import xgboost as xgb
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score

if __name__ == '__main__':
    path = rf'F:\data\wine.data'
    data = pd.read_csv(path, header=None)
    y, x = data[0], data[list(range(1,data.shape[1]))]
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=1)
    #logisticRegression
    lr = LogisticRegression(penalty='l2')
    lr.fit(x_train,y_train)
    y_hat = lr.predict(x_test)
    print('lr',accuracy_score(y_test,y_hat))
    
    #RF
    rf = RandomForestClassifier(n_estimators=30, criterion='gini', max_depth=8, min_samples_leaf=3)
    rf.fit(x_train,y_train)
    y_train_pred = rf.predict(x_train)
    y_test_pred = rf.predict(x_test)
    print('RF训练集',accuracy_score(y_train,y_train_pred))
    print('RF测试集',accuracy_score(y_test,y_test_pred))    
    
    #xgb
    y_train[y_train==3] = 0
    y_test[y_test==3] = 0
    data_train = xgb.DMatrix(x_train, label=y_train)
    data_test = xgb.DMatrix(x_test, label=y_test)
    watch_list = [(data_test,'eval'),(data_train,'train')]
    param = {'max_depth':3,'eta':1, 'silent':0, 'objective':'multi:softmax','num_class':3}
    bst = xgb.train(param, data_train,num_boost_round=2, evals=watch_list)
    y_hat = bst.predict(data_test)
    print('xgb',accuracy_score(y_hat,y_test))

    
    
4.泰坦尼克乘客存活问题,包含缺失值的中位数、随机森林填充
    import csv
    import xgboost as xgb
    import numpy as np
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import accuracy_score


def show_accuracy(a, b, tip):
    acc = a.ravel() == b.ravel()
    acc_rate = 100 * float(acc.sum()) / a.size
    print('%s正确率:%.3f%%' % (tip, acc_rate))
    return acc_rate

def load_data(file_name, is_train):
    data = pd.read_csv(file_name)
    pd.set_option('display.width',200)
    print('data.describe() = \n',data.describe())
    data['Sex'] = pd.Categorical(data['Sex']).codes
    
    # 补齐船票价格缺失值
    if len(data.Fare[data.Fare == 0]) > 0:
        fare = np.zeros(3)
        for f in range(0, 3):#根据三个船舱等级设定三种猜测票价
            fare[f] = data[data['Pclass'] == f + 1]['Fare'].dropna().median()#取对应等级的中位数
        print('='*30)
        print(fare)
        print('='*30)
        for f in range(0, 3):  # loop 0 to 2
            data.loc[(data.Fare == 0) & (data.Pclass == f + 1), 'Fare'] = fare[f]

    print('data.describe() = \n', data.describe())
    # 年龄:使用均值代替缺失值
    # mean_age = data['Age'].dropna().mean()
    # data.loc[(data.Age.isnull()), 'Age'] = mean_age
    if is_train:#读取文件时要同时传入Ture or False是否是训练集
        # 年龄:使用随机森林预测年龄缺失值
        print('随机森林预测缺失年龄:--start--')
        data_for_age = data[['Age', 'Survived', 'Fare', 'Parch', 'SibSp', 'Pclass']]
        age_exist = data_for_age.loc[(data.Age.notnull())]   # 年龄不缺失的数据
        age_null = data_for_age.loc[(data.Age.isnull())]
        print(age_exist)
        x = age_exist.values[:, 1:]
        y = age_exist.values[:, 0]
        rfr = RandomForestRegressor(n_estimators=20)
        rfr.fit(x, y)
        age_hat = rfr.predict(age_null.values[:, 1:])
        # print age_hat
        data.loc[(data.Age.isnull()), 'Age'] = age_hat
        print('随机森林预测缺失年龄:--over--')
    else:
        print('随机森林预测缺失年龄2:--start--')
        data_for_age = data[['Age', 'Fare', 'Parch', 'SibSp', 'Pclass']]
        age_exist = data_for_age.loc[(data.Age.notnull())]  # 年龄不缺失的数据
        age_null = data_for_age.loc[(data.Age.isnull())]
        # print age_exist
        x = age_exist.values[:, 1:]
        y = age_exist.values[:, 0]
        rfr = RandomForestRegressor(n_estimators=1000)
        rfr.fit(x, y)
        age_hat = rfr.predict(age_null.values[:, 1:])
        # print age_hat
        data.loc[(data.Age.isnull()), 'Age'] = age_hat
        print('随机森林预测缺失年龄2:--over--')
    data['Age'] = pd.cut(data['Age'], bins=6, labels=np.arange(6))#分箱

    # 起始城市
    data.loc[(data.Embarked.isnull()), 'Embarked'] = 'S'  # 保留缺失出发城市(用S填充)
    embarked_data = pd.get_dummies(data.Embarked)#得到独热编码的一个DF
    print('embarked_data = ', embarked_data)
    # embarked_data = embarked_data.rename(columns={'S': 'Southampton', 'C': 'Cherbourg', 'Q': 'Queenstown', 'U': 'UnknownCity'})
    embarked_data = embarked_data.rename(columns=lambda x: 'Embarked_' + str(x))
    data = pd.concat([data, embarked_data], axis=1)#拼到右边

    print(data.describe())
    data.to_csv('F:\\data\\New_Data.csv')

    x = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_C', 'Embarked_Q', 'Embarked_S']]
    # x = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
    y = None
    if 'Survived' in data:
        y = data['Survived']

    x = np.array(x)
    y = np.array(y)

    # 思考:这样做,其实发生了什么?
    x = np.tile(x, (5, 1))#竖着贴5块瓷砖
    y = np.tile(y, (5, ))#横着贴
    if is_train:
        return x, y
    return x, data['PassengerId']


def write_result(c, c_type):
    file_name = rf'F:\data\Titanic.test.csv'
    x, passenger_id = load_data(file_name, False)

    if type == 3:
        x = xgb.DMatrix(x)
    y = c.predict(x)
    y[y > 0.5] = 1
    y[~(y > 0.5)] = 0

    predictions_file = open("Prediction_%d.csv" % c_type, "wb")
    open_file_object = csv.writer(predictions_file)
    open_file_object.writerow(["PassengerId", "Survived"])
    open_file_object.writerows(list(zip(passenger_id, y)))
    predictions_file.close()


#if __name__ == "__main__":
x, y = load_data('F:\\data\\Titanic.train.csv', True)
print('x = ', x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)
#
lr = LogisticRegression(penalty='l2')
lr.fit(x_train, y_train)
y_hat = lr.predict(x_test)
lr_acc = accuracy_score(y_test, y_hat)
# write_result(lr, 1)

rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(x_train, y_train)
y_hat = rfc.predict(x_test)#y:是否存活
rfc_acc = accuracy_score(y_test, y_hat)
# write_result(rfc, 2)

# XGBoost
data_train = xgb.DMatrix(x_train, label=y_train)
data_test = xgb.DMatrix(x_test, label=y_test)
watch_list = [(data_test, 'eval'), (data_train, 'train')]
param = {'max_depth': 6, 'eta': 0.8, 'silent': 1, 'objective': 'binary:logistic'}
         # 'subsample': 1, 'alpha': 0, 'lambda': 0, 'min_child_weight': 1}
bst = xgb.train(param, data_train, num_boost_round=20, evals=watch_list)
y_hat = bst.predict(data_test)
# write_result(bst, 3)
y_hat[y_hat > 0.5] = 1
y_hat[~(y_hat > 0.5)] = 0
xgb_acc = accuracy_score(y_test, y_hat)

print('Logistic回归:%.3f%%' % (100*lr_acc))
print('随机森林:%.3f%%' % (100*rfc_acc))
print('XGBoost:%.3f%%' % (100*xgb_acc))

5.adaboost对鸢尾花数据两特征组合的分类结果
    import numpy as np
    import pandas as pd
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

if __name__ == "__main__":
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False
    
    iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'
    path = 'F:\\data\\iris.data'  # 数据文件路径
    data = pd.read_csv(path, header=None)
    x_prime = data[list(range(4))]
    y = pd.Categorical(data[4]).codes
    x_prime_train, x_prime_test, y_train, y_test = train_test_split(x_prime, y, train_size=0.7, random_state=0)

    feature_pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
    plt.figure(figsize=(11, 8), facecolor='#FFFFFF')
    for i, pair in enumerate(feature_pairs):
        # 准备数据
        x_train = x_prime_train[pair]
        x_test = x_prime_test[pair]
    
        base_estimator = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_split=4)
        model = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=10, learning_rate=0.1)
        model.fit(x_train, y_train)

        #画图
        M,N = 500,500
        x1_min, x2_min = x_train.min()
        x1_max, x2_max = x_train.max()
        t1 = np.linspace(x1_min, x1_max, M)
        t2 = np.linspace(x2_min, x2_max, N)
        x1, x2 = np.meshgrid(t1,t2)
        x_show = np.stack((x1.flat, x2.flat), axis=1)

        y_train_pred = model.predict(x_train)
        y_test_pred = model.predict(x_test)
        acc_train = accuracy_score(y_train_pred, y_train)
        acc_test = accuracy_score(y_test_pred, y_test)
        print('特征:', iris_feature[pair[0]], ' + ', iris_feature[pair[1]])
        print('\t训练集准确率: %.4f%%' % (100*acc_train))
        print('\t测试集准确率: %.4f%%\n' % (100*acc_test))

        cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
        cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
        y_hat = model.predict(x_show)
        y_hat = y_hat.reshape(x1.shape)
        plt.subplot(2, 3, i+1)
        plt.contour(x1, x2, y_hat, colors='k', levels=[0, 1], antialiased=True, linestyles='--', linewidths=1.5)
        plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)  # 预测值
        plt.scatter(x_train[pair[0]], x_train[pair[1]], c=y_train, s=20, edgecolors='k', cmap=cm_dark)
        plt.scatter(x_test[pair[0]], x_test[pair[1]], c=y_test, s=100, marker='*', edgecolors='k', cmap=cm_dark)
        plt.xlabel(iris_feature[pair[0]], fontsize=14)
        plt.ylabel(iris_feature[pair[1]], fontsize=14)
        plt.xlim(x1_min, x1_max)
        plt.ylim(x2_min, x2_max)
        plt.grid(b=True)
    plt.suptitle('Adaboost对鸢尾花数据两特征组合的分类结果', fontsize=18)
    plt.tight_layout(1, rect=(0, 0, 1, 0.95))    # (left, bottom, right, top)
    plt.show()
    
6.adaboost模型存储,评价
    import numpy as np
    import scipy as sp
    import pandas as pd
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn import metrics
    from sklearn.externals import joblib
    import os

if __name__ == '__main__':
    pd.set_option('display.width', 400)
    pd.set_option('display.expand_frame_repr', False)
    pd.set_option('display.max_columns', 70)
    column_names = 'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', \
                   'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'
    model_file = 'G:\\data\\adult.pkl'
    if os.path.exists(model_file):
        model = joblib.load(model_file)
    else:
        show_result = True
        print('读入数据')
        data = pd.read_csv('G:\\data\\adult.data',header=None, names=column_names)
        for name in data.columns:
            data[name] = pd.Categorical(data[name]).codes
        x = data[data.columns[:-1]]
        y = data[data.columns[-1]]
        x_train, x_valid, y_train, y_valid = train_test_split(x,y,test_size=0.3, random_state=0)
        print(y_train.mean())
        print (y_valid.mean())
        base_estimator = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_split=5)
        model = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=50, learning_rate=0.1)
        model.fit(x_train,y_train)
        joblib.dump(model, model_file)
        if show_result:
            y_train_pred = model.predict(x_train)
            print('训练集准确率:', accuracy_score(y_train, y_train_pred))
            print('\t训练集查准率:', precision_score(y_train, y_train_pred))
            print('\t训练集召回率:', recall_score(y_train, y_train_pred))
            print('\t训练集F1:', f1_score(y_train, y_train_pred))
            y_valid_pred = model.predict(x_valid)
            print('验证集准确率:', accuracy_score(y_valid, y_valid_pred))
            print('\t验证集查准率:', precision_score(y_valid, y_valid_pred))
            print('\t验证集召回率:', recall_score(y_valid, y_valid_pred))
            print('\t验证集F1:', f1_score(y_valid, y_valid_pred))
    data_test = pd.read_csv('G:\\data\\adult.test', header=None, skiprows=1, names=column_names)
    for name in data_test.columns:
        data_test[name] = pd.Categorical(data_test[name]).codes
    x_test = data_test[data_test.columns[:-1]]
    y_test = data_test[data_test.columns[-1]]
    y_test_pred = model.predict(x_test)
    print('测试集准确率:', accuracy_score(y_test, y_test_pred))
    print('\t测试集查准率:', precision_score(y_test, y_test_pred))
    print('\t测试集召回率:', recall_score(y_test, y_test_pred))
    print('\t测试集F1:', f1_score(y_test, y_test_pred))

    y_test_proba = model.predict_proba(x_test)
    print (y_test_proba)#小于等于50k和大于50k的概率
    y_test_proba = y_test_proba[:, 1]
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_proba)
    #第三个叫截断点,类似于阈值,根据n个阈值算出的(FPR,TPR)n个坐标点
                        #auc = metrics.auc(fpr, tpr)
                        #print('AUC = ', auc)
                        # 或直接调用roc_auc_score
    print ('AUC = ', metrics.roc_auc_score(y_test, y_test_proba))

    mpl.rcParams['font.sans-serif'] = 'SimHei'
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(facecolor='w')
    plt.plot(fpr, tpr, 'r-', lw=2, alpha=0.8, label='AUC=%.3f' % auc)
    plt.plot((0, 1), (0, 1), c='b', lw=1.5, ls='--', alpha=0.7)
    plt.xlim((-0.01, 1.02))
    plt.ylim((-0.01, 1.02))
    plt.xticks(np.arange(0, 1.1, 0.1))
    plt.yticks(np.arange(0, 1.1, 0.1))
    plt.xlabel('False Positive Rate', fontsize=14)
    plt.ylabel('True Positive Rate', fontsize=14)
    plt.grid(b=True)
    plt.legend(loc='lower right', fancybox=True, framealpha=0.8, fontsize=14)
    plt.title('Adult数据的ROC曲线和AUC值', fontsize=17)
    plt.show()

六、SVM

1.svm鸢尾花三分类
    import numpy as np
    import pandas as pd
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    from sklearn import svm
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score

if __name__ == '__main__':
    iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'
    path = 'F:\\data\\iris.data'
    data = pd.read_csv(path, header=None)
    x,y = data[[0,1]], pd.Categorical(data[4]).codes
    x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.4, random_state=1)
    
    clf = svm.SVC(C=0.1, kernel='linear', decision_function_shape='ovr', probability=True)#判别函数多分类one vs rest\one vs one
    clf.fit(x_train,y_train.ravel())
    print(clf.decision_function(x))
    print(clf.score(x_train,y_train))
    print('训练集准确率:', accuracy_score(y_train, clf.predict(x_train)))
    print(clf.score(x_test, y_test))
    print('测试集准确率:', accuracy_score(y_test, clf.predict(x_test)))
    
    # decision_function
    print(x_train[:5])
    print('decision_function:\n', clf.decision_function(x_train))
    print('\npredict:\n', clf.predict(x_train))
    # 画图
    x1_min, x2_min = x.min()
    x1_max, x2_max = x.max()
    x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]  # 生成网格采样点
    grid_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点
    grid_hat = clf.predict(grid_test)       # 预测分类值
    grid_hat = grid_hat.reshape(x1.shape)  # 使之与输入的形状相同
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False

    cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
    plt.figure(facecolor='w')
    plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)
    plt.scatter(x[0], x[1], c=y, edgecolors='k', s=50, cmap=cm_dark)      # 样本
    plt.scatter(x_test[0], x_test[1], s=120, facecolors='none', zorder=10)     # 圈中测试集样本
    plt.xlabel(iris_feature[0], fontsize=13)
    plt.ylabel(iris_feature[1], fontsize=13)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.title('鸢尾花SVM二特征分类', fontsize=16)
    plt.grid(b=True, ls=':')
    plt.tight_layout(pad=1.5)
    plt.show()
    
2.svm 不同核函数和不同的参数对应的图
    import numpy as np
    import pandas as pd
    from sklearn import svm
    from sklearn.metrics import accuracy_score
    import matplotlib as mpl
    import matplotlib.colors
    import matplotlib.pyplot as plt


if __name__ == "__main__":
    data = pd.read_csv('F:\\data\\bipartition.txt', sep='\t', header=None)
    x, y = data[[0, 1]], data[2]

    # 分类器
    clf_param = (('linear', 0.1), ('linear', 0.5), ('linear', 1), ('linear', 2),
                ('rbf', 1, 0.1), ('rbf', 1, 1), ('rbf', 1, 10), ('rbf', 1, 100),
                ('rbf', 5, 0.1), ('rbf', 5, 1), ('rbf', 5, 10), ('rbf', 5, 100))
    x1_min, x2_min = np.min(x, axis=0)
    x1_max, x2_max = np.max(x, axis=0)
    x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]
    grid_test = np.stack((x1.flat, x2.flat), axis=1)

    cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FFA0A0'])
    cm_dark = mpl.colors.ListedColormap(['g', 'r'])
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(13, 9), facecolor='w')
    for i, param in enumerate(clf_param):
        clf = svm.SVC(C=param[1], kernel=param[0])
        if param[0] == 'rbf':
            clf.gamma = param[2]
            title = '高斯核,C=%.1f,$\gamma$ =%.1f' % (param[1], param[2])
        else:
            title = '线性核,C=%.1f' % param[1]

        clf.fit(x, y)
        y_hat = clf.predict(x)
        print('准确率:', accuracy_score(y, y_hat))

        # 画图
        print(title)
        print('支撑向量的数目:', clf.n_support_)
        print('支撑向量的系数:', clf.dual_coef_)
        print('支撑向量:', clf.support_)
        plt.subplot(3, 4, i+1)
        grid_hat = clf.predict(grid_test)       # 预测分类值
        grid_hat = grid_hat.reshape(x1.shape)  # 使之与输入的形状相同
        plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light, alpha=0.8)
        plt.scatter(x[0], x[1], c=y, edgecolors='k', s=40, cmap=cm_dark)      # 样本的显示
        plt.scatter(x.loc[clf.support_, 0], x.loc[clf.support_, 1], edgecolors='k', facecolors='none', s=100, marker='o')   # 支撑向量
        z = clf.decision_function(grid_test)
        # print 'z = \n', z
        print('clf.decision_function(x) = ', clf.decision_function(x))
        print('clf.predict(x) = ', clf.predict(x))
        z = z.reshape(x1.shape)
        plt.contour(x1, x2, z, colors=list('kbrbk'), linestyles=['--', '--', '-', '--', '--'],
                    linewidths=[1, 0.5, 1.5, 0.5, 1], levels=[-1, -0.5, 0, 0.5, 1])
        plt.xlim(x1_min, x1_max)
        plt.ylim(x2_min, x2_max)
        plt.title(title, fontsize=12)
    plt.suptitle('SVM不同参数的分类', fontsize=16)
    plt.tight_layout(1.4)
    plt.subplots_adjust(top=0.92)
    plt.show()
    
3.svm数据不平衡,线性核函数和高斯核函数对数据权重的调整结果
    import numpy as np
    from sklearn import svm
    import matplotlib.colors
    import matplotlib.pyplot as plt
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    from sklearn.exceptions import UndefinedMetricWarning
    import warnings


if __name__ == "__main__":
    warnings.filterwarnings(action='ignore', category=UndefinedMetricWarning)
    np.random.seed(0)   # 保持每次生成的数据相同

    c1 = 990
    c2 = 10
    N = c1 + c2
    x_c1 = 3*np.random.randn(c1, 2)
    x_c2 = 0.5*np.random.randn(c2, 2) + (4, 4)
    x = np.vstack((x_c1, x_c2))
    y = np.ones(N)
    y[:c1] = -1

    # 显示大小
    s = np.ones(N) * 30
    s[:c1] = 10

    # 分类器
    clfs = [svm.SVC(C=1, kernel='linear'),
           svm.SVC(C=1, kernel='linear', class_weight={-1: 1, 1: 50}),
           svm.SVC(C=0.8, kernel='rbf', gamma=0.5, class_weight={-1: 1, 1: 2}),
           svm.SVC(C=0.8, kernel='rbf', gamma=0.5, class_weight={-1: 1, 1: 10})]
    titles = 'Linear', 'Linear, Weight=50', 'RBF, Weight=2', 'RBF, Weight=10'

    x1_min, x2_min = np.min(x, axis=0)
    x1_max, x2_max = np.max(x, axis=0)
    x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]
    grid_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点

    cm_light = matplotlib.colors.ListedColormap(['#77E0A0', '#FF8080'])
    cm_dark = matplotlib.colors.ListedColormap(['g', 'r'])
    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(10, 8), facecolor='w')
    for i, clf in enumerate(clfs):
        clf.fit(x, y)

        y_hat = clf.predict(x)
        # show_accuracy(y_hat, y) # 正确率
        # show_recall(y, y_hat)   # 召回率
        print(i+1, '次:')
        print('accuracy:\t', accuracy_score(y, y_hat))
        print('precision:\t', precision_score(y, y_hat, pos_label=1))
        print('recall:\t', recall_score(y, y_hat, pos_label=1))
        print('F1-score:\t', f1_score(y, y_hat, pos_label=1))
        print()


        # 画图
        plt.subplot(2, 2, i+1)
        grid_hat = clf.predict(grid_test)       # 预测分类值
        grid_hat = grid_hat.reshape(x1.shape)  # 使之与输入的形状相同
        plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light, alpha=0.8)
        plt.scatter(x[:, 0], x[:, 1], c=y, edgecolors='k', s=s, cmap=cm_dark)      # 样本的显示
        plt.xlim(x1_min, x1_max)
        plt.ylim(x2_min, x2_max)
        plt.title(titles[i])
        plt.grid(b=True, ls=':')
    plt.suptitle('不平衡数据的处理', fontsize=18)
    plt.tight_layout(1.5)
    plt.subplots_adjust(top=0.92)
    plt.show()

4.图片数字识别,交叉验证调参
    import numpy as np
    from sklearn import svm
    import matplotlib.colors
    import matplotlib.pyplot as plt
    from PIL import Image
    from sklearn.metrics import accuracy_score
    import os
    from sklearn.model_selection import GridSearchCV
    from time import time


def show_accuracy(a, b, tip):
    acc = a.ravel() == b.ravel()
    print(tip + '正确率:%.2f%%' % (100*np.mean(acc)))


def save_image(im, i):
    im *= 15.9375
    im = 255 - im
    a = im.astype(np.uint8)
    output_path = '.\\HandWritten'
    if not os.path.exists(output_path):
        os.mkdir(output_path)
    Image.fromarray(a).save(output_path + ('\\%d.png' % i))


if __name__ == "__main__":
    print('Load Training File Start...')
    data = np.loadtxt('F:\\data\\optdigits.tra', dtype=np.float, delimiter=',')
    x, y = np.split(data, (-1, ), axis=1)
    images = x.reshape(-1, 8, 8)
    y = y.ravel().astype(np.int)

    print('Load Test Data Start...')
    data = np.loadtxt('F:\\data\\optdigits.tes', dtype=np.float, delimiter=',')
    x_test, y_test = np.split(data, (-1, ), axis=1)
    print(y_test.shape)
    images_test = x_test.reshape(-1, 8, 8)
    y_test = y_test.ravel().astype(np.int)
    print('Load Data OK...')

    # x, x_test, y, y_test = train_test_split(x, y, test_size=0.4, random_state=1)
    # images = x.reshape(-1, 8, 8)
    # images_test = x_test.reshape(-1, 8, 8)

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(15, 9), facecolor='w')
    for index, image in enumerate(images[:16]):
        plt.subplot(4, 8, index + 1)
        plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
        plt.title('训练图片: %i' % y[index])
    for index, image in enumerate(images_test[:16]):
        plt.subplot(4, 8, index + 17)
        plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
        save_image(image.copy(), index)
        plt.title('测试图片: %i' % y_test[index])
    plt.tight_layout()
    plt.show()

    # params = {'C':np.logspace(0, 3, 7), 'gamma':np.logspace(-5, 0, 11)}
    # model = GridSearchCV(svm.SVC(kernel='rbf'), param_grid=params, cv=3)
    model = svm.SVC(C=10, kernel='rbf', gamma=0.001)
    print('Start Learning...')
    t0 = time()
    model.fit(x, y)
    t1 = time()
    t = t1 - t0
    print('训练+CV耗时:%d分钟%.3f秒' % (int(t/60), t - 60*int(t/60)))
    # print '最优参数:\t', model.best_params_
    #clf.fit(x, y)
    print('Learning is OK...')
    print('训练集准确率:', accuracy_score(y, model.predict(x)))
    y_hat = model.predict(x_test)
    print('测试集准确率:', accuracy_score(y_test, model.predict(x_test)))
    print(y_hat)
    print(y_test)

    err_images = images_test[y_test != y_hat]
    err_y_hat = y_hat[y_test != y_hat]
    err_y = y_test[y_test != y_hat]
    print(err_y_hat)
    print(err_y)
    plt.figure(figsize=(10, 8), facecolor='w')
    for index, image in enumerate(err_images):
        if index >= 12:
            break
        plt.subplot(3, 4, index + 1)
        plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
        plt.title('错分为:%i,真实值:%i' % (err_y_hat[index], err_y[index]))
    plt.tight_layout()
    plt.show()

5.回归
    import numpy as np
    from sklearn import svm
    import matplotlib.pyplot as plt

N=50
np.random.seed(0)
x = np.sort(np.random.uniform(0,6,N), axis=0)
y = 2*np.sin(x) + 0.1*np.random.randn(N)
x = x.reshape(-1,1)

svr_rbf = svm.SVR(C=100, kernel='rbf', gamma=0.2)#方差倒数
svr_rbf.fit(x,y)

svr_linear = svm.SVR(C=100, kernel='linear')
svr_linear.fit(x,y)

svr_poly = svm.SVR(C=100, kernel='poly', degree=3)#维度为3
svr_poly.fit(x,y)

x_test = np.linspace(x.min(), 1.1*x.max(), 100).reshape(-1, 1)
y_rbf = svr_rbf.predict(x_test)
y_linear = svr_linear.predict(x_test)
y_poly = svr_poly.predict(x_test)

plt.figure(figsize=(7,6),facecolor='w')
plt.plot(x_test,y_rbf,'r-',linewidth=2, label='RBF Kernel')
plt.plot(x_test,y_linear,'g-',linewidth=2, label='Linear Kernel')
plt.plot(x_test,y_poly,'b-',linewidth=2, label='Polynomial Kernel')
plt.plot(x,y,'mo', ms=6, mec='k')
plt.scatter(x[svr_rbf.support_], y[svr_rbf.support_], s=200, c='r', marker='*', edgecolors='k', label='RBF Support Vectors', zorder=10)
plt.legend(loc='lower left', fontsize=12)
plt.title('SVR', fontsize=15)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(b=True, ls=':')
plt.tight_layout(2)
plt.show()

6.回归调参
    import numpy as np
    from sklearn import svm
    from sklearn.model_selection import GridSearchCV, RandomizedSearchCV    # 0.17 grid_search
    import matplotlib.pyplot as plt


if __name__ == "__main__":
    N = 50
    np.random.seed(0)
    x = np.sort(np.random.uniform(0, 6, N), axis=0)
    y = 2*np.sin(x) + 0.1*np.random.randn(N)
    x = x.reshape(-1, 1)
    print('x =\n', x)
    print('y =\n', y)

    model = svm.SVR(kernel='rbf')
    c_can = np.logspace(-2, 2, 10)
    gamma_can = np.logspace(-2, 2, 10)
    svr = GridSearchCV(model, param_grid={'C': c_can, 'gamma': gamma_can}, cv=5)
    svr.fit(x, y)
    print('验证参数:\n', svr.best_params_)

    x_test = np.linspace(x.min(), x.max(), 100).reshape(-1, 1)
    y_hat = svr.predict(x_test)

    sp = svr.best_estimator_.support_
    plt.figure(facecolor='w')
    plt.scatter(x[sp], y[sp], s=120, c='r', marker='*', label='Support Vectors', zorder=3)
    plt.plot(x_test, y_hat, 'r-', linewidth=2, label='RBF Kernel')
    plt.plot(x, y, 'go', markersize=5)
    plt.legend(loc='upper right')
    plt.title('SVR', fontsize=16)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.grid(True)
    plt.show()

七、聚类

拉普拉斯矩阵和谱聚类 https://blog.csdn.net/guoxinian/article/details/79532893

1.聚类kmeans++和几个评价方法
#聚类
    import numpy as np
    import pandas as pd
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import sklearn.datasets as ds
    from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score, adjusted_mutual_info_score, adjusted_rand_score, silhouette_score
    from sklearn.cluster import KMeans

def expand(a,b):
    d = (b-a)*0.1
    return a-d, b+d

if __name__ == '__main__':
    N=400
    centers = 4
    data, y = ds.make_blobs(N, n_features=2, centers=centers, random_state=2)
    data2, y2 = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=(1,2.5,0.5,2), random_state=2)
    data3 = np.vstack((data[y==0][:],data[y==1][:50],data[y==2][:20],data[y==3][:5]))
    y3 = np.array([0]*100+[1]*50+[2]*20+[3]*5)
    m = np.array(((1,1),(1,3)))
    data_r = data.dot(m)#旋转了
    
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False
    cm = mpl.colors.ListedColormap(list('rgbm'))
    data_list = data, data, data_r, data_r, data2, data2, data3, data3
    y_list = y, y, y, y, y2, y2, y3, y3
    titles = '原始数据', 'KMeans++聚类', '旋转后数据', '旋转后KMeans++聚类',\
             '方差不相等数据', '方差不相等KMeans++聚类', '数量不相等数据', '数量不相等KMeans++聚类'
    model = KMeans(n_clusters=4, init='k-means++', n_init=5)#5遍
    plt.figure(figsize=(8,9), facecolor='w')
    for i,(x,y,title) in enumerate(zip(data_list,y_list,titles),start=1):
        plt.subplot(4,2,i)
        plt.title(title)
        if i % 2==1:
            y_pred = y
        else:
            y_pred = model.fit_predict(x)
        print(i)
        print('Homogeneity:', homogeneity_score(y, y_pred))
        print('completeness:', completeness_score(y, y_pred))
        print('V measure:', v_measure_score(y, y_pred))
        print('AMI:', adjusted_mutual_info_score(y, y_pred))
        print('ARI:', adjusted_rand_score(y, y_pred))
        print('Silhouette:', silhouette_score(x, y_pred), '\n')
        plt.scatter(x[:, 0], x[:, 1], c=y_pred, s=30, cmap=cm, edgecolors='none')
        x1_min, x2_min = np.min(x, axis=0)
        x1_max, x2_max = np.max(x, axis=0)
        x1_min, x1_max = expand(x1_min, x1_max)
        x2_min, x2_max = expand(x2_min, x2_max)
        plt.xlim((x1_min, x1_max))
        plt.ylim((x2_min, x2_max))
        plt.grid(b=True, ls=':')
    plt.tight_layout(2, rect=(0, 0, 1, 0.97))
    plt.suptitle('数据分布对KMeans聚类的影响', fontsize=18)
    plt.show()
    
2.图像的聚类简化、压缩
    from PIL import Image
    import numpy as np
    from sklearn.cluster import KMeans
    import matplotlib
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D
    %matplotlib inline


def restore_image(cb, cluster, shape):
    row, col, dummy = shape
    image = np.empty((row, col, 3))
    index = 0
    for r in range(row):
        for c in range(col):
            image[r, c] = cb[cluster[index]]#依次把各元素写入成对应聚类中心的颜色三原色值
            index += 1
    return image


def show_scatter(a):
    N = 10
    print('原始数据:\n', a)
    density, edges = np.histogramdd(a, bins=[N,N,N], range=[(0,1), (0,1), (0,1)])#image_v是(-1,3),每个元素的三个值都在0-1之间,将之对应到建立好的三维直方图中,得到的是频数和分界点
    np.set_printoptions(linewidth=300, suppress=True)#一行能显示300,不使用科学计数法
    print('print density\n',density)
    density /= density.sum()#将频数化为频率
    x = y = z = np.arange(N)
    d = np.meshgrid(x, y, z)

    fig = plt.figure(1, facecolor='w')
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(d[1], d[0], d[2], c='r', s=100*density/density.max(), marker='o', depthshade=True)
    ax.set_xlabel('红色分量')
    ax.set_ylabel('绿色分量')
    ax.set_zlabel('蓝色分量')
    plt.title('图像颜色三维频数分布', fontsize=13)

    plt.figure(2, facecolor='w')
    den = density[density > 0]
    print(den.shape)
    den = np.sort(den)[::-1]#从大到小排列
    t = np.arange(len(den))
    plt.plot(t, den, 'r-', t, den, 'go', lw=2)
    plt.title('图像颜色频数分布', fontsize=13)
    plt.grid(True)

    plt.show()


if __name__ == '__main__':
    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False

    num_vq = 50
    im = Image.open('G:\\data\\lena.png')     # son.bmp(100)/flower2.png(200)/son.png(60)/lena.png(50)
    image = np.array(im).astype(np.float) / 255 #把图片喂给array就变成数组了
    image = image[:, :, :3]#元素个数,每个元素多少行,三原色
    image_v = image.reshape((-1, 3))#把各元素接到一块
    show_scatter(image_v)

    N = image_v.shape[0]    # 图像像素总数
    # 选择足够多的样本(如1000个),计算聚类中心
    #idx = np.random.randint(0, N, size=1000)#0-N随机选1000个
    idx = np.arange(N)
    np.random.shuffle(idx)
    idx = idx[:1000]
    image_sample = image_v[idx]
    model = KMeans(num_vq)#簇个数,默认8
    model.fit(image_sample)#拿一千个样本来训练模型
    c = model.predict(image_v)  # 聚类结果
    print('聚类结果:\n', c)#各像素值被聚类到哪个簇中
    print('聚类中心:\n', model.cluster_centers_)

    plt.figure(figsize=(12, 6), facecolor='w')
    plt.subplot(121)
    plt.axis('off')
    plt.title('原始图片', fontsize=14)
    plt.imshow(image)
    # plt.savefig('1.png')

    plt.subplot(122)
    vq_image = restore_image(model.cluster_centers_, c, image.shape)
    plt.axis('off')
    plt.title('矢量量化后图片:%d色' % num_vq, fontsize=14)
    plt.imshow(vq_image)
    # plt.savefig('2.png')

    plt.tight_layout(2)
    plt.show()

3.AP聚类
    import numpy as np 
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import matplotlib.colors
    import sklearn.datasets as ds
    from sklearn.cluster import AffinityPropagation
    from sklearn.metrics import euclidean_distances#欧氏距离

if __name__ == '__main__':
    N=400
    centers = [[1,2],[-1,-1],[1,-1],[-1,1]]
    data, y = ds.make_blobs(N,n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
    m = euclidean_distances(data, squared=True)
    preference = -np.median(m)
    print(preference)
    
    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(12,9), facecolor='w')
    for i, mul in enumerate(np.linspace(1,4,9)):
        print(mul)#乘子
        p = mul*preference
        model = AffinityPropagation(affinity='euclidean', preference=p)
        af = model.fit(data)
        center_indices = af.cluster_centers_indices_#得到中心的索引值
        n_clusters = len(center_indices)
        print(('p = %.lf'%mul), p, '聚类簇的个数为',n_clusters)
        y_hat = af.labels_
        
        plt.subplot(3,3,i+1)
        plt.title('Preference %.2f, 簇个数:%d'%(p, n_clusters))
        clrs = []#颜色
        for c in np.linspace(16711680,255,n_clusters, dtype=int):
            clrs.append('#%06x'%c)
        for k, clr in enumerate(clrs):
            cur = (y_hat==k)
            plt.scatter(data[cur,0],data[cur,1], s=15, c=clr, edgecolors='none')
            center = data[center_indices[k]]
            for x in data[cur]:
                plt.plot([x[0], center[0]], [x[1], center[1]],color=clr, lw=0.5, zorder=1)
        plt.scatter(data[center_indices,0], data[center_indices,1], s=80, c=clrs, marker='*',edgecolors='k', zorder=2)
        plt.grid(b=True, ls=':')
    plt.tight_layout()
    plt.suptitle('AP聚类',fontsize=20)
    plt.subplots_adjust(top=0.92)
    plt.show()

4.MeanShift聚类,代码套路和AP差不多
    import numpy as np
    import matplotlib.pyplot as plt
    import sklearn.datasets as ds
    import matplotlib.colors
    from sklearn.cluster import MeanShift
    from sklearn.metrics import euclidean_distances


if __name__ == "__main__":
    N = 1000
    centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
    data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(8, 7), facecolor='w')
    m = euclidean_distances(data, squared=True)
    bw = np.median(m)
    print(bw)
    for i, mul in enumerate(np.linspace(0.1, 0.4, 4)):
        band_width = mul * bw
        model = MeanShift(bin_seeding=True, bandwidth=band_width)#带宽
        ms = model.fit(data)
        centers = ms.cluster_centers_
        y_hat = ms.labels_
        n_clusters = np.unique(y_hat).size
        print('带宽:', mul, band_width, '聚类簇的个数为:', n_clusters)

        plt.subplot(2, 2, i+1)
        plt.title('带宽:%.2f,聚类簇的个数为:%d' % (band_width, n_clusters))
        clrs = []
        for c in np.linspace(16711680, 255, n_clusters, dtype=int):
            clrs.append('#%06x' % c)
        # clrs = plt.cm.Spectral(np.linspace(0, 1, n_clusters))
        for k, clr in enumerate(clrs):
            cur = (y_hat == k)
            plt.scatter(data[cur, 0], data[cur, 1], c=clr, edgecolors='none')
        plt.scatter(centers[:, 0], centers[:, 1], s=150, c=clrs, marker='*', edgecolors='k')
        plt.grid(b=True, ls=':')
    plt.tight_layout(2)
    plt.suptitle('MeanShift聚类', fontsize=15)
    plt.subplots_adjust(top=0.9)
    plt.show()
    
5.层次聚类
    import numpy as np
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    from sklearn.cluster import AgglomerativeClustering
    from sklearn.neighbors import kneighbors_graph
    import sklearn.datasets as ds
    import warnings


def extend(a, b):
    return 1.05*a-0.05*b, 1.05*b-0.05*a


if __name__ == '__main__':
    warnings.filterwarnings(action='ignore', category=UserWarning)
    np.set_printoptions(suppress=True)#不用科学计数法
    np.random.seed(0)
    n_clusters = 4
    N = 400
    data1, y1 = ds.make_blobs(n_samples=N, n_features=2, centers=((-1, 1), (1, 1), (1, -1), (-1, -1)),
                              cluster_std=(0.1, 0.2, 0.3, 0.4), random_state=0)
    data1 = np.array(data1)
    n_noise = int(0.1*N)#40
    r = np.random.rand(n_noise, 2)#40个,2维的
    data_min1, data_min2 = np.min(data1, axis=0)
    data_max1, data_max2 = np.max(data1, axis=0)
    r[:, 0] = r[:, 0] * (data_max1-data_min1) + data_min1
    r[:, 1] = r[:, 1] * (data_max2-data_min2) + data_min2
    data1_noise = np.concatenate((data1, r), axis=0)
    y1_noise = np.concatenate((y1, [4]*n_noise))

    data2, y2 = ds.make_moons(n_samples=N, noise=.05)#专门做月牙形数据的
    data2 = np.array(data2)
    n_noise = int(0.1 * N)
    r = np.random.rand(n_noise, 2)
    data_min1, data_min2 = np.min(data2, axis=0)
    data_max1, data_max2 = np.max(data2, axis=0)
    r[:, 0] = r[:, 0] * (data_max1 - data_min1) + data_min1
    r[:, 1] = r[:, 1] * (data_max2 - data_min2) + data_min2
    data2_noise = np.concatenate((data2, r), axis=0)
    y2_noise = np.concatenate((y2, [3] * n_noise))

    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False

    cm = mpl.colors.ListedColormap(['r', 'g', 'b', 'm', 'c'])
    plt.figure(figsize=(10, 8), facecolor='w')
    plt.cla()
    linkages = ("ward", "complete", "average")
    for index, (n_clusters, data, y) in enumerate(((4, data1, y1), (4, data1_noise, y1_noise),
                                                   (2, data2, y2), (2, data2_noise, y2_noise))):
        plt.subplot(4, 4, 4*index+1)
        plt.scatter(data[:, 0], data[:, 1], c=y, s=12, edgecolors='k', cmap=cm)
        plt.title('Prime', fontsize=12)
        plt.grid(b=True, ls=':')
        data_min1, data_min2 = np.min(data, axis=0)
        data_max1, data_max2 = np.max(data, axis=0)
        plt.xlim(extend(data_min1, data_max1))
        plt.ylim(extend(data_min2, data_max2))

        connectivity = kneighbors_graph(data, n_neighbors=7, mode='distance', metric='minkowski', p=2, include_self=True)#取2的闵可夫斯基就是欧氏距离
        connectivity = 0.5 * (connectivity + connectivity.T)
        for i, linkage in enumerate(linkages):
            ac = AgglomerativeClustering(n_clusters=n_clusters, affinity='euclidean',
                                         connectivity=connectivity, linkage=linkage)
            ac.fit(data)
            y = ac.labels_
            plt.subplot(4, 4, i+2+4*index)
            plt.scatter(data[:, 0], data[:, 1], c=y, s=12, edgecolors='k', cmap=cm)
            plt.title(linkage, fontsize=12)
            plt.grid(b=True, ls=':')
            plt.xlim(extend(data_min1, data_max1))
            plt.ylim(extend(data_min2, data_max2))
    plt.suptitle('层次聚类的不同合并策略', fontsize=15)
    plt.tight_layout(0.5, rect=(0, 0, 1, 0.95))
    plt.show()

6.基于密度的聚类DBSCAN
    import numpy as np
    import matplotlib.pyplot as plt
    import sklearn.datasets as ds
    import matplotlib.colors
    from sklearn.cluster import DBSCAN
    from sklearn.preprocessing import StandardScaler


def expand(a, b):
    d = (b - a) * 0.1
    return a-d, b+d


if __name__ == "__main__":
    N = 1000
    centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
    data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
    data = StandardScaler().fit_transform(data)
    # 数据1的参数:(epsilon, min_sample)
    params = ((0.2, 5), (0.2, 10), (0.2, 15), (0.3, 5), (0.3, 10), (0.3, 15))

    # 数据2
    # t = np.arange(0, 2*np.pi, 0.1)
    # data1 = np.vstack((np.cos(t), np.sin(t))).T
    # data2 = np.vstack((2*np.cos(t), 2*np.sin(t))).T
    # data3 = np.vstack((3*np.cos(t), 3*np.sin(t))).T
    # data = np.vstack((data1, data2, data3))
    # # # 数据2的参数:(epsilon, min_sample)
    # params = ((0.5, 3), (0.5, 5), (0.5, 10), (1., 3), (1., 10), (1., 20))

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False

    plt.figure(figsize=(9, 7), facecolor='w')
    plt.suptitle('DBSCAN聚类', fontsize=15)

    for i in range(6):
        eps, min_samples = params[i]
        model = DBSCAN(eps=eps, min_samples=min_samples)
        model.fit(data)
        y_hat = model.labels_

        core_indices = np.zeros_like(y_hat, dtype=bool)
        core_indices[model.core_sample_indices_] = True

        y_unique = np.unique(y_hat)
        n_clusters = y_unique.size - (1 if -1 in y_hat else 0)
        print(y_unique, '聚类簇的个数为:', n_clusters)

        # clrs = []
        # for c in np.linspace(16711680, 255, y_unique.size):
        #     clrs.append('#%06x' % c)
        plt.subplot(2, 3, i+1)
        clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size))
        print(clrs)
        for k, clr in zip(y_unique, clrs):
            cur = (y_hat == k)
            if k == -1:
                plt.scatter(data[cur, 0], data[cur, 1], s=10, c='k')
                continue
            plt.scatter(data[cur, 0], data[cur, 1], s=15, c=clr, edgecolors='k')
            plt.scatter(data[cur & core_indices][:, 0], data[cur & core_indices][:, 1], s=30, c=clr, marker='o', edgecolors='k')
        x1_min, x2_min = np.min(data, axis=0)
        x1_max, x2_max = np.max(data, axis=0)
        x1_min, x1_max = expand(x1_min, x1_max)
        x2_min, x2_max = expand(x2_min, x2_max)
        plt.xlim((x1_min, x1_max))
        plt.ylim((x2_min, x2_max))
        plt.plot()
        plt.grid(b=True, ls=':', color='#606060')
        plt.title(r'$\epsilon$ = %.1f  m = %d,聚类数目:%d' % (eps, min_samples, n_clusters), fontsize=12)
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

7.层次聚类与密度聚类的结合HDBSCAN,对参数eps和m相对不敏感能得到更好的结果
    import numpy as np
    import matplotlib.pyplot as plt
    import sklearn.datasets as ds
    import matplotlib.colors
    from sklearn.cluster import DBSCAN
    from sklearn.preprocessing import StandardScaler
    import hdbscan


def expand(a, b):
    d = (b - a) * 0.1
    return a-d, b+d


if __name__ == "__main__":
    N = 1000
    centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
    data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
    data = StandardScaler().fit_transform(data)
    # 数据1的参数:(epsilon, min_sample)
    params = ((0.2, 5), (0.2, 10), (0.2, 15), (0.3, 5), (0.3, 10), (0.3, 15))

    # 数据2
    # t = np.arange(0, 2*np.pi, 0.1)
    # data1 = np.vstack((np.cos(t), np.sin(t))).T
    # data2 = np.vstack((2*np.cos(t), 2*np.sin(t))).T
    # data3 = np.vstack((3*np.cos(t), 3*np.sin(t))).T
    # data = np.vstack((data1, data2, data3))
    # # # 数据2的参数:(epsilon, min_sample)
    # params = ((0.5, 3), (0.5, 5), (0.5, 10), (1., 3), (1., 10), (1., 20))

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False

    plt.figure(figsize=(12, 8), facecolor='w')
    plt.suptitle('HDBSCAN聚类', fontsize=16)

    for i in range(6):
        eps, min_samples = params[i]
        model = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=10)
        model.fit(data)
        y_hat = model.labels_

        core_indices = np.zeros_like(y_hat, dtype=bool)
        core_indices[y_hat != -1] = True

        y_unique = np.unique(y_hat)
        n_clusters = y_unique.size - (1 if -1 in y_hat else 0)
        print(y_unique, '聚类簇的个数为:', n_clusters)

        # clrs = []
        # for c in np.linspace(16711680, 255, y_unique.size):
        #     clrs.append('#%06x' % c)
        plt.subplot(2, 3, i+1)
        clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size))
        for k, clr in zip(y_unique, clrs):
            cur = (y_hat == k)
            # if k == -1:
            #     plt.scatter(data[cur, 0], data[cur, 1], s=20, c='k')
            #     continue

            plt.scatter(data[cur, 0], data[cur, 1], s=60*model.probabilities_[cur], marker='o', c=clr, edgecolors='k', alpha=0.9)
            #plt.scatter(data[cur & core_indices][:, 0], data[cur & core_indices][:, 1], s=60, c=clr, marker='o', edgecolors='k')
        x1_min, x2_min = np.min(data, axis=0)
        x1_max, x2_max = np.max(data, axis=0)
        x1_min, x1_max = expand(x1_min, x1_max)
        x2_min, x2_max = expand(x2_min, x2_max)
        plt.xlim((x1_min, x1_max))
        plt.ylim((x2_min, x2_max))
        plt.grid(b=True, ls=':', color='#808080')
        plt.title(r'$\epsilon$ = %.1f  m = %d,聚类数目:%d' % (eps, min_samples, n_clusters), fontsize=13)
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()
    
8.谱聚类(SC)和处理图像
    import numpy as np
    import matplotlib.pyplot as plt
    import sklearn.datasets as ds
    import matplotlib.colors
    from sklearn.cluster import SpectralClustering
    from sklearn.metrics import euclidean_distances


def expand(a, b):
    d = (b - a) * 0.1
    return a-d, b+d


if __name__ == "__main__":
    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False

    t = np.arange(0, 2*np.pi, 0.1)
    data1 = np.vstack((np.cos(t), np.sin(t))).T
    data2 = np.vstack((2*np.cos(t), 2*np.sin(t))).T
    data3 = np.vstack((3*np.cos(t), 3*np.sin(t))).T
    data = np.vstack((data1, data2, data3))

    n_clusters = 3
    m = euclidean_distances(data, squared=True)

    plt.figure(figsize=(12, 8), facecolor='w')
    plt.suptitle('谱聚类', fontsize=16)
    clrs = plt.cm.Spectral(np.linspace(0, 0.8, n_clusters))
    for i, s in enumerate(np.logspace(-2, 0, 6)):
        print(s)
        af = np.exp(-m ** 2 / (s ** 2)) + 1e-6
        model = SpectralClustering(n_clusters=n_clusters, affinity='precomputed', assign_labels='kmeans', random_state=1)
        y_hat = model.fit_predict(af)
        plt.subplot(2, 3, i+1)
        for k, clr in enumerate(clrs):
            cur = (y_hat == k)
            plt.scatter(data[cur, 0], data[cur, 1], s=40, c=clr, edgecolors='k')
        x1_min, x2_min = np.min(data, axis=0)
        x1_max, x2_max = np.max(data, axis=0)
        x1_min, x1_max = expand(x1_min, x1_max)
        x2_min, x2_max = expand(x2_min, x2_max)
        plt.xlim((x1_min, x1_max))
        plt.ylim((x2_min, x2_max))
        plt.grid(b=True, ls=':', color='#808080')
        plt.title(r'$\sigma$ = %.2f' % s, fontsize=13)
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

处理图像:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
from sklearn.cluster import spectral_clustering
from sklearn.feature_extraction import image
from PIL import Image
import time


if __name__ == "__main__":
    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False

    pic = Image.open('F:\\data\\Chrome.png')
    pic = pic.convert('L')
    data = np.array(pic).astype(np.float) / 255

    plt.figure(figsize=(10, 5), facecolor='w')
    plt.subplot(121)
    plt.imshow(pic, cmap=plt.cm.gray, interpolation='nearest')
    plt.title('原始图片', fontsize=18)
    n_clusters = 15

    affinity = image.img_to_graph(data)
    beta = 3
    affinity.data = np.exp(-beta * affinity.data / affinity.data.std()) + 10e-5
    # a = affinity.toarray()
    # b = np.diag(a.diagonal())
    # a -= b
    print('开始谱聚类...')
    y = spectral_clustering(affinity, n_clusters=n_clusters, assign_labels='kmeans', random_state=1)
    print('谱聚类完成...')
    y = y.reshape(data.shape)
    for n in range(n_clusters):
        data[y == n] = n
    plt.subplot(122)
    clrs = []
    for c in np.linspace(16776960, 16711935, n_clusters, dtype=int):
        clrs.append('#%06x' % c)
    cm = matplotlib.colors.ListedColormap(clrs)
    plt.imshow(data, cmap=cm, interpolation='nearest')
    plt.title('谱聚类:%d簇' % n_clusters, fontsize=18)
    plt.tight_layout()
    plt.show()

八、EN(高斯混合的最大期望)

1.EM算法
import numpy as np
from scipy.stats import multivariate_normal
from sklearn.mixture import GaussianMixture
from mpl_toolkits.mplot3d import Axes3D
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import pairwise_distances_argmin


mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False


if __name__ == '__main__':
    style = 'myself'

    np.random.seed(0)
    mu1_fact = (0, 0, 0)
    cov1_fact = np.diag((1, 2, 3))
    data1 = np.random.multivariate_normal(mu1_fact, cov1_fact*0.3, 400)
    mu2_fact = (2, 2, 1)
    cov2_fact = np.array(((6, 1, 3), (1, 5, 1), (3, 1, 4)))
    data2 = np.random.multivariate_normal(mu2_fact, cov2_fact*0.1, 100)
    data = np.vstack((data1, data2))
    y = np.array([True] * 400 + [False] * 100)

    if style == 'sklearn':
        g = GaussianMixture(n_components=2, covariance_type='full', tol=1e-6, max_iter=1000)
        g.fit(data)
        print('类别概率:\t', g.weights_[0])
        print('均值:\n', g.means_, '\n')
        print('方差:\n', g.covariances_, '\n')
        mu1, mu2 = g.means_
        sigma1, sigma2 = g.covariances_
    else:
        num_iter = 100
        n, d = data.shape
        # 随机指定
        # mu1 = np.random.standard_normal(d)
        # print mu1
        # mu2 = np.random.standard_normal(d)
        # print mu2
        mu1 = data.min(axis=0)
        mu2 = data.max(axis=0)
        sigma1 = np.identity(d)
        sigma2 = np.identity(d)
        pi = 0.5
        # EM
        for i in range(num_iter):
            # E Step
            norm1 = multivariate_normal(mu1, sigma1)
            norm2 = multivariate_normal(mu2, sigma2)
            tau1 = pi * norm1.pdf(data)
            tau2 = (1 - pi) * norm2.pdf(data)
            gamma = tau1 / (tau1 + tau2)

            # M Step
            mu1 = np.dot(gamma, data) / np.sum(gamma)
            mu2 = np.dot((1 - gamma), data) / np.sum((1 - gamma))
            sigma1 = np.dot(gamma * (data - mu1).T, data - mu1) / np.sum(gamma)
            sigma2 = np.dot((1 - gamma) * (data - mu2).T, data - mu2) / np.sum(1 - gamma)
            pi = np.sum(gamma) / n
            print(i, ":\t", mu1, mu2)
        print('类别概率:\t', pi)
        print('均值:\t', mu1, mu2)
        print('方差:\n', sigma1, '\n\n', sigma2, '\n')

    # 预测分类
    norm1 = multivariate_normal(mu1, sigma1)
    norm2 = multivariate_normal(mu2, sigma2)
    tau1 = norm1.pdf(data)
    tau2 = norm2.pdf(data)

    fig = plt.figure(figsize=(10, 5), facecolor='w')
    ax = fig.add_subplot(121, projection='3d')
    ax.scatter(data[:, 0], data[:, 1], data[:, 2], c='b', s=30, marker='o', edgecolors='k', depthshade=True)
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    ax.set_title('原始数据', fontsize=15)
    ax = fig.add_subplot(122, projection='3d')
    order = pairwise_distances_argmin([mu1_fact, mu2_fact], [mu1, mu2], metric='euclidean')
    print(order)
    if order[0] == 0:
        c1 = tau1 > tau2
    else:
        c1 = tau1 < tau2
    c2 = ~c1
    acc = np.mean(y == c1)
    print('准确率:%.2f%%' % (100*acc))
    ax.scatter(data[c1, 0], data[c1, 1], data[c1, 2], c='r', s=30, marker='o', edgecolors='k', depthshade=True)
    ax.scatter(data[c2, 0], data[c2, 1], data[c2, 2], c='g', s=30, marker='^', edgecolors='k', depthshade=True)
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    ax.set_title('EM算法分类', fontsize=15)
    plt.suptitle('EM算法的实现', fontsize=18)
    plt.subplots_adjust(top=0.90)
    plt.tight_layout()
    plt.show()
    
2.GMM 调参
    import numpy as np
    from sklearn.mixture import GaussianMixture
    import matplotlib as mpl
    import matplotlib.colors
    import matplotlib.pyplot as plt

mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False


def expand(a, b, rate=0.05):
    d = (b - a) * rate
    return a-d, b+d


def accuracy_rate(y1, y2):
    acc = np.mean(y1 == y2)
    return acc if acc > 0.5 else 1-acc


if __name__ == '__main__':
    np.random.seed(0)
    cov1 = np.diag((1, 2))
    print(cov1)
    N1 = 500
    N2 = 300
    N = N1 + N2
    x1 = np.random.multivariate_normal(mean=(1, 2), cov=cov1, size=N1)
    m = np.array(((1, 1), (1, 3)))
    x1 = x1.dot(m)
    x2 = np.random.multivariate_normal(mean=(-1, 10), cov=cov1, size=N2)
    x = np.vstack((x1, x2))
    y = np.array([0]*N1 + [1]*N2)

    types = ('spherical', 'diag', 'tied', 'full')
    err = np.empty(len(types))
    bic = np.empty(len(types))
    for i, type in enumerate(types):
        gmm = GaussianMixture(n_components=2, covariance_type=type, random_state=0)
        gmm.fit(x)
        err[i] = 1 - accuracy_rate(gmm.predict(x), y)
        bic[i] = gmm.bic(x)
    print('错误率:', err.ravel())
    print('BIC:', bic.ravel())
    xpos = np.arange(4)
    plt.figure(facecolor='w')
    ax = plt.axes()
    b1 = ax.bar(xpos-0.3, err, width=0.3, color='#77E0A0', edgecolor='k')
    b2 = ax.twinx().bar(xpos, bic, width=0.3, color='#FF8080', edgecolor='k')
    plt.grid(b=True, ls=':', color='#606060')
    bic_min, bic_max = expand(bic.min(), bic.max())
    plt.ylim((bic_min, bic_max))
    plt.xticks(xpos, types)
    plt.legend([b1[0], b2[0]], ('错误率', 'BIC'))
    plt.title('不同方差类型的误差率和BIC', fontsize=15)
    plt.show()

    optimal = bic.argmin()
    gmm = GaussianMixture(n_components=2, covariance_type=types[optimal], random_state=0)
    gmm.fit(x)
    print('均值 = \n', gmm.means_)
    print('方差 = \n', gmm.covariances_)
    y_hat = gmm.predict(x)

    cm_light = mpl.colors.ListedColormap(['#FF8080', '#77E0A0'])
    cm_dark = mpl.colors.ListedColormap(['r', 'g'])
    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()
    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()
    x1_min, x1_max = expand(x1_min, x1_max)
    x2_min, x2_max = expand(x2_min, x2_max)
    x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]
    grid_test = np.stack((x1.flat, x2.flat), axis=1)
    grid_hat = gmm.predict(grid_test)
    grid_hat = grid_hat.reshape(x1.shape)
    if gmm.means_[0][0] > gmm.means_[1][0]:
        z = grid_hat == 0
        grid_hat[z] = 1
        grid_hat[~z] = 0
    plt.figure(facecolor='w')
    plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)
    plt.scatter(x[:, 0], x[:, 1], s=30, c=y, marker='o', cmap=cm_dark, edgecolors='k')

    ax1_min, ax1_max, ax2_min, ax2_max = plt.axis()
    plt.xlim((x1_min, x1_max))
    plt.ylim((x2_min, x2_max))
    plt.title('GMM调参:covariance_type=%s' % types[optimal], fontsize=15)
    plt.grid(b=True, ls=':', color='#606060')
    plt.tight_layout(2)
    plt.show()
    
3.GMM 鸢尾花
    import numpy as np
    import pandas as pd
    from sklearn.mixture import GaussianMixture
    import matplotlib as mpl
    import matplotlib.colors
    import matplotlib.pyplot as plt
    from sklearn.metrics.pairwise import pairwise_distances_argmin

mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False

iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'


def expand(a, b, rate=0.05):
    d = (b - a) * rate
    return a-d, b+d


if __name__ == '__main__':
    path = 'F:\\data\\\iris.data'
    data = pd.read_csv(path, header=None)
    x_prime = data[np.arange(4)]
    y = pd.Categorical(data[4]).codes

    n_components = 3
    feature_pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
    plt.figure(figsize=(8, 6), facecolor='w')
    for k, pair in enumerate(feature_pairs, start=1):
        x = x_prime[pair]
        m = np.array([np.mean(x[y == i], axis=0) for i in range(3)])  # 均值的实际值
        print('实际均值 = \n', m)

        gmm = GaussianMixture(n_components=n_components, covariance_type='full', random_state=0)
        gmm.fit(x)
        print('预测均值 = \n', gmm.means_)
        print('预测方差 = \n', gmm.covariances_)
        y_hat = gmm.predict(x)
        order = pairwise_distances_argmin(m, gmm.means_, axis=1, metric='euclidean')
        print('顺序:\t', order)

        n_sample = y.size
        n_types = 3
        change = np.empty((n_types, n_sample), dtype=np.bool)
        for i in range(n_types):
            change[i] = y_hat == order[i]
        for i in range(n_types):
            y_hat[change[i]] = i
        acc = '准确率:%.2f%%' % (100*np.mean(y_hat == y))
        print(acc)

        cm_light = mpl.colors.ListedColormap(['#FF8080', '#77E0A0', '#A0A0FF'])
        cm_dark = mpl.colors.ListedColormap(['r', 'g', '#6060FF'])
        x1_min, x2_min = x.min()
        x1_max, x2_max = x.max()
        x1_min, x1_max = expand(x1_min, x1_max)
        x2_min, x2_max = expand(x2_min, x2_max)
        x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]
        grid_test = np.stack((x1.flat, x2.flat), axis=1)
        grid_hat = gmm.predict(grid_test)

        change = np.empty((n_types, grid_hat.size), dtype=np.bool)
        for i in range(n_types):
            change[i] = grid_hat == order[i]
        for i in range(n_types):
            grid_hat[change[i]] = i

        grid_hat = grid_hat.reshape(x1.shape)
        plt.subplot(2, 3, k)
        plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)
        plt.scatter(x[pair[0]], x[pair[1]], s=20, c=y, marker='o', cmap=cm_dark, edgecolors='k')
        xx = 0.95 * x1_min + 0.05 * x1_max
        yy = 0.1 * x2_min + 0.9 * x2_max
        plt.text(xx, yy, acc, fontsize=10)
        plt.xlim((x1_min, x1_max))
        plt.ylim((x2_min, x2_max))
        plt.xlabel(iris_feature[pair[0]], fontsize=11)
        plt.ylabel(iris_feature[pair[1]], fontsize=11)
        plt.grid(b=True, ls=':', color='#606060')
    plt.suptitle('EM算法无监督分类鸢尾花数据', fontsize=14)
    plt.tight_layout(1, rect=(0, 0, 1, 0.95))
    plt.show()
    
4.DPGMM
    import numpy as np
    from sklearn.mixture import GaussianMixture, BayesianGaussianMixture
    import scipy as sp
    import matplotlib as mpl
    import matplotlib.colors
    import matplotlib.pyplot as plt
    from matplotlib.patches import Ellipse


def expand(a, b, rate=0.05):
    d = (b - a) * rate
    return a-d, b+d


matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False


if __name__ == '__main__':
    np.random.seed(0)
    cov1 = np.diag((1, 2))
    N1 = 500
    N2 = 300
    N = N1 + N2
    x1 = np.random.multivariate_normal(mean=(3, 2), cov=cov1, size=N1)
    m = np.array(((1, 1), (1, 3)))
    x1 = x1.dot(m)
    x2 = np.random.multivariate_normal(mean=(-1, 10), cov=cov1, size=N2)
    x = np.vstack((x1, x2))
    y = np.array([0]*N1 + [1]*N2)
    n_components = 3

    # 绘图使用
    colors = '#A0FFA0', '#2090E0', '#FF8080'
    cm = mpl.colors.ListedColormap(colors)
    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()
    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()
    x1_min, x1_max = expand(x1_min, x1_max)
    x2_min, x2_max = expand(x2_min, x2_max)
    x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]
    grid_test = np.stack((x1.flat, x2.flat), axis=1)

    plt.figure(figsize=(6, 6), facecolor='w')
    plt.suptitle('GMM/DPGMM比较', fontsize=15)

    ax = plt.subplot(211)
    gmm = GaussianMixture(n_components=n_components, covariance_type='full', random_state=0)
    gmm.fit(x)
    centers = gmm.means_
    covs = gmm.covariances_
    print('GMM均值 = \n', centers)
    print('GMM方差 = \n', covs)
    y_hat = gmm.predict(x)

    grid_hat = gmm.predict(grid_test)
    grid_hat = grid_hat.reshape(x1.shape)
    plt.pcolormesh(x1, x2, grid_hat, cmap=cm)
    plt.scatter(x[:, 0], x[:, 1], s=20, c=y, cmap=cm, marker='o', edgecolors='#202020')

    clrs = list('rgbmy')
    for i, (center, cov) in enumerate(zip(centers, covs)):
        value, vector = sp.linalg.eigh(cov)
        width, height = value[0], value[1]
        v = vector[0] / sp.linalg.norm(vector[0])
        angle = 180* np.arctan(v[1] / v[0]) / np.pi
        e = Ellipse(xy=center, width=width, height=height,
                    angle=angle, color=clrs[i], alpha=0.5, clip_box = ax.bbox)
        ax.add_artist(e)

    ax1_min, ax1_max, ax2_min, ax2_max = plt.axis()
    plt.xlim((x1_min, x1_max))
    plt.ylim((x2_min, x2_max))
    plt.title('GMM', fontsize=15)
    plt.grid(b=True, ls=':', color='#606060')

    # DPGMM
    dpgmm = BayesianGaussianMixture(n_components=n_components, covariance_type='full', max_iter=1000, n_init=5,
                                    weight_concentration_prior_type='dirichlet_process', weight_concentration_prior=0.1)
    dpgmm.fit(x)
    centers = dpgmm.means_
    covs = dpgmm.covariances_
    print('DPGMM均值 = \n', centers)
    print('DPGMM方差 = \n', covs)
    y_hat = dpgmm.predict(x)
    print(y_hat)

    ax = plt.subplot(212)
    grid_hat = dpgmm.predict(grid_test)
    grid_hat = grid_hat.reshape(x1.shape)
    plt.pcolormesh(x1, x2, grid_hat, cmap=cm)
    plt.scatter(x[:, 0], x[:, 1], s=20, c=y, cmap=cm, marker='o', edgecolors='#202020')

    for i, cc in enumerate(zip(centers, covs)):
        if i not in y_hat:
            continue
        center, cov = cc
        value, vector = sp.linalg.eigh(cov)
        width, height = value[0], value[1]
        v = vector[0] / sp.linalg.norm(vector[0])
        angle = 180* np.arctan(v[1] / v[0]) / np.pi
        e = Ellipse(xy=center, width=width, height=height,
                    angle=angle, color='m', alpha=0.5, clip_box = ax.bbox)
        ax.add_artist(e)
    plt.xlim((x1_min, x1_max))
    plt.ylim((x2_min, x2_max))
    plt.title('DPGMM', fontsize=15)
    plt.grid(b=True, ls=':', color='#606060')
    plt.tight_layout(2, rect=(0, 0, 1, 0.95))
    plt.show()
    

九、时间模块

import time
import datetime
time.time()#1当前时间戳
time.localtime()#struct_time九个元素
time.gmtime(time.time())#时间戳转成struct_time
time.mktime(time.localtime())#struct_time转成时间戳
time.strftime('%Y%m%d',time.localtime())#把struct_time转成规定结构的字符串
time.strptime('20181016','%Y%m%d')#把一串字符按照这样的格式转成struct_time
'''
datetime.date:表示日期的类。常用的属性有year, month, day;
datetime.time:表示时间的类。常用的属性有hour, minute, second, microsecond;
datetime.datetime:表示日期时间。
datetime.timedelta:表示时间间隔,即两个时间点之间的长度。
datetime.tzinfo:与时区有关的相关信息timezoneinfo
'''
datetime.datetime.now()#表示日期时间的是一个datetime对象
datetime.datetime.now().strftime('%Y%m%d')#用strftime将一个datetime型日期转换成字符串
datetime.datetime.strptime('20181016',"%Y%m%d")#格式字符串转换为datetime对象
'''
datetime2 = datetime1 + timedelta  # 日期加上一个间隔,返回一个新的日期对象(timedelta将在下面介绍,表示时间间隔)
datetime2 = datetime1 - timedelta   # 日期隔去间隔,返回一个新的日期对象
timedelta = date1 - date2   # 两个日期相减,返回一个时间间隔对象
datetime1 < datetime2  # 两个日期进行比较
'''
datetime.datetime.now() - datetime.timedelta(days = 7)
#参数datetime.timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=0, weeks=0)
datetime.datetime.fromtimestamp(time.time())#用fromtimestamp吧给定时间戳转成datetime对象
'''
标识	含义	举例
%a	星期简写	Mon 
%A	星期全称	Monday 
%b	月份简写	 Mar
%B	月份全称	 March
%c	适合语言下的时间表示	May Mon May 20 16:00:02 2013 
%d	一个月的第一天,取值范围: [01,31].	 20
%H	24小时制的小时,取值范围[00,23].	17 
%I	12小时制的小时,取值范围 [01,12].	 10
%j	一年中的第几天,取值范围 [001,366].	120 
%m	十进制月份,取值范围[01,12].	 50
%M	分钟,取值范围 [00,59].	 50
%p	上、下午,AM 或 PM.	PM
%S	秒,取值范围 [00,61].	30
%U	这一年的星期数(星期天为一个星期的第一天,开年的第一个星期天之前的天记到第0个星期)趋势范围[00,53].	20
%w	星期的十进制表示,取值范围 [0(星期天),6].	1 
%W	这一年的星期数(星一为一个星期的第一天,开年的第一个星期一之前的天记到第0个星期)趋势范围[00,53].	20
%x	特定自然语言下的日期表示	05/20/13 
%X	特定自然语言下的时间表示	 16:00:02
%y	年的后两位数,取值范围[00,99].	13 
%Y	完整的年	2013
%Z	时区名	 CST(China Standard Time)
%%	%字符	% 
'''

你可能感兴趣的:(笔记,机器学习,数据挖掘)