一、机器学习常用技巧
1.自动编码方法:
第一种是pd.Categorical().Codes
第二种是用:LableEncoder非常智能,会按照原数据的某种顺序关系来编码
sklearn:先from sklearn.preprocessing import LableEncoder
LableEncoder().fit_transform(data[4])
第三种:标称型数据映射称为一组数字pd.factorize
2.独热编码(one hot encoder):通过独热编码,会让特征之间的距离计算更加合理,并且,将特征转为0,1,有利于提升计算速度。要是one hot encoding的类别数目不太多,建议优先考虑。 当类别的数量很多时,特征空间会变得非常大。在这种情况下,一般可以用PCA来减少维度。而且one hot encoding+PCA这种组合在实际中也非常有用。
pd.get_dummies,OneHotEncoder
sklearn中的几种方法:sklearn.preprocessing 中的OneHotEncoder,LableEncoder,LabelBinarizer,MultiLabelBinarizer
https://blog.csdn.net/qq_40587575/article/details/81118610
3.数据归一化:MinMaxScaler(sklearn):x = MinMaxScaler().fit_transform(x)
或 from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)
还有一种在19
4.简单分箱:利用pd.cut(等距)和pd.qcut(分位数)函数,一种处理连续型变量的方法,先将连续型变量分箱离散化,然后再进行离散化处理如编码
5.y(目标变量)定义:为离散时叫分类,连续时叫回归。
6.逻辑回归是一个对数的广义的线性回归,虽然叫回归但其实是解决分类问题的。
7.因为在求解最大似然估计时是求后面负项的最小值,后面这个负项就是关于θ的目标函数(obj),由于是求此项的最小值因此也叫损失函数
8.过拟合:过于符合训练集,导致模型对其他数据的预测较差
9.设置一行最多显示个数(常用显示格式):
pd.set_option('display.width', 100)
np.set_printoptions(linewidth=100, suppress=True)
pd.set_option('expand_frame_repr', False)
10.忽略警告(调用包时候的那一块粉色的警告):
from statsmodels.tools.sm_exceptions import HessianInversionWarning, warnings
warnings.filterwarnings(action='ignore', category=HessianInversionWarning)
11.移动数据(上下还是左右看axis=几):x.shift(period=(移动量,默认为1))
12.bagging和boosting算法(集成学习算法):
https://blog.csdn.net/chenyukuai6625/article/details/73692347
13.GBDT,Adaboost,XGboost特点和区别(讲的非常好)
https://blog.csdn.net/chengfulukou/article/details/76906710
14.决策树是一个弱分类器,随机森林是若干个弱分类器等权值线性组合(集成),GBDT的调整权值是算梯度来定位模型不足,(可以适用的分类器就可以是多种),adaboost,XGboost是提升错分数据权值,降低错分类器权重
15.保存模型:
from sklearn.externals import joblib
model_file = 'F:\\data\\adult.pkl'
joblib.dump(model, model_file)
model = joblib.load(model_file)
16.剪枝:
https://blog.csdn.net/zfan520/article/details/82454814
17.自动生成数据:
from scipy import stats
x = np.empty((4*N,2))
means = [(-1, 1), (1, 1), (1, -1), (-1, -1)]
sigmas = [np.eye(2), 2*np.eye(2), np.diag((1,2)), np.array(((3, 2), (2, 3)))]
for i in range(4):
stats.multivariate_normal(means[i], sigmas[i]*0.1)
x[i*N:(i+1)*N, :] = mn.rvs(N)
或者
from sklearn.datasets import make_blobs
x, y = make_blobs(800, n_features=2, centers=means, cluster_std=(0.1, 0.2, 0.3, 0.4))
18.随机化、洗牌,随机挑选样本
N=100
test = np.arange(N)
np.random.shuffle(test)
还有一种:
test=np.random.randint(0,N,size=1000)
19.数据标准化: https://blog.csdn.net/quiet_girl/article/details/72517053
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
20.某列数据统计:data.Fare.value_counts()
21.条形图,箱图
train_data['Age'].hist(bins=70)
train_data.boxplot(column='Age', showfliers=False)
22.正则化re模块:https://www.cnblogs.com/MrFiona/p/5954084.html
23.apply, map, applymap区别
import pandas as pd
import numpy as np
from pandas import DataFrame
from pandas import Series
df1= DataFrame({
"sales1":[-1,2,3],
"sales2":[3,-5,7],
})
df1
df1.apply(lambda x :x.max()-x.min(),axis=1)
24.['Fare']是Series,[['Fare']]是DF,data每个元素位置的取值由transform里函数计算,如下取对应分组列的均值
combined_data['Fare'] = combined_data[['Fare']].fillna(combined_data.groupby('Pclass').transform(np.mean))
25. .drop(['Title_code'],axis=1,inplace=True)
26.heatmap
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10,as_cmap=True),square=True, ax=ax)
27.显示中文
mpl.rcParams['font.sans-serif'] = ['simHei']
mpl.rcParams['axes.unicode_minus'] = False
28.简单填充缺失数据:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)
new_data = original_data.copy()
cols_with_missing = (col for col in new_data.columns
if new_data[col].isnull().any())
for col in cols_with_missing:
new_data[col + '_was_missing'] = new_data[col].isnull()
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns
29.返回数据类型
def generic_type_name(v):
"""Return a descriptive type name that isn't Python specific. For example,
an int value will return 'integer' rather than 'int'."""
if isinstance(v, numbers.Integral):
return 'integer'
elif isinstance(v, numbers.Real):
return 'float'
elif isinstance(v, (tuple, list)):
return 'list'
elif isinstance(v, six.string_types):
return 'string'
elif v is None:
return 'null'
else:
return type(v).__name__
二、时间序列
1 时间序列建模基本步骤
获取被观测系统时间序列数据;
对数据绘图,观测是否为平稳时间序列;对于非平稳时间序列要先进行d阶差分运算,化为平稳时间序列;
经过第二步处理,已经得到平稳时间序列。要对平稳时间序列分别求得其自相关系数ACF 和偏自相关系数PACF ,通过对自相关图和偏自相关图的分析,得到最佳的阶层 p 和阶数 q
由以上得到的 ,得到ARIMA模型。然后开始对得到的模型进行模型检验。
2. 时间序列:整个过程介绍的非常详细 https://www.jianshu.com/p/cced6617b423
from statsmodels.tsa.arima_model import ARIMA
import warnings
from statsmodels.tools.sm_exceptions import HessianInversionWarning
def extend(a,b):
return 1.05*a-0.05*b,1.05*b-0.05*a
def date_parser(date):
return pd.datetime.strptime(date,'%Y-%m')
if __name__ == '__main__':
warnings.filterwarnings(action='ignore', category=HessianInversionWarning)
pd.set_option('display.width',100)
np.set_printoptions(linewidth=100, suppress=True)
f = rf'F:\data\AirPassengers.csv'
data = pd.read_csv(f, header=0, parse_dates=['Month'], date_parser=date_parser, index_col=['Month'] )
data.rename(columns={'#Passengers':'Passengers'}, inplace=True)
x = data['Passengers'].astype(np.float)
x = np.log(x)
show = 'prime'
d = 1
diff = x - x.shift(periods=d)
ma = x.rolling(window=12).mean()
xma = x - ma
p = 2
q = 2
model = ARIMA(endog=x, order=(p, d, q))
arima = model.fit(disp=-1)
prediction = arima.fittedvalues
y = prediction.cumsum() + x[0]
mse = ((x-y)**2).mean()
rmse = np.sqrt(mse)
plt.figure(facecolor='w')
if show == 'diff':
plt.plot(x,'r-',lw=2,label='原始数据')
plt.plot(diff,'g-',lw=2,label='{}阶差分'.format(d))
title = '乘客人数变化曲线-取对数'
elif show == 'ma':
plt.plot(xma, 'g-', lw=2, label='ln原始数据 - ln滑动平均数据')
plt.plot(prediction, 'r-', lw=2, label='预测数据')
title = '滑动平均值与MA预测值'
else:
plt.plot(x, 'r-', lw=2, label='原始数据')
plt.plot(y, 'g-', lw=2, label='预测数据')
title = '对数乘客人数与预测值(AR=%d, d=%d, MA=%d):RMSE=%.4f' % (p, d, q, rmse)
plt.legend(loc='lower right')
plt.grid(b=True,ls=':')
plt.title(title,fontsize=16)
plt.tight_layout(2)
plt.show()
三、决策树、随机森林
1.决策树与随机森林建模(包含graphviz使用)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pydotplus
if __name__ == "__main__":
mpl.rcParams['font.sans-serif'] = ['simHei']
mpl.rcParams['axes.unicode_minus'] = False
iris_feature_E = 'sepal length', 'sepal width'
iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'
iris_class = 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'
path = rf'F:\data\iris.data'
data = pd.read_csv(path, header=None)
x = data[[0,1]]
y = LabelEncoder().fit_transform(data[4])
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=1)
model = DecisionTreeClassifier(criterion='entropy')
model.fit(x_train, y_train)
y_test_hat = model.predict(x_test)
print('accuracy_score:', accuracy_score(y_test, y_test_hat))
with open('G:\Download\python\iris.dot', 'w') as f:
tree.export_graphviz(model, out_file=f)
dot_data = tree.export_graphviz(model, out_file=None, feature_names=iris_feature_E, class_names=iris_class,
filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf('G:\Download\python\iris.pdf')
f = open('G:\Download\python\iris.png', 'wb')
f.write(graph.create_png())
f.close()
N, M = 50, 50
x1_min, x2_min = x.min()
x1_max, x2_max = x.max()
t1 = np.linspace(x1_min, x1_max, N)
t2 = np.linspace(x2_min, x2_max, M)
x1, x2 = np.meshgrid(t1, t2)
x_show = np.stack((x1.flat, x2.flat), axis=1)
print(x_show.shape)
cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
y_show_hat = model.predict(x_show)
print(y_show_hat.shape)
print(y_show_hat)
y_show_hat = y_show_hat.reshape(x1.shape)
print(y_show_hat)
plt.figure(facecolor='w')
plt.pcolormesh(x1, x2, y_show_hat, cmap=cm_light)
plt.scatter(x_test[0], x_test[1], c=y_test.ravel(), edgecolors='k', s=100, zorder=10, cmap=cm_dark, marker='*')
plt.scatter(x[0], x[1], c=y.ravel(), edgecolors='k', s=20, cmap=cm_dark)
plt.xlabel(iris_feature[0], fontsize=13)
plt.ylabel(iris_feature[1], fontsize=13)
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.grid(b=True, ls=':', color='#606060')
plt.title('鸢尾花数据的决策树分类', fontsize=15)
plt.show()
y_test = y_test.reshape(-1)
print(y_test_hat)
print(y_test)
result = (y_test_hat == y_test)
acc = np.mean(result)
print('准确度: %.2f%%' % (100 * acc))
depth = np.arange(1, 15)
err_list = []
for d in depth:
clf = DecisionTreeClassifier(criterion='entropy', max_depth=d)
clf.fit(x_train, y_train)
y_test_hat = clf.predict(x_test)
result = (y_test_hat == y_test)
err = 1 - np.mean(result)
err_list.append(err)
print(d, ' 错误率: %.2f%%' % (100 * err))
plt.figure(facecolor='w')
plt.plot(depth, err_list, 'ro-', markeredgecolor='k', lw=2)
plt.xlabel('决策树深度', fontsize=13)
plt.ylabel('错误率', fontsize=13)
plt.title('决策树深度与过拟合', fontsize=15)
plt.grid(b=True, ls=':', color='#606060')
plt.show()
2.特征组合,等高线图
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
if __name__ == '__main__':
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
iris_feature = u'花萼长度', u'花萼宽度', u'花瓣长度', u'花瓣宽度'
path = rf'G:\Download\python\iris.data'
data = pd.read_csv(path, header=None)
x_prime = data[list(range(4))]
y = pd.Categorical(data[4]).codes
x_prime__train, x_prime_test, y_train, y_test = train_test_split(x_prime, y, train_size=0.7,random_state=0)
feature_pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
plt.figure(figsize=(8,6), facecolor='#FFFFFF')
for i,pair in enumerate(feature_pairs):
x_train = x_prime__train[pair]
x_test = x_prime_test[pair]
model = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=3)
model.fit(x_train, y_train)
N,M=50,50
x1_min, x2_min = x_train.min()
x1_max, x2_max = x_train.max()
t1 = np.linspace(x1_min,x1_max,N)
t2 = np.linspace(x2_min,x2_max,M)
x1,x2 = np.meshgrid(t1,t2)
x_show = np.stack((x1.flat,x2.flat),axis=1)
y_train_pred = model.predict(x_train)
acc_train = accuracy_score(y_train, y_train_pred)
y_test_pred = model.predict(x_test)
acc_test = accuracy_score(y_test, y_test_pred)
print('特征',iris_feature[pair[0]],iris_feature[pair[1]])
print(acc_train,acc_test)
cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
y_hat = model.predict(x_show)
y_hat = y_hat.reshape(x1.shape)
plt.subplot(2,3,i+1)
plt.contour(x1,x2,y_hat,colors='k', levels=[0,1], antialiased=True, linewidths=1)
plt.pcolormesh(x1,x2,y_hat,cmap=cm_light)
plt.scatter(x_train[pair[0]],x_train[pair[1]], c=y_train, s=20, edgecolors='k', cmap=cm_dark, label='训练集')
plt.scatter(x_test[pair[0]], x_test[pair[1]], c=y_test, s=80, marker='*', edgecolors='k', cmap=cm_dark, label=u'测试集')
plt.xlabel(iris_feature[pair[0]], fontsize=12)
plt.ylabel(iris_feature[pair[1]], fontsize=12)
plt.legend(loc='upper right', fancybox=True, framealpha=0.3)
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.grid(b=True, ls=':', color='#606060')
plt.suptitle(u'决策树对鸢尾花数据两特征组合的分类结果', fontsize=15)
plt.tight_layout(1, rect=(0, 0, 1, 0.94))
plt.show()
3.决策树回归
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
if __name__ == '__main__':
N=100
x = np.random.rand(N)*6-3
x.sort()
y = np.sin(x) + np.random.rand(N)*0.05
print(y)
x = x.reshape(-1,1)
dt = DecisionTreeRegressor(criterion='mse', max_depth=9)
dt.fit(x,y)
x_test = np.linspace(-3,3,100).reshape(-1,1)
y_hat = dt.predict(x_test)
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(facecolor='w')
plt.plot(x, y, 'r*', markersize=10, markeredgecolor='k', label='实际值')
plt.plot(x_test,y_hat, 'g-', linewidth=2, label='预测')
plt.legend(loc='upper left',fontsize=12)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(b=True, ls=':', color='#606060')
plt.title('决策树回归', fontsize=15)
plt.tight_layout(2)
plt.show()
depth = [2,4,6,8,10]
clr = 'rgbmy'
dtr = DecisionTreeRegressor(criterion='mse')
plt.figure(facecolor='w')
plt.plot(x,y,'ro',ms=5, mec='k', label='实际值')
x_test = np.linspace(-3,3,100).reshape(-1,1)
for d, c in zip(depth, clr):
dtr.set_params(max_depth=d)
dtr.fit(x,y)
y_hat = dtr.predict(x_test)
print(mean_squared_error(y,y_hat))
plt.plot(x_test, y_hat, '-', color=c, lw=2, mec='k', label='Depth={}'.format(d))
plt.legend(loc='upper left', fontsize=12)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(b=True, ls=':', color='#606060')
plt.title('决策树回归', fontsize=15)
plt.tight_layout(2)
plt.show()
4.决策树多输出回归
N=400
x = np.random.rand(N) * 8 - 4
x = np.random.rand(N) * 4*np.pi
x.sort()
print(x.shape)
print('====================')
y1 = 16 * np.sin(x) ** 3 + np.random.randn(N)*0.5
y2 = 13 * np.cos(x) - 5 * np.cos(2*x) - 2 * np.cos(3*x) - np.cos(4*x) + np.random.randn(N)*0.5
np.set_printoptions(suppress=True)
y = np.vstack((y1, y2)).T
print(y1.shape)
print(y2.shape)
print(y.shape)
data = np.vstack((x,y1,y2)).T
print(data.shape)
x = x.reshape(-1,1)
deep=10
reg = DecisionTreeRegressor(criterion='mse',max_depth=deep)
dt = reg.fit(x,y)
x_test = np.linspace(x.min(), x.max(), num=1000).reshape(-1,1)
print(x_test.shape)
y_hat = reg.predict(x_test)
print(y_hat.shape)
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(facecolor='w')
plt.scatter(y[:,0],y[:,1], c='r', marker='s', edgecolors='k', s=60, label='真实值', alpha=0.8)
plt.scatter(y_hat[:,0],y_hat[:,1], c='g', marker='o', edgecolor='k', edgecolors='g', s=30, label='预测值', alpha=0.8)
plt.legend(loc='lower left', fancybox=True, fontsize=12)
plt.xlabel('$Y_1$', fontsize=12)
plt.ylabel('$Y_2$', fontsize=12)
plt.grid(b=True,ls=':',color='#606060')
plt.title('决策树多输出回归',fontsize=15)
plt.tight_layout(2)
plt.show()
5.随机森林对鸢尾花数据两特征组合的分类结果
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
if __name__ == "__main__":
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
iris_feature = u'花萼长度', u'花萼宽度', u'花瓣长度', u'花瓣宽度'
path = rf'F:\data\iris.data'
data = pd.read_csv(path, header=None)
x_prime = data[list(range(4))]
y = pd.Categorical(data[4]).codes
x_prime_train, x_prime_test, y_train, y_test = train_test_split(x_prime, y, train_size=0.7, random_state=0)
feature_pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
plt.figure(figsize=(8, 6), facecolor='#FFFFFF')
for i, pair in enumerate(feature_pairs):
x_train = x_prime_train[pair]
x_test = x_prime_test[pair]
model = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=5, oob_score=True)
model.fit(x_train, y_train)
N, M = 500, 500
x1_min, x2_min = x_train.min()
x1_max, x2_max = x_train.max()
t1 = np.linspace(x1_min, x1_max, N)
t2 = np.linspace(x2_min, x2_max, M)
x1, x2 = np.meshgrid(t1, t2)
x_show = np.stack((x1.flat, x2.flat), axis=1)
y_train_pred = model.predict(x_train)
acc_train = accuracy_score(y_train, y_train_pred)
y_test_pred = model.predict(x_test)
acc_test = accuracy_score(y_test, y_test_pred)
print('特征:', iris_feature[pair[0]], ' + ', iris_feature[pair[1]])
print('OOB Score:', model.oob_score_)
print('\t训练集准确率: %.4f%%' % (100*acc_train))
print('\t测试集准确率: %.4f%%\n' % (100*acc_test))
cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
y_hat = model.predict(x_show)
y_hat = y_hat.reshape(x1.shape)
plt.subplot(2, 3, i+1)
plt.contour(x1, x2, y_hat, colors='k', levels=[0, 1], antialiased=True, linestyles='--', linewidths=1)
plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)
plt.scatter(x_train[pair[0]], x_train[pair[1]], c=y_train, s=20, edgecolors='k', cmap=cm_dark, label='训练集')
plt.scatter(x_test[pair[0]], x_test[pair[1]], c=y_test, s=100, marker='*', edgecolors='k', cmap=cm_dark, label='测试集')
plt.xlabel(iris_feature[pair[0]], fontsize=12)
plt.ylabel(iris_feature[pair[1]], fontsize=12)
plt.legend(loc='upper right', fancybox=True, framealpha=0.3)
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.grid(b=True, ls=':', color='#606060')
plt.suptitle('随机森林对鸢尾花数据两特征组合的分类结果', fontsize=15)
plt.tight_layout(1, rect=(0, 0, 1, 0.95))
plt.show()
6.波士顿房价建模(随机森林)
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNetCV
import sklearn.datasets
from pprint import pprint
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import warnings
def not_empty(s):
return s != ''
if __name__ == "__main__":
data = sklearn.datasets.load_boston()
x = np.array(data.data)
y = np.array(data.target)
print('样本个数:%d, 特征个数:%d' % x.shape)
print(y.shape)
y = y.ravel()
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=0)
model = RandomForestRegressor(n_estimators=50, criterion='mse')
print('开始建模...')
model.fit(x_train, y_train)
order = y_test.argsort(axis=0)
y_test = y_test[order]
x_test = x_test[order, :]
y_pred = model.predict(x_test)
r2 = model.score(x_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print('R2:', r2)
print('均方误差:', mse)
t = np.arange(len(y_pred))
mpl.rcParams['font.sans-serif'] = ['simHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(facecolor='w')
plt.plot(t, y_test, 'r-', lw=2, label='真实值')
plt.plot(t, y_pred, 'g-', lw=2, label='估计值')
plt.legend(loc='best')
plt.title('波士顿房价预测', fontsize=18)
plt.xlabel('样本编号', fontsize=15)
plt.ylabel('房屋价格', fontsize=15)
plt.grid()
plt.show()
四、五、岭,决策树,集成岭,集成树的回归比较
1.ridge,DT,baggingRidge,baggingDT的回归比较
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.linear_model import RidgeCV
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
def f(x):
return 0.5*np.exp(-(x+3)**2) + np.exp(-x**2) + 1.5*np.exp(-(x-3)**2)
if __name__ == '__main__':
np.random.seed(0)
N=200
x = np.random.rand(N)*10-5
x = np.sort(x)
y = f(x)+0.05*np.random.rand(N)
x.shape = -1,1
degree=6
n_estimators=50
max_samples=0.5
ridge = RidgeCV(alphas=np.logspace(-3,2,20), fit_intercept=False)
ridged = Pipeline([('poly',PolynomialFeatures(degree=degree)),('Ridge',ridge)])
bagging_ridged = BaggingRegressor(ridged, n_estimators=n_estimators, max_samples=max_samples)
dtr = DecisionTreeRegressor(max_depth=9)
regs = [
('DecisionTree', dtr),
('Ridge(%d Degree)' % degree, ridged),
('Bagging Ridge(%d Degree)' % degree, bagging_ridged),
('Bagging DecisionTree', BaggingRegressor(dtr, n_estimators=n_estimators, max_samples=max_samples))]
x_test = np.linspace(1.1*x.min(), 1.1*x.max(), 1000)
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(8,6), facecolor='w')
plt.plot(x,y,'ro', mec='k',label='训练数据')
plt.plot(x_test, f(x_test), color='k', lw=3, ls='-', label='真实值')
clrs = '#FF2020', 'm', 'y', 'g'
for i,(name,reg) in enumerate(regs):
reg.fit(x,y)
label = '%s, $R^2$=%.3f' % (name, reg.score(x, y))
y_test = reg.predict(x_test.reshape(-1,1))
plt.plot(x_test, y_test, color=clrs[i], lw=(i+1)*0.5, label=label, zorder=6-i)
plt.legend(loc='upper left', fontsize=11)
plt.xlabel('X', fontsize=12)
plt.ylabel('Y', fontsize=12)
plt.title('回归曲线拟合:samples_rate(%.1f), n_trees(%d)' % (max_samples, n_estimators), fontsize=15)
plt.ylim((-0.2, 1.1*y.max()))
plt.tight_layout(2)
plt.grid(b=True, ls=':', color='#606060')
plt.show()
五、XGB,LR,RF,adaboost比较
1.xgboost 二分类
import xgboost as xgb
import numpy as np
from sklearn.tree import DecisionTreeClassifier
def g_h(y_hat, y):
p = 1.0 / (1.0 + np.exp(-y_hat))
g = p - y.get_label()
h = p * (1.0-p)
return g, h
def error_rate(y_hat, y):
return 'error', float(sum(y.get_label() != (y_hat > 0.5))) / len(y_hat)
if __name__ == "__main__":
data_train = xgb.DMatrix(rf'F:\data\agaricus_train.txt')
data_test = xgb.DMatrix(rf'F:\data\agaricus_test.txt')
print (data_train)
print (type(data_train))
param = {'max_depth': 3, 'eta': 0.4, 'silent': 1, 'objective': 'binary:logistic'}
watchlist = [(data_test, 'eval'), (data_train, 'train')]
n_round = 3
bst = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist)
y_hat = bst.predict(data_test)
y = data_test.get_label()
print (y_hat)
print (y)
error = sum(y != (y_hat > 0.5))
error_rate = float(error) / len(y_hat)
print ('样本总数:\t', len(y_hat))
print ('错误数目:\t%4d' % error)
print ('错误率:\t%.5f%%' % (100*error_rate))
2.xgboost 鸢尾花三分类,并于LR、RF进行比较
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler
if __name__ == '__main__':
path = rf'F:\data\iris.data'
data = pd.read_csv(path,header=None)
x, y = data[list(range(0,4))], data[4]
y = pd.Categorical(y).codes
x = MinMaxScaler().fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=50, random_state=1)
data_train = xgb.DMatrix(x_train, label=y_train)
data_test = xgb.DMatrix(x_test, label=y_test)
watch_list = [(data_test,'eval'), (data_train,'train')]
param = {'max_depth':4, 'eta':0.3,'silent':1, 'objective':'multi:softmax', 'num_class':3}
bst = xgb.train(param, data_train, num_boost_round=6, evals=watch_list)
y_hat = bst.predict(data_test)
result = y_test == y_hat
print('正确率:{}'.format(float(np.sum(result))/len(y_hat)))
print('==========')
models = [('LogisticRegression',LogisticRegressionCV(Cs=10, cv=5)),('RandomForest',RandomForestClassifier(n_estimators=30, criterion='gini'))]
for name, model in models:
model.fit(x_train, y_train)
print(name, accuracy_score(y_train, model.predict(x_train)))
print(name, accuracy_score(y_test, model.predict(x_test)))
3.xgboost、LR、RF 红酒三分类
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
if __name__ == '__main__':
path = rf'F:\data\wine.data'
data = pd.read_csv(path, header=None)
y, x = data[0], data[list(range(1,data.shape[1]))]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=1)
lr = LogisticRegression(penalty='l2')
lr.fit(x_train,y_train)
y_hat = lr.predict(x_test)
print('lr',accuracy_score(y_test,y_hat))
rf = RandomForestClassifier(n_estimators=30, criterion='gini', max_depth=8, min_samples_leaf=3)
rf.fit(x_train,y_train)
y_train_pred = rf.predict(x_train)
y_test_pred = rf.predict(x_test)
print('RF训练集',accuracy_score(y_train,y_train_pred))
print('RF测试集',accuracy_score(y_test,y_test_pred))
y_train[y_train==3] = 0
y_test[y_test==3] = 0
data_train = xgb.DMatrix(x_train, label=y_train)
data_test = xgb.DMatrix(x_test, label=y_test)
watch_list = [(data_test,'eval'),(data_train,'train')]
param = {'max_depth':3,'eta':1, 'silent':0, 'objective':'multi:softmax','num_class':3}
bst = xgb.train(param, data_train,num_boost_round=2, evals=watch_list)
y_hat = bst.predict(data_test)
print('xgb',accuracy_score(y_hat,y_test))
4.泰坦尼克乘客存活问题,包含缺失值的中位数、随机森林填充
import csv
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
def show_accuracy(a, b, tip):
acc = a.ravel() == b.ravel()
acc_rate = 100 * float(acc.sum()) / a.size
print('%s正确率:%.3f%%' % (tip, acc_rate))
return acc_rate
def load_data(file_name, is_train):
data = pd.read_csv(file_name)
pd.set_option('display.width',200)
print('data.describe() = \n',data.describe())
data['Sex'] = pd.Categorical(data['Sex']).codes
if len(data.Fare[data.Fare == 0]) > 0:
fare = np.zeros(3)
for f in range(0, 3):
fare[f] = data[data['Pclass'] == f + 1]['Fare'].dropna().median()
print('='*30)
print(fare)
print('='*30)
for f in range(0, 3):
data.loc[(data.Fare == 0) & (data.Pclass == f + 1), 'Fare'] = fare[f]
print('data.describe() = \n', data.describe())
if is_train:
print('随机森林预测缺失年龄:--start--')
data_for_age = data[['Age', 'Survived', 'Fare', 'Parch', 'SibSp', 'Pclass']]
age_exist = data_for_age.loc[(data.Age.notnull())]
age_null = data_for_age.loc[(data.Age.isnull())]
print(age_exist)
x = age_exist.values[:, 1:]
y = age_exist.values[:, 0]
rfr = RandomForestRegressor(n_estimators=20)
rfr.fit(x, y)
age_hat = rfr.predict(age_null.values[:, 1:])
data.loc[(data.Age.isnull()), 'Age'] = age_hat
print('随机森林预测缺失年龄:--over--')
else:
print('随机森林预测缺失年龄2:--start--')
data_for_age = data[['Age', 'Fare', 'Parch', 'SibSp', 'Pclass']]
age_exist = data_for_age.loc[(data.Age.notnull())]
age_null = data_for_age.loc[(data.Age.isnull())]
x = age_exist.values[:, 1:]
y = age_exist.values[:, 0]
rfr = RandomForestRegressor(n_estimators=1000)
rfr.fit(x, y)
age_hat = rfr.predict(age_null.values[:, 1:])
data.loc[(data.Age.isnull()), 'Age'] = age_hat
print('随机森林预测缺失年龄2:--over--')
data['Age'] = pd.cut(data['Age'], bins=6, labels=np.arange(6))
data.loc[(data.Embarked.isnull()), 'Embarked'] = 'S'
embarked_data = pd.get_dummies(data.Embarked)
print('embarked_data = ', embarked_data)
embarked_data = embarked_data.rename(columns=lambda x: 'Embarked_' + str(x))
data = pd.concat([data, embarked_data], axis=1)
print(data.describe())
data.to_csv('F:\\data\\New_Data.csv')
x = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_C', 'Embarked_Q', 'Embarked_S']]
y = None
if 'Survived' in data:
y = data['Survived']
x = np.array(x)
y = np.array(y)
x = np.tile(x, (5, 1))
y = np.tile(y, (5, ))
if is_train:
return x, y
return x, data['PassengerId']
def write_result(c, c_type):
file_name = rf'F:\data\Titanic.test.csv'
x, passenger_id = load_data(file_name, False)
if type == 3:
x = xgb.DMatrix(x)
y = c.predict(x)
y[y > 0.5] = 1
y[~(y > 0.5)] = 0
predictions_file = open("Prediction_%d.csv" % c_type, "wb")
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId", "Survived"])
open_file_object.writerows(list(zip(passenger_id, y)))
predictions_file.close()
x, y = load_data('F:\\data\\Titanic.train.csv', True)
print('x = ', x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)
lr = LogisticRegression(penalty='l2')
lr.fit(x_train, y_train)
y_hat = lr.predict(x_test)
lr_acc = accuracy_score(y_test, y_hat)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(x_train, y_train)
y_hat = rfc.predict(x_test)
rfc_acc = accuracy_score(y_test, y_hat)
data_train = xgb.DMatrix(x_train, label=y_train)
data_test = xgb.DMatrix(x_test, label=y_test)
watch_list = [(data_test, 'eval'), (data_train, 'train')]
param = {'max_depth': 6, 'eta': 0.8, 'silent': 1, 'objective': 'binary:logistic'}
bst = xgb.train(param, data_train, num_boost_round=20, evals=watch_list)
y_hat = bst.predict(data_test)
y_hat[y_hat > 0.5] = 1
y_hat[~(y_hat > 0.5)] = 0
xgb_acc = accuracy_score(y_test, y_hat)
print('Logistic回归:%.3f%%' % (100*lr_acc))
print('随机森林:%.3f%%' % (100*rfc_acc))
print('XGBoost:%.3f%%' % (100*xgb_acc))
5.adaboost对鸢尾花数据两特征组合的分类结果
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
if __name__ == "__main__":
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'
path = 'F:\\data\\iris.data'
data = pd.read_csv(path, header=None)
x_prime = data[list(range(4))]
y = pd.Categorical(data[4]).codes
x_prime_train, x_prime_test, y_train, y_test = train_test_split(x_prime, y, train_size=0.7, random_state=0)
feature_pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
plt.figure(figsize=(11, 8), facecolor='#FFFFFF')
for i, pair in enumerate(feature_pairs):
x_train = x_prime_train[pair]
x_test = x_prime_test[pair]
base_estimator = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_split=4)
model = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=10, learning_rate=0.1)
model.fit(x_train, y_train)
M,N = 500,500
x1_min, x2_min = x_train.min()
x1_max, x2_max = x_train.max()
t1 = np.linspace(x1_min, x1_max, M)
t2 = np.linspace(x2_min, x2_max, N)
x1, x2 = np.meshgrid(t1,t2)
x_show = np.stack((x1.flat, x2.flat), axis=1)
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)
print('特征:', iris_feature[pair[0]], ' + ', iris_feature[pair[1]])
print('\t训练集准确率: %.4f%%' % (100*acc_train))
print('\t测试集准确率: %.4f%%\n' % (100*acc_test))
cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
y_hat = model.predict(x_show)
y_hat = y_hat.reshape(x1.shape)
plt.subplot(2, 3, i+1)
plt.contour(x1, x2, y_hat, colors='k', levels=[0, 1], antialiased=True, linestyles='--', linewidths=1.5)
plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)
plt.scatter(x_train[pair[0]], x_train[pair[1]], c=y_train, s=20, edgecolors='k', cmap=cm_dark)
plt.scatter(x_test[pair[0]], x_test[pair[1]], c=y_test, s=100, marker='*', edgecolors='k', cmap=cm_dark)
plt.xlabel(iris_feature[pair[0]], fontsize=14)
plt.ylabel(iris_feature[pair[1]], fontsize=14)
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.grid(b=True)
plt.suptitle('Adaboost对鸢尾花数据两特征组合的分类结果', fontsize=18)
plt.tight_layout(1, rect=(0, 0, 1, 0.95))
plt.show()
6.adaboost模型存储,评价
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.externals import joblib
import os
if __name__ == '__main__':
pd.set_option('display.width', 400)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_columns', 70)
column_names = 'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', \
'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'
model_file = 'G:\\data\\adult.pkl'
if os.path.exists(model_file):
model = joblib.load(model_file)
else:
show_result = True
print('读入数据')
data = pd.read_csv('G:\\data\\adult.data',header=None, names=column_names)
for name in data.columns:
data[name] = pd.Categorical(data[name]).codes
x = data[data.columns[:-1]]
y = data[data.columns[-1]]
x_train, x_valid, y_train, y_valid = train_test_split(x,y,test_size=0.3, random_state=0)
print(y_train.mean())
print (y_valid.mean())
base_estimator = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_split=5)
model = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=50, learning_rate=0.1)
model.fit(x_train,y_train)
joblib.dump(model, model_file)
if show_result:
y_train_pred = model.predict(x_train)
print('训练集准确率:', accuracy_score(y_train, y_train_pred))
print('\t训练集查准率:', precision_score(y_train, y_train_pred))
print('\t训练集召回率:', recall_score(y_train, y_train_pred))
print('\t训练集F1:', f1_score(y_train, y_train_pred))
y_valid_pred = model.predict(x_valid)
print('验证集准确率:', accuracy_score(y_valid, y_valid_pred))
print('\t验证集查准率:', precision_score(y_valid, y_valid_pred))
print('\t验证集召回率:', recall_score(y_valid, y_valid_pred))
print('\t验证集F1:', f1_score(y_valid, y_valid_pred))
data_test = pd.read_csv('G:\\data\\adult.test', header=None, skiprows=1, names=column_names)
for name in data_test.columns:
data_test[name] = pd.Categorical(data_test[name]).codes
x_test = data_test[data_test.columns[:-1]]
y_test = data_test[data_test.columns[-1]]
y_test_pred = model.predict(x_test)
print('测试集准确率:', accuracy_score(y_test, y_test_pred))
print('\t测试集查准率:', precision_score(y_test, y_test_pred))
print('\t测试集召回率:', recall_score(y_test, y_test_pred))
print('\t测试集F1:', f1_score(y_test, y_test_pred))
y_test_proba = model.predict_proba(x_test)
print (y_test_proba)
y_test_proba = y_test_proba[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_proba)
print ('AUC = ', metrics.roc_auc_score(y_test, y_test_proba))
mpl.rcParams['font.sans-serif'] = 'SimHei'
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(facecolor='w')
plt.plot(fpr, tpr, 'r-', lw=2, alpha=0.8, label='AUC=%.3f' % auc)
plt.plot((0, 1), (0, 1), c='b', lw=1.5, ls='--', alpha=0.7)
plt.xlim((-0.01, 1.02))
plt.ylim((-0.01, 1.02))
plt.xticks(np.arange(0, 1.1, 0.1))
plt.yticks(np.arange(0, 1.1, 0.1))
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.grid(b=True)
plt.legend(loc='lower right', fancybox=True, framealpha=0.8, fontsize=14)
plt.title('Adult数据的ROC曲线和AUC值', fontsize=17)
plt.show()
六、SVM
1.svm鸢尾花三分类
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
if __name__ == '__main__':
iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'
path = 'F:\\data\\iris.data'
data = pd.read_csv(path, header=None)
x,y = data[[0,1]], pd.Categorical(data[4]).codes
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.4, random_state=1)
clf = svm.SVC(C=0.1, kernel='linear', decision_function_shape='ovr', probability=True)
clf.fit(x_train,y_train.ravel())
print(clf.decision_function(x))
print(clf.score(x_train,y_train))
print('训练集准确率:', accuracy_score(y_train, clf.predict(x_train)))
print(clf.score(x_test, y_test))
print('测试集准确率:', accuracy_score(y_test, clf.predict(x_test)))
print(x_train[:5])
print('decision_function:\n', clf.decision_function(x_train))
print('\npredict:\n', clf.predict(x_train))
x1_min, x2_min = x.min()
x1_max, x2_max = x.max()
x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]
grid_test = np.stack((x1.flat, x2.flat), axis=1)
grid_hat = clf.predict(grid_test)
grid_hat = grid_hat.reshape(x1.shape)
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
plt.figure(facecolor='w')
plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)
plt.scatter(x[0], x[1], c=y, edgecolors='k', s=50, cmap=cm_dark)
plt.scatter(x_test[0], x_test[1], s=120, facecolors='none', zorder=10)
plt.xlabel(iris_feature[0], fontsize=13)
plt.ylabel(iris_feature[1], fontsize=13)
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.title('鸢尾花SVM二特征分类', fontsize=16)
plt.grid(b=True, ls=':')
plt.tight_layout(pad=1.5)
plt.show()
2.svm 不同核函数和不同的参数对应的图
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.metrics import accuracy_score
import matplotlib as mpl
import matplotlib.colors
import matplotlib.pyplot as plt
if __name__ == "__main__":
data = pd.read_csv('F:\\data\\bipartition.txt', sep='\t', header=None)
x, y = data[[0, 1]], data[2]
clf_param = (('linear', 0.1), ('linear', 0.5), ('linear', 1), ('linear', 2),
('rbf', 1, 0.1), ('rbf', 1, 1), ('rbf', 1, 10), ('rbf', 1, 100),
('rbf', 5, 0.1), ('rbf', 5, 1), ('rbf', 5, 10), ('rbf', 5, 100))
x1_min, x2_min = np.min(x, axis=0)
x1_max, x2_max = np.max(x, axis=0)
x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]
grid_test = np.stack((x1.flat, x2.flat), axis=1)
cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FFA0A0'])
cm_dark = mpl.colors.ListedColormap(['g', 'r'])
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(13, 9), facecolor='w')
for i, param in enumerate(clf_param):
clf = svm.SVC(C=param[1], kernel=param[0])
if param[0] == 'rbf':
clf.gamma = param[2]
title = '高斯核,C=%.1f,$\gamma$ =%.1f' % (param[1], param[2])
else:
title = '线性核,C=%.1f' % param[1]
clf.fit(x, y)
y_hat = clf.predict(x)
print('准确率:', accuracy_score(y, y_hat))
print(title)
print('支撑向量的数目:', clf.n_support_)
print('支撑向量的系数:', clf.dual_coef_)
print('支撑向量:', clf.support_)
plt.subplot(3, 4, i+1)
grid_hat = clf.predict(grid_test)
grid_hat = grid_hat.reshape(x1.shape)
plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light, alpha=0.8)
plt.scatter(x[0], x[1], c=y, edgecolors='k', s=40, cmap=cm_dark)
plt.scatter(x.loc[clf.support_, 0], x.loc[clf.support_, 1], edgecolors='k', facecolors='none', s=100, marker='o')
z = clf.decision_function(grid_test)
print('clf.decision_function(x) = ', clf.decision_function(x))
print('clf.predict(x) = ', clf.predict(x))
z = z.reshape(x1.shape)
plt.contour(x1, x2, z, colors=list('kbrbk'), linestyles=['--', '--', '-', '--', '--'],
linewidths=[1, 0.5, 1.5, 0.5, 1], levels=[-1, -0.5, 0, 0.5, 1])
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.title(title, fontsize=12)
plt.suptitle('SVM不同参数的分类', fontsize=16)
plt.tight_layout(1.4)
plt.subplots_adjust(top=0.92)
plt.show()
3.svm数据不平衡,线性核函数和高斯核函数对数据权重的调整结果
import numpy as np
from sklearn import svm
import matplotlib.colors
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.exceptions import UndefinedMetricWarning
import warnings
if __name__ == "__main__":
warnings.filterwarnings(action='ignore', category=UndefinedMetricWarning)
np.random.seed(0)
c1 = 990
c2 = 10
N = c1 + c2
x_c1 = 3*np.random.randn(c1, 2)
x_c2 = 0.5*np.random.randn(c2, 2) + (4, 4)
x = np.vstack((x_c1, x_c2))
y = np.ones(N)
y[:c1] = -1
s = np.ones(N) * 30
s[:c1] = 10
clfs = [svm.SVC(C=1, kernel='linear'),
svm.SVC(C=1, kernel='linear', class_weight={-1: 1, 1: 50}),
svm.SVC(C=0.8, kernel='rbf', gamma=0.5, class_weight={-1: 1, 1: 2}),
svm.SVC(C=0.8, kernel='rbf', gamma=0.5, class_weight={-1: 1, 1: 10})]
titles = 'Linear', 'Linear, Weight=50', 'RBF, Weight=2', 'RBF, Weight=10'
x1_min, x2_min = np.min(x, axis=0)
x1_max, x2_max = np.max(x, axis=0)
x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]
grid_test = np.stack((x1.flat, x2.flat), axis=1)
cm_light = matplotlib.colors.ListedColormap(['#77E0A0', '#FF8080'])
cm_dark = matplotlib.colors.ListedColormap(['g', 'r'])
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(10, 8), facecolor='w')
for i, clf in enumerate(clfs):
clf.fit(x, y)
y_hat = clf.predict(x)
print(i+1, '次:')
print('accuracy:\t', accuracy_score(y, y_hat))
print('precision:\t', precision_score(y, y_hat, pos_label=1))
print('recall:\t', recall_score(y, y_hat, pos_label=1))
print('F1-score:\t', f1_score(y, y_hat, pos_label=1))
print()
plt.subplot(2, 2, i+1)
grid_hat = clf.predict(grid_test)
grid_hat = grid_hat.reshape(x1.shape)
plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light, alpha=0.8)
plt.scatter(x[:, 0], x[:, 1], c=y, edgecolors='k', s=s, cmap=cm_dark)
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.title(titles[i])
plt.grid(b=True, ls=':')
plt.suptitle('不平衡数据的处理', fontsize=18)
plt.tight_layout(1.5)
plt.subplots_adjust(top=0.92)
plt.show()
4.图片数字识别,交叉验证调参
import numpy as np
from sklearn import svm
import matplotlib.colors
import matplotlib.pyplot as plt
from PIL import Image
from sklearn.metrics import accuracy_score
import os
from sklearn.model_selection import GridSearchCV
from time import time
def show_accuracy(a, b, tip):
acc = a.ravel() == b.ravel()
print(tip + '正确率:%.2f%%' % (100*np.mean(acc)))
def save_image(im, i):
im *= 15.9375
im = 255 - im
a = im.astype(np.uint8)
output_path = '.\\HandWritten'
if not os.path.exists(output_path):
os.mkdir(output_path)
Image.fromarray(a).save(output_path + ('\\%d.png' % i))
if __name__ == "__main__":
print('Load Training File Start...')
data = np.loadtxt('F:\\data\\optdigits.tra', dtype=np.float, delimiter=',')
x, y = np.split(data, (-1, ), axis=1)
images = x.reshape(-1, 8, 8)
y = y.ravel().astype(np.int)
print('Load Test Data Start...')
data = np.loadtxt('F:\\data\\optdigits.tes', dtype=np.float, delimiter=',')
x_test, y_test = np.split(data, (-1, ), axis=1)
print(y_test.shape)
images_test = x_test.reshape(-1, 8, 8)
y_test = y_test.ravel().astype(np.int)
print('Load Data OK...')
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(15, 9), facecolor='w')
for index, image in enumerate(images[:16]):
plt.subplot(4, 8, index + 1)
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('训练图片: %i' % y[index])
for index, image in enumerate(images_test[:16]):
plt.subplot(4, 8, index + 17)
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
save_image(image.copy(), index)
plt.title('测试图片: %i' % y_test[index])
plt.tight_layout()
plt.show()
model = svm.SVC(C=10, kernel='rbf', gamma=0.001)
print('Start Learning...')
t0 = time()
model.fit(x, y)
t1 = time()
t = t1 - t0
print('训练+CV耗时:%d分钟%.3f秒' % (int(t/60), t - 60*int(t/60)))
print('Learning is OK...')
print('训练集准确率:', accuracy_score(y, model.predict(x)))
y_hat = model.predict(x_test)
print('测试集准确率:', accuracy_score(y_test, model.predict(x_test)))
print(y_hat)
print(y_test)
err_images = images_test[y_test != y_hat]
err_y_hat = y_hat[y_test != y_hat]
err_y = y_test[y_test != y_hat]
print(err_y_hat)
print(err_y)
plt.figure(figsize=(10, 8), facecolor='w')
for index, image in enumerate(err_images):
if index >= 12:
break
plt.subplot(3, 4, index + 1)
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('错分为:%i,真实值:%i' % (err_y_hat[index], err_y[index]))
plt.tight_layout()
plt.show()
5.回归
import numpy as np
from sklearn import svm
import matplotlib.pyplot as plt
N=50
np.random.seed(0)
x = np.sort(np.random.uniform(0,6,N), axis=0)
y = 2*np.sin(x) + 0.1*np.random.randn(N)
x = x.reshape(-1,1)
svr_rbf = svm.SVR(C=100, kernel='rbf', gamma=0.2)
svr_rbf.fit(x,y)
svr_linear = svm.SVR(C=100, kernel='linear')
svr_linear.fit(x,y)
svr_poly = svm.SVR(C=100, kernel='poly', degree=3)
svr_poly.fit(x,y)
x_test = np.linspace(x.min(), 1.1*x.max(), 100).reshape(-1, 1)
y_rbf = svr_rbf.predict(x_test)
y_linear = svr_linear.predict(x_test)
y_poly = svr_poly.predict(x_test)
plt.figure(figsize=(7,6),facecolor='w')
plt.plot(x_test,y_rbf,'r-',linewidth=2, label='RBF Kernel')
plt.plot(x_test,y_linear,'g-',linewidth=2, label='Linear Kernel')
plt.plot(x_test,y_poly,'b-',linewidth=2, label='Polynomial Kernel')
plt.plot(x,y,'mo', ms=6, mec='k')
plt.scatter(x[svr_rbf.support_], y[svr_rbf.support_], s=200, c='r', marker='*', edgecolors='k', label='RBF Support Vectors', zorder=10)
plt.legend(loc='lower left', fontsize=12)
plt.title('SVR', fontsize=15)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(b=True, ls=':')
plt.tight_layout(2)
plt.show()
6.回归调参
import numpy as np
from sklearn import svm
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import matplotlib.pyplot as plt
if __name__ == "__main__":
N = 50
np.random.seed(0)
x = np.sort(np.random.uniform(0, 6, N), axis=0)
y = 2*np.sin(x) + 0.1*np.random.randn(N)
x = x.reshape(-1, 1)
print('x =\n', x)
print('y =\n', y)
model = svm.SVR(kernel='rbf')
c_can = np.logspace(-2, 2, 10)
gamma_can = np.logspace(-2, 2, 10)
svr = GridSearchCV(model, param_grid={'C': c_can, 'gamma': gamma_can}, cv=5)
svr.fit(x, y)
print('验证参数:\n', svr.best_params_)
x_test = np.linspace(x.min(), x.max(), 100).reshape(-1, 1)
y_hat = svr.predict(x_test)
sp = svr.best_estimator_.support_
plt.figure(facecolor='w')
plt.scatter(x[sp], y[sp], s=120, c='r', marker='*', label='Support Vectors', zorder=3)
plt.plot(x_test, y_hat, 'r-', linewidth=2, label='RBF Kernel')
plt.plot(x, y, 'go', markersize=5)
plt.legend(loc='upper right')
plt.title('SVR', fontsize=16)
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(True)
plt.show()
七、聚类
拉普拉斯矩阵和谱聚类 https://blog.csdn.net/guoxinian/article/details/79532893
1.聚类kmeans++和几个评价方法
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn.datasets as ds
from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score, adjusted_mutual_info_score, adjusted_rand_score, silhouette_score
from sklearn.cluster import KMeans
def expand(a,b):
d = (b-a)*0.1
return a-d, b+d
if __name__ == '__main__':
N=400
centers = 4
data, y = ds.make_blobs(N, n_features=2, centers=centers, random_state=2)
data2, y2 = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=(1,2.5,0.5,2), random_state=2)
data3 = np.vstack((data[y==0][:],data[y==1][:50],data[y==2][:20],data[y==3][:5]))
y3 = np.array([0]*100+[1]*50+[2]*20+[3]*5)
m = np.array(((1,1),(1,3)))
data_r = data.dot(m)
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
cm = mpl.colors.ListedColormap(list('rgbm'))
data_list = data, data, data_r, data_r, data2, data2, data3, data3
y_list = y, y, y, y, y2, y2, y3, y3
titles = '原始数据', 'KMeans++聚类', '旋转后数据', '旋转后KMeans++聚类',\
'方差不相等数据', '方差不相等KMeans++聚类', '数量不相等数据', '数量不相等KMeans++聚类'
model = KMeans(n_clusters=4, init='k-means++', n_init=5)
plt.figure(figsize=(8,9), facecolor='w')
for i,(x,y,title) in enumerate(zip(data_list,y_list,titles),start=1):
plt.subplot(4,2,i)
plt.title(title)
if i % 2==1:
y_pred = y
else:
y_pred = model.fit_predict(x)
print(i)
print('Homogeneity:', homogeneity_score(y, y_pred))
print('completeness:', completeness_score(y, y_pred))
print('V measure:', v_measure_score(y, y_pred))
print('AMI:', adjusted_mutual_info_score(y, y_pred))
print('ARI:', adjusted_rand_score(y, y_pred))
print('Silhouette:', silhouette_score(x, y_pred), '\n')
plt.scatter(x[:, 0], x[:, 1], c=y_pred, s=30, cmap=cm, edgecolors='none')
x1_min, x2_min = np.min(x, axis=0)
x1_max, x2_max = np.max(x, axis=0)
x1_min, x1_max = expand(x1_min, x1_max)
x2_min, x2_max = expand(x2_min, x2_max)
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.grid(b=True, ls=':')
plt.tight_layout(2, rect=(0, 0, 1, 0.97))
plt.suptitle('数据分布对KMeans聚类的影响', fontsize=18)
plt.show()
2.图像的聚类简化、压缩
from PIL import Image
import numpy as np
from sklearn.cluster import KMeans
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
def restore_image(cb, cluster, shape):
row, col, dummy = shape
image = np.empty((row, col, 3))
index = 0
for r in range(row):
for c in range(col):
image[r, c] = cb[cluster[index]]
index += 1
return image
def show_scatter(a):
N = 10
print('原始数据:\n', a)
density, edges = np.histogramdd(a, bins=[N,N,N], range=[(0,1), (0,1), (0,1)])
np.set_printoptions(linewidth=300, suppress=True)
print('print density\n',density)
density /= density.sum()
x = y = z = np.arange(N)
d = np.meshgrid(x, y, z)
fig = plt.figure(1, facecolor='w')
ax = fig.add_subplot(111, projection='3d')
ax.scatter(d[1], d[0], d[2], c='r', s=100*density/density.max(), marker='o', depthshade=True)
ax.set_xlabel('红色分量')
ax.set_ylabel('绿色分量')
ax.set_zlabel('蓝色分量')
plt.title('图像颜色三维频数分布', fontsize=13)
plt.figure(2, facecolor='w')
den = density[density > 0]
print(den.shape)
den = np.sort(den)[::-1]
t = np.arange(len(den))
plt.plot(t, den, 'r-', t, den, 'go', lw=2)
plt.title('图像颜色频数分布', fontsize=13)
plt.grid(True)
plt.show()
if __name__ == '__main__':
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
num_vq = 50
im = Image.open('G:\\data\\lena.png')
image = np.array(im).astype(np.float) / 255
image = image[:, :, :3]
image_v = image.reshape((-1, 3))
show_scatter(image_v)
N = image_v.shape[0]
idx = np.arange(N)
np.random.shuffle(idx)
idx = idx[:1000]
image_sample = image_v[idx]
model = KMeans(num_vq)
model.fit(image_sample)
c = model.predict(image_v)
print('聚类结果:\n', c)
print('聚类中心:\n', model.cluster_centers_)
plt.figure(figsize=(12, 6), facecolor='w')
plt.subplot(121)
plt.axis('off')
plt.title('原始图片', fontsize=14)
plt.imshow(image)
plt.subplot(122)
vq_image = restore_image(model.cluster_centers_, c, image.shape)
plt.axis('off')
plt.title('矢量量化后图片:%d色' % num_vq, fontsize=14)
plt.imshow(vq_image)
plt.tight_layout(2)
plt.show()
3.AP聚类
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors
import sklearn.datasets as ds
from sklearn.cluster import AffinityPropagation
from sklearn.metrics import euclidean_distances
if __name__ == '__main__':
N=400
centers = [[1,2],[-1,-1],[1,-1],[-1,1]]
data, y = ds.make_blobs(N,n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
m = euclidean_distances(data, squared=True)
preference = -np.median(m)
print(preference)
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(12,9), facecolor='w')
for i, mul in enumerate(np.linspace(1,4,9)):
print(mul)
p = mul*preference
model = AffinityPropagation(affinity='euclidean', preference=p)
af = model.fit(data)
center_indices = af.cluster_centers_indices_
n_clusters = len(center_indices)
print(('p = %.lf'%mul), p, '聚类簇的个数为',n_clusters)
y_hat = af.labels_
plt.subplot(3,3,i+1)
plt.title('Preference %.2f, 簇个数:%d'%(p, n_clusters))
clrs = []
for c in np.linspace(16711680,255,n_clusters, dtype=int):
clrs.append('#%06x'%c)
for k, clr in enumerate(clrs):
cur = (y_hat==k)
plt.scatter(data[cur,0],data[cur,1], s=15, c=clr, edgecolors='none')
center = data[center_indices[k]]
for x in data[cur]:
plt.plot([x[0], center[0]], [x[1], center[1]],color=clr, lw=0.5, zorder=1)
plt.scatter(data[center_indices,0], data[center_indices,1], s=80, c=clrs, marker='*',edgecolors='k', zorder=2)
plt.grid(b=True, ls=':')
plt.tight_layout()
plt.suptitle('AP聚类',fontsize=20)
plt.subplots_adjust(top=0.92)
plt.show()
4.MeanShift聚类,代码套路和AP差不多
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import MeanShift
from sklearn.metrics import euclidean_distances
if __name__ == "__main__":
N = 1000
centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(8, 7), facecolor='w')
m = euclidean_distances(data, squared=True)
bw = np.median(m)
print(bw)
for i, mul in enumerate(np.linspace(0.1, 0.4, 4)):
band_width = mul * bw
model = MeanShift(bin_seeding=True, bandwidth=band_width)
ms = model.fit(data)
centers = ms.cluster_centers_
y_hat = ms.labels_
n_clusters = np.unique(y_hat).size
print('带宽:', mul, band_width, '聚类簇的个数为:', n_clusters)
plt.subplot(2, 2, i+1)
plt.title('带宽:%.2f,聚类簇的个数为:%d' % (band_width, n_clusters))
clrs = []
for c in np.linspace(16711680, 255, n_clusters, dtype=int):
clrs.append('#%06x' % c)
for k, clr in enumerate(clrs):
cur = (y_hat == k)
plt.scatter(data[cur, 0], data[cur, 1], c=clr, edgecolors='none')
plt.scatter(centers[:, 0], centers[:, 1], s=150, c=clrs, marker='*', edgecolors='k')
plt.grid(b=True, ls=':')
plt.tight_layout(2)
plt.suptitle('MeanShift聚类', fontsize=15)
plt.subplots_adjust(top=0.9)
plt.show()
5.层次聚类
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import kneighbors_graph
import sklearn.datasets as ds
import warnings
def extend(a, b):
return 1.05*a-0.05*b, 1.05*b-0.05*a
if __name__ == '__main__':
warnings.filterwarnings(action='ignore', category=UserWarning)
np.set_printoptions(suppress=True)
np.random.seed(0)
n_clusters = 4
N = 400
data1, y1 = ds.make_blobs(n_samples=N, n_features=2, centers=((-1, 1), (1, 1), (1, -1), (-1, -1)),
cluster_std=(0.1, 0.2, 0.3, 0.4), random_state=0)
data1 = np.array(data1)
n_noise = int(0.1*N)
r = np.random.rand(n_noise, 2)
data_min1, data_min2 = np.min(data1, axis=0)
data_max1, data_max2 = np.max(data1, axis=0)
r[:, 0] = r[:, 0] * (data_max1-data_min1) + data_min1
r[:, 1] = r[:, 1] * (data_max2-data_min2) + data_min2
data1_noise = np.concatenate((data1, r), axis=0)
y1_noise = np.concatenate((y1, [4]*n_noise))
data2, y2 = ds.make_moons(n_samples=N, noise=.05)
data2 = np.array(data2)
n_noise = int(0.1 * N)
r = np.random.rand(n_noise, 2)
data_min1, data_min2 = np.min(data2, axis=0)
data_max1, data_max2 = np.max(data2, axis=0)
r[:, 0] = r[:, 0] * (data_max1 - data_min1) + data_min1
r[:, 1] = r[:, 1] * (data_max2 - data_min2) + data_min2
data2_noise = np.concatenate((data2, r), axis=0)
y2_noise = np.concatenate((y2, [3] * n_noise))
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
cm = mpl.colors.ListedColormap(['r', 'g', 'b', 'm', 'c'])
plt.figure(figsize=(10, 8), facecolor='w')
plt.cla()
linkages = ("ward", "complete", "average")
for index, (n_clusters, data, y) in enumerate(((4, data1, y1), (4, data1_noise, y1_noise),
(2, data2, y2), (2, data2_noise, y2_noise))):
plt.subplot(4, 4, 4*index+1)
plt.scatter(data[:, 0], data[:, 1], c=y, s=12, edgecolors='k', cmap=cm)
plt.title('Prime', fontsize=12)
plt.grid(b=True, ls=':')
data_min1, data_min2 = np.min(data, axis=0)
data_max1, data_max2 = np.max(data, axis=0)
plt.xlim(extend(data_min1, data_max1))
plt.ylim(extend(data_min2, data_max2))
connectivity = kneighbors_graph(data, n_neighbors=7, mode='distance', metric='minkowski', p=2, include_self=True)
connectivity = 0.5 * (connectivity + connectivity.T)
for i, linkage in enumerate(linkages):
ac = AgglomerativeClustering(n_clusters=n_clusters, affinity='euclidean',
connectivity=connectivity, linkage=linkage)
ac.fit(data)
y = ac.labels_
plt.subplot(4, 4, i+2+4*index)
plt.scatter(data[:, 0], data[:, 1], c=y, s=12, edgecolors='k', cmap=cm)
plt.title(linkage, fontsize=12)
plt.grid(b=True, ls=':')
plt.xlim(extend(data_min1, data_max1))
plt.ylim(extend(data_min2, data_max2))
plt.suptitle('层次聚类的不同合并策略', fontsize=15)
plt.tight_layout(0.5, rect=(0, 0, 1, 0.95))
plt.show()
6.基于密度的聚类DBSCAN
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
def expand(a, b):
d = (b - a) * 0.1
return a-d, b+d
if __name__ == "__main__":
N = 1000
centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
data = StandardScaler().fit_transform(data)
params = ((0.2, 5), (0.2, 10), (0.2, 15), (0.3, 5), (0.3, 10), (0.3, 15))
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(9, 7), facecolor='w')
plt.suptitle('DBSCAN聚类', fontsize=15)
for i in range(6):
eps, min_samples = params[i]
model = DBSCAN(eps=eps, min_samples=min_samples)
model.fit(data)
y_hat = model.labels_
core_indices = np.zeros_like(y_hat, dtype=bool)
core_indices[model.core_sample_indices_] = True
y_unique = np.unique(y_hat)
n_clusters = y_unique.size - (1 if -1 in y_hat else 0)
print(y_unique, '聚类簇的个数为:', n_clusters)
plt.subplot(2, 3, i+1)
clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size))
print(clrs)
for k, clr in zip(y_unique, clrs):
cur = (y_hat == k)
if k == -1:
plt.scatter(data[cur, 0], data[cur, 1], s=10, c='k')
continue
plt.scatter(data[cur, 0], data[cur, 1], s=15, c=clr, edgecolors='k')
plt.scatter(data[cur & core_indices][:, 0], data[cur & core_indices][:, 1], s=30, c=clr, marker='o', edgecolors='k')
x1_min, x2_min = np.min(data, axis=0)
x1_max, x2_max = np.max(data, axis=0)
x1_min, x1_max = expand(x1_min, x1_max)
x2_min, x2_max = expand(x2_min, x2_max)
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.plot()
plt.grid(b=True, ls=':', color='#606060')
plt.title(r'$\epsilon$ = %.1f m = %d,聚类数目:%d' % (eps, min_samples, n_clusters), fontsize=12)
plt.tight_layout()
plt.subplots_adjust(top=0.9)
plt.show()
7.层次聚类与密度聚类的结合HDBSCAN,对参数eps和m相对不敏感能得到更好的结果
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import hdbscan
def expand(a, b):
d = (b - a) * 0.1
return a-d, b+d
if __name__ == "__main__":
N = 1000
centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
data = StandardScaler().fit_transform(data)
params = ((0.2, 5), (0.2, 10), (0.2, 15), (0.3, 5), (0.3, 10), (0.3, 15))
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(12, 8), facecolor='w')
plt.suptitle('HDBSCAN聚类', fontsize=16)
for i in range(6):
eps, min_samples = params[i]
model = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=10)
model.fit(data)
y_hat = model.labels_
core_indices = np.zeros_like(y_hat, dtype=bool)
core_indices[y_hat != -1] = True
y_unique = np.unique(y_hat)
n_clusters = y_unique.size - (1 if -1 in y_hat else 0)
print(y_unique, '聚类簇的个数为:', n_clusters)
plt.subplot(2, 3, i+1)
clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size))
for k, clr in zip(y_unique, clrs):
cur = (y_hat == k)
plt.scatter(data[cur, 0], data[cur, 1], s=60*model.probabilities_[cur], marker='o', c=clr, edgecolors='k', alpha=0.9)
x1_min, x2_min = np.min(data, axis=0)
x1_max, x2_max = np.max(data, axis=0)
x1_min, x1_max = expand(x1_min, x1_max)
x2_min, x2_max = expand(x2_min, x2_max)
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.grid(b=True, ls=':', color='#808080')
plt.title(r'$\epsilon$ = %.1f m = %d,聚类数目:%d' % (eps, min_samples, n_clusters), fontsize=13)
plt.tight_layout()
plt.subplots_adjust(top=0.9)
plt.show()
8.谱聚类(SC)和处理图像
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import SpectralClustering
from sklearn.metrics import euclidean_distances
def expand(a, b):
d = (b - a) * 0.1
return a-d, b+d
if __name__ == "__main__":
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
t = np.arange(0, 2*np.pi, 0.1)
data1 = np.vstack((np.cos(t), np.sin(t))).T
data2 = np.vstack((2*np.cos(t), 2*np.sin(t))).T
data3 = np.vstack((3*np.cos(t), 3*np.sin(t))).T
data = np.vstack((data1, data2, data3))
n_clusters = 3
m = euclidean_distances(data, squared=True)
plt.figure(figsize=(12, 8), facecolor='w')
plt.suptitle('谱聚类', fontsize=16)
clrs = plt.cm.Spectral(np.linspace(0, 0.8, n_clusters))
for i, s in enumerate(np.logspace(-2, 0, 6)):
print(s)
af = np.exp(-m ** 2 / (s ** 2)) + 1e-6
model = SpectralClustering(n_clusters=n_clusters, affinity='precomputed', assign_labels='kmeans', random_state=1)
y_hat = model.fit_predict(af)
plt.subplot(2, 3, i+1)
for k, clr in enumerate(clrs):
cur = (y_hat == k)
plt.scatter(data[cur, 0], data[cur, 1], s=40, c=clr, edgecolors='k')
x1_min, x2_min = np.min(data, axis=0)
x1_max, x2_max = np.max(data, axis=0)
x1_min, x1_max = expand(x1_min, x1_max)
x2_min, x2_max = expand(x2_min, x2_max)
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.grid(b=True, ls=':', color='#808080')
plt.title(r'$\sigma$ = %.2f' % s, fontsize=13)
plt.tight_layout()
plt.subplots_adjust(top=0.9)
plt.show()
处理图像:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
from sklearn.cluster import spectral_clustering
from sklearn.feature_extraction import image
from PIL import Image
import time
if __name__ == "__main__":
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
pic = Image.open('F:\\data\\Chrome.png')
pic = pic.convert('L')
data = np.array(pic).astype(np.float) / 255
plt.figure(figsize=(10, 5), facecolor='w')
plt.subplot(121)
plt.imshow(pic, cmap=plt.cm.gray, interpolation='nearest')
plt.title('原始图片', fontsize=18)
n_clusters = 15
affinity = image.img_to_graph(data)
beta = 3
affinity.data = np.exp(-beta * affinity.data / affinity.data.std()) + 10e-5
print('开始谱聚类...')
y = spectral_clustering(affinity, n_clusters=n_clusters, assign_labels='kmeans', random_state=1)
print('谱聚类完成...')
y = y.reshape(data.shape)
for n in range(n_clusters):
data[y == n] = n
plt.subplot(122)
clrs = []
for c in np.linspace(16776960, 16711935, n_clusters, dtype=int):
clrs.append('#%06x' % c)
cm = matplotlib.colors.ListedColormap(clrs)
plt.imshow(data, cmap=cm, interpolation='nearest')
plt.title('谱聚类:%d簇' % n_clusters, fontsize=18)
plt.tight_layout()
plt.show()
八、EN(高斯混合的最大期望)
1.EM算法
import numpy as np
from scipy.stats import multivariate_normal
from sklearn.mixture import GaussianMixture
from mpl_toolkits.mplot3d import Axes3D
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import pairwise_distances_argmin
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
if __name__ == '__main__':
style = 'myself'
np.random.seed(0)
mu1_fact = (0, 0, 0)
cov1_fact = np.diag((1, 2, 3))
data1 = np.random.multivariate_normal(mu1_fact, cov1_fact*0.3, 400)
mu2_fact = (2, 2, 1)
cov2_fact = np.array(((6, 1, 3), (1, 5, 1), (3, 1, 4)))
data2 = np.random.multivariate_normal(mu2_fact, cov2_fact*0.1, 100)
data = np.vstack((data1, data2))
y = np.array([True] * 400 + [False] * 100)
if style == 'sklearn':
g = GaussianMixture(n_components=2, covariance_type='full', tol=1e-6, max_iter=1000)
g.fit(data)
print('类别概率:\t', g.weights_[0])
print('均值:\n', g.means_, '\n')
print('方差:\n', g.covariances_, '\n')
mu1, mu2 = g.means_
sigma1, sigma2 = g.covariances_
else:
num_iter = 100
n, d = data.shape
mu1 = data.min(axis=0)
mu2 = data.max(axis=0)
sigma1 = np.identity(d)
sigma2 = np.identity(d)
pi = 0.5
for i in range(num_iter):
norm1 = multivariate_normal(mu1, sigma1)
norm2 = multivariate_normal(mu2, sigma2)
tau1 = pi * norm1.pdf(data)
tau2 = (1 - pi) * norm2.pdf(data)
gamma = tau1 / (tau1 + tau2)
mu1 = np.dot(gamma, data) / np.sum(gamma)
mu2 = np.dot((1 - gamma), data) / np.sum((1 - gamma))
sigma1 = np.dot(gamma * (data - mu1).T, data - mu1) / np.sum(gamma)
sigma2 = np.dot((1 - gamma) * (data - mu2).T, data - mu2) / np.sum(1 - gamma)
pi = np.sum(gamma) / n
print(i, ":\t", mu1, mu2)
print('类别概率:\t', pi)
print('均值:\t', mu1, mu2)
print('方差:\n', sigma1, '\n\n', sigma2, '\n')
norm1 = multivariate_normal(mu1, sigma1)
norm2 = multivariate_normal(mu2, sigma2)
tau1 = norm1.pdf(data)
tau2 = norm2.pdf(data)
fig = plt.figure(figsize=(10, 5), facecolor='w')
ax = fig.add_subplot(121, projection='3d')
ax.scatter(data[:, 0], data[:, 1], data[:, 2], c='b', s=30, marker='o', edgecolors='k', depthshade=True)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('原始数据', fontsize=15)
ax = fig.add_subplot(122, projection='3d')
order = pairwise_distances_argmin([mu1_fact, mu2_fact], [mu1, mu2], metric='euclidean')
print(order)
if order[0] == 0:
c1 = tau1 > tau2
else:
c1 = tau1 < tau2
c2 = ~c1
acc = np.mean(y == c1)
print('准确率:%.2f%%' % (100*acc))
ax.scatter(data[c1, 0], data[c1, 1], data[c1, 2], c='r', s=30, marker='o', edgecolors='k', depthshade=True)
ax.scatter(data[c2, 0], data[c2, 1], data[c2, 2], c='g', s=30, marker='^', edgecolors='k', depthshade=True)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('EM算法分类', fontsize=15)
plt.suptitle('EM算法的实现', fontsize=18)
plt.subplots_adjust(top=0.90)
plt.tight_layout()
plt.show()
2.GMM 调参
import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib as mpl
import matplotlib.colors
import matplotlib.pyplot as plt
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
def expand(a, b, rate=0.05):
d = (b - a) * rate
return a-d, b+d
def accuracy_rate(y1, y2):
acc = np.mean(y1 == y2)
return acc if acc > 0.5 else 1-acc
if __name__ == '__main__':
np.random.seed(0)
cov1 = np.diag((1, 2))
print(cov1)
N1 = 500
N2 = 300
N = N1 + N2
x1 = np.random.multivariate_normal(mean=(1, 2), cov=cov1, size=N1)
m = np.array(((1, 1), (1, 3)))
x1 = x1.dot(m)
x2 = np.random.multivariate_normal(mean=(-1, 10), cov=cov1, size=N2)
x = np.vstack((x1, x2))
y = np.array([0]*N1 + [1]*N2)
types = ('spherical', 'diag', 'tied', 'full')
err = np.empty(len(types))
bic = np.empty(len(types))
for i, type in enumerate(types):
gmm = GaussianMixture(n_components=2, covariance_type=type, random_state=0)
gmm.fit(x)
err[i] = 1 - accuracy_rate(gmm.predict(x), y)
bic[i] = gmm.bic(x)
print('错误率:', err.ravel())
print('BIC:', bic.ravel())
xpos = np.arange(4)
plt.figure(facecolor='w')
ax = plt.axes()
b1 = ax.bar(xpos-0.3, err, width=0.3, color='#77E0A0', edgecolor='k')
b2 = ax.twinx().bar(xpos, bic, width=0.3, color='#FF8080', edgecolor='k')
plt.grid(b=True, ls=':', color='#606060')
bic_min, bic_max = expand(bic.min(), bic.max())
plt.ylim((bic_min, bic_max))
plt.xticks(xpos, types)
plt.legend([b1[0], b2[0]], ('错误率', 'BIC'))
plt.title('不同方差类型的误差率和BIC', fontsize=15)
plt.show()
optimal = bic.argmin()
gmm = GaussianMixture(n_components=2, covariance_type=types[optimal], random_state=0)
gmm.fit(x)
print('均值 = \n', gmm.means_)
print('方差 = \n', gmm.covariances_)
y_hat = gmm.predict(x)
cm_light = mpl.colors.ListedColormap(['#FF8080', '#77E0A0'])
cm_dark = mpl.colors.ListedColormap(['r', 'g'])
x1_min, x1_max = x[:, 0].min(), x[:, 0].max()
x2_min, x2_max = x[:, 1].min(), x[:, 1].max()
x1_min, x1_max = expand(x1_min, x1_max)
x2_min, x2_max = expand(x2_min, x2_max)
x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]
grid_test = np.stack((x1.flat, x2.flat), axis=1)
grid_hat = gmm.predict(grid_test)
grid_hat = grid_hat.reshape(x1.shape)
if gmm.means_[0][0] > gmm.means_[1][0]:
z = grid_hat == 0
grid_hat[z] = 1
grid_hat[~z] = 0
plt.figure(facecolor='w')
plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)
plt.scatter(x[:, 0], x[:, 1], s=30, c=y, marker='o', cmap=cm_dark, edgecolors='k')
ax1_min, ax1_max, ax2_min, ax2_max = plt.axis()
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title('GMM调参:covariance_type=%s' % types[optimal], fontsize=15)
plt.grid(b=True, ls=':', color='#606060')
plt.tight_layout(2)
plt.show()
3.GMM 鸢尾花
import numpy as np
import pandas as pd
from sklearn.mixture import GaussianMixture
import matplotlib as mpl
import matplotlib.colors
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import pairwise_distances_argmin
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'
def expand(a, b, rate=0.05):
d = (b - a) * rate
return a-d, b+d
if __name__ == '__main__':
path = 'F:\\data\\\iris.data'
data = pd.read_csv(path, header=None)
x_prime = data[np.arange(4)]
y = pd.Categorical(data[4]).codes
n_components = 3
feature_pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
plt.figure(figsize=(8, 6), facecolor='w')
for k, pair in enumerate(feature_pairs, start=1):
x = x_prime[pair]
m = np.array([np.mean(x[y == i], axis=0) for i in range(3)])
print('实际均值 = \n', m)
gmm = GaussianMixture(n_components=n_components, covariance_type='full', random_state=0)
gmm.fit(x)
print('预测均值 = \n', gmm.means_)
print('预测方差 = \n', gmm.covariances_)
y_hat = gmm.predict(x)
order = pairwise_distances_argmin(m, gmm.means_, axis=1, metric='euclidean')
print('顺序:\t', order)
n_sample = y.size
n_types = 3
change = np.empty((n_types, n_sample), dtype=np.bool)
for i in range(n_types):
change[i] = y_hat == order[i]
for i in range(n_types):
y_hat[change[i]] = i
acc = '准确率:%.2f%%' % (100*np.mean(y_hat == y))
print(acc)
cm_light = mpl.colors.ListedColormap(['#FF8080', '#77E0A0', '#A0A0FF'])
cm_dark = mpl.colors.ListedColormap(['r', 'g', '#6060FF'])
x1_min, x2_min = x.min()
x1_max, x2_max = x.max()
x1_min, x1_max = expand(x1_min, x1_max)
x2_min, x2_max = expand(x2_min, x2_max)
x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]
grid_test = np.stack((x1.flat, x2.flat), axis=1)
grid_hat = gmm.predict(grid_test)
change = np.empty((n_types, grid_hat.size), dtype=np.bool)
for i in range(n_types):
change[i] = grid_hat == order[i]
for i in range(n_types):
grid_hat[change[i]] = i
grid_hat = grid_hat.reshape(x1.shape)
plt.subplot(2, 3, k)
plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)
plt.scatter(x[pair[0]], x[pair[1]], s=20, c=y, marker='o', cmap=cm_dark, edgecolors='k')
xx = 0.95 * x1_min + 0.05 * x1_max
yy = 0.1 * x2_min + 0.9 * x2_max
plt.text(xx, yy, acc, fontsize=10)
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.xlabel(iris_feature[pair[0]], fontsize=11)
plt.ylabel(iris_feature[pair[1]], fontsize=11)
plt.grid(b=True, ls=':', color='#606060')
plt.suptitle('EM算法无监督分类鸢尾花数据', fontsize=14)
plt.tight_layout(1, rect=(0, 0, 1, 0.95))
plt.show()
4.DPGMM
import numpy as np
from sklearn.mixture import GaussianMixture, BayesianGaussianMixture
import scipy as sp
import matplotlib as mpl
import matplotlib.colors
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
def expand(a, b, rate=0.05):
d = (b - a) * rate
return a-d, b+d
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
if __name__ == '__main__':
np.random.seed(0)
cov1 = np.diag((1, 2))
N1 = 500
N2 = 300
N = N1 + N2
x1 = np.random.multivariate_normal(mean=(3, 2), cov=cov1, size=N1)
m = np.array(((1, 1), (1, 3)))
x1 = x1.dot(m)
x2 = np.random.multivariate_normal(mean=(-1, 10), cov=cov1, size=N2)
x = np.vstack((x1, x2))
y = np.array([0]*N1 + [1]*N2)
n_components = 3
colors = '#A0FFA0', '#2090E0', '#FF8080'
cm = mpl.colors.ListedColormap(colors)
x1_min, x1_max = x[:, 0].min(), x[:, 0].max()
x2_min, x2_max = x[:, 1].min(), x[:, 1].max()
x1_min, x1_max = expand(x1_min, x1_max)
x2_min, x2_max = expand(x2_min, x2_max)
x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]
grid_test = np.stack((x1.flat, x2.flat), axis=1)
plt.figure(figsize=(6, 6), facecolor='w')
plt.suptitle('GMM/DPGMM比较', fontsize=15)
ax = plt.subplot(211)
gmm = GaussianMixture(n_components=n_components, covariance_type='full', random_state=0)
gmm.fit(x)
centers = gmm.means_
covs = gmm.covariances_
print('GMM均值 = \n', centers)
print('GMM方差 = \n', covs)
y_hat = gmm.predict(x)
grid_hat = gmm.predict(grid_test)
grid_hat = grid_hat.reshape(x1.shape)
plt.pcolormesh(x1, x2, grid_hat, cmap=cm)
plt.scatter(x[:, 0], x[:, 1], s=20, c=y, cmap=cm, marker='o', edgecolors='#202020')
clrs = list('rgbmy')
for i, (center, cov) in enumerate(zip(centers, covs)):
value, vector = sp.linalg.eigh(cov)
width, height = value[0], value[1]
v = vector[0] / sp.linalg.norm(vector[0])
angle = 180* np.arctan(v[1] / v[0]) / np.pi
e = Ellipse(xy=center, width=width, height=height,
angle=angle, color=clrs[i], alpha=0.5, clip_box = ax.bbox)
ax.add_artist(e)
ax1_min, ax1_max, ax2_min, ax2_max = plt.axis()
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title('GMM', fontsize=15)
plt.grid(b=True, ls=':', color='#606060')
dpgmm = BayesianGaussianMixture(n_components=n_components, covariance_type='full', max_iter=1000, n_init=5,
weight_concentration_prior_type='dirichlet_process', weight_concentration_prior=0.1)
dpgmm.fit(x)
centers = dpgmm.means_
covs = dpgmm.covariances_
print('DPGMM均值 = \n', centers)
print('DPGMM方差 = \n', covs)
y_hat = dpgmm.predict(x)
print(y_hat)
ax = plt.subplot(212)
grid_hat = dpgmm.predict(grid_test)
grid_hat = grid_hat.reshape(x1.shape)
plt.pcolormesh(x1, x2, grid_hat, cmap=cm)
plt.scatter(x[:, 0], x[:, 1], s=20, c=y, cmap=cm, marker='o', edgecolors='#202020')
for i, cc in enumerate(zip(centers, covs)):
if i not in y_hat:
continue
center, cov = cc
value, vector = sp.linalg.eigh(cov)
width, height = value[0], value[1]
v = vector[0] / sp.linalg.norm(vector[0])
angle = 180* np.arctan(v[1] / v[0]) / np.pi
e = Ellipse(xy=center, width=width, height=height,
angle=angle, color='m', alpha=0.5, clip_box = ax.bbox)
ax.add_artist(e)
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title('DPGMM', fontsize=15)
plt.grid(b=True, ls=':', color='#606060')
plt.tight_layout(2, rect=(0, 0, 1, 0.95))
plt.show()
九、时间模块
import time
import datetime
time.time()
time.localtime()
time.gmtime(time.time())
time.mktime(time.localtime())
time.strftime('%Y%m%d',time.localtime())
time.strptime('20181016','%Y%m%d')
'''
datetime.date:表示日期的类。常用的属性有year, month, day;
datetime.time:表示时间的类。常用的属性有hour, minute, second, microsecond;
datetime.datetime:表示日期时间。
datetime.timedelta:表示时间间隔,即两个时间点之间的长度。
datetime.tzinfo:与时区有关的相关信息timezoneinfo
'''
datetime.datetime.now()
datetime.datetime.now().strftime('%Y%m%d')
datetime.datetime.strptime('20181016',"%Y%m%d")
'''
datetime2 = datetime1 + timedelta # 日期加上一个间隔,返回一个新的日期对象(timedelta将在下面介绍,表示时间间隔)
datetime2 = datetime1 - timedelta # 日期隔去间隔,返回一个新的日期对象
timedelta = date1 - date2 # 两个日期相减,返回一个时间间隔对象
datetime1 < datetime2 # 两个日期进行比较
'''
datetime.datetime.now() - datetime.timedelta(days = 7)
datetime.datetime.fromtimestamp(time.time())
'''
标识 含义 举例
%a 星期简写 Mon
%A 星期全称 Monday
%b 月份简写 Mar
%B 月份全称 March
%c 适合语言下的时间表示 May Mon May 20 16:00:02 2013
%d 一个月的第一天,取值范围: [01,31]. 20
%H 24小时制的小时,取值范围[00,23]. 17
%I 12小时制的小时,取值范围 [01,12]. 10
%j 一年中的第几天,取值范围 [001,366]. 120
%m 十进制月份,取值范围[01,12]. 50
%M 分钟,取值范围 [00,59]. 50
%p 上、下午,AM 或 PM. PM
%S 秒,取值范围 [00,61]. 30
%U 这一年的星期数(星期天为一个星期的第一天,开年的第一个星期天之前的天记到第0个星期)趋势范围[00,53]. 20
%w 星期的十进制表示,取值范围 [0(星期天),6]. 1
%W 这一年的星期数(星一为一个星期的第一天,开年的第一个星期一之前的天记到第0个星期)趋势范围[00,53]. 20
%x 特定自然语言下的日期表示 05/20/13
%X 特定自然语言下的时间表示 16:00:02
%y 年的后两位数,取值范围[00,99]. 13
%Y 完整的年 2013
%Z 时区名 CST(China Standard Time)
%% %字符 %
'''