原始机器学习框架整合
- 按题目要求选取算法
- 加载数据
- 数据预处理
- 根据经验设置参数建立模型并训练
- 根据训练结果返回调节参数
- 输出模型评价指标
机器学习框架进阶版
- 不指定算法的情况下,根据经验选几个可能的算法
- 加载数据
- 数据预处理
- 选取小规模数据使用交叉验证对上述几个可能算法进行评估比较,选出最适合的算法
- 使用表格搜索对最优算法交叉验证,获得模型的最优参数
- 使用最优参数建立模型
- 使用管道机制将数据处理,降维,最优模型等模块进行封装
- 将数据传入管道训练,打印输出各项指标
- 将管道模型保存至本地,可以循环使用
以脑瘤数据集为例整合代码如下
'''
导包的时候不要着急,不要强行记忆,用到什么导入什么即可
'''
import numpy as np
import pandas as pd
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score,GridSearchCV,train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score,roc_curve
from matplotlib import pyplot as plt
from sklearn.externals import joblib
warnings.filterwarnings('ignore')
df = pd.read_excel(r'../datas/naoliu.xls')
pd.set_option('display.max_columns',20)
pd.set_option('display.max_rows',200)
df.drop(columns=['Unnamed: 0'],inplace=True)
df.dropna(inplace=True)
df_small = df.sample(500)
X = df_small.iloc[:,:-1]
y = df_small.iloc[:,-1]
X = StandardScaler().fit_transform(X)
model_pca = PCA(n_components=3)
X = model_pca.fit_transform(X)
LR = LogisticRegression(max_iter=2000)
DT = DecisionTreeClassifier()
SVM = SVC(kernel='linear',max_iter=2000)
LR_f1 = cross_val_score(LR,X,y,scoring='f1',cv=10)
DT_f1 = cross_val_score(DT,X,y,scoring='f1',cv=10)
SVM_f1 = cross_val_score(SVM,X,y,scoring='f1',cv=10)
print('LR,f1指标是:',np.mean(LR_f1))
print('SVM,f1指标是:',np.mean(SVM_f1))
print('DT,f1指标是:', np.mean(DT_f1))
params = {
'C':[0.1,0.3,0.03,1,3,5,10,13,30,50,100,500,1000],
'kernel':['rbf','linear','sigmoid'],
'max_iter':[500,1000,2000,5000,10000]
}
model_nice = GridSearchCV(SVM,param_grid=params).fit(X,y)
best_score = model_nice.best_params_
print(best_score)
model_pip = Pipeline([
('ss',StandardScaler()),
('pca',PCA(n_components=3)),
('svm',SVC(C=best_score.get('C'),kernel=best_score.get('kernel'),max_iter=best_score.get('max_iter'),probability=True))
])
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)
model_pip.fit(X_train,y_train)
pred = model_pip.predict(X_test)
preds = model_pip.predict_proba(X_test)
print(model_pip.score(X_test,y_test))
print(model_pip.score(X_train,y_train))
AUC = roc_auc_score(y_test,preds[:,-1])
print(AUC)
TPR,FPR,TH = roc_curve(y_test,preds[:,-1])
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.title('ROC曲线')
plt.plot(TPR,FPR)
plt.show()
joblib.dump(model_pip,filename='naoliu.model')
效果如下
加载保存好的模型用于工业预测
import pandas as pd
import warnings
from sklearn.externals import joblib
warnings.filterwarnings('ignore')
df = pd.read_excel(r'../datas/naoliu.xls')
df.drop(columns=['Unnamed: 0'],inplace=True)
df.dropna(inplace=True)
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
model = joblib.load(filename='naoliu.model')
print(model.score(X, y))
>>0.8457831325301205