机器学习模型常用技巧(持续更新中……)

目录

1、网格搜索的套路函数(以决策树为例):

2、保存模型:

3、模型评价的指标:

4、绘制决策树图形

5、样本不均衡问题处理 

1、网格搜索的套路函数(以决策树为例):

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
def check_model(x,y):
    ##以决策树为例##
    classifier = DecisionTreeClassifier(random_state=1)
    parameters = {
        'max_leaf_nodes': list(range(2,100)),    # 参数是决策树分类器中的,以便进行网格超参数搜索
        'min_samples_split': [8,10,15]
    }
    # StratifiedKFold与KFold类似,但它是分层采样,确保训练集、测试集中各类别样本的比例与原始数据集中的相同
    folder = StratifiedKFold(n_splits=3, shuffle=True)
    # Exhaustive search over specified parameter values for an estimator.
    grid_search = GridSearchCV(
        estimator=classifier,
        param_grid=parameters,
        cv=folder,
        n_jobs=2,
        verbose=1    # Controls the verbosity: the higher, the more messages.
    )
    grid_search = grid_search.fit(x,y)
    print(grid_search.best_params_)
    return grid_search
model = check_model(x_train,y_train)
moedl = model.best_estimator_    # 选择最好的分类器
##进行预测模型评估等……##

2、保存模型:

##保存模型##
### 方法一
import os, pickle
if not os.path.isfile('test_model.pkl'):
    with open('test_model.pkl', 'wb') as f:
        pickle.dump(model, f)
else:
    with open('test_model.pkl', 'rb') as f:    ##这个就是读取模型文件## 追加用'w+'
    model = pickle.load(f)

### 方法二(用sklearn里面的工具):
from sklearn.externals import joblib
# 保存
joblib.dump(model, 'test.pkl')
# 加载
estimator = joblib.load('test.pkl')

3、模型评价的指标:

分类模型的有:accuracy_score(精确率);

回归模型的有:MSE(均方误差);

一般地,使用roc也可以;

①classification_report:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
## 输出的结果由精确率、召回率和f1-score

②roc_curve、roc_auc_score

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
logit_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, threshold = roc_curve(y_test, clf.predict_prob(x_test)[:,1])

plt.figure()
plt.plot(fpr, tpr, label='logit regression (area=%0.2f)'%logit_roc_auc)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc='lower right')
plt.savefig('1.png')
plt.show()

机器学习模型常用技巧(持续更新中……)_第1张图片

4、绘制决策树图形

# conda install python-graphviz
# conda install pydotplus
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  # clf是已经建立好的决策树模型
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

 机器学习模型常用技巧(持续更新中……)_第2张图片

5、样本不均衡问题处理 

这篇文章介绍的还算全面,包括了机器学习、视觉、NLP中样本不均衡的处理方法(炼丹笔记一:样本不平衡问题)

你可能感兴趣的:(机器学习)