【超参数寻优】交叉验证(Cross Validation)超参数寻优的python实现:多参数寻优

【超参数寻优】交叉验证(Cross Validation)超参数寻优的python实现:多参数寻优

  • 一、网格搜索原理
  • 二、网格搜索+交叉验证用于多参数寻优的python实现
    • 1、训练模型及待寻优参数
    • 2、直接循环嵌套实现网格搜索 + cross_validation
    • 3、GridSearchCV
  • 参考资料

交叉验证的基础知识可参考 上一篇博文,本博文主要介绍使用交叉验证进行多参数寻优时的参数搜索方法:网格搜索(Grid Search)以及网格搜索两种常用的python实现方式。

一、网格搜索原理

网格搜索是一种穷举搜索方法,通过循环遍历多参数的可能取值情况,性能最好的模型对应的参数就是最优参数。

二、网格搜索+交叉验证用于多参数寻优的python实现

代码及样本地址: https://github.com/shiluqiang/python_GridSearch-CV

1、训练模型及待寻优参数

本博文选用的多参数机器学习模型为非线性SVM(参考资料【1】),模型的优化问题为:

min ⁡ W , e 1 2 ∥ W ∥ 2 + C 2 ∑ i = 1 m e i 2 s . t . y i ( W ⋅ φ ( x i ) + b ) ≥ 1 − e i , i = 1 , ⋯   , m e ≥ 0 , i = 1 , ⋯   , m \begin{array}{l} \mathop {\min }\limits_{W,e} \frac{1}{2}{\left\| W \right\|^2} + \frac{C}{2}\sum\limits_{i = 1}^m {{e_i}^2} \\ s.t.{y_i}\left( {W \cdot \varphi ({x_i}) + b} \right) \ge 1 - {e_i},i = 1, \cdots ,m\\ e \ge 0,i = 1, \cdots ,m \end{array} W,emin21W2+2Ci=1mei2s.t.yi(Wφ(xi)+b)1ei,i=1,,me0,i=1,,m

通过Lagrange乘数法并转化为对偶问题,优化问题转换为:

min ⁡ α 1 2 ∑ i m ∑ j m α i α j y i y j K ( x i , x j ) − ∑ i = 1 m α i s . t . ∑ i = 1 m α i y i = 0 0 ≤ α i ≤ C , i = 1 , ⋯   , m \begin{array}{l} \mathop {\min }\limits_\alpha \frac{1}{2}\sum\limits_i^m {\sum\limits_j^m {{\alpha _i}{\alpha _j}{y^i}{y^j}K\left( {{x_i},{x_j}} \right) - \sum\limits_{i = 1}^m {{\alpha _i}} } } \\ s.t.\sum\limits_{i = 1}^m {{\alpha _i}{y^i} = 0} \\ 0 \le {\alpha _i} \le C,i = 1, \cdots ,m \end{array} αmin21imjmαiαjyiyjK(xi,xj)i=1mαis.t.i=1mαiyi=00αiC,i=1,,m

其中: K ( x i , x j ) = exp ⁡ ( − ∥ x i − x j ∥ 2 2 σ 2 ) K\left( {{x_i},{x_j}} \right) = \exp \left( { - \frac{{{{\left\| {{x_i} - {x_j}} \right\|}^2}}}{{2{\sigma ^2}}}} \right) K(xi,xj)=exp(2σ2xixj2)

非线性SVM有两个参数:正则化参数 C C C和核参数 σ \sigma σ

2、直接循环嵌套实现网格搜索 + cross_validation

import numpy as np
from sklearn import svm
from sklearn import cross_validation
from sklearn.model_selection import GridSearchCV

def load_data(data_file):
    '''导入训练数据
    input:  data_file(string):训练数据所在文件
    output: data(mat):训练样本的特征
            label(mat):训练样本的标签
    '''
    data = []
    label = []
    f = open(data_file)
    for line in f.readlines():
        lines = line.strip().split(' ')
        
        # 提取得出label
        label.append(float(lines[0]))
        # 提取出特征,并将其放入到矩阵中
        index = 0
        tmp = []
        for i in range(1, len(lines)):
            li = lines[i].strip().split(":")
            if int(li[0]) - 1 == index:
                tmp.append(float(li[1]))
            else:
                while(int(li[0]) - 1 > index):
                    tmp.append(0)
                    index += 1
                tmp.append(float(li[1]))
            index += 1
        while len(tmp) < 13:
            tmp.append(0)
        data.append(tmp)
    f.close()
    return np.array(data), np.array(label).T

### 1.导入数据集
trainX,trainY = load_data('heart_scale')

### 2.设置C和sigma的取值范围
c_list = []
for i in range(1,50):
    c_list.append(i * 0.5)
    
gamma_list = []
for j in range(1,40):
    gamma_list.append(j * 0.2)
    
### 3.1循环嵌套实现网格搜索 + 交叉验证
best_value = 0.0

for i in c_list:
    for j in gamma_list:
        current_value = 0.0
        rbf_svm = svm.SVC(kernel = 'rbf', C = i, gamma = j)
        scores = cross_validation.cross_val_score(rbf_svm,trainX,trainY,cv =3,scoring = 'accuracy')
        current_value = scores.mean()
        if current_value >= best_value:
            best_value = current_value
            best_parameters = {'C': i, 'gamma': j}
        print('Best Value is :%f'%best_value)
        print('Best Parameters is',best_parameters)

3、GridSearchCV

import numpy as np
from sklearn import svm
from sklearn.model_selection import GridSearchCV

def load_data(data_file):
    '''导入训练数据
    input:  data_file(string):训练数据所在文件
    output: data(mat):训练样本的特征
            label(mat):训练样本的标签
    '''
    data = []
    label = []
    f = open(data_file)
    for line in f.readlines():
        lines = line.strip().split(' ')
        
        # 提取得出label
        label.append(float(lines[0]))
        # 提取出特征,并将其放入到矩阵中
        index = 0
        tmp = []
        for i in range(1, len(lines)):
            li = lines[i].strip().split(":")
            if int(li[0]) - 1 == index:
                tmp.append(float(li[1]))
            else:
                while(int(li[0]) - 1 > index):
                    tmp.append(0)
                    index += 1
                tmp.append(float(li[1]))
            index += 1
        while len(tmp) < 13:
            tmp.append(0)
        data.append(tmp)
    f.close()
    return np.array(data), np.array(label).T

### 1.导入数据集
trainX,trainY = load_data('heart_scale')

### 2.设置C和sigma的取值范围
c_list = []
for i in range(1,50):
    c_list.append(i * 0.5)
    
gamma_list = []
for j in range(1,40):
    gamma_list.append(j * 0.2)
    
### 3.2 GridSearchCV(网格搜索+CV)
param_grid = {'C': c_list,
              'gamma':gamma_list}

rbf_svm1 = svm.SVC(kernel = 'rbf')
grid = GridSearchCV(rbf_svm1, param_grid, cv=3, scoring='accuracy')
grid.fit(trainX,trainY)
best_parameter = grid.best_params_
print(best_parameter)

参考资料

1.https://blog.csdn.net/google19890102/article/details/35989959

你可能感兴趣的:(机器学习,人工智能,寻优算法)