auto-sklearn配置及使用

1.安装遇到的坑

1)swig安装,macos下用brew装一下即可。

2)内存调整,autosklearn的model fit对内存要求较高,把limit调到300000,否则报错。

2.使用及探索

1)数据加载及库引入,还是用iris数据集

"""
Created on Sat Apr 16 15:26:21 2022

@author: johnny
"""
import autosklearn
from autosklearn.classification import AutoSklearnClassifier
from sklearn.metrics import accuracy_score


#数据引入
from sklearn.datasets import load_iris
data = load_iris()
x = data.data
y = data.target

from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y,test_size=0.3,random_state=0)

2)模型配置

# define search
model = AutoSklearnClassifier(memory_limit=300000,
                              time_left_for_this_task=5*60,
                              per_run_time_limit=50, 
                              n_jobs=4,
                              
                              tmp_folder='/Users/johnny/Downloads/CreditMaster/temp/autosklearn_classification_example_tmp')
# perform the search
import time
start = time.time()
model.fit(train_x, train_y)
end = time.time()

print (str(end-start))

time_left_for_this_task意思是此任务的最长时间,并为其分配5分钟。如果没有为此参数指定任何内容,则该过程将运行一个小时。

per_run_time_limit参数将分配给每个模型评估的时间设置为 50 秒。

ensemble_sizeinitial_configurations_via_metalearning,可用于微调分类器。默认情况下,上述搜索命令会创建一组表现最佳的模型。为了避免过度拟合,我们可以通过更改设置ensemble_size = 1initial_configurations_via_metalearning = 0来禁用它。

3)运行效果

运行时间:311.1847469806671
#搜索最佳性能摘要
print(model.sprint_statistics())

#模型排行榜
print(model.leaderboard())
print(model.sprint_statistics())
auto-sklearn results:
  Dataset name: 2b87dd22-be00-11ec-83d9-acde48001122
  Metric: accuracy
  Best validation score: 0.971429
  Number of target algorithm runs: 57
  Number of successful target algorithm runs: 57
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0
 print(model.leaderboard())
          rank  ensemble_weight                 type      cost  duration
model_id                                                                
56           1             0.12                  lda  0.028571  6.529774
55           2             0.06                  qda  0.028571  6.576634
43           3             0.06        liblinear_svc  0.057143  4.191880
27           4             0.04                  qda  0.057143  2.138941
53           5             0.04  k_nearest_neighbors  0.085714  5.523627
52           6             0.12                  qda  0.085714  4.785708
6            7             0.02        random_forest  0.085714  6.293557
9            8             0.04           libsvm_svc  0.085714  3.043304
48           9             0.06        liblinear_svc  0.114286  4.556017
45          10             0.04        liblinear_svc  0.114286  5.430604
35          11             0.02                  lda  0.114286  0.922487
3           12             0.02        random_forest  0.114286  5.901496
23          13             0.02   passive_aggressive  0.114286  1.694792
20          14             0.02          extra_trees  0.114286  2.763696
13          15             0.04                  mlp  0.114286  4.000428
5           16             0.08   passive_aggressive  0.114286  2.807672
4           17             0.02        random_forest  0.114286  5.275651
28          18             0.04        liblinear_svc  0.114286  2.893949
34          19             0.02          extra_trees  0.114286  5.547182
49          20             0.04        liblinear_svc  0.142857  5.844000
46          21             0.08             adaboost  0.171429  5.250963

表示一共跑了57个模型,最优得分0.97。

m1_acc_score= accuracy_score(test_y, y_pred)
m1_acc_score
from sklearn.metrics import confusion_matrix, accuracy_score
import seaborn as sns
y_pred= model.predict(test_x)
conf_matrix= confusion_matrix(y_pred, test_y)
sns.heatmap(conf_matrix, annot=True)

auto-sklearn配置及使用_第1张图片

 3.结论

是一种综合集成的自动机器学习方法,用来偷懒挺合适。

你可能感兴趣的:(机器学习,sklearn,人工智能)