#scikit-learn中datasets自带的医学癌症数据
#使用默认的高斯函数,使用GridSearchCV来自动选择参数gamma,可以得到在超参数调优器的使用下,得到的最优模型评分
from sklearn import datasets
datas=datasets.load_breast_cancer()
print(datas)
运行结果:
{'DESCR': 'Breast Cancer Wisconsin (Diagnostic) Database\n=============================================\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 569\n\n :Number of Attributes: 30 numeric, predictive attributes and the class\n\n :Attribute Information:\n - radius (mean of distances from center to points on the perimeter)\n - texture (standard deviation of gray-scale values)\n - perimeter\n - area\n - smoothness (local variation in radius lengths)\n - compactness (perimeter^2 / area - 1.0)\n - concavity (severity of concave portions of the contour)\n - concave points (number of concave portions of the contour)\n - symmetry \n - fractal dimension ("coastline approximation" - 1)\n\n The mean, standard error, and "worst" or largest (mean of the three\n largest values) of these features were computed for each image,\n resulting in 30 features. For instance, field 3 is Mean Radius, field\n 13 is Radius SE, field 23 is Worst Radius.\n\n - class:\n - WDBC-Malignant\n - WDBC-Benign\n\n :Summary Statistics:\n\n ===================================== ====== ======\n Min Max\n ===================================== ====== ======\n radius (mean): 6.981 28.11\n texture (mean): 9.71 39.28\n perimeter (mean): 43.79 188.5\n area (mean): 143.5 2501.0\n smoothness (mean): 0.053 0.163\n compactness (mean): 0.019 0.345\n concavity (mean): 0.0 0.427\n concave points (mean): 0.0 0.201\n symmetry (mean): 0.106 0.304\n fractal dimension (mean): 0.05 0.097\n radius (standard error): 0.112 2.873\n texture (standard error): 0.36 4.885\n perimeter (standard error): 0.757 21.98\n area (standard error): 6.802 542.2\n smoothness (standard error): 0.002 0.031\n compactness (standard error): 0.002 0.135\n concavity (standard error): 0.0 0.396\n concave points (standard error): 0.0 0.053\n symmetry (standard error): 0.008 0.079\n fractal dimension (standard error): 0.001 0.03\n radius (worst): 7.93 36.04\n texture (worst): 12.02 49.54\n perimeter (worst): 50.41 251.2\n area (worst): 185.2 4254.0\n smoothness (worst): 0.071 0.223\n compactness (worst): 0.027 1.058\n concavity (worst): 0.0 1.252\n concave points (worst): 0.0 0.291\n symmetry (worst): 0.156 0.664\n fractal dimension (worst): 0.055 0.208\n ===================================== ====== ======\n\n :Missing Attribute Values: None\n\n :Class Distribution: 212 - Malignant, 357 - Benign\n\n :Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian\n\n :Donor: Nick Street\n\n :Date: November, 1995\n\nThis is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.\nhttps://goo.gl/U2Uwz2\n\nFeatures are computed from a digitized image of a fine needle\naspirate (FNA) of a breast mass. They describe\ncharacteristics of the cell nuclei present in the image.\n\nSeparating plane described above was obtained using\nMultisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree\nConstruction Via Linear Programming." Proceedings of the 4th\nMidwest Artificial Intelligence and Cognitive Science Society,\npp. 97-101, 1992], a classification method which uses linear\nprogramming to construct a decision tree. Relevant features\nwere selected using an exhaustive search in the space of 1-4\nfeatures and 1-3 separating planes.\n\nThe actual linear program used to obtain the separating plane\nin the 3-dimensional space is that described in:\n[K. P. Bennett and O. L. Mangasarian: "Robust Linear\nProgramming Discrimination of Two Linearly Inseparable Sets",\nOptimization Methods and Software 1, 1992, 23-34].\n\nThis database is also available through the UW CS ftp server:\n\nftp ftp.cs.wisc.edu\ncd math-prog/cpo-dataset/machine-learn/WDBC/\n\nReferences\n----------\n - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction \n for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on \n Electronic Imaging: Science and Technology, volume 1905, pages 861-870,\n San Jose, CA, 1993.\n - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and \n prognosis via linear programming. Operations Research, 43(4), pages 570-577, \n July-August 1995.\n - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques\n to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) \n 163-171.\n', 'target_names': array(['malignant', 'benign'], dtype='from sklearn.model_selection import train_test_split,GridSearchCV from sklearn import metrics from sklearn.svm import SVC import numpy as np
x_train,x_test,y_train,y_test=train_test_split(datas.data,datas.target)
thresholds=np.linspace(0,0.001,100)#设置gamma参数列表 thresholds
运行结果:
array([ 0.00000000e+00, 1.01010101e-05, 2.02020202e-05, 3.03030303e-05, 4.04040404e-05, 5.05050505e-05, 6.06060606e-05, 7.07070707e-05, 8.08080808e-05, 9.09090909e-05, 1.01010101e-04, 1.11111111e-04, 1.21212121e-04, 1.31313131e-04, 1.41414141e-04, 1.51515152e-04, 1.61616162e-04, 1.71717172e-04, 1.81818182e-04, 1.91919192e-04, 2.02020202e-04, 2.12121212e-04, 2.22222222e-04, 2.32323232e-04, 2.42424242e-04, 2.52525253e-04, 2.62626263e-04, 2.72727273e-04, 2.82828283e-04, 2.92929293e-04, 3.03030303e-04, 3.13131313e-04, 3.23232323e-04, 3.33333333e-04, 3.43434343e-04, 3.53535354e-04, 3.63636364e-04, 3.73737374e-04, 3.83838384e-04, 3.93939394e-04, 4.04040404e-04, 4.14141414e-04, 4.24242424e-04, 4.34343434e-04, 4.44444444e-04, 4.54545455e-04, 4.64646465e-04, 4.74747475e-04, 4.84848485e-04, 4.94949495e-04, 5.05050505e-04, 5.15151515e-04, 5.25252525e-04, 5.35353535e-04, 5.45454545e-04, 5.55555556e-04, 5.65656566e-04, 5.75757576e-04, 5.85858586e-04, 5.95959596e-04, 6.06060606e-04, 6.16161616e-04, 6.26262626e-04, 6.36363636e-04, 6.46464646e-04, 6.56565657e-04, 6.66666667e-04, 6.76767677e-04, 6.86868687e-04, 6.96969697e-04, 7.07070707e-04, 7.17171717e-04, 7.27272727e-04, 7.37373737e-04, 7.47474747e-04, 7.57575758e-04, 7.67676768e-04, 7.77777778e-04, 7.87878788e-04, 7.97979798e-04, 8.08080808e-04, 8.18181818e-04, 8.28282828e-04, 8.38383838e-04, 8.48484848e-04, 8.58585859e-04, 8.68686869e-04, 8.78787879e-04, 8.88888889e-04, 8.98989899e-04, 9.09090909e-04, 9.19191919e-04, 9.29292929e-04, 9.39393939e-04, 9.49494949e-04, 9.59595960e-04, 9.69696970e-04, 9.79797980e-04, 9.89898990e-04, 1.00000000e-03])
#这步,直接影响下面的数据 param_grid={'gamma':thresholds} clf=GridSearchCV(SVC(kernel='rbf'),param_grid,cv=5) clf.fit(x_train,y_train)
运行结果:
GridSearchCV(cv=5, error_score='raise', estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False), fit_params=None, iid=True, n_jobs=1, param_grid={'gamma': array([ 0.00000e+00, 1.01010e-05, ..., 9.89899e-04, 1.00000e-03])}, pre_dispatch='2*n_jobs', refit=True, return_train_score=True, scoring=None, verbose=0)
运行结果:print("最佳效果:%0.3f"% clf.best_score_) print("最优参数组合:") best_parameters=clf.best_estimator_.get_params() for param_name in sorted(param_grid.keys()): print('\t%s:%r' %(param_name,best_parameters[param_name]))
最佳效果:0.925 最优参数组合: gamma:8.0808080808080811e-05注:这步结果不唯一,下面的数据也不一致,原因详见https://blog.csdn.net/wjwfighting/article/details/80970396,开头有提到
print("训练集评分:",clf.score(x_train,y_train)) print("测试集评分:",clf.score(x_test,y_test))
训练集评分: 0.934272300469 测试集评分: 0.979020979021
predicted=clf.predict(x_test) print('预测值:',predicted) print('实际值:',y_test)
预测值: [0 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 0] 实际值: [0 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 0]运行结果:print('精准值:',metrics.precision_score(predicted,y_test)) print('召回率:',metrics.recall_score(predicted,y_test)) print('F1:',metrics.f1_score(predicted,y_test)) print("准确率:",np.mean(predicted==y_test))
精准值: 0.988888888889 召回率: 0.978021978022 F1: 0.983425414365 准确率: 0.979020979021#分类报告 xx=metrics.classification_report(y_test,predicted,target_names=datas.target_names) print(xx)
运行结果:
precision recall f1-score support malignant 0.98 0.96 0.97 53 benign 0.98 0.99 0.98 90 avg / total 0.98 0.98 0.98 143
confusion_matrix=metrics.confusion_matrix(y_test,predicted) print('--混淆矩阵--') print(confusion_matrix)
运行结果: