SVM对乳腺癌的检测

数据集来自美国威 斯康星州的乳腺癌诊断数据集,数据表一共包括了 32 个字段,代表的含义如下:
SVM对乳腺癌的检测_第1张图片
上面的表格中,mean 代表平均值,se 代表标准差,worst 代表大值(3 个大值的平 均值)。得出了这 30 个特征值(不包括 ID 字段和分类标 识结果字段 diagnosis),实际上是 10 个特征值(radius、texture、perimeter、area、 smoothness、compactness、concavity、concave points、symmetry 和fractal_dimension_mean)的 3 个维度,所以我们只需要取一个维度特征值,作为本次的特征。字段中没有缺失的值。在 569 个患者中,一共有 357 个是良性,212 个是恶 性。

数据读取

import pandas as pd
from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_curve
import warnings
warnings.filterwarnings('ignore')
# 读取数据
data = pd.read_csv('D:/breast_cancer_data-master/breast_cancer_data-master/data.csv', encoding='utf-8')
print(data.head(5))
print(data.describe())

运行结果如下,不一一放齐了

id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

            ...             radius_worst  texture_worst  perimeter_worst  \
0           ...                    25.38          17.33           184.60   
1           ...                    24.99          23.41           158.80   
2           ...                    23.57          25.53           152.50   
3           ...                    14.91          26.50            98.87   
4           ...                    22.54          16.67           152.20   

   area_worst  smoothness_worst  compactness_worst  concavity_worst  \
0      2019.0            0.1622             0.6656           0.7119   
1      1956.0            0.1238             0.1866           0.2416   
2      1709.0            0.1444             0.4245           0.4504   
3       567.7            0.2098             0.8663           0.6869   
4      1575.0            0.1374             0.2050           0.4000   

   concave points_worst  symmetry_worst  fractal_dimension_worst  
0                0.2654          0.4601                  0.11890  
1                0.1860          0.2750                  0.08902  
2                0.2430          0.3613                  0.08758  
3                0.2575          0.6638                  0.17300  
4                0.1625          0.2364                  0.07678  

[5 rows x 32 columns]
                 id  radius_mean  texture_mean  perimeter_mean    area_mean  \
count  5.690000e+02   569.000000    569.000000      569.000000   569.000000   
mean   3.037183e+07    14.127292     19.289649       91.969033   654.889104   
std    1.250206e+08     3.524049      4.301036       24.298981   351.914129   
min    8.670000e+03     6.981000      9.710000       43.790000   143.500000   
25%    8.692180e+05    11.700000     16.170000       75.170000   420.300000   
50%    9.060240e+05    13.370000     18.840000       86.240000   551.100000   
75%    8.813129e+06    15.780000     21.800000      104.100000   782.700000   
max    9.113205e+08    28.110000     39.280000      188.500000  2501.000000

数据处理

# 选择mean这个维度作为特征变量
features_mean = list(data.columns[2:12])
# 删除没有的id字段
data.drop('id', axis=1, inplace=True)
# 将M恶性替换为1,B良性替换为0
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})
# 查看恶性良性数量
sns.countplot(data['diagnosis'], label='count')
plt.show()
plt.figure(figsize=(10, 10))
# 查看共线性
corr = data[features_mean].corr()
sns.heatmap(corr, annot=True)
plt.show()

运行结果如下
SVM对乳腺癌的检测_第2张图片
SVM对乳腺癌的检测_第3张图片
从热力图可以看到radius_mean、perimeter_mean 和 area_mean相关系数远大于0.8,为高度线性相关所以只需要在其中选取一个特征变量即可,compactness_mean、concavity_mean、concave_points_mean相关系数也大于0.8 所以选取一个特征变量即可

特征选择与数据集切割

# 特征选择
features_remain = ['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean']
train, test = train_test_split(data, test_size=0.2)
x_train = train[features_remain]
y_train = train['diagnosis']
x_test = test[features_remain]
y_test = test['diagnosis']
# 数据标准化
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)

建模

# 建立SVC模型
model = SVC()
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
score = accuracy_score(y_test, y_predict)
print(score)

运行结果如下:

0.956140350877193

准确度达到0.956,可见模型不错,谢谢大家观看!

你可能感兴趣的:(SVM对乳腺癌的检测)