数据集来自美国威 斯康星州的乳腺癌诊断数据集,数据表一共包括了 32 个字段,代表的含义如下:
上面的表格中,mean 代表平均值,se 代表标准差,worst 代表大值(3 个大值的平 均值)。得出了这 30 个特征值(不包括 ID 字段和分类标 识结果字段 diagnosis),实际上是 10 个特征值(radius、texture、perimeter、area、 smoothness、compactness、concavity、concave points、symmetry 和fractal_dimension_mean)的 3 个维度,所以我们只需要取一个维度特征值,作为本次的特征。字段中没有缺失的值。在 569 个患者中,一共有 357 个是良性,212 个是恶 性。
import pandas as pd
from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_curve
import warnings
warnings.filterwarnings('ignore')
# 读取数据
data = pd.read_csv('D:/breast_cancer_data-master/breast_cancer_data-master/data.csv', encoding='utf-8')
print(data.head(5))
print(data.describe())
运行结果如下,不一一放齐了
id diagnosis radius_mean texture_mean perimeter_mean area_mean \
0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
... radius_worst texture_worst perimeter_worst \
0 ... 25.38 17.33 184.60
1 ... 24.99 23.41 158.80
2 ... 23.57 25.53 152.50
3 ... 14.91 26.50 98.87
4 ... 22.54 16.67 152.20
area_worst smoothness_worst compactness_worst concavity_worst \
0 2019.0 0.1622 0.6656 0.7119
1 1956.0 0.1238 0.1866 0.2416
2 1709.0 0.1444 0.4245 0.4504
3 567.7 0.2098 0.8663 0.6869
4 1575.0 0.1374 0.2050 0.4000
concave points_worst symmetry_worst fractal_dimension_worst
0 0.2654 0.4601 0.11890
1 0.1860 0.2750 0.08902
2 0.2430 0.3613 0.08758
3 0.2575 0.6638 0.17300
4 0.1625 0.2364 0.07678
[5 rows x 32 columns]
id radius_mean texture_mean perimeter_mean area_mean \
count 5.690000e+02 569.000000 569.000000 569.000000 569.000000
mean 3.037183e+07 14.127292 19.289649 91.969033 654.889104
std 1.250206e+08 3.524049 4.301036 24.298981 351.914129
min 8.670000e+03 6.981000 9.710000 43.790000 143.500000
25% 8.692180e+05 11.700000 16.170000 75.170000 420.300000
50% 9.060240e+05 13.370000 18.840000 86.240000 551.100000
75% 8.813129e+06 15.780000 21.800000 104.100000 782.700000
max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000
# 选择mean这个维度作为特征变量
features_mean = list(data.columns[2:12])
# 删除没有的id字段
data.drop('id', axis=1, inplace=True)
# 将M恶性替换为1,B良性替换为0
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})
# 查看恶性良性数量
sns.countplot(data['diagnosis'], label='count')
plt.show()
plt.figure(figsize=(10, 10))
# 查看共线性
corr = data[features_mean].corr()
sns.heatmap(corr, annot=True)
plt.show()
运行结果如下
从热力图可以看到radius_mean、perimeter_mean 和 area_mean相关系数远大于0.8,为高度线性相关所以只需要在其中选取一个特征变量即可,compactness_mean、concavity_mean、concave_points_mean相关系数也大于0.8 所以选取一个特征变量即可
# 特征选择
features_remain = ['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean']
train, test = train_test_split(data, test_size=0.2)
x_train = train[features_remain]
y_train = train['diagnosis']
x_test = test[features_remain]
y_test = test['diagnosis']
# 数据标准化
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)
# 建立SVC模型
model = SVC()
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
score = accuracy_score(y_test, y_predict)
print(score)
运行结果如下:
0.956140350877193
准确度达到0.956,可见模型不错,谢谢大家观看!