因为我主要想做二分类的问题,所以就把鸢尾花中三类删除其中一类使其变为二分类问题。
导入数据
import pandas as pd
# 读入数据
df = pd.read_csv('F:/Program Files/coding/book/python book/iris/iris_data.csv')
df.head()
然后可以通过如下语句查看数据的分布和信息情况:
df.describe()
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
#按鸢尾花的标签涂色
def scatter_plot_by_category(feat, x, y):
alpha = 0.5
gs = df.groupby(feat)
cs = cm.rainbow(np.linspace(0, 1, len(gs)))
for g, c in zip(gs, cs):
plt.scatter(g[1][x], g[1][y], color=c, alpha=alpha)
plt.figure(figsize=(20,5))
plt.subplot(131)
scatter_plot_by_category('Species', 'Sepal_Length', 'Petal_Length')
plt.xlabel('Sepal_Length')
plt.ylabel('Petal_Length')
plt.title('Species')
划分数据为训练集和测试集
from sklearn.cross_validation import train_test_split
all_inputs = df[['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']].values
all_classes = df['Species'].values
(X_train,X_test,Y_train,Y_test) = train_test_split(all_inputs, all_classes, train_size=0.8, random_state=1)
print(Y_test)
- all_inputs中必须用双中括号,不然会报错
选择模型进行训练
使用决策树进行训练
from sklearn.tree import DecisionTreeClassifier
# 定义一个决策树对象
decision_tree_classifier = DecisionTreeClassifier()
# 训练模型
model = decision_tree_classifier.fit(X_train, Y_train)
# 所得模型的准确性
print(decision_tree_classifier.score(X_test, Y_test))
# 使用训练的模型进行预测,为了偷懒,
# 直接把测试集里面的数据拿出来了三条
print(X_test[0:3])
print(Y_test[0:3])
model.predict(X_test[0:3])
使用SVM进行训练
from sklearn.svm import LinearSVC
svm_classifier = LinearSVC(penalty='l1', tol=0.1, C=1, dual=False, class_weight=None,max_iter=100)
model = svm_classifier.fit(X_train, Y_train)
label = svm_classifier.predict(X_test)
score = svm_classifier.decision_function(X_train) #每一列的值代表距离各类别的距离 三类即有三列值
实验进行到这里,我发现标签如果是string不太容易进行一些别的分析,故,从头开始先把标签数值化
在read_csv()的参数中converters参数,该参数的含义是列转换函数的字典。key可以是列名或列的序号。
def iris_type(s):
class_label={'Iris-setosa':0,'Iris-versicolor':1}
return class_label[s]
# 读入数据
df = pd.read_csv('F:/Program Files/paper2019/team/data/iris_data.csv',converters={4:iris_type})
# 查看前5条数据
df.head()