【dataset/titanic_train.csv】文件包含了泰坦尼克号乘客信息及其是否幸存的记录,各个字段含义如下:
要求完成下列任务项:
本演练准备工作包括:
import random
import numpy as np
import pandas as pd
random_state = 100
random.seed(random_state)
np.random.seed(random_state)
data_file = 'dataset/titanic_train.csv'
df = pd.read_csv(data_file)
print(df.head())
print("=" * 100)
print("训练样本维度:", df.shape)
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
====================================================================================================
训练样本维度: (891, 12)
检查数据集每个字段的类型(文本/数值),查看字段的的缺失值数量
print(df.info())
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
统计并使用堆叠柱状图显示不同仓位(Pclass)的生还(Survived)结果
import matplotlib.pyplot as plt
%matplotlib inline
# 统计未幸存的乘客中每种Pclass的数量
no_survived = df['Pclass'][df['Survived'] == 0].value_counts()
# 统计幸存的乘客中每种Pclass的数量
survived = df['Pclass'][df['Survived'] == 1].value_counts()
# 构建用于作图的数据集
df_temp = pd.DataFrame({'Survived':survived,'Died':no_survived})
# 绘制堆叠柱状图
df_temp.plot(kind='bar',stacked = True)
plt.xlabel('Class')
plt.ylabel('Sum')
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-s8L6eMJx-1626015050075)(output_7_0.png)]
下面的代码从数据集中删除无关字段,仅保留特征字段和标签。
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
print(df.columns)
Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
'Embarked'],
dtype='object')
上述结果输出,已经去除了无关字段。
将数据按照7:3拆分成训练数据集和测试数据集
注意,Survived字段是标签(下标索引为0),其余字段是特征
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0], test_size=0.3, random_state=random_state)
print("训练集特征维度:", X_train.shape, ",标签维度:", y_train.shape)
print("测试集特征维度:", X_test.shape, ",标签维度:", y_test.shape)
训练集特征维度: (623, 7) ,标签维度: (623,)
测试集特征维度: (268, 7) ,标签维度: (268,)
训练样本623个,测试样本268个。这些样本都要保留下来,但是部分字段存在缺失值,需要以恰当的方式填充这些缺失值
from sklearn.ensemble import RandomForestRegressor
# 使用数据集中的'Age','Fare', 'Parch', 'SibSp', 'Pclass'
age = X_train[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
known_age = age[age.Age.notnull()].values # Age未缺失的样本
unknown_age = age[age.Age.isnull()].values # Age缺失的样本
X = known_age[:,1:] # 后4个字段为特征矩阵X
y = known_age[:,0] # 第1个字段为标签结果
# 训练模型
rf = RandomForestRegressor(random_state=random_state, n_estimators=200)
rf.fit(X, y)
# 计算Age缺失的样本中的预测值
predicts = rf.predict(unknown_age[:, 1:])
# 将Age值更新到df_train中
X_train.loc[(X_train.Age.isnull()), 'Age' ] = predicts
# 将Age值更新到df_test中
age_test = X_test[['Fare', 'Parch', 'SibSp', 'Pclass']][X_test.Age.isnull()]
predicts = rf.predict(age_test)
X_test.loc[(X_test.Age.isnull()), 'Age' ] = predicts
print("测试数据中插补的Age值:", predicts)
print("=" * 100)
print(X_train.info())
print("=" * 100)
print(X_test.info())
测试数据中插补的Age值: [ 4.9025846 27.01755548 29.26959254 11.22093849 29.146325 28.85321249
26.396 11.22093849 26.41578589 23.26333333 25.06640712 32.73736452
39.30680952 22.71342866 27.01546825 26.41578589 39.05960516 29.44317208
27.01755548 29.26959254 32.73736452 32.73736452 29.44317208 16.00583333
32.73736452 32.73736452 27.84292626 27.84292626 27.01755548 22.71342866
27.84292626 23.34266667 32.43666667 25.62333333 32.73736452 38.06933333
32.51466667 25.41690476 33.41228986 23.02333333 22.90907184 27.01755548
36.85508479 30.35875 27.01755548 39.58833333 52.01260387 25.6397619
36.85508479]
====================================================================================================
Int64Index: 623 entries, 69 to 520
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 623 non-null int64
1 Sex 623 non-null object
2 Age 623 non-null float64
3 SibSp 623 non-null int64
4 Parch 623 non-null int64
5 Fare 623 non-null float64
6 Embarked 622 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 38.9+ KB
None
====================================================================================================
Int64Index: 268 entries, 205 to 277
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 268 non-null int64
1 Sex 268 non-null object
2 Age 268 non-null float64
3 SibSp 268 non-null int64
4 Parch 268 non-null int64
5 Fare 268 non-null float64
6 Embarked 267 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 16.8+ KB
None
可以看到测试数据集中预测的Age值,同时观察到,训练集中Age的缺失值已经为0
embarked_mode = X_train['Embarked'].mode().values[0]
print("Embarked字段的众数值:", embarked_mode)
X_train.loc[X_train.Embarked.isnull(), 'Embarked'] = embarked_mode
X_test.loc[X_test.Embarked.isnull(), 'Embarked'] = embarked_mode
print(X_train.info())
print("=" * 100)
print(X_test.info())
Embarked字段的众数值: S
Int64Index: 623 entries, 69 to 520
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 623 non-null int64
1 Sex 623 non-null object
2 Age 623 non-null float64
3 SibSp 623 non-null int64
4 Parch 623 non-null int64
5 Fare 623 non-null float64
6 Embarked 623 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 38.9+ KB
None
====================================================================================================
Int64Index: 268 entries, 205 to 277
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 268 non-null int64
1 Sex 268 non-null object
2 Age 268 non-null float64
3 SibSp 268 non-null int64
4 Parch 268 non-null int64
5 Fare 268 non-null float64
6 Embarked 268 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 16.8+ KB
None
可见,Embarked字段已经没有缺失值
将female取值转换成0,male取值转换成1
X_train.loc[X_train.Sex=='male', 'Sex'] = 1
X_train.loc[X_train.Sex=='female', 'Sex'] = 0
X_test.loc[X_test.Sex=='male', 'Sex'] = 1
X_test.loc[X_test.Sex=='female', 'Sex'] = 0
print(X_train.head())
print("=" * 100)
print(X_test.head())
Pclass Sex Age SibSp Parch Fare Embarked
69 3 1 26.00000 2 0 8.6625 S
85 3 0 33.00000 3 0 15.8500 S
794 3 1 25.00000 0 0 7.8958 S
161 2 0 40.00000 0 0 15.7500 S
815 1 1 39.30681 0 0 0.0000 S
====================================================================================================
Pclass Sex Age SibSp Parch Fare Embarked
205 3 0 2.0 0 1 10.4625 S
44 3 0 19.0 0 0 7.8792 Q
821 3 1 27.0 0 0 8.6625 S
458 2 0 50.0 0 0 10.5000 S
795 2 1 39.0 0 0 13.0000 S
可见,Sex字段值已经转换成0、1数值
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoder.fit(X_train.loc[:, ['Embarked']]) # 即使只有1个列,也必须将列名写在[]中
# 分别针对训练集和测试集进行OneHot编码
train_onehot = encoder.transform(X_train.loc[:, ['Embarked']]).toarray() # 使用toarray获得转换以后的OneHot编码数组
test_onehot = encoder.transform(X_test.loc[:, ['Embarked']]).toarray() # 使用toarray获得转换以后的OneHot编码数组
# 新增编码后的字段
index = 0
for index in range(train_onehot.shape[1]):
category_name = encoder.categories_[0][index] # 获取OneHot编码后对应的第index个类别名称
X_train['Embarked_' + category_name] = train_onehot[:, index] # 将对应类别(字段)的编码数值增加到df中
X_test['Embarked_' + category_name] = test_onehot[:, index]
# 删除原有Embarked字段
X_train.drop(['Embarked'], axis=1, inplace=True)
X_test.drop(['Embarked'], axis=1, inplace=True)
# 打印前5行数据
print(X_train.head())
print("=" * 100)
print(X_test.head())
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q \
69 3 1 26.00000 2 0 8.6625 0.0 0.0
85 3 0 33.00000 3 0 15.8500 0.0 0.0
794 3 1 25.00000 0 0 7.8958 0.0 0.0
161 2 0 40.00000 0 0 15.7500 0.0 0.0
815 1 1 39.30681 0 0 0.0000 0.0 0.0
Embarked_S
69 1.0
85 1.0
794 1.0
161 1.0
815 1.0
====================================================================================================
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q \
205 3 0 2.0 0 1 10.4625 0.0 0.0
44 3 0 19.0 0 0 7.8792 0.0 1.0
821 3 1 27.0 0 0 8.6625 0.0 0.0
458 2 0 50.0 0 0 10.5000 0.0 0.0
795 2 1 39.0 0 0 13.0000 0.0 0.0
Embarked_S
205 1.0
44 0.0
821 1.0
458 1.0
795 1.0
如上所述,Age和Fare字段需要进行标准化和归一化处理。此处采用标准化处理方式。
注意测试数据集也要使用训练集的参数进行标准化处理
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train[['Age', 'Fare']] = scaler.fit_transform(X_train[['Age', 'Fare']])
X_test[['Age', 'Fare']] = scaler.transform(X_test[['Age', 'Fare']])
print(X_test.head())
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q \
205 3 0 -2.076413 0 1 -0.427985 0.0 0.0
44 3 0 -0.815021 0 0 -0.477484 0.0 1.0
821 3 1 -0.221425 0 0 -0.462475 0.0 0.0
458 2 0 1.485164 0 0 -0.427266 0.0 0.0
795 2 1 0.668970 0 0 -0.379363 0.0 0.0
Embarked_S
205 1.0
44 0.0
821 1.0
458 1.0
795 1.0
经过归一化处理之后,Age和Fare字段的值域范围进行了缩放
from sklearn.linear_model import LogisticRegression
# 训练模型
model = LogisticRegression()
model.fit(X_train, y_train)
# 从测试数据中取出前10条数据,预测其分类结果
number = 10
predicts = model.predict(X_test.iloc[:number, :])
# 预测每个样本的分类概率
predicts_prob =model.predict_proba(X_test.iloc[:number, :])
# 将预测样本的特征和结果合并显示
results = X_test.iloc[:number, :].copy()
results['Survived'] = y_test[:number]
results['Predicted'] = predicts
results['Predicts_prob_0'] = predicts_prob[:, 0] # 获取分类结果为1(生还)的预测概率
results['Predicts_prob_1'] = predicts_prob[:, 1] # 获取分类结果为1(生还)的预测概率
print(results)
Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q \
205 3 0 -2.076413 0 1 -0.427985 0.0 0.0
44 3 0 -0.815021 0 0 -0.477484 0.0 1.0
821 3 1 -0.221425 0 0 -0.462475 0.0 0.0
458 2 0 1.485164 0 0 -0.427266 0.0 0.0
795 2 1 0.668970 0 0 -0.379363 0.0 0.0
118 1 1 -0.444023 0 1 4.114350 1.0 0.0
424 3 1 -0.889220 1 1 -0.241162 0.0 0.0
678 3 0 0.965768 1 6 0.270204 0.0 0.0
269 1 0 0.372172 0 0 1.970445 0.0 0.0
229 3 0 -1.861042 3 1 -0.140485 0.0 0.0
Embarked_S Survived Predicted Predicts_prob_0 Predicts_prob_1
205 1.0 0 1 0.229934 0.770066
44 0.0 1 1 0.290268 0.709732
821 1.0 1 0 0.900322 0.099678
458 1.0 1 1 0.323410 0.676590
795 1.0 0 0 0.826048 0.173952
118 0.0 0 1 0.300159 0.699841
424 1.0 0 0 0.913576 0.086424
678 1.0 0 0 0.830145 0.169855
269 1.0 1 1 0.063346 0.936654
229 1.0 0 1 0.439897 0.560103
针对测试数据,检测模型的正确率、精度、召回率和F1 Score。
可以调用模型的score方法计算正确率, 使用sklearn.metrics的precision_score, recall_score, f1_score方法来计算其它指标。
from sklearn.metrics import precision_score, recall_score, f1_score
accuracy = model.score(X_test, y_test)
# 预测测试数据的结果
predicts = model.predict(X_test)
precision = precision_score(y_test, predicts)
recall = recall_score(y_test, predicts)
f1 = f1_score(y_test, predicts)
print("正确率:%.3f, 精度:%.3f, 召回率:%.3f, F1:%.3f" % (accuracy, precision, recall, f1))
正确率:0.791, 精度:0.785, 召回率:0.670, F1:0.723
sklearn.metrics.confusion_matrix可根据真实分类结果和预测结果,统计TP/FP/TN/FN的数量值
from sklearn.metrics import confusion_matrix
predicts = model.predict(X_test)
confusion_matrix_model = confusion_matrix(y_test, predicts)
print(confusion_matrix_model)
[[139 20]
[ 36 73]]
sklearn.metrics.classification_report可针对逻辑回归模型自动生成综合性能指标报表
from sklearn.metrics import classification_report
predicts = model.predict(X_test)
print(classification_report(y_test, predicts))
precision recall f1-score support
0 0.79 0.87 0.83 159
1 0.78 0.67 0.72 109
accuracy 0.79 268
macro avg 0.79 0.77 0.78 268
weighted avg 0.79 0.79 0.79 268
predicts_prob = model.predict_proba(X_test)
K = 0.4
# 对于每一个样本的预测概率,如果分类1的概率大于K,那么其分类结果就为1,否则为0
predicts = [1 if prob[1] > K else 0 for prob in predicts_prob]
# 手动计算正确率
corrects = np.sum(predicts == y_test)
accuracy = corrects / len(y_test)
precision = precision_score(y_test, predicts)
recall = recall_score(y_test, predicts)
f1 = f1_score(y_test, predicts)
print("正确率:%.2f, 精度:%.2f, 召回率:%.2f, F1:%.2f" % (accuracy, precision, recall, f1))
正确率:0.81, 精度:0.78, 召回率:0.73, F1:0.75
关于泰坦尼克号乘客生还预测的实验到此结束。【dataset/titanic_test.csv】中另外存放了一批乘客信息(不包含标签Survived字段)。可尝试使用上述模型对这些信息进行预测,并将预测结果提交到Kaggle平台来检查正确率。详情请看链接https://www.kaggle.com/c/titanic
【dataset/exam_score.csv】文件存放了一系列学生成绩数据,每个样本包括两个特征字段:exam1_score(分数1)、exam2_score(分数2)和1个标签passed(总成绩是否通过)。要求:
分析 :
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
trainData = np.loadtxt(open('dataset/exam_score.csv', 'r'), delimiter=",", skiprows=1)
x1 = trainData[:,0] # 第一个特征
x2 = trainData[:,1] # 第二个特征
y = trainData[:,2] # 标签结果
def initPlot():
plt.figure()
plt.title('Data for ')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
return plt
plt = initPlot()
score1ForPassed = trainData[trainData[:,2] == 1, 0] # 标签结果为1(通过)的样本点的第一特征值(exam1_score)
score2ForPassed = trainData[trainData[:,2] == 1, 1] # 标签结果为1(通过)的样本点的第一特征值(exam1_score)
score1ForUnpassed = trainData[trainData[:,2] == 0, 0]
score2ForUnpassed = trainData[trainData[:,2] == 0, 1]
plt.plot(score1ForPassed,score2ForPassed,'r+')
plt.plot(score1ForUnpassed,score2ForUnpassed,'ko')
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hH52B8NJ-1626015050077)(output_48_0.png)]
# 准备数据
X_train = trainData[:,[0,1]]
y_train = trainData[:,2]
model = LogisticRegression()
model.fit(X_train, y_train)
# 给定4个用于测试的样本的特征
newScores = np.array([[58, 67],[90, 90],[35, 38],[55, 56]])
print("预测结果:")
print(model.predict(newScores))
预测结果:
[1. 1. 0. 0.]
4个样本,前2个预测结果为1(通过),后两个预测结果为0(不通过)
# 获取权重参数w0,w1和w2
W = np.array([model.intercept_[0], model.coef_[0,0], model.coef_[0,1]])
plt = initPlot()
score1ForPassed = trainData[trainData[:,2] == 1, 0]
score2ForPassed = trainData[trainData[:,2] == 1, 1]
score1ForUnpassed = trainData[trainData[:,2] == 0, 0]
score2ForUnpassed = trainData[trainData[:,2] == 0, 1]
plt.plot(score1ForPassed,score2ForPassed,'r+')
plt.plot(score1ForUnpassed,score2ForUnpassed,'ko')
# 绘制决策边界线
boundaryX = np.array([30, 100]) # 给定任意两个样本点的横坐标
boundaryY = -(W[1] * boundaryX + W[0]) / W[2] # 计算对应的纵坐标
plt.plot(boundaryX, boundaryY, 'b-') # 连接边界线上的两个点
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-X64cpATW-1626015050078)(output_54_0.png)]
【dataset/non_linear.csv.csv】文件包括两个特征字段和1个分类结果。要求:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
trainData = np.loadtxt(open('dataset/non_linear.csv', 'r'), delimiter=",", skiprows=0)
x1 = trainData[:,0] # 第一个变量
x2 = trainData[:,1] # 第二个变量
y = trainData[:,2] # 因变量
def initPlot():
plt.figure()
plt.title('Data for ')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
return plt
plt = initPlot()
score1ForPassed = trainData[trainData[:,2] == 1, 0]
score2ForPassed = trainData[trainData[:,2] == 1, 1]
score1ForUnpassed = trainData[trainData[:,2] == 0, 0]
score2ForUnpassed = trainData[trainData[:,2] == 0, 1]
plt.plot(score1ForPassed,score2ForPassed,'r+')
plt.plot(score1ForUnpassed,score2ForUnpassed,'ko')
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BEHZWj2Z-1626015050080)(output_58_0.png)]
可见,上述样本是线性不可分的,无法使用一条直线来较好的分割不同类别的样本点
def mapFeatures(x1, x2): # 生成6阶双变量的多项式拟合特征值矩阵
rowCount = len(x1)
colIndex = 1 # 第0列为Intercept Item,无需进行计算
features = np.ones((rowCount, FEATURE_COUNT))
for i in np.arange(1, DEGREE + 1): # 1,2,3....DEGREE
for j in np.arange(0, i + 1): # 0,1,2...i
features[:, colIndex] = (x1 ** (i - j)) * (x2 ** j) # 每个循环计算1列Feature
colIndex = colIndex + 1
return features
# 定义全局变量
DEGREE = 6 # 最高为6阶
FEATURE_COUNT = 28 # 两个变量,6阶公式,共28个Feature(含Intercept Item)
ROW_COUNT = len(trainData) # 总行数
features = mapFeatures(x1, x2) # 获得一个ROW_COUNT x FEATURE_COUNT维度的特征值数组
print("高阶特征矩阵的维度:", features.shape) # 每个样本都拥有28个维度
高阶特征矩阵的维度: (118, 28)
X_train = features
y_train = trainData[:,2]
model = LogisticRegression(penalty='none', max_iter=2000)
model.fit(X_train, y_train)
print("截距项:", model.intercept_[0])
print("权重参数:", model.coef_[0])
截距项: 13.805214927591399
权重参数: [ 13.80521493 41.32782962 40.70132754 -280.85788491
-152.60128169 -130.58982339 -299.39921416 -464.90952213
-328.27047934 -144.13859136 944.5933964 1006.56535514
1300.86067248 619.81271183 249.08387005 501.85908681
1084.70837594 1402.01857905 1175.93373061 564.27758173
159.45797961 -1047.61826184 -1743.79125559 -2795.08309536
-2483.86639601 -2035.87732905 -854.59244568 -213.22468803]
c:\users\iahuo\appdata\local\programs\python\python38\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
此时边界线不再是直线,因此在绘制时,必须给定大量点的横坐标和纵坐标,然后代入到判别式中计算出其对应的分类结果(0或1),最后使用等高线图显示高度值为0.5(决策阈值)的等高线
plt = initPlot()
score1ForPassed = trainData[trainData[:,2] == 1, 0]
score2ForPassed = trainData[trainData[:,2] == 1, 1]
score1ForUnpassed = trainData[trainData[:,2] == 0, 0]
score2ForUnpassed = trainData[trainData[:,2] == 0, 1]
plt.plot(score1ForPassed,score2ForPassed,'r+')
plt.plot(score1ForUnpassed,score2ForUnpassed,'ko')
# 生成若干个样本点
plotX1 = np.linspace(-1, 1.5, 50)
plotX2 = np.linspace(-1, 1.5, 50)
Z = np.zeros((len(plotX1), len(plotX2)))
for i in np.arange(0, len(plotX1)): # 每次预测一列点
a1 = [plotX1[i] for _ in np.arange(0, len(plotX2))]
plotFeatures = mapFeatures(a1, plotX2)
Z[i,:] = model.predict(plotFeatures)
plt.contour(plotX1, plotX2, Z, levels=[0.5]) # 取Z=0.5作为决策边界
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-R2xW3ng7-1626015050081)(output_66_0.png)]
很显然,虽然分割的效果不错,但是已经产生了过拟合
model = LogisticRegression(C=0.1)
model.fit(X_train, y_train)
print("截距项:", model.intercept_[0])
print("权重参数:", model.coef_[0])
plt = initPlot()
score1ForPassed = trainData[trainData[:,2] == 1, 0]
score2ForPassed = trainData[trainData[:,2] == 1, 1]
score1ForUnpassed = trainData[trainData[:,2] == 0, 0]
score2ForUnpassed = trainData[trainData[:,2] == 0, 1]
plt.plot(score1ForPassed,score2ForPassed,'r+')
plt.plot(score1ForUnpassed,score2ForUnpassed,'ko')
# 生成若干个样本点
plotX1 = np.linspace(-1, 1.5, 50)
plotX2 = np.linspace(-1, 1.5, 50)
Z = np.zeros((len(plotX1), len(plotX2)))
for i in np.arange(0, len(plotX1)): # 每次预测一列点
a1 = [plotX1[i] for _ in np.arange(0, len(plotX2))]
plotFeatures = mapFeatures(a1, plotX2)
Z[i,:] = model.predict(plotFeatures)
plt.contour(plotX1, plotX2, Z, levels=[0.5]) # 取Z=0.5作为决策边界
plt.show()
截距项: 0.3261743348125194
权重参数: [ 4.80060874e-06 -8.15346950e-03 1.65795385e-01 -4.46717768e-01
-1.11773868e-01 -2.78919687e-01 -7.14543762e-02 -5.78891579e-02
-6.50971508e-02 -1.06370649e-01 -3.36728581e-01 -1.29717223e-02
-1.16707334e-01 -2.80967442e-02 -2.86026426e-01 -1.16148883e-01
-3.70447251e-02 -2.24215126e-02 -4.88657219e-02 -4.16295811e-02
-1.86754269e-01 -2.53337925e-01 -2.91085963e-03 -5.79667693e-02
-5.28007020e-04 -6.35287458e-02 -1.20640539e-02 -2.71483918e-01]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k3id7uQD-1626015050082)(output_70_1.png)]
model = LogisticRegression(C=10)
model.fit(X_train, y_train)
plt = initPlot()
score1ForPassed = trainData[trainData[:,2] == 1, 0]
score2ForPassed = trainData[trainData[:,2] == 1, 1]
score1ForUnpassed = trainData[trainData[:,2] == 0, 0]
score2ForUnpassed = trainData[trainData[:,2] == 0, 1]
plt.plot(score1ForPassed,score2ForPassed,'r+')
plt.plot(score1ForUnpassed,score2ForUnpassed,'ko')
# 生成若干个样本点
plotX1 = np.linspace(-1, 1.5, 50)
plotX2 = np.linspace(-1, 1.5, 50)
Z = np.zeros((len(plotX1), len(plotX2)))
for i in np.arange(0, len(plotX1)): # 每次预测一列点
a1 = [plotX1[i] for _ in np.arange(0, len(plotX2))]
plotFeatures = mapFeatures(a1, plotX2)
Z[i,:] = model.predict(plotFeatures)
plt.contour(plotX1, plotX2, Z, levels=[0.5]) # 取Z=0.5作为决策边界
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0VjK9ZbZ-1626015050083)(output_73_0.png)]
此种情况较好,基本能反映样本点的分类情况
model = LogisticRegression(C=1000)
model.fit(X_train, y_train)
plt = initPlot()
score1ForPassed = trainData[trainData[:,2] == 1, 0]
score2ForPassed = trainData[trainData[:,2] == 1, 1]
score1ForUnpassed = trainData[trainData[:,2] == 0, 0]
score2ForUnpassed = trainData[trainData[:,2] == 0, 1]
plt.plot(score1ForPassed,score2ForPassed,'r+')
plt.plot(score1ForUnpassed,score2ForUnpassed,'ko')
# 生成若干个样本点
plotX1 = np.linspace(-1, 1.5, 50)
plotX2 = np.linspace(-1, 1.5, 50)
Z = np.zeros((len(plotX1), len(plotX2)))
for i in np.arange(0, len(plotX1)): # 每次预测一列点
a1 = [plotX1[i] for _ in np.arange(0, len(plotX2))]
plotFeatures = mapFeatures(a1, plotX2)
Z[i,:] = model.predict(plotFeatures)
plt.contour(plotX1, plotX2, Z, levels=[0.5]) # 取Z=0.5作为决策边界
plt.show()
c:\users\iahuo\appdata\local\programs\python\python38\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IMuWYCVU-1626015050084)(output_76_1.png)]
此时有过拟合之嫌。可以进一步推算,如果继续增加C的值,将逐渐相当于没有惩罚项(严重过拟合)
前述介绍的逻辑回归能够较好的判断结果是1或0的情形(两种类别)。但是现实中,结果往往有多种类别。例如,判断某个手写出来的阿拉伯数字是0-9中的哪一个,这就有10个类别。
【dataset/digits_training.csv】存放了5000张手写数字图片的像素信息,每行代表一张图片,每列代表一个像素值(二维像素数组展开成一维数组)。
【dataset/digits_testing.csv】存放了500个用于测试的数据。
要求:训练逻辑回归模型,是之能够根据像素数组识别出对应的0~9数字分类
分析:
准备训练数据
为了使机器学习具有一定的准确性,需要提供足量的训练数据。本例中,我们准备提供0~9这10个数字的手写图片总共5000张(另有500张测试图片),并且:
准备好训练数据的特征值矩阵
使用逻辑回归进行分类
使用得到的假设公式进行预测
使用测试数据进行验证
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
%matplotlib inline
# 使用简单的归一化处理
def normalizeData(X, col_avg):
return (X - col_avg) / 255
trainData = np.loadtxt(open('dataset/digits_training.csv', 'r'), delimiter=",",skiprows=1)
MTrain, NTrain = np.shape(trainData)
xTrain = trainData[:,1:NTrain]
xTrain_col_avg = np.mean(xTrain, axis=0)
xTrain = normalizeData(xTrain, xTrain_col_avg)
yTrain = trainData[:,0]
print("装载训练数据:", MTrain, "条,训练中......")
model = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=500)
model.fit(xTrain, yTrain)
print("训练完毕")
testData = np.loadtxt(open('dataset/digits_testing.csv', 'r'), delimiter=",",skiprows=1)
MTest,NTest = np.shape(testData)
xTest = testData[:,1:NTest]
xTest = normalizeData(xTest, xTrain_col_avg) # 使用训练数据的列均值进行处理
yTest = testData[:,0]
print("装载测试数据:", MTest, "条,预测中......")
yPredict = model.predict(xTest)
errors = np.count_nonzero(yTest - yPredict)
print("预测完毕。错误:", errors, "条")
print("测试数据正确率:", (MTest - errors) / MTest)
装载训练数据: 5000 条,训练中......
训练完毕
装载测试数据: 500 条,预测中......
预测完毕。错误: 54 条
测试数据正确率: 0.892
针对测试数据的预测正确率约为89.2%,对于手写数字图像识别而言,这实际上是一个很差的效果。使用深度学习和卷积神经网络,一般可达99%以上的正确率。这也在一定程度上说明了,简单的机器学习可能并不太适合处理图像问题。