说一件大事,我涨粉了,一个。我的博客终于有了第一个粉丝,为了这一个粉丝,我一定好好更新下去。
今天更新的是挑战100天搞定机器学习的第4到第6天,为啥呢?因为第四天原作者给出了理论,第5天原作者对相应理论做深入了解,没有给出具体内容,第6天给出了代码和数据集解析。这里就一并翻译出来。话不多说,开始。
转载请注明出处。
下面是原作者给出的知识图谱
logistic regression is used for a different class of problems known as classification problems.
逻辑回归用于问题的其他方面-------分类问题
here the aim is to predict the group to which the current object under observation belongs to.
目的是预测被观察的当前对象所属的组。
It gives you a discrete binary outcome between 0 and 1
它给出一个0到1之间的离散二进制结果。
A simple example would be whether a person will vote or not in upcoming elections.
一个简单的例子是一个人是否会在即将举行的选举中投票
Logistic regression measures the relationship between the dependent variables (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function
逻辑回归通过依赖于它的基本逻辑函数估计概率,来测量因变量(标签,我们想要预测的)和一个或多个独立变量(特征)之间的关系。
these probabilities must then be transformed into binary values in order to actually make a prediction
然后必须将这些概率转换为二进制值,以便实际进行预测
this is the task of the logistic function, also called the sigmoid function
这是逻辑函数的任务,也称为sigmoid函数
this values between 0 and 1 will be transformed into either 0 or 1 using a threshold classifier
使用阈值分类器将0和1之间的值转换为0或1
logistic regression gives you a discrete outcome but linear regression gives a continuous outcome
逻辑回归回归给你一个离散的结果,但线性回归给出了一个连续的结果。
the Sigmoid-Function is an S-shaped curve that can take any real-value between the range of 0 and 1, but never exactly at those limits
sigmoid函数是S形曲线,可以在0到1范围内取任何实际值,但不会取到极限值。
This infographic is just the logistic regression intuition and is very brief.
这个信息图表只是很简单的逻辑回归知识。
The mathematical logic and implementation part will be covered in another infographic
数学逻辑和实现部分将在另一个信息图表给出。
下面是原作者的在第5天写下的
Moving forward into #100DaysOfMLCode today I dived into the deeper depth of what actually Logistic Regression is and what is the math involved behind it. Learned how cost function is calculated and then how to apply gradient descent algorithm to cost function to minimize the error in prediction.
Due to less time I will now be posting a infographic on alternate days. Also if someone wants to help me out in documentaion of code and has already some experince in the field and knows Markdown for github please contact me on LinkedIn :) .
Moving forward into #100DaysOfMLCode today I dived into the deeper depth of what actually Logistic Regression is and what is the math involved behind it.
今天我们将继续讨论#100DaysOfMLCode,我深入了解了什么是逻辑回归,以及它背后的数学原理。
Learned how cost function is calculated and then how to apply gradient descent algorithm to cost function to minimize the error in prediction.
学习了成本函数的计算方法,以及如何将梯度下降法应用到成本函数中,使预测误差最小化
Due to less time I will now be posting a infographic on alternate days. Also if someone wants to help me out in documentaion of code and has already some experince in the field and knows Markdown for github please contact me on LinkedIn :) .
由于时间较少,我现在将在隔天发布一个信息图表。另外,如果有人想帮助我编写代码文档,并且已经在这个领域有一些经验并且知道github的Markdown,请在LinkedIn上联系我:)。
/*译者注:
python 代码很少,你别指望能从中学到理论知识。我认为理论是很重要的,一些培训机构也会讲机器学习,简单看了一下课程介绍,很长的篇幅就是python基础和python应用,比如爬虫和建站。这对我们学习机器学习没有帮助。而培训机构的课程中关于机器学习的基础知识很少,主要讲编码,讲一些简单的例子,比如LeNet建立,这是空中楼阁。所以再次建议看李宏毅教授的理论课程,生动有趣,很不错。
*/
下面是作者给出的数据集简介的图片
This dataset contains information of users in a social network. Those informations are the user id the gender the age and the estimated salary. A car company has just launched their brand new luxury SUV. And we're trying to see which of these users of the social network are going to buy this brand new SUV And the last column here tells If yes or no the user bought this SUV we are going to build a model that is going to predict if a user is going to buy or not the SUV based on two variables which are going to be the age and the estimated salary. So our matrix of feature is only going to be these two columns. We want to find some correlations between the age and the estimated salary of a user and his decision to purchase yes or no the SUV.
此数据集包含社交网络中用户的信息。这些信息是用户id、性别、年龄和估计工资。一家汽车公司刚刚推出了他们全新的豪华SUV。我们想看到哪一个社交网络的用户会购买这个全新的SUV和最后一列告诉如果是或否用户买了这SUV我们要建立一个模型,将基于两个变量估计即年龄和薪水预测用户是否要买的SUV。我们的特征矩阵是这两列。我们想要找到年龄、用户预期工资以及用户是否购买SUV的决定之间的关系。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
The library for this job which is going to be the linear model library and it is called linear because the logistic regression is a linear classifier which means that here since we're in two dimensions, our two categories of users are going to be separated by a straight line. Then import the logistic regression class. Next we will create a new object from this class which is going to be our classifier that we are going to fit on our training set.
这个功能的库将成为线性模型库,它被称为线性,因为逻辑回归是一个线性分类器,这意味着,因为我们在两个维度,我们的两类用户将被分开 一条直线。 然后导入逻辑回归类。 接下来,我们将从这个类中创建一个新对象,它将成为我们的训练集中的分类器。
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
We predicted the test results and now we will evaluate if our logistic regression model learned and understood correctly. So this confusion matrix is going to contain the correct predictions that our model made on the set as well as the incorrect predictions.
我们预测了测试结果,现在我们将评估我们的逻辑回归模型是否正确学习和理解。 所以这个混淆矩阵将包含我们的模型在集合上做出的正确预测以及不正确的预测。
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
如果按照原作者的代码,没有任何显示,我们什么也看不到,需要加如下代码
print(cm)
但打印出来的是这样
[[65 3]
[ 8 24]]
我要对这个矩阵做一下着重解释
0 | 1 | |
0 | 65 | 3 |
1 | 8 | 24 |
0代表不想买,1代表想买,(0,0)位置代表不想买的人被预测为不想买(65所在位置),(0,1)代表不想买的人被预测为想买(3所在位置),(1,0)代表想买的人被预测为不想买,(1,1)被预测为想买的被预测为想买。即主对角线为预测正确的。