

This is the second article in a series of articles where we will understand the “under the hood” workings of various ML algorithms, using their base math equations.


With so many optimized implementations out there, we sometimes focus too much on the library and the abstraction it provides, and too little on the underlying calculations that go into the model. Understanding these calculations can often be the difference between a good and a great model.

In this series, I focus on implementing the algorithms by hand, to understand the math behind it, which will hopefully help us train and deploy better models.


Note — This series assumes you know the basics of Machine Learning and why we need it. If not, do give this article a read to get up to speed on why and how we utilize ML.

Logistic Regression builds on the concepts of Linear Regression, where the model produces a linear equation relating the input features(X) to the target variable (Y).


The two major differentiating features of the Logistic Regression algorithm are —


  1. The target variable is a discrete value (0 or 1) unlike a continuous value, as in the case of Linear Regression, which adds an additional step after calculating output from the linear equation, to get discrete values.

  2. The equation built by the model focuses of separating the various discrete values of target — trying to identify a line such that all 1’s fall on one side of the line and all 0’s on the other.


Consider the following data, with two input features — X1, X2, and one binary (0/1) target feature — Y



Logistic Regression will try to find the optimum values for the parameters w1, w2, and b, such that —


Here, the function H, also known as the activation function, converts the continuous output values of y to a discrete value. This will ensure that the equation is able to output a 1 or 0 similar to the input data.

The algorithm finds these optimum values using the following steps —


  1. Assign random values to w1, w2, and b.


2. Pick one instance of the data and calculate the continuous output (z).


3. Calculate the discrete output (ŷ) using the activation function H().

4. Calculate loss — Did our assumptions lead us close to a 1, when the actual target was 1?


5. Calculate the gradient for w1, w2 and b — How should we change the parameters to move closer to the actual output?


6. Update w1, w2 and b.


7. Repeat steps 2–6 until convergence.


Let’s look at each of these steps in detail.


We start with assigning random values to the model parameters —


2. 选择一个数据实例并计算连续输出(z) (2. Pick one instance of the data and calculate the continuous output (z))

Let’s start with the first row of our data —



Putting in our assumed parameter values, we get —


3. 使用激活函数H() 计算离散输出( ŷ )— (3. Calculate the discrete output (ŷ) using the activation function H() —)

If you’ve been following the “Under the Hood” series, you would have noticed that Steps 1 and 2 are exactly the same as in Linear Regression. This point will come up often, as Linear Regression forms the foundation for most algorithms out there — from simple ML algorithms to Neural Networks.

It’s all powered by Y = MX + c

With our foundation laid out, we now focus on what makes Logistic Regression different from Linear Regression, and this is where the “activation function” comes in.


The activation function forms the bridge between the linear world with continuous target values and the “logistic” world (like the one we are working in right now!) with discrete target values.


A simple form of an activation function would be a thresholding function (also known as the “Step function”) —


But, as with all things ML, “it can’t be that simple”.

One major flaw with a simple thresholding function like this, is that we have to manually select the right threshold(based on the range of output) every time we build a classification model. Our values could have any arbitrary range depending on the input variables and weights.

像这样的简单阈值功能的一个主要缺陷是,每次构建分类模型时,我们都必须手动选择正确的阈值(基于输出范围)。 根据输入变量和权重,我们的值可以具有任意范围。

Another fact working against a simple function like this, is that it is not differentiable at z=threshold. We need to follow the pipeline of loss -> gradient -> weight-update, hence we will need to make things a bit more complex here such that our life is simpler when we calculate the derivative(gradient) of our loss.

针对像这样的简单函数起作用的另一个事实是,在z = threshold处它是不可微的。 我们需要遵循损失-> 梯度 ->权重更新的流水线,因此我们需要在这里使事情变得更复杂,以便在计算损失的导数(梯度)时我们的生活更简单。

This is where a slightly modified version of the threshold — the Sigmoid function, comes in —


Just like thresholding, the sigmoid function converts real number values, to a value between 0 and 1 with a midpoint at z=0 (thus solving our first problem). As evident from the graph, the function is smooth at all points, which will be of benefit when calculating gradients(thus solving our second problem).

就像阈值一样,S型函数将实数值转换为0到1之间的值,且中点为z = 0(从而解决了第一个问题)。 从图中可以明显看出,该函数在所有点上都是平滑的,这在计算梯度时将很有用(从而解决了我们的第二个问题)。

Another advantage of the sigmoid function is that it tells us how close our estimate is, to 0 or 1. This helps us get a good understanding of the model loss(error) — a prediction of 0.9 for a row that has an actual 1 is better than a prediction of 0.7. Such intricacies are lost when using a step function.

S形函数的另一个优点是,它告诉我们估算值接近0或1。这有助于我们更好地理解模型损失(错误),对于实际值为1的行的预测为0.9,优于0.7的预测。 当使用步进功能时,这样的复杂性会丢失。

Let’s use the sigmoid function to calculate our estimated output —


4. 计算损失 (4. Calculate loss)

Building on Linear Regression, we could choose squared error to represent our loss, but that would always make the error small, as our values lie in the 0–1 range. We need a function that outputs a large loss when our assumptions provide a value close to 0, while the actual is 1, and vice-versa.

在线性回归的基础上,我们可以选择平方误差来表示我们的损失,但这将使误差始终很小,因为我们的值在0–1范围内。 当我们的假设提供接近0的值,而实际为1时,我们需要一个输出大损失的函数,反之亦然。

One such function is the log-loss, which uses log transformation of our predicted output(z) to calculate loss.


We define the loss function as —


The output of the log function approaches negative infinity as the input reaches close to 0, and it is 0 when the input is 1. The negative sign inverts the log values to ensure our loss lies between 0 to infinity.


We can write the function as a single equation —


We can now calculate the error our assumptions lead us to —


5. 计算w1,w2和b的梯度 (5. Calculate the gradient for w1, w2 and b)

We now calculate the impact each of our parameter has on the predicted output(and loss) by calculating the gradient of each of our parameters vs the loss.


This is where having differential functions in our pipeline make our life easy, giving us derivatives** for each of our parameters as —



6. 更新w1,w2和b (6. Update w1, w2 and b)

The gradients tell us how much we should change each of our parameter assumptions to reduce the loss and move our predicted output closer to the actual output.


As we are “training” the model on one instance at a time, we want to limit the impact the loss on this individual instance has on our parameters. Thus, we scale the gradients to a tenth of their value, before updating the parameters, using “learning rate (η)” —

This completes one iteration of training.


To get the optimum parameter values, we repeat the above steps either for a fixed number of times, or until our loss stops reducing i.e., convergence.

Let’s work through another iteration, using the updated parameters —



Looks like our loss has increased on this instance. This shows how the binary instances are pulling the weights in opposite direction to generate a decision boundary between the two classes.

看来我们在这种情况下的损失有所增加。 这显示了二进制实例如何沿相反方向拉权重以在两个类之间生成决策边界。

Repeating the steps for another few iterations*, sampling data randomly from the first 4 rows, we get the optimum weights as —


Using these values on our last row of data (which we did not use while training), we get —


That is really close to our actual class ‘1’.

Choosing a cutoff at 0.5(mid-point of our sigmoid curve), gives us a prediction of 1 for this instance.


And that’s it. At its core, this is all that the Logistic Regression algorithm does.

“Under the Hood” being the focus of this series, we took a look at the foundation of Logistic Regression taking one sample at a time and updating our parameters to fit the data.


While it’s true that this is what Logistic Regression does at its core, there is a lot more that goes into a good Logistic Regression model, like —


1. Regularization — L1 and L2


2. Learning Rate scheduling


3. Why choose sigmoid activation?


4. Scaling and Normalizing variables


5. Multi-class classification


I will cover these concepts in a parallel series focused on the various intricacies of the different ML algorithms we cover in this series.


In the next article in this series, we will continue with the classification task, and look under the hood of another class of algorithms — Decision Trees, which work a lot like how you and I would use reasoning to arrive at a conclusion regarding the data.

*Over 5 iterations, this is how our loss, gradients, and the parameter values changed —



** I have not covered how we get the derivatives of our loss function w.r.t our parameters in this article, as the derivation is extensive and warrants an article in its own right (It will be covered in another article :))

