怎样简单计算从头拉到尾计算_从头开始使用最小二乘的简单线性回归

怎样简单计算从头拉到尾计算

‘We all walk before we run’ which indicates the need of basics before doing the big things.Simple Linear regression is one of the very basics of machine Learning.In this Post, We would implement the linear regression from scratch using the statistical technique of least squares.

``大步向前走''表示在做大事之前需要基础知识。简单线性回归是机器学习的基础知识之一。在本文中,我们将使用统计技术从零开始实现线性回归。最小二乘。

Introduction to Simple Linear regression

简单线性回归简介

Simple Linear regression is a method used to represent the relationship between the dependent variable(Y) and a single independent variable(X) which can be expressed as y=wx+b where w is the weight of the feature x and b is the bias in Machine Learning terms whereas in mathematics, the equation is represented as y=mx+c with slope m and intercept c.

简单线性回归是一种用于表示因变量( Y )与单个自变量( X )之间关系的方法,可以将其表示为y = wx + b ,其中w是特征x的权重, b是偏差在机器学习术语中,而在数学中,方程表示为y = mx + c ,斜率为m ,截距为c

Now,we are gonna implement this from the scratch with the least squares methodology. In order to implement this we will be using python.The following libraries in python will be used

现在,我们将使用最小二乘法从头开始实现这一目标。 为了实现这一点,我们将使用python.python中的以下库将被使用

  • numpy

    麻木
  • pandas

    大熊猫
  • matplotlib

    matplotlib

In this example, we will be predicting the percentage of savings that a employee makes from his income based on his experience(in months) in a company.You can find the complete code and the Data set we used,in my GitHubRepo.

在此示例中,我们将根据员工在公司中的经验(以月为单位)来预测员工从其收入中节省的资金百分比。您可以在GitHubRepo中找到完整的代码和我们使用的数据集。

The Data should be preprocessed before fitting the model and data preprocessing is also implemented from scratch and explained.

在拟合模型之前,应对数据进行预处理,并对数据进行预处理并从头开始进行说明。

#Reading the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_excel('Experience vs savings %.xlsx')
dataset.plot(x='Experience(in months)',y='% of savings from income',kind='scatter',
title='Experience VS % of Savings')

we have imported the required packages then we have read the data set file which is of the format excel spreadsheet using the pandas read_excel() function. The data set has been stored in form of a dataframe in the variable dataset. Then we have made a scatter plot to get an intuition or overview of the data set.

我们已经导入了所需的软件包,然后使用pandas read_excel()函数读取了excel电子表格格式的数据集文件。 数据集已以数据框的形式存储在变量数据集中 。 然后,我们绘制了一个散点图,以获取数据集的直觉或概览。

#Checking  For Null values
df=dataset.copy()
print("Test for null values in the dataset : {}".format(np.any(df.isnull().values==True)))
dependent_var=df.iloc[:,1].values
independent_var=df.iloc[:,0].values
#output:
Test for null values in the dataset : False

In this section, a copy of the original dataset dataframe has been created. Then the data set is checked for presence of null values. In our case,the data set didn’t have any null values.Eventually,data has been split into dependent variable and independent variable.

在本节中,已创建原始数据集数据框的副本。 然后,检查数据集是否存在空值。 在我们的例子中,数据集没有任何空值。最终,数据被分为因变量和自变量。

The data is scaled using MinMaxscaling technique which will the scale the highest value to 1, the lowest value to 0 and the values in between will be scaled accordingly to the range (0,1). Data scaling helps the model in making the calculations faster.

使用MinMaxscaling技术缩放数据,它将最大值缩放为1,将最小值缩放为0,介于两者之间的值将缩放到范围(0,1)。 数据缩放有助于模型加快计算速度。

#Scaling the data using the Minmaxscaling technique
class Scaler:
def __init__(self):
self.min=None
self.max=None
def scale(self,data):
if self.min is None and self.max is None:
self.min=data.min()
self.max=data.max()
return (data-self.min) / (self.max-self.min)
def reverse_scaling(self,data):
return (data*(self.max-self.min))+self.min
xscaler=Scaler()
yscaler=Scaler()
x=xscaler.scale(independent_var)
y=yscaler.scale(dependent_var)

In this snippet of code,a class of type Scaler has been defined with an initialiser,and two methods called scale and reverse_scaling. scale() method is used to scale the data to the range (0,1) using the minmaxscaling technique while reverse_scaling() method will convert the scaled values into their original range of values.The independent_var and dependent_var have been scaled and stored as x and y.

在此代码段中,已定义了带有初始化程序的Scaler类型的类以及两个称为scale和reverse_scaling的方法。 规模()方法被用于将数据扩展到使用minmaxscaling技术而reverse_scaling范围(0,1)()方法将经缩放的值转换为它们的原始范围values.The independent_vardependent_var已缩放和为x存储和y。

#Splitting the dataset into train and test set
def splitter(x,y,train_size=0.75,seed=None):
np.random.seed(seed)
data=np.concatenate([x.reshape(-1,1),y.reshape(-1,1)],axis=1)
np.random.shuffle(data)
xtrain=data[:int(len(data)*train_size),0]
ytrain=data[:int(len(data)*train_size),1]
xtest=data[int(len(data)*train_size):,0]
ytest=data[int(len(data)*train_size):,1]
return xtrain,ytrain,xtest,ytest
xtrain,ytrain,xtest,ytest=splitter(x,y,train_size=0.85,seed=101)

splitter() method can be used to randomly split the whole data set into train and test data of desired size.

splitter()方法可用于将整个数据集随机拆分为所需大小的训练数据和测试数据。

#Method of least squares
def least_squares(x,y):
xmean=x.mean()
ymean=y.mean()
num=((x-xmean)*(y-ymean)).sum(axis=0)
den=((x-x.mean())**2).sum(axis=0)
weight=num/den
bias=ymean-(weight*xmean)
return weight,bias
weight,bias=least_squares(xtrain,ytrain)
print("weight :{} , bias : {}".format(weight,bias))
ypred=yscaler.reverse_scaling(predict(xtest,weight,bias))
ytrue=yscaler.reverse_scaling(ytest)
#Output:
weight :0.8931 , bias : 0.0818
Method of Least Squares 最小二乘法

The above snippet code represents the implementation of least squares method using the above mathematical expression, where m is the weight(also called slope) and c is the bias(also called intercept). The training x and y data has been used to find the weight and bias of the regression equation.Then the model is used to make predictions on the test set in order to evaluate its performance.

上面的代码段代表使用上述数学表达式实现的最小二乘法,其中m是权重(也称为斜率), c是偏差(也称为intercept)。 训练的xy数据已用于找到回归方程的权重和偏差,然后使用该模型对测试集进行预测以评估其性能。

Root Mean Square Error and R-squared value(Model Performance Metrics) 均方根误差和R平方值(模型性能指标)

The Root Mean Square Error explains how far the actual values are located from the predicted values or the fitted regression line.The square of RMSE will give the Mean Squared Error. The RMSE can also be interpreted as the standard deviation of the error while MSE is the variance.R-squared value is the ratio of variance observed by the model to the total variance of the Data set.

均方根误差解释了实际值与预测值或拟合的回归线之间的距离,RMSE的平方将给出均方误差。 RMSE也可以解释为误差的标准偏差,而MSE是方差。 R平方值是模型观察到的方差与数据集的总方差之比。

#Model performance metrics
def mse(true,pred):
return np.mean((pred-true)**2)
def rmse(true,pred):
return mse(true,pred)**0.5
def mae(true,pred):
return np.mean(abs(pred-true))
def r_squared(true,pred):
true_mean=true.mean()
pred_mean=pred.mean()
tot=((true-true_mean)**2).sum(axis=0)
obs=((true-pred)**2).sum(axis=0)
return 1-(obs/tot)
print("MSE : ",mse(ytrue,ypred))
print("RMSE : ",rmse(ytrue,ypred))
print("MAE : ",mae(ytrue,ypred))
print("R-squared Value",r_squared(ytrue,ypred))
#Output:
MSE : 0.02091
RMSE : 0.1446
MAE : 0.1197
R-squared Value 0.9174

The performance of the model we had fitted is quite good.It has RMSE of 0.1446 which means that the actual values are located 0.1446 units away from the predicted values.R-squared value of 91.74% indicates that the model is able to observe 91.74% of the total variability of the data set.The following graph represents the fitted least square regression line to the data set.

我们安装的模型的性能相当好,RMSE为0.1446,这意味着实际值与预测值相距0.1446个单位.R平方值为91.74%,表明该模型能够观测到91.74%下图代表数据集的拟合最小二乘回归线。

Fitted Least Square Regression Model 拟合最小二乘回归模型

翻译自: https://medium.com/analytics-vidhya/simple-linear-regression-using-least-squares-from-scratch-3f56205d63a5

怎样简单计算从头拉到尾计算

你可能感兴趣的:(python,算法,人工智能,机器学习,java)