python进行数据预处理
welcome guys to the second article in the series : preprocessing data for ML,in this article we gonna discover whether or not centering and scaling can help our model in a logistic regression task,so with that said let’s jump right into it.
欢迎大家阅读本系列的第二篇文章: ML的预处理数据 ,在本文中,我们将发现对中和缩放是否可以帮助我们的模型完成逻辑回归任务,因此,让我们直接进入它。
in the first article , we have explored the fundamental role of preprocessing in the context of machine learning especially a classification task,with the use of k-Nearest Neighbours algorithm or KNN for short,There you saw that centering and scaling numerical data improved the performance of k-NN for a number of model performance measures. but will this always be the same case? unfortunately not, we will explore the role of scaling and centering numerical data in another basic machine learning model, logistic regression.
在第一篇文章中,我们使用k-最近邻居算法(K-Nearest Neighbors algorithm)或简称KNN探索了机器学习环境中预处理的基本作用,特别是分类任务,您发现对数值数据进行定中心和缩放可以提高性能k-NN用于许多模型性能指标。 但这是否总是一样? 不幸的是,我们将探索在另一个基本的机器学习模型Logistic回归中缩放和居中数值数据的作用。
first of all, i will explain regression ,which can be used to predict the value of a numerical variable as well as classes.A brief introduction to regression with Python
首先,我将解释回归,该回归可用于预测数值变量和类的值。Python回归的简要介绍
Python回归简介:Python中的线性回归。 (A brief introduction to regression with Python: Linear regression in Python.)
In mathematics, regression covers several methods of statistical analysis that allow a variable to be approached from others that are correlated with it,As mentioned above, regression is commonly used to predict the value of one numerical variable from that of another.
在数学中,回归涵盖了几种统计分析方法,这些方法允许从与之相关的其他变量中获取一个变量。如上所述,回归通常用于预测一个数值变量与另一个数值的值。
Let’s have a toy dataset for it. You will use house price prediction dataset to investigate this but this time with two features. The task remains the same i.e., predicting the house price.
让我们为其提供一个玩具数据集。 您将使用房屋价格预测数据集进行调查,但这一次具有两个功能。 任务保持不变,即预测房价。
How does such a regression work? In brief, the mechanics are as follows: we wish to fit a model y=ax+b to the data (xi,yi), that is, we want to find the optimal a and b, given the data. In the ordinary least squares (OLS, by far the most common) formulation, there is an assumption that the error will occur in the dependent variable. For this reason, the optimal a and b are found by minimizing:
这样的回归如何工作? 简而言之,其机制如下:我们希望对数据(xi,yi)拟合模型y = ax + b ,也就是说,我们希望在给定数据的情况下找到最优a和b 。 在普通最小二乘(OLS,迄今为止最常见)的公式中,假设误差将在因变量中发生。 因此,最优a和b通过最小化来找到:
This regression captures the general increasing trend of the data but not much more in order to avoid overfitting( the model start to capture the pertucalirity of the data).
这种回归可以捕获数据的总体增长趋势,但不能避免过多的拟合(模型开始捕获数据的超标度)。
Python中的逻辑回归: (Logistic Regression in Python:)
Regression can also be used for classification problems. The first natural example of this is logistic regression. In binary classifation (two labels), we can think of the labels as 0 & 1. Once again denoting the predictor variable as x, the logistic regression model is given by the logistic function
回归也可以用于分类问题。 第一个自然的例子是逻辑回归 。 在二元分类(两个标签)中,我们可以将标签视为0和1。再次将预测变量表示为x,由logistic函数给出logistic回归模型
This is a sigmoidal (S-shaped) curved and you can see an example below. For any given x, if F(x)<0.5, then the logistic model predicts y = 0 and, alternatively, if F(X)>0.5 the model predicts y=1.
这是S形(S形)曲线,您可以在下面看到一个示例。 对于任何给定的x, 如果F(x)<0.5 ,则逻辑模型预测y = 0,或者, 如果F(X)> 0.5,则模型预测y = 1 。
Logistic回归和数据缩放: (Logistic Regression and Data Scaling:)
Now we’ve seen the mechanics of logistic regression, let’s implement a logistic regression classifier on our delicious wine dataset.
现在我们已经了解了逻辑回归的机制,让我们在美味的葡萄酒数据集上实现逻辑回归分类器。
Let’s now run our logistic regression and see how it performs!
现在,让我们运行逻辑回归并查看其性能!
Out of the box, this logistic regression performs better than K-NN (with or without scaling). Lets now scale our data and perform logistic regression:
开箱即用,这种逻辑回归的性能优于K-NN(无论有无缩放)。 现在让我们缩放数据并执行逻辑回归:
This is very interesting! The performance of logistic regression did not improve with data scaling. Why not, particularly when we saw that k-Nearest Neigbours performance improved substantially with scaling? The reason is that, if there predictor variables with large ranges that do not effect the target variable, a regression algorithm will make the corresponding coefficients ai small so that they do not effect predictions so much. K-nearest neighbours does not have such an inbuilt strategy and so we very much needed to scale the data.
这很有趣! Logistic回归的性能并未随数据缩放而提高。 为什么不这样做,特别是当我们看到k-最近的Neigbours性能随着缩放而大大提高时? 原因是 ,如果存在大范围的预测变量不会影响目标变量,则回归算法将使相应的系数ai小,以至于它们不会对预测产生太大影响。 K近邻没有这种内置策略,因此我们非常需要扩展数据。
In the next article, I’ll unpack the vastly different results of centering and scaling in k-NN and logistic regression by synthesizing a dataset, adding noise and seeing how centering and scaling alter the performance of both models as a function of noise strength.
在下一篇文章中,我将通过综合数据集,添加噪声以及了解定心和定标如何根据噪声强度来改变两个模型的性能,来解开k-NN和逻辑回归中定心和定标的截然不同的结果。
词汇表: (Glossary:)
Supervised learning: The task of inferring a target variable from predictor variables. For example, inferring the target variable ‘presence of heart disease’ from predictor variables such as ‘age’, ‘sex’, and ‘smoker status’.
监督学习:从预测变量中推断目标变量的任务。 例如,从预测变量(例如“年龄”,“性”和“吸烟状态”)推断目标变量 “心脏病的存在”。
Classification task: A supervised learning task is a classification task if the target variable consists of categories (e.g. ‘click’ or ‘not’, ‘malignant’ or ‘benign’ tumour).
分类任务:如果目标变量由类别(例如“点击”或“非”,“恶性”或“良性”肿瘤)组成,则监督学习任务就是分类任务 。
Regression task: A supervised learning task is a regression task if the target variable is a continuously varying variable (e.g. price of a house) or an ordered categorical variable such as ‘quality rating of wine’.
回归任务:如果目标变量是连续变化的变量(例如房屋价格)或有序的分类变量(例如“葡萄酒的质量等级”),则监督学习任务就是回归任务 。
k-Nearest Neighbors: An algorithm for classification tasks, in which a data point is assigned the label decided by a majority vote of its k nearest neighbors.
k最近邻居:一种用于分类任务的算法,其中为数据点分配由其k个最近邻居的多数表决决定的标签。
Preprocessing: Any number of operations data scientists will use to get their data into a form more appropriate for what they want to do with it. For example, before performing sentiment analysis of twitter data, you may want to strip out any html tags, white spaces, expand abbreviations and split the tweets into lists of the words they contain.
预处理:科学家将使用任意数量的运算数据将其数据转换成更适合他们想要处理的形式。 例如,在对Twitter数据进行情感分析之前,您可能希望删除所有html标签,空格,扩展缩写并将推文拆分为包含它们的单词的列表。
Centering and Scaling: These are both forms of preprocessing numerical data, that is, data consisting of numbers, as opposed to categories or strings, for example; centering a variable is subtracting the mean of the variable from each data point so that the new variable’s mean is 0; scaling a variable is multiplying each data point by a constant in order to alter the range of the data. See the body of the article for the importance of these, along with examples.
居中和缩放:这两种都是预处理数值数据的形式,例如,由数字组成的数据,而不是类别或字符串。 以变量居中表示从每个数据点减去变量的平均值,以便新变量的平均值为0; 缩放变量是将每个数据点乘以一个常数,以更改数据范围。 有关这些重要性的重要性,请参见本文的正文以及示例。
翻译自: https://medium.com/analytics-vidhya/preprocessing-data-for-machine-learning-in-python-part-2-f818b255cffe
python进行数据预处理