预测月份温度机器学习模型
A Practical Machine Learning Workflow Example
实用的机器学习工作流程示例
问题介绍 (Problem Introduction)
The problem we will tackle is predicting the average global land and ocean temperature using over 100 years of past weather data. We are going to act as if we don’t have access to any weather forecasts. What we do have access to is a century’s worth of historical global temperatures averages including; global maximum temperatures, global minimum temperatures, and global land and ocean temperatures. Having all of this, we know that this is a supervised, regression machine learning problem
我们将要解决的问题是使用100多年的过去天气数据来预测全球平均陆地和海洋温度。 我们将采取行动,好像我们无法获得任何天气预报一样。 我们所能获得的是一个世纪以来全球历史平均温度值,包括: 全球最高温度,全球最低温度以及全球陆地和海洋温度。 有了所有这些,我们知道这是一个有监督的回归机器学习问题
It’s supervised because we have both the features and the target that we want to predict, also our target makes this a regression task because it is continuous. During training, we will give multiple regression models both the features and targets and it must learn how to map the data to a prediction. Moreover, this is a regression task because the target value is continuous (as opposed to discrete classes in classification).
之所以受到监督,是因为我们既具有要预测的特征和目标,又因为它是连续的,所以我们的目标使它成为回归任务。 在训练期间,我们将提供特征和目标的多个回归模型,并且它必须学习如何将数据映射到预测。 此外,这是一项回归任务,因为目标值是连续的(与分类中的离散类相对)。
That’s pretty much all the background we need, so let’s start!
这几乎是我们需要的所有背景,所以让我们开始吧!
ML工作流程 (ML Workflow)
Before we jump right into programming, we should outline exactly what we want to do. The following steps are the basis of my machine learning workflow now that we have our problem and model in mind:
在开始进行编程之前,我们应该准确概述我们想做的事情。 考虑到我们的问题和模型,以下步骤是我的机器学习工作流程的基础:
- State the question and determine the required data (completed) 陈述问题并确定所需数据(已完成)
- Acquire the data 采集数据
- Identify and correct missing data points/anomalies 识别并纠正丢失的数据点/异常
- Prepare the data for the machine learning model by cleaning/wrangling 通过清理/整理为机器学习模型准备数据
- Establish a baseline model 建立基准模型
- Train the model on the training data 根据训练数据训练模型
- Make predictions on the test data 对测试数据做出预测
- Compare predictions to the known test set targets and calculate performance metrics 将预测与已知测试集目标进行比较,并计算性能指标
- If performance is not satisfactory, adjust the model, acquire more data, or try a different modeling technique 如果性能不令人满意,请调整模型,获取更多数据或尝试其他建模技术
- Interpret model and report results visually and numerically 可视化和数字化解释模型并报告结果
数据采集 (Data Acquisition)
First, we need some data. To use a realistic example, I retrieved temperature data from the Berkeley Earth Climate Change: Earth Surface Temperature Dataset found on Kaggle.com. Being that this dataset was created from one of the most prestigious research universities in the world, we will assume data in the dataset is truthful.
首先,我们需要一些数据。 举一个实际的例子,我从Kaggle.com上的“伯克利地球气候变化:地球表面温度数据集”中检索了温度数据。 由于该数据集是由世界上最负盛名的研究型大学之一创建的,因此我们将假定数据集中的数据是真实的。
Dataset link:https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data
数据集链接: https : //www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data
After importing some important libraries and modules, the code below loads in the CSV data which I store into a variable we can use later:
导入一些重要的库和模块后,下面的代码将CSV数据加载到我存储的变量中,以备后用:
Following are explanations of each column:
以下是各列的说明:
dt: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures
dt:平均陆地温度从1750年开始,最高和最低陆地温度以及全球海洋和陆地温度从1850年开始
LandAverageTemperature: global average land temperature in celsius
LandAverageTemperature:摄氏全球平均气温
LandAverageTemperatureUncertainty: the 95% confidence interval around the average
LandAverageTemperatureUncertainty:围绕平均值的95%置信区间
LandMaxTemperature: global average maximum land temperature in celsius
LandMaxTemperature:全球平均最高气温,以摄氏度为单位
LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature
LandMaxTemperatureUncertainty:最高陆地温度附近的95%置信区间
LandMinTemperature: global average minimum land temperature in celsius
LandMinTemperature:摄氏全球平均最低气温
LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature
LandMinTemperatureUncertainty:最低地面温度附近的95%置信区间
LandAndOceanAverageTemperature: global average land and ocean temperature in celsius
LandAndOceanAverageTemperature:全球平均陆地和海洋温度以摄氏
LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature
陆地和海洋平均温度不确定性:全球平均陆地和海洋温度的95%置信区间
识别异常/丢失数据 (Identify Anomalies/ Missing Data)
Looking through the data (shown above) from Berkeley Earth, I noticed several missing data points, which is a great reminder that data collected in the real-world will never be perfect. Missing data can impact analysis immensely, as can incorrect data or outliers.
通过查看来自伯克利地球的数据(如上所示),我注意到了一些缺失的数据点,这很提醒我们,在现实世界中收集的数据永远不会是完美的。 数据丢失或不正确的数据或异常值都会极大地影响分析。
To identify anomalies, we can quickly find missing using the info() method on our DataFrame.
为了识别异常,我们可以使用DataFrame上的info()方法快速找到缺失的内容。
Also, we can use the “.isnull()” and “.sum()” methods directly on our dataframe to find the total amount of missing values in each column.
另外,我们可以直接在数据帧上使用“ .isnull()”和“ .sum()”方法来查找每一列中缺失值的总数。
资料准备 (Data Preparation)
Unfortunately, we aren’t quite at the point where we can just feed the raw data into a model and have it return an answer (although you could, it would not be the most accurate)! We will need to do some minor modification to put our data into machine-understandable terms.
不幸的是,我们还不能完全将原始数据馈入模型并让其返回答案(尽管可以,但这并不是最准确的)! 我们将需要做一些小的修改,以使我们的数据成为机器可理解的术语。
The exact steps for preparation of the data will depend on the model used and the data gathered, but some amount of data manipulation will be required.
准备数据的确切步骤将取决于所使用的模型和收集的数据,但是将需要一定数量的数据处理。
First things first, I will be creating a function called wrangle() in which I will call our dataframe.
首先,我将创建一个名为wrangle()的函数,在其中将调用我们的数据框。
We want to make a copy of the dataframe so we do not corrupt the original. After that, we are going to drop columns that hold high cardinality.
我们要复制数据框,以免损坏原始数据框。 之后,我们将删除具有高基数的列。
High cardinality refers to columns with values that are very uncommon or unique. Given how common high-cardinality data are within most time-series datasets, we are going to address this problem directly by removing these high cardinality columns from our dataset completely as to not confuse our model in the future.
高基数是指具有非常不常见或唯一的值的列。 考虑到大多数时间序列数据集中常见的高基数数据,我们将通过从数据集中完全删除这些高基数列来直接解决此问题,以免将来混淆我们的模型。
Next in the set of instructions for our function, we are going to create a function within our pending wrangle function, called convertTemp(). Essentially this convertTemp function is just for my own eyes (and maybe yours) and being that I am from the United States, our official measurement for temperature is in Fahrenheit and the dataset I have used is measured in Celsius.
在该函数的指令集中,接下来,我们将在待处理的wrangle函数中创建一个函数convertTemp() 。 本质上,这个convertTemp函数仅适用于我自己(也许是您的)自己的眼睛,并且由于我来自美国,因此,我们对温度的官方度量单位是华氏度,而我使用的数据集是摄氏温度。
So just for ease purposes, not that it will affect our model results or predictions in any way, I chose to apply that function to the remaining columns which hold Celsius temperature:
因此,仅为方便起见,并不是要以任何方式影响我们的模型结果或预测,我选择将该函数应用于保持摄氏温度的其余列:
Finally, the last step in our data wrangling function would be to convert the dt(Date) column to a DateTime object. After which we will create subsequent columns for the month and year, eventually dropping the dt and Month columns.
最后,数据整理功能的最后一步是将dt(Date)列转换为DateTime对象。 之后,我们将为月份和年份创建后续列,最终删除dt和Month列。
Now if you remember we also had missing values which we saw earlier in our dataset. From just analyzing the dataset and from what I described about the Date column, the LandAverageTemperature column starts in 1750 while the other 4 columns we chose to keep in our wrangle function start in 1850.
现在,如果您还记得,我们还缺少我们在数据集中前面看到的值。 从分析数据集和我对日期列的描述来看,LandAverageTemperature列始于1750年,而我们选择保留在纠缠函数中的其他4列始于1850年。
So I think we will solve much of the missing value problem by just splicing the dataset by the year, creating a new dataset that starts from the year 1850 and above. We will also call the dropna(), just in case there are any other missing values in our dataset:
因此,我认为我们将通过按年份拼接数据集并创建一个始于1850年及以上年份的新数据集来解决许多缺失值问题。 我们还将调用dropna(),以防数据集中还有其他缺失值:
Let's see how it looks:
让我们看看它的外观:
After calling our wrangle function to our globalTemp dataframe, we can now see a new cleaned-up version of our globalTemp dataframe free of any missing values
在对我们的globalTemp数据框调用了wrangle函数之后,我们现在可以看到一个新的globalTemp数据框的清理版本,其中没有任何缺失值
It looks like we are ready for the next step, Setting up our target and features, train/test split, and establishing our baseline…
看来我们已准备好进行下一步,即设置目标和功能,训练/测试组以及建立基线…
快速关联可视化 (Quick Correlation Visualization)
One thing I like to do when working with regression problems is to look at the cleaned dataframe and to see if we can truly use one column as our target and the others as our features.
处理回归问题时,我想做的一件事是查看清理的数据框,看看我们是否可以真正地将一列用作目标,将其他列真正用作我们的功能。
One way I loosely determine that is by plotting a correlation matrix, just to get an understanding of how related each column is to each other:
我粗略地确定这一点的一种方法是绘制一个相关矩阵,以了解每列之间的相关性:
As we can see, and some as some of you probably guessed, The columns we chose to keep moving forward are HIGHLY correlated to one another. So we should have pretty strong & positive predictions just from glancing at this plot.
我们可以看到,有些人可能已经猜到了,我们选择继续前进的各列之间是高度相关的。 因此,只要浏览一下该图,我们就应该有非常强大而积极的预测。
将目标与功能区分开 (Separating our Target From Our Features)
Now, we need to separate the data into the features and targets. The target, also known as Y, is the value we want to predict, in this case, the actual land and ocean average temperature and the features are all the columns (minus our target) the model uses to make a prediction:
现在,我们需要将数据分为特征和目标。 目标,也称为Y,是我们要预测的值,在这种情况下,实际的陆地和海洋平均温度,特征是模型用来进行预测的所有列(减去目标):
火车测试拆分 (Train-Test Split)
Now we are on the final step of the data preparation part of our ML workflow: splitting data into training and testing sets.
现在,我们进入了ML工作流程中数据准备部分的最后一步:将数据分为训练和测试集。
During training, we let the model ‘see’ the answers, in this case, the actual temperature, so it can learn how to predict the temperature from the features. As we know, there is a relationship between all the features and the target value, and the model’s job is to learn this relationship during training. Then, when it comes time to evaluate the model, we ask it to make predictions on a testing set where it only has access to the features (not the target)!
在训练过程中,我们让模型“查看”答案,在这种情况下为实际温度,因此它可以学习如何根据特征预测温度。 众所周知,所有特征和目标值之间都有关系,而模型的工作就是在训练过程中学习这种关系。 然后,当需要评估模型时,我们要求它在只能访问特征(而不是目标)的测试集中进行预测!
Generally, when training a regression model, we randomly split the data into training and testing sets to get a representation of all data points.
通常,在训练回归模型时,我们将数据随机分为训练和测试集,以表示所有数据点。
For example, if we trained the model on the first nine months of the year and then used the final three months for prediction, our algorithm would not perform well because it has not seen any data from those last three months.
例如,如果我们在一年的前九个月对模型进行训练,然后将最后三个月用于预测,则我们的算法将无法很好地执行,因为它没有看到过去三个月的任何数据。
Make sense?
合理?
The following code splits the data sets:
以下代码拆分数据集:
We can look at the shape of all the data to make sure we did everything correctly. We expect the training(X_train) features number of columns to match the testing (X_val) feature number of columns and the number of rows to match for the respective training and testing features and target:
我们可以查看所有数据的形状,以确保正确完成了所有操作。 我们期望training(X_train)功能的列数与测试(X_val)功能的列数相匹配,并且行数与相应的训练和测试功能及目标相匹配:
It looks as if everything is in order! Just to recap, we:
看来一切都井然有序! 回顾一下,我们:
- Got rid of missing values and unneeded columns 消除了缺少的值和不需要的列
- Split data into features and target 将数据分为特征和目标
- Split data into training and testing sets 将数据分为训练和测试集
These steps may seem tedious at first, but once you get the basic ML workflow, it will be generally the same for any machine learning problem. It’s all about taking human-readable data and putting it into a form that can be understood by a machine learning model.
这些步骤乍一看可能很乏味,但是一旦您获得了基本的ML工作流程,对于任何机器学习问题,它通常都是相同的。 一切都是关于将人类可读的数据放入一种机器学习模型可以理解的形式。
建立基线平均绝对误差 (Establish Baseline Mean Absolute Error)
Before we can make and evaluate predictions, we need to establish a baseline, a sensible measure that we hope to beat with our model. If our model cannot improve upon the baseline, then it will be a failure and we should try a different model or admit that machine learning is not right for our problem.
在做出和评估预测之前,我们需要建立一个基线,这是我们希望与我们的模型相抗衡的明智措施。 如果我们的模型无法在基线上得到改善,那将是一个失败,我们应该尝试其他模型或承认机器学习不适合我们的问题。
The baseline prediction for our case will be the yearly average temperature. In other words, our baseline is the error we would get if we simply predicted the average temperature for our target dataset (Y_train)
我们的案例的基线预测将是年平均温度。 换句话说,我们的基线是如果我们仅预测目标数据集(Y_train)的平均温度就会得到的误差
In order to find out the MAE, very easily, we can import the mean_absolute_error method from the sci-kit learn library which will calculate it for us:
为了很容易地找到MAE,我们可以从sci-kit学习库中导入mean_absolute_error方法,该方法将为我们进行计算:
We now have our goal! If we can’t beat an average error of 2 degrees, then we need to rethink our approach.
现在我们有了目标! 如果我们无法克服2度的平均误差,那么我们需要重新考虑我们的方法。
火车模型 (Train Model)
After all the work of data preparation, creating and training the model is pretty simple using scikit-learn. For this problem, we could try a multitude of models, but in this situation, we are going to use two different models; a Linear Regression Model and a Random Forest Regressor Model.
在完成所有数据准备工作之后,使用scikit-learn创建和训练模型非常简单。 对于这个问题,我们可以尝试多种模型,但是在这种情况下,我们将使用两种不同的模型; 线性回归模型和随机森林回归模型。
线性回归模型 (Linear Regression Model)
Linear regression is a statistical approach that models the relationship between input features and output. Our goal here is to predict the value of the output based on the input features.
线性回归是一种统计方法,可对输入要素和输出之间的关系进行建模。 我们的目标是根据输入要素预测输出值。
In the code below, I created what is called pipeline which allows stacking multiple processes into a single scikit-learn estimator. Here the only processes we are using is a StandardScalar(), which subtracts the mean from each feature and then scaled to the variance of the unit and obviously the LinearRegression() process:
在下面的代码中,我创建了所谓的管道,该管道允许将多个进程堆叠到单个scikit-learn估计器中。 在这里,我们使用的唯一过程是StandardScalar() ,它从每个特征中减去均值,然后缩放到单位的方差,并且显然缩放到LinearRegression()过程:
随机森林回归模型 (Random Forest Regressor Model)
A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique commonly known as bagging.
随机森林是一种集成技术,能够通过使用多个决策树来执行回归和分类任务,并且该技术通常称为装袋 。
The basic idea behind bagging is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling from the dataset forming sample datasets for every model:
套袋背后的基本思想是在确定最终输出时结合多个决策树,而不是依赖于单个决策树。RandomForest具有多个决策树作为基础学习模型。 我们从数据集中随机执行行采样和特征采样,形成每个模型的样本数据集:
Little information on whats going on in the code snippet above:
上面的代码片段中发生了什么的小信息:
n_estimators represents the number of trees in the random forest.
n_estimators表示随机森林中树木的数量。
max depth represents the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data.
最大深度代表森林中每棵树的深度。 树越深,其分裂就越多,它会捕获有关数据的更多信息。
n_jobs refers to the number of cores the regressor will use. -1 means it will use all cores available to run the regressor.
n_jobs是指回归器将使用的核心数。 -1表示它将使用所有可用的内核来运行回归器。
SelectKBest just scores the features using an internal function. In this case, I chose to score all the features.
SelectKBest仅使用内部函数对功能进行评分。 在这种情况下,我选择对所有功能进行评分。
After creating our pipelines and having fit our training data into our pipeline models, we now need to make some predictions.
创建管道并将训练数据拟合到管道模型之后,我们现在需要进行一些预测。
对测试集进行预测 (Make Predictions on the Test Set)
Our model has now been trained to learn the relationships between the features and the targets. The next step is figuring out how good the model is! To do this we make predictions on the test features and compare the predictions to the known answers.
现在,我们的模型已经过训练,可以学习特征和目标之间的关系。 下一步是弄清楚模型的好坏! 为此,我们对测试功能进行预测,并将预测结果与已知答案进行比较。
When performing regression predictions, we need to make sure to use the absolute error because we expect some of our answers to be low and some to be high. We are interested in how far away our average prediction is from the actual value so we take the absolute value (as we also did when establishing the original baseline earlier in this blog):
在执行回归预测时,我们需要确保使用绝对误差,因为我们希望我们的某些答案很低,而有些答案则很高。 我们对平均预测值与实际值相距多远感兴趣,因此我们采用了绝对值(就像我们在本博客前面建立原始基线时所做的那样):
Let look at our Random Forest Regressor MAE:
让我们看一下我们的随机森林回归MAE:
Our average temperature prediction estimate is off by 0.28 degrees in our Linear Regression MAE and 0.24 for our Random Forest MAE. That is almost a 2-degree average improvement over the baseline of 2.03 degrees.
对于线性回归MAE,我们的平均温度预测估计值偏离了0.28度;对于随机森林MAE,我们的平均温度预测值偏离了0.24度。 这比2.03度的基准几乎提高了2度。
Although this might not seem significant, it is nearly 95% better than the baseline, which, depending on the field and the problem, could represent millions of dollars to a company.
尽管这看起来似乎并不重要,但它比基准要好95%,这取决于领域和问题,对公司而言可能代表数百万美元。
确定性能指标 (Determine Performance Metrics)
To put our predictions in perspective, we can calculate an accuracy using the mean average percentage error subtracted from 100 %.
为了正确理解我们的预测,我们可以使用从100%中减去的平均百分比误差来计算准确性。
线性回归测试/火车精度: (Linear Regression Test/Train Accuracy:)
随机森林回归器训练/测试精度: (Random Forest Regressor Train/Test Accuracy:)
By looking at the error metric values we got, we can say that our model performs optimally and is able to give accurate predictions, given a new set of records(y_pred).
通过查看我们获得的误差度量值,可以说我们的模型在给出新的记录集(y_pred)的情况下表现最佳,并且能够给出准确的预测。
Our model has learned how to predict the average temperature for the next year with 99% accuracy in both our models.
我们的模型已经学会了如何在两个模型中以99%的准确度预测明年的平均温度。
Nice!!
不错!
模型调整 (Model Tuning)
In the usual machine learning workflow, we would stop here after achieving 99% accuracy. But in most cases, as I stated before, the dataset would not be as clean, this would be when to start hyperparameter tuning the model.
在通常的机器学习工作流程中,达到99%的准确性后,我们将在此处停止。 但是在大多数情况下,正如我之前所说,数据集不会那么干净,这是何时开始对模型进行超参数调整的时候。
Hyperparameter tuning is a complicated phrase that means “adjust the settings to improve performance”. The most common way to do this is to simply make a bunch of models with different settings, evaluate them all on the same validation set, and see which one does best.
超参数调整是一个复杂的短语,表示“调整设置以提高性能”。 最常见的方法是简单地制作一堆具有不同设置的模型,在相同的验证集上对它们进行评估,然后看看哪个模型效果最好。
An accuracy of 99% is obviously satisfactory for this problem, but it is known that the first model built will almost never be the model that makes it to production. So let us try to reach 100% accuracy if that is possible.
99%的准确度显然可以解决此问题,但是众所周知,第一个建立的模型几乎永远不会成为能够投入生产的模型。 因此,如果可能的话,让我们尝试达到100%的准确性。
随机搜索 (RandomizedSearchCV)
In the beginning, I decided I wanted to use GridSearchCV to hyper tune my model, but GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters.
一开始,我决定要使用GridSearchCV来超调我的模型,但是GridSearchCV的计算量可能会很大,尤其是当您在大超参数空间中搜索并处理多个超参数时。
The most efficient way to find an optimal set of hyperparameters for a machine learning model is to use random search. A solution to this is to use another sci-kit learn method named RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.
查找机器学习模型的最佳超参数集的最有效方法是使用随机搜索。 一种解决方案是使用另一种名为RandomizedSearchCV的sci-kit学习方法,在该方法中,并非所有超参数值都经过尝试。 而是从指定的概率分布中采样固定数量的超参数设置。
Now being that we only have 5 columns in total, there is really no need for us to use RandomizedSearchCV, but for blogging purposes, we will see how to use RandomizedSearchCV to tune your model.
现在我们总共只有5列,实际上我们不需要使用RandomizedSearchCV,但是出于博客目的,我们将看到如何使用RandomizedSearchCV来调整模型。
Let’s see if we have any gains in our prediction accuracy score and MAE:
让我们看看我们的预测准确性得分和MAE是否有所提高:
Little information on the code snippet above:
上面的代码片段中的信息很少:
n_iter: represents the number of iterations. Each iteration represents a new model trained on a new draw from your dictionary of hyperparameter distributions.
n_iter:表示迭代次数。 每次迭代都代表一个新模型,该模型在您的超参数分布字典的新图纸上进行训练。
param_distributions: specify parameters and distributions to sample from
param_distributions:指定要从中采样的参数和分布
cv: 10-fold cross-validation (cv). The number of cross-validation chosen determines how many times it will train each model on a different subset of data.
简历: 10倍交叉验证(cv)。 选择的交叉验证次数决定了它将在不同数据子集上训练每个模型多少次。
n_jobs refers to the number of cores the regressor will use. -1 means it will use all cores available to run the regressor.
n_jobs是指回归器将使用的核心数。 -1表示它将使用所有可用的内核来运行回归器。
best_estimator_: refers to an attribute is an instance of the specified model type, which has the ‘best’ combination of given parameters from the params variable
best_estimator_:指的是 属性是指定模型类型的实例,该模型具有params变量中给定参数的“最佳”组合
We then use the best set of hyperparameter values chosen in the RandomizedSearchCV in the actual model which we named best_model as shown:
然后,我们在实际模型中使用在RandomizedSearchCV中选择的最佳超参数值集,我们将其命名为best_model,如下所示:
As suspected, after running our using our predict method on our best_model, we can see RandomizedSearchCV output the same prediction results and accuracy score percentage as our Random Forest Regressor model earlier.
可以怀疑,在best_model上使用我们的预测方法后,我们可以看到RandomizedSearchCV输出的预测结果和准确性得分百分比与之前的Random Forest Regressor模型相同。
Although no need for it, we have seen how hyper tuning could essentially help improve model scores if needed
尽管不需要它,但我们已经看到了超调如何从本质上帮助改善模型得分(如果需要)
可视化 (Visualizations)
部分依赖图 (Partial Dependence Plots)
PDPbox is a partial dependence plot toolbox written in Python. The goal of pdpbox is to visualize the impact of certain features towards model prediction for any supervised learning algorithm.
PDPbox是用Python编写的部分依赖图工具箱。 pdpbox的目标是可视化某些功能对任何监督学习算法的模型预测的影响。
The problem is when using machine learning algorithms like random forest, it is hard to understand the relations between predictors and model outcomes. For example, in terms of random forest, all we get is the feature importance. Although we can know which feature is significantly influencing the outcome based on the importance calculation, we really don’t know in which direction it is influencing.
问题在于,当使用诸如随机森林之类的机器学习算法时,很难理解预测变量与模型结果之间的关系。 例如,就随机森林而言,我们所获得的只是功能的重要性。 尽管根据重要性计算我们可以知道哪个功能对结果有重大影响,但我们实际上并不知道它在哪个方向上影响。
This is where PDPbox comes into play:
这是PDPbox发挥作用的地方:
A little background on whats going on in the code above:
上面的代码中发生了什么的一些背景知识:
feature: the feature column we want to compare against our model to see the effect it has on the model prediction (our target)
feature:我们要与模型进行比较的特征列,以查看其对模型预测(我们的目标)的影响
isolated: pdp_isolate is what we call to create our PDP pipeline. Being that we are only comparing one feature, hence the name isolated
隔离: pdp_isolate是我们创建PDP管道的调用。 由于我们只是比较一个功能,因此名称被隔离
All other columns should be self-explanatory.
所有其他列应不言自明。
Now let us look at our plot:
现在让我们看看我们的情节:
From this plot, we can see that as the average LandAndOceanTemperature rises and LandAverageTemperature increases, the predicted temperature tends to increase.
从该图可以看出,随着平均LandAndOceanTemperature升高和LandAverageTemperature升高,预测温度趋于升高。
We also created another PDPbox plot in which we used two features (LandMinTemperature and LandMaxTemperature) to see how it affect model prediction with our target column(LandAndOceanTemperature):
我们还创建了另一个PDPbox图,其中使用了两个功能(LandMinTemperature和LandMaxTemperature)来查看它如何影响目标列(LandAndOceanTemperature)的模型预测:
From this plot, we can see the same results as well. As the average LandMaxTemperature rises and LandMinTemperature increases, the predicted target, LandAndOcean, temperature tends to increase.
从该图中可以看到相同的结果。 随着平均LandMaxTemperature升高和LandMinTemperature升高,预测的目标LandAndOcean温度趋于升高。
结论 (Conclusion)
We have now completed an entire end-to-end machine learning example!
现在,我们已经完成了一个完整的端到端机器学习示例!
At this point, if we want to improve our model, we could try different hyperparameters ( RandomizedSearchCV, or something new like GridSearchCV), try a different algorithm, or the best approach of all, just gather more data! The performance of any model is directly related to how much data it can learn from, and we were using a very limited amount of information for training.
在这一点上,如果我们要改进模型,可以尝试使用不同的超参数(RandomizedSearchCV,或诸如GridSearchCV之类的新东西),尝试使用不同的算法,或者使用所有最佳方法,只是收集更多数据! 任何模型的性能都与它可以从中学习多少数据直接相关,并且我们使用的信息量非常有限。
I hope everyone who made it through has seen how accessible machine learning has become and it uses.
我希望所有通过它的人都已经了解了机器学习的可访问性及其使用方式。
Until next time, my friends…
直到下一次,我的朋友们…
翻译自: https://medium.com/swlh/predicting-weather-temperature-change-using-machine-learning-models-4f98c8983d08
预测月份温度机器学习模型