从头大数据项目_如何从头开始构建数据科学项目

从头大数据项目

by Jekaterina Kokatjuhha

通过叶卡捷琳娜·科卡朱哈(Jekaterina Kokatjuhha)

如何从头开始构建数据科学项目 (How to build a data science project from scratch)

使用柏林租金价格分析的演示 (A demonstration using an analysis of Berlin rental prices)

There are many online courses about data science and machine learning that will guide you through a theory and provide you with some code examples and an analysis of very clean data.

有许多关于数据科学和机器学习的在线课程,它们将指导您进行理论学习,并为您提供一些代码示例以及对非常干净的数据的分析。

However, in order to start practising data science, it is better if you challenge a real-life problem. Digging into the data in order to find deeper insights. Carrying out feature engineering using additional sources of data and building stand-alone machine learning pipelines.

但是,为了开始实践数据科学,最好挑战现实生活中的问题。 挖掘数据以找到更深刻的见解。 使用其他数据源进行功能工程,并构建独立的机器学习管道。

This blogpost will guide you through the main steps of building a data science project from scratch. It is based on a real-life problem — what are the main drivers of rental prices in Berlin? It will provide an analysis of this situation. It will also highlight the common mistake beginners tend to make when it comes to machine learning.

该博客文章将指导您完成从头开始构建数据科学项目的主要步骤。 它基于一个现实问题 -柏林租金价格的主要驱动因素是什么? 它将提供对这种情况的分析。 它还将突出显示初学者在机器学习中容易犯的常见错误。

These are the steps that will be discussed in detail:

这些是将详细讨论的步骤:

  • finding a topic

    寻找话题
  • extracting data from the web and cleaning it

    从网上提取数据并清理
  • gaining deeper insights

    获得更深刻的见解
  • engineering of features using external APIs

    使用外部API设计功能
  • common mistakes while carrying out machine learning

    进行机器学习时的常见错误
  • feature importance: finding the drivers of rental prices

    功能重要性:找到租金价格的驱动因素
  • building machine learning models.

    建立机器学习模型。

寻找话题 (Finding a topic)

There are many problems that can be solved by analyzing data, but it is always better to find a problem that you are interested in and that will motivate you. While searching for a topic, you should definitely concentrate on your preferences and interests.

通过分析数据可以解决许多问题,但是总是最好找到您感兴趣并且会激发您的问题。 在搜索主题时,您绝对应该专注于自己的偏好和兴趣。

For instance, if you are interested in healthcare systems, there are many angles from which you could challenge the data provided on that topic. “Exploring the ChestXray14 dataset: problems” is an example of how to question the quality of medical data. Another example — if you are interested in music, you could try to predict the genre of the song from its audio.

例如,如果您对医疗保健系统感兴趣,则可以从许多角度挑战有关该主题的数据。 “探索ChestXray14数据集:问题”是一个如何质疑医疗数据质量的示例。 另一个例子-如果您对音乐感兴趣,则可以尝试根据音频来预测歌曲的类型 。

However, I suggest not only to concentrate on your interests but also to listen to what people around you are talking about. What bothers them? What are they complaining about? This can be another good source of ideas for a data science project. In those cases where people are still complaining about it, this may mean that the problem wasn’t solved properly the first time around. Thus, if you challenge it with data, you could provide an even better solution and have an impact in how this topic is perceived.

但是,我建议您不仅要专注于您的兴趣,而且还要倾听您周围的人在谈论什么。 是什么困扰着他们? 他们在抱怨什么? 对于数据科学项目,这可能是另一个很好的想法来源。 在人们仍在抱怨的情况下,这可能意味着问题在第一时间没有得到正确解决。 因此,如果您用数据挑战它,则可以提供一个更好的解决方案,并且对如何理解此主题有影响。

This may all sound a bit too abstract, so lets find out how I came up with the idea to analyze Berlin rental prices.

这听起来可能有点抽象,所以让我们找出我是如何提出来分析柏林租金价格的想法。

“If I had known that the rental prices were so high here, I would have negotiated for a higher salary.”
“如果我知道这里的租金太高了,我本来可以争取更高的薪水的。”

This is just one of the things I heard from people who had recently moved to Berlin for work. Most newcomers complained that they hadn’t imagined Berlin to be so expensive, and that there were no statistics about possible price ranges of the apartment. If they had known this it beforehand, they could have asked for a higher salary during the job application process or could have considered other options.

这只是我最近移居柏林工作的人们听到的一件事。 大多数新移民抱怨他们没有想到柏林这么贵,而且没有关于公寓可能价格范围的统计数据。 如果他们事先知道这一点,那么他们可能会在求职过程中要求更高的薪水,或者可以考虑其他选择。

I googled, checked several rental apartment websites, and asked several people, but could not find any plausible statistics or visualizations of the current market prices. And this was how I came up with the idea of this analysis.

我用谷歌搜索,检查了几个出租公寓的网站,并询问了几个人,但是找不到当前市场价格的任何合理的统计数据或可视化效果。 这就是我提出此分析想法的方式。

I wanted to gather the data, build an interactive dashboard where you could select different options such as a 40m2 apartment situated in Berlin Mitte with a balcony and equipped kitchen, and it would show you the price ranges. This, alone, would help people understand apartment prices in Berlin. Also, by applying machine learning, I would be able to identify the drivers of the rental prices and practise with different machine learning algorithms.

我想收集数据,建立一个交互式仪表板,在这里您可以选择其他选项,例如位于柏林米特的40平方米公寓,带有阳台和设备齐全的厨房,它可以为您显示价格范围。 仅此一项,就可以帮助人们了解柏林的公寓价格。 此外,通过应用机器学习,我将能够确定租金价格的驱动因素,并使用不同的机器学习算法进行练习。

从网上提取数据并清理 (Extracting data from the web and cleaning it)

获取数据 (Getting the data)

Now that you have an idea about your data science project, you can start looking for the data. There are tons of amazing data repositories, such as Kaggle, UCI ML Repository or dataset search engines, and websites containing academic papers with datasets. Alternatively, you could use web scraping.

既然您对数据科学项目有了一个想法,就可以开始寻找数据了。 有许多惊人的数据存储库,例如Kaggle , UCI ML存储库或数据集搜索引擎 ,以及包含带有数据集的学术论文的网站 。 或者,您可以使用网络抓取 。

But be cautious — old data is everywhere. When I was searching for the information about the rental prices in Berlin, I found many visualizations but they were old, or without any year specified.

但是要小心-旧数据无处不在。 当我在搜索有关柏林租金价格的信息时,我发现了很多可视化效果, 但是它们很旧,或者没有指定任何年份。

For some statistics, they even had a note saying that this price would only be for a 2 room apartment of 50 m2 without furniture. But what if I am searching for a smaller apartment with a furnished kitchen?

对于某些统计数据,他们甚至注意到,这个价格仅适用于50平方米(不含家具)的2居室公寓。 但是,如果我要寻找一个带家具的小公寓怎么办?

As I could find only old data, I decided to web scrape the websites that offered rental apartments. Web scraping is a technique used to extract data from websites through an automated process.

由于只能找到旧数据,因此我决定通过网络抓取提供出租公寓的网站。 Web抓取是一种用于通过自动化过程从网站提取数据的技术。

My web scraping blogpost goes into the details of pitfalls and design patterns of web scraping.

我的网络抓取博文详细介绍了网络抓取的陷阱和设计模式。

Web Scraping Tutorial with Python: Tips and TricksI was searching for flight tickets and noticed that ticket prices fluctuate during the day. I tried to find out when…hackernoon.com

使用Python的Web爬网教程:技巧和窍门 我当时在寻找机票,发现机票价格在白天波动。 我试图找出何时…… hackernoon.com

Here are the main findings:

以下是主要发现:

  • Before scraping, check if there is a public API available

    抓取之前,请检查是否有公共API
  • Be kind! Don’t overload the website by sending hundreds of requests per second

    仁慈 ! 不要通过每秒发送数百个请求来过载网站

  • Save the date when the extraction took place. It will be explained why this is important.

    保存提取发生的日期。 将解释为什么这很重要。

数据清理 (Data cleaning)

Once you starting getting the data, it is very important to have a look at it as early as possible in order to find any possible issues.

一旦开始获取数据,尽快发现它对于发现任何可能的问题非常重要。

While web scraping rental data, I included some small checks such as the number of missing values for all features. Web-masters could change the HTML of the website, which would result in my program not getting the data anymore.

在网络抓取租赁数据时,我进行了一些小检查,例如所有功能的缺失值数量。 网站管理员可以更改网站HTML,这将导致我的程序不再获取数据。

Once I had ensured that all technical aspects of web scraping were covered, I thought the data would almost be ideal. However, I ended up cleaning the data for around a week because of not so obvious duplicates.

一旦确保覆盖了Web抓取的所有技术方面,我便认为数据几乎是理想的。 但是,由于没有那么明显的重复,我最终清理了大约一周的数据。

Once you starting getting the data, it is very important to have a look at it as early as possible in order to find any possible issues. For instance, if you web scrape, you could have missed some important fields. If you use a comma separator while saving data into a file, and one of the fields also contains commas, you can end up having files which are not separated very well.

一旦开始获取数据,尽快发现它对于发现任何可能的问题非常重要。 例如,如果您通过网络抓取,可能会错过一些重要的字段。 如果在将数据保存到文件中时使用逗号分隔符,并且其中一个字段还包含逗号,则最终可能会导致文件分隔得不太好。

There were several sources of duplicates:

有几个重复的来源:

  • Duplicated apartments because they had been online for a while

    重复的公寓,因为它们已经上网了一段时间
  • Agencies had input errors, for example the rental price or the storey of the apartment. They would correct them after a while, or would publish a completely new ad with corrected values and additional description modifications

    代理商输入错误,例如租金或公寓楼层。 他们会在一段时间后对其进行更正,或者发布具有更正值和其他说明修改的全新广告
  • Some prices were changed (increased and decreased) after a month for the same apartment

    一个公寓一个月后,一些价格发生了变化(上升和下降)

While the duplicates from the first case were easy to identify by their ID, the duplicates from the second case were very complicated. The reason is that an agency could slightly change a description, modify the wrong price, and publish it as a new ad so that the ID would also be new.

第一种情况的重复项很容易通过ID进行识别,而第二种情况的重复项则非常复杂。 原因是代理商可以稍微更改说明,修改错误的价格,然后将其发布为新广告,以便ID也可以是新的。

I had to come up with many logic-based rules to filter out the old versions of the ads. Once I was able to identify that these apartments would be the actual duplicates but with slight modifications, I could sort them by the extraction date, taking the latest one as the most recent.

我必须提出许多基于逻辑的规则,以过滤出旧版本的广告。 一旦我能够确定这些公寓是实际的重复公寓,但稍作修改,便可以按提取日期对它们进行排序,以最新公寓为最新公寓。

Additionally, some agencies would increase or decrease the price for the same apartment after a month. I was told that if nobody wanted this apartment, the price would decrease. Conversely, I was told that, if there were so many requests for it, that the agencies increased the price. These sounds like good explanations.

此外,一些代理商会在一个月后提高或降低同一套公寓的价格。 有人告诉我,如果没人要这个公寓,价格就会降低。 相反,有人告诉我,如果有太多要求,代理商会提高价格。 这些听起来像是很好的解释。

深入了解 (Gaining deeper insights)

Now that we have everything ready, we can start analyzing the data. I know data scientists love seaborn and ggplot2, as well as many static visualizations from which they can derive some insights.

现在我们已经准备就绪,可以开始分析数据了。 我知道数据科学家喜欢seaborn和ggplot2,以及许多静态可视化可以从中得出一些见解。

However, interactive dashboards can help you and other stakeholders to find useful insights. There are many amazing easy-to-use tools for that, such as Tableau and Microstrategy.

但是,交互式仪表板可以帮助您和其他利益相关者找到有用的见解。 有许多惊人的易于使用的工具,例如Tableau和Microstrategy 。

It took me less than 30 minutes to create an interactive dashboard where one can select all the important components and see how the price would change.

我用了不到30分钟的时间创建了一个交互式仪表板,从中可以选择所有重要组件并查看价格将如何变化。

A fairly simple dashboard could already provide insights into the prices in Berlin for newcomers and could be a good user driver for a rental apartment website.

相当简单的仪表盘已经可以为新人提供深入了解柏林的价格,并可能成为一个租公寓网站良好的用户驱动程序

Already from this data visualization you can see that the price distribution of 2.5 rooms falls into the distribution of 2 room apartment. The reason for this is that most of the 2.5 room apartments aren’t situated in the center of the city which, of course, reduces the price.

通过此数据可视化,您已经可以看到2.5个房间的价格分布属于2个房间的公寓的分布。 原因是大多数2.5个房间的公寓都不位于城市中心,这当然会降低价格。

This data was gathered in winter 2017/18 and it will also get outdated. However, my point is that the rental websites could frequently update their statistics and visualizations to provide more transparency to this question.

该数据是在2017/18冬季收集的,并且也会过时。 但是,我的观点是,出租网站可以经常更新其统计信息和可视化效果,以使该问题更加透明。

使用外部API设计功能 (Engineering of features using external APIs)

Visualization helps you to identify important attributes, or “features,” that could be used by these machine learning algorithms. If the features you use are very uninformative, any algorithm will produce bad predictions. With very strong features, even a very simple algorithm can produce pretty decent results.

可视化可以帮助您识别这些机器学习算法可以使用的重要属性或“功能”。 如果您使用的功能非常无用,则任何算法都会产生错误的预测。 具有非常强大的功能,即使是非常简单的算法也可以产生相当不错的结果。

In the rental price project, price is a continuous variable, so it is a typical regression problem. Taking all extracted information, I collected the following features in order to be able to predict a rental price.

在租赁价格项目中,价格是一个连续变量,因此它是一个典型的回归问题。 提取所有信息后,我收集了以下功能,以便能够预测租金。

However, there was one feature that was problematic, namely the address. There were 6.6K apartments and around 4.4K unique addresses of different granularity. There were around 200 unique postcodes which could be converted into the dummy variables but then very precious information of a particular location would be lost.

但是,有一个有问题的功能,即地址。 有6.6K套公寓和大约4.4K个不同粒度的唯一地址。 大约有200个唯一的邮政编码可以转换为虚拟变量,但是特定位置的非常宝贵的信息将丢失。

What do you do when you are given a new address?You either google where it is or how to get there.

收到新地址后该怎么办? 您可以在google上找到它的位置,或者如何到达那里。

By using an external API following the four additional features given, the apartment’s address could be calculated:

通过使用外部API遵循给出的四个附加功能,可以计算出公寓的地址:

  1. duration of a train trip to the S-Bahn Friedrichstrasse (central station)

    到S-Bahn Friedrichstrasse(中央车站)的火车旅行时间

2. distance to U-Bahn Stadtmitte (city center) by car

2.乘车到U-Bahn Stadtmitte(市中心)的距离

3. duration of a walking trip to the nearest metro station

3.步行到最近的地铁站的时间

4. number of metro stations within one kilometer from the apartment

4.距离公寓一公里内的地铁站数量

These four features boosted the performance significantly.

这四个功能大大提高了性能。

进行机器学习和数据科学时的常见错误 (Common mistakes when carrying out machine learning and data science)

After scraping or getting the data, there are many steps to accomplish before applying a machine learning model.

在抓取或获取数据之后, 应用机器学习模型之前,需要完成许多步骤。

You need to visualize each of the variables to see distributions, find the outliers, and understand why there are such outliers.

您需要可视化每个变量以查看分布,找到异常值,并了解为什么存在此类异常值。

What can you do with missing values in certain features?

某些功能缺少值怎么办?

What would be the best way to convert categorical features into numerical ones?

将分类特征转换为数字特征的最佳方法是什么?

There are many such questions, but I will give some details on the ones where the majority of beginners encounter mistakes.

有很多这样的问题,但我将详细介绍大多数初学者遇到错误的地方。

1.可视化 (1. Visualization)

Firstly, you should visualize the distribution of the continuous features to get a feeling if there are many outliers, what the distribution would be, and if it makes sense.

首先,您应该可视化连续特征的分布,以了解是否存在许多异常值,分布将是什么以及是否有意义。

There are many ways to visualize it, for example box plots, histograms, cumulative distribution functions, and violin plots. However, one should pick the plot that will give the most information about the data.

可视化的方法有很多,例如箱形图 , 直方图 , 累积分布函数和小提琴图 。 但是,应该选择能够提供有关数据最多信息的图。

To see the distribution (if it is normal, or bimodal), the histograms will be the most helpful. Although histograms are a good starting point, the box plots might be superior in identifying the number of outliers and seeing where the median quartiles lie.

要查看分布(如果是正态或双峰 ),则直方图将是最有用的。 尽管直方图是一个很好的起点,但箱形图在识别异常值和查看中位数四分位数的位置方面可能会更好。

Based on the plots, the most interesting question would be: do you see what you expected to see? Answering this question will help you either in finding insights or finding bugs in the data.

根据这些图,最有趣的问题是: 您看到期望看到的东西了吗? 回答这个问题将有助于您发现见解或发现数据中的错误。

To get inspired and understand what plot will give the most value, I frequently referred to the Python’s seaborn gallery. Another good source of inspiration for the visualization and finding insights are kernels on Kaggle. Here is my kaggle kernel of the in-depth visualization of the titanic dataset.

为了获得启发并了解哪种情节将带来最大的价值,我经常提到Python的seaborn画廊 。 可视化和发现见解的另一个很好的灵感来源是Kaggle上的内核。 这是泰坦尼克号数据集的深入可视化的我的kaggle内核 。

In the context of rental prices, I plotted the histograms of each continuous feature and expected to see a long right tail in the distribution of the rent without bills and total area.

在租金价格的背景下,我绘制了每个连续要素的直方图,并期望在租金分配中看到一条长长的右尾角,而没有账单和总面积。

Box plots helped me see the number of outliers for each of the features. In fact, most of the outliers apartments based on the rent without bills were either the ateliers for the small shops with more than 200m2 or the student dormitories with very low rent.

箱形图帮助我查看了每个功能的离群值数量。 实际上,大多数基于无租金租金的离群公寓都是面积超过200平方米的小商店的工作室,或者租金很低的学生宿舍。

2.我是否根据整个数据集估算值? (2. Do I impute the values based on the whole dataset?)

Sometimes there will be missing values, due to various reasons. If we exclude every observation with at least one missing value, we can end up with a very reduced dataset.

有时由于各种原因会丢失值。 如果我们排除每个观察值至少具有一个缺失值,那么最终可以得到一个非常简化的数据集。

There are many ways of imputing the values, mean, or median. It is up to you how to do it but make sure to calculate the imputation statistics only on the training data to avoid data leakage of your test set.

有许多估算值,平均值或中位数的方法。 这取决于您如何执行, 但请确保仅对训练数据计算插补统计信息,以避免测试集的数据泄漏

In the rental data, I also extracted a description of the apartment. Whenever the quality, condition, or type of apartment was missing, I would impute it from the description if the description contained this information.

在租金数据中,我还提取了公寓的描述。 每当缺少质量,条件或类型的公寓时,如果描述中包含此信息,我都会从描述中归类。

3.如何转换分类变量? (3. How do I transform categorical variables?)

Some algorithms, depending on the implementation, wouldn’t work directly with the categorical data, so one would need to somehow transform them into numerical values.

某些算法(取决于实现方式)无法直接与分类数据配合使用,因此需要以某种方式将其转换为数值。

There are many ways of transforming categorical variables into numerical features, such as Label Encoder, One Hot Encoding, bin encoding, and hashing encoding. However, most people use the Label Encoding incorrectly when the One Hot Encoding should have been used instead.

有很多方法可以将分类变量转换为数字特征,例如标签编码器,一次热编码,bin编码和哈希编码。 但是,大多数人应该改用“一种热编码”来错误地使用“标签编码”。

Assume, in our rental data, that we have an apartment-type column with the following values: [ground floor, loft, maisonette, loft, loft, ground floor]. LabelEncoder can turn this into [3,2,1,2,2,1], introducing ordinality, which means that ground_floor >loft > maisonette. For some algorithms like decision trees, and its deviations, this type of encoding for this feature would be fine, but applying regressions and SVM might not make that much sense.

假设在我们的租金数据中,我们有一个带有以下值的公寓类型列:[一楼,阁楼,小别墅,阁楼,阁楼,一楼]。 LabelEncoder可以将其转换为[3,2,1,2,2,1],引入序数,这意味着ground_floor> loft> maisonette。 对于决策树及其偏差之类的某些算法,此功能的这种编码类型会很好,但是应用回归和SVM可能没有太大意义。

In the rental price dataset, the condition is encoded as follows:

在租赁价格数据集中, 条件编码如下:

  • new:1

    新:1
  • renovated:2

    装修:2
  • needs renovation: 3

    需要装修:3

and the quality as:

质量为:

  • Luxus:1

    勒克斯:1
  • better than normal: 2

    比平常好:2
  • normal: 3

    正常:3
  • simple: 4

    简单:4
  • unknown: 5

    未知:5

4.我需要标准化变量吗? (4. Do I need to standardize variables?)

Standardization brings all continuous variables to the same scale, meaning if one variable has values from 1K to 1M and another from 0.1 to 1, after standardization they will have the same range.

标准化使所有连续变量达到相同的比例,这意味着如果一个变量的值从1K到1M,另一个变量的值从0.1到1,则在标准化之后,它们将具有相同的范围。

L1 or L2 regularizations are the common way of reducing overfitting and can be used within many regression algorithms. However, it is important to apply feature standardization before L1 or L2.

L1或L2正则化是减少过度拟合的常见方法,可以在许多回归算法中使用。 但是, L1或L2 之前应用特征标准化很重要。

The rental price is in Euros so the fitted coefficient would be approximately 100 times larger than the fitted coefficient if the price was in cents. L1 and L2 penalize the larger coefficients more, meaning it will penalize the features in smaller scales more. To prevent this, the features should be standardized before applying L1 or L2.

租赁价格以欧元为单位,因此,如果价格为美分,则拟合系数将比拟合系数大约大100倍。 L1和L2对较大的系数的惩罚更多,这意味着对较小比例的特征的惩罚会更多。 为防止这种情况,应在应用L1或L2之前对功能进行标准化。

Another reason to standardize is that if you or the your algorithm use gradient descent, gradient descent converges much faster with feature scaling.

进行标准化的另一个原因是,如果您或您的算法使用梯度下降,则通过特征缩放可以更快地收敛梯度下降。

5.我是否需要导出目标变量的对数? (5. Do I need to derive the logarithm of the target variable?)

It took me a while to understand that there is no universal answer.

我花了一段时间才知道没有普遍的答案

It depends on many factors:

它取决于许多因素:

  • whether you want fractional or absolute error

    您是否要分数误差或绝对误差
  • which algorithm you use

    您使用哪种算法
  • what residual plots and changes in the metrics tell you

    残差图和度量标准的变化告诉您什么

In regression, firstly pay attention to the residual plots and the metric. Sometimes the logarithmization of the target variable leads to a better model and the results of the model would still be easy to understand. However, there are still other transformations that could be of interest, such as to taking the square root.

在回归中,首先要注意残差图和度量。 有时,目标变量的对数导致更好的模型,并且模型的结果仍将易于理解。 但是,还有其他一些有趣的转换,例如求平方根。

There are many answers on Stack Overflow regarding this question, and I think Residual Plots and RMSE on raw and log target variable explains it very well.

关于这个问题,有很多关于堆栈溢出的答案,我认为原始图和对数目标变量上的残差图和RMSE很好地解释了这个问题。

For the rental data, I derived the logarithm of the price as the residual plots looked a bit better.

对于租金数据,我得出了价格的对数,因为残差图看起来更好。

6.一些更重要的东西 (6. Some more important stuff)

Some algorithms, such as regressions, will suffer from collinearities in the data because the coefficients become very unstable (more math). SVM might or might not suffer from collinearity due to the choice of kernel.

某些算法(例如回归)将因数据的共线性而变得不稳定,因为系数变得非常不稳定( 更多数学 )。 由于选择内核, SVM 可能会或可能不会共线。

Decision-based algorithms will not suffer from multicollinearity as they could use features interchangeably in different trees without it affecting the performance. However, the interpretation of feature importance then gets more difficult as the correlated variable may not appear to be as important as it is.

基于决策的算法不会受到多重共线性的困扰,因为它们可以在不同的树中互换使用特征而不影响性能。 但是,由于相关变量可能看起来不那么重要,因此特征重要性的解释变得更加困难。

机器学习 (Machine learning)

After you have familiarized yourself with data and cleaned out the outliers, it is the perfect time to get the hang of machine learning. There are many algorithms you could use for this supervised machine learning.

在您熟悉数据并清除异常值之后,现在正是掌握机器学习的最佳时机。 您可以将许多算法用于这种有监督的机器学习。

There were three different algorithms I wanted to explore, comparing characterstics such as performance differences and speed. These three were gradient boosted trees with different implementations (XGBoost and LightGMB), Random Forest (FR, scikit-learn) and 3-layer Neuronal Networks (NN, Tensorflow). I selected RMSLE (root mean squared logarithm error) to be the metric for the optimization of the process. I used RMSLE because I derived the logarithm of the target variable.

我想探索三种不同的算法,以比较诸如性能差异和速度之类的特征。 这三个是具有不同实现方式(XGBoost和LightGMB),随机森林(FR,scikit-learn)和3层神经元网络(NN,Tensorflow)的梯度增强树。 我选择RMSLE(均方根对数误差)作为过程优化的度量。 我使用RMSLE是因为我导出了目标变量的对数。

XGBoost and LigthGBM performed comparably, RF slightly worse, whereas NN was the worst.

XGBoost和LigthGBM的性能相当,RF稍差,而NN最差。

Decision tree-based algorithms are very good at interpreting features. For example, they produce a feature importance score.

基于决策树的算法非常擅长于解释功能。 例如,它们产生特征重要性分数。

功能重要性:找到租金价格的驱动因素 (Feature importance: finding the drivers of the rental price)

After fitting a decision tree-based model, you can see what features are the most valuable for the price prediction.

拟合基于决策树的模型后,您可以查看哪些功能对价格预测最有价值。

Feature importance provides a score that indicates how informative each feature was in the construction of the decision trees within the model. One of the ways to calculate this score is to count how many times a feature is used to split the data across all trees. This score can be computed in different ways.

特征重要性提供了一个分数,该分数指示每个特征在模型中决策树的构造中的信息量。 计算该分数的方法之一是计算功能用于在所有树上拆分数据的次数。 这个分数可以用不同的方法来计算。

Feature importance can reveal other insights about the main price drivers.

功能重要性可以揭示有关主要价格驱动因素的其他见解。

For the rental price prediction, it isn’t surprising that total area is the most important driver of the price. Interestingly, some features that were engineered with external API are also in the top most important features.

对于租金价格预测,不足为奇的是总面积是价格的最重要驱动因素。 有趣的是,使用外部API设计的某些功能也是最重要的功能。

However, as mentioned in “Interpretable Machine Learning with XGBoost”, there can be inconsistencies in feature importance depending on the attribution option. The author of the linked blogpost, and SHAP NIPS paper, proposes a new way of calculating feature importance that will be both accurate and consistent. This uses the shap Python library. SHAP values represent the responsibility of a feature for a change in the model output.

但是,如“使用XGBoost进行可解释的机器学习”中所述 ,取决于归因选项,功能重要性可能存在不一致之处。 链接的博客文章和SHAP NIPS论文的作者提出了一种计算特征重要性的新方法,该方法既准确又一致。 这使用了shap Python库 。 SHAP值代表要素对模型输出变化的责任。

The output of the analysis on the rental price data is shown in the figure below.

下图显示了租金价格数据的分析结果。

The figure incorporates a lot of valuable information (features are sorted by mean (|Tree SHAP|)). Small disclaimer: data is from the beginning of 2018; the district can evolve and therefore the price-dependent factors could change.

该图包含许多有价值的信息(功能按均值排序(| Tree SHAP |))。 小免责声明:数据来自2018年初; 该地区可以发展,因此价格相关因素可能会发生变化。

  • the proximity to the city center (kilometers till U-Bahn Stadtmitte by car and duration of a train trip to S-Bahn Friedrichstrasse) increases the predicted rental apartment price

    距市中心的距离(驾车到U-Bahn Stadtmitte的公里以及到S-Bahn Friedrichstrasse的火车旅行的持续时间)会增加预计的出租公寓价格
  • total area as the strongest driver of the rental price

    总面积成为租金最强劲的推动力
  • if the apartment owner requires you to have a low income certificate (WBS in German), the predicted price is lower

    如果公寓业主要求您提供低收入证明(德语中的WBS),则预计价格较低
  • renting an apartment in these districts would also increase the rental price: Mitte, Prenzlauer Berg, Wilmersdorf, Charlottenburg, Zehlendorf and Friedrichshain.

    在这些地区租用公寓还会增加租金:米特(Mitte),普伦茨劳伯格(Prenzlauer Berg),威尔默斯多夫(Wilmersdorf),夏洛滕堡(Charlottenburg),泽伦多夫(Zehlendorf)和弗里德里希斯海因(Friedrichshain)。
  • districts with lower prices would be: Spandau, Tempelhof, Wedding and Reinickendorf

    价格较低的地区是:斯潘道,滕珀尔霍夫,婚礼和赖尼肯多夫
  • obviously, an apartment in better condition — the lower value is the better — of better quality — the lower value is better — with furniture, a built-in kitchen, and elevator will cost more

    显然,条件更好的公寓-价格越低越好-质量越好-价格越低越好-带有家具,内置厨房和电梯的价格会更高

Interesting are the impacts of following features:

有趣的是以下功能的影响:

  • duration to the nearest metro station

    到最近地铁站的时间
  • number of stations within 1 km.

    1公里以内的车站数量。

Duration to the nearest metro station: It seems that, for some apartments, the high value of this feature indicates the higher price. The reason for this is that these apartments are situated in very wealthy residential areas outside of Berlin.

到最近地铁站的时间:对于某些公寓,此功能的高价值似乎表明价格较高。 原因是这些公寓位于柏林以外非常富裕的住宅区。

One can also see that the proximity to the metro station has two directions: it lowers and it increases the price for some apartments. The reason could be that the apartments that are very close to metro station would also suffer from underground noise or vibrations caused by trains but, on the other hand, they would be well-connected to the public transportation. However, one could investigate a bit more into this feature as it shows the proximity only to the nearest metro stations and not tram/bus stations.

人们还可以看到,靠近地铁站有两个方向:它会降低它增加了一些公寓的价格。 原因可能是,距离地铁站很近的公寓也会受到地下噪音或火车引起的振动的影响,但另一方面,它们与公共交通的连接也很好。 但是,您可以对此功能进行更多的研究,因为它仅显示了距离最近的地铁站的距离,而不显示有轨电车/公共汽车站的距离。

Number of stations within 1 km: The same applies to the number of stations within one kilometer from the apartment. Many metro stations around would, in general, increase the rental price. However, it also had a negative effect — more noise.

1公里以内的车站数量:距公寓一公里以内的车站数量相同。 一般来说,周围的许多地铁站会增加租金。 但是,它也会产生负面影响-产生更多噪音。

合奏平均 (Ensemble averaging)

After playing around with different models and comparing performance, you could just combine the results of each of the model and build an ensemble!

在试用了不同的模型并比较了性能之后,您可以合并每个模型的结果并建立一个整体!

Bagging is the machine learning ensemble model that utilizes the predictions of several algorithms to calculate the final aggregated predictions. It is designed to prevent overfitting and reduces the variance of the algorithms.

Bagging是一种机器学习集成模型,它利用几种算法的预测来计算最终的聚合预测。 它旨在防止过度拟合并减少算法的差异。

As I already had predictions from the above mentioned algorithms, I combined all four models in all possible ways and picked the seven best single and ensemble models based on the RMSLE of the validation set.

正如我对上述算法的预测一样,我以所有可能的方式组合了所有四个模型,并根据验证集的RMSLE选择了七个最佳的单一模型和集成模型。

Then the RMSLE of those seven models was calculated on the test set.

然后,在测试集上计算出这七个模型的RMSLE。

The ensemble of three decision-tree based algorithms performed the best compared to each single model.

与每个模型相比,基于三种决策树的算法的组合表现最佳。

You could also produce a weighted ensemble, assigning more weight to a better single model. The reasoning behind it is that other models could overrule the best model only if they collectively agree on an alternative.

您还可以生成加权合奏,为更好的单个模型分配更多的权重。 其背后的原因是,只有其他模型共同同意替代方案,其他模型才能推翻最佳模型。

In reality, one would never know if an averaged ensemble would be better than the single model without just trying it out.

实际上,如果不尝试一下,平均合奏是否会比单一模型更好呢?

堆叠模型 (Stacked models)

An averaged or weighted ensemble is not the only way to combine the predictions of different models. You could also stack the models in very different ways!

平均或加权的集合不是组合不同模型的预测的唯一方法。 您也可以以非常不同的方式堆叠模型!

The idea behind stacked models is to create several base models and a meta model on top of the results from the base models in order to produce final predictions. However, it is not so obvious how to train the meta model because it can be biased towards the best of the base models. A very good explanation of how to do it correctly can be found in the post “Stacking models for improved predictions”.

堆叠模型背后的思想是在基础模型的结果之上创建几个基础模型和一个元模型,以产生最终预测。 但是,如何训练元模型并不是很明显,因为它可能会偏向最佳的基础模型。 有关如何正确执行此操作的很好解释,请参见“用于改进预测的堆叠模型”一文 。

For the rental price case, stacked models didn’t improve the RMSLE at all — they even increased the metrics. There might be several reasons for this — either I coded it incorrectly ;) or there was just too much noise introduced by stacking.

对于租赁价格而言,堆叠模型根本无法改善RMSLE,甚至可以提高指标。 造成这种情况的原因可能有多种-要么我对它的编码不正确;),要么由于堆叠引入了太多的噪音。

If you want to explore more of the ensemble and stacked model articles, the Kaggle Ensemble Guide explains many different kinds of ensembling with the performance comparison and referrals on how such stacked models got to the top of Kaggle’s competitions.

如果您想了解更多有关合奏和堆叠模型的文章,则《 Kaggle合奏指南》将通过性能比较来解释许多不同的合奏 ,并推荐此类堆叠模型如何在Kaggle竞争中脱颖而出。

最后的想法 (Final thoughts)

  • listen to what people talk about around you; their complaining can serve as a good starting point for solving something big

    听别人谈论你周围的事; 他们的抱怨可以作为解决大问题的良好起点
  • let people find their own insights by providing interactive dashboards

    通过提供交互式仪表板让人们找到自己的见解
  • don’t restrict yourself to common feature engineering as multiplying two variables. Try to find additional sources of data or explanations

    不要将自己与两个要素相乘就限于通用要素工程。 尝试查找其他数据来源或说明
  • try out ensembles and stacked models as those methods could improve the performance

    尝试集成和堆叠模型,因为这些方法可以提高性能

And please, provide the date of the data you display!

并请提供您显示数据的日期!

Sources of figures:https://www.pinterest.de/minimalcouture/paris-apartments/https://www.theodysseyonline.com/the-struggles-of-moving-into-your-first-apartmenthttps://www.fashionbeans.com/content/the-worlds-10-smallest-apartments-are-downright-shocking/

数字来源: https://www.pinterest.de/minimalcouture/paris-apartments/ https://www.theodysseyonline.com/the-struggles-of-moving-into-your-first-apartment 的https:// WWW .fashionbeans.com / content / the-worlds-10-smallest-apartments-are-downright-shocking /

翻译自: https://www.freecodecamp.org/news/how-to-build-a-data-science-project-from-scratch-dc4f096a62a1/

从头大数据项目

你可能感兴趣的:(算法,可视化,大数据,数据挖掘,编程语言)