cumian8165

从头大数据项目_如何从头开始构建数据科学项目

从头大数据项目

by Jekaterina Kokatjuhha

通过叶卡捷琳娜·科卡朱哈(Jekaterina Kokatjuhha)

如何从头开始构建数据科学项目 (How to build a data science project from scratch)

使用柏林租金价格分析的演示 (A demonstration using an analysis of Berlin rental prices)

There are many online courses about data science and machine learning that will guide you through a theory and provide you with some code examples and an analysis of very clean data.

有许多关于数据科学和机器学习的在线课程，它们将指导您进行理论学习，并为您提供一些代码示例以及对非常干净的数据的分析。

However, in order to start practising data science, it is better if you challenge a real-life problem. Digging into the data in order to find deeper insights. Carrying out feature engineering using additional sources of data and building stand-alone machine learning pipelines.

但是，为了开始实践数据科学，最好挑战现实生活中的问题。挖掘数据以找到更深刻的见解。使用其他数据源进行功能工程，并构建独立的机器学习管道。

This blogpost will guide you through the main steps of building a data science project from scratch. It is based on a real-life problem — what are the main drivers of rental prices in Berlin? It will provide an analysis of this situation. It will also highlight the common mistake beginners tend to make when it comes to machine learning.

该博客文章将指导您完成从头开始构建数据科学项目的主要步骤。它基于一个现实问题 -柏林租金价格的主要驱动因素是什么？它将提供对这种情况的分析。它还将突出显示初学者在机器学习中容易犯的常见错误。

These are the steps that will be discussed in detail:

这些是将详细讨论的步骤：

finding a topic
寻找话题
extracting data from the web and cleaning it
从网上提取数据并清理
gaining deeper insights
获得更深刻的见解
engineering of features using external APIs
使用外部API设计功能
common mistakes while carrying out machine learning
进行机器学习时的常见错误
feature importance: finding the drivers of rental prices
功能重要性：找到租金价格的驱动因素
building machine learning models.
建立机器学习模型。

寻找话题 (Finding a topic)

There are many problems that can be solved by analyzing data, but it is always better to find a problem that you are interested in and that will motivate you. While searching for a topic, you should definitely concentrate on your preferences and interests.

通过分析数据可以解决许多问题，但是总是最好找到您感兴趣并且会激发您的问题。在搜索主题时，您绝对应该专注于自己的偏好和兴趣。

For instance, if you are interested in healthcare systems, there are many angles from which you could challenge the data provided on that topic. “Exploring the ChestXray14 dataset: problems” is an example of how to question the quality of medical data. Another example — if you are interested in music, you could try to predict the genre of the song from its audio.

例如，如果您对医疗保健系统感兴趣，则可以从许多角度挑战有关该主题的数据。 “探索ChestXray14数据集：问题”是一个如何质疑医疗数据质量的示例。另一个例子-如果您对音乐感兴趣，则可以尝试根据音频来预测歌曲的类型。

However, I suggest not only to concentrate on your interests but also to listen to what people around you are talking about. What bothers them? What are they complaining about? This can be another good source of ideas for a data science project. In those cases where people are still complaining about it, this may mean that the problem wasn’t solved properly the first time around. Thus, if you challenge it with data, you could provide an even better solution and have an impact in how this topic is perceived.

但是，我建议您不仅要专注于您的兴趣，而且还要倾听您周围的人在谈论什么。是什么困扰着他们？他们在抱怨什么？对于数据科学项目，这可能是另一个很好的想法来源。在人们仍在抱怨的情况下，这可能意味着问题在第一时间没有得到正确解决。因此，如果您用数据挑战它，则可以提供一个更好的解决方案，并且对如何理解此主题有影响。

This may all sound a bit too abstract, so lets find out how I came up with the idea to analyze Berlin rental prices.

这听起来可能有点抽象，所以让我们找出我是如何提出来分析柏林租金价格的想法。

“If I had known that the rental prices were so high here, I would have negotiated for a higher salary.”

“如果我知道这里的租金太高了，我本来可以争取更高的薪水的。”

This is just one of the things I heard from people who had recently moved to Berlin for work. Most newcomers complained that they hadn’t imagined Berlin to be so expensive, and that there were no statistics about possible price ranges of the apartment. If they had known this it beforehand, they could have asked for a higher salary during the job application process or could have considered other options.

这只是我最近移居柏林工作的人们听到的一件事。大多数新移民抱怨他们没有想到柏林这么贵，而且没有关于公寓可能价格范围的统计数据。如果他们事先知道这一点，那么他们可能会在求职过程中要求更高的薪水，或者可以考虑其他选择。

I googled, checked several rental apartment websites, and asked several people, but could not find any plausible statistics or visualizations of the current market prices. And this was how I came up with the idea of this analysis.

我用谷歌搜索，检查了几个出租公寓的网站，并询问了几个人，但是找不到当前市场价格的任何合理的统计数据或可视化效果。这就是我提出此分析想法的方式。

I wanted to gather the data, build an interactive dashboard where you could select different options such as a 40m2 apartment situated in Berlin Mitte with a balcony and equipped kitchen, and it would show you the price ranges. This, alone, would help people understand apartment prices in Berlin. Also, by applying machine learning, I would be able to identify the drivers of the rental prices and practise with different machine learning algorithms.

我想收集数据，建立一个交互式仪表板，在这里您可以选择其他选项，例如位于柏林米特的40平方米公寓，带有阳台和设备齐全的厨房，它可以为您显示价格范围。仅此一项，就可以帮助人们了解柏林的公寓价格。此外，通过应用机器学习，我将能够确定租金价格的驱动因素，并使用不同的机器学习算法进行练习。

从网上提取数据并清理 (Extracting data from the web and cleaning it)

获取数据 (Getting the data)

Now that you have an idea about your data science project, you can start looking for the data. There are tons of amazing data repositories, such as Kaggle, UCI ML Repository or dataset search engines, and websites containing academic papers with datasets. Alternatively, you could use web scraping.

既然您对数据科学项目有了一个想法，就可以开始寻找数据了。有许多惊人的数据存储库，例如Kaggle ， UCI ML存储库或数据集搜索引擎，以及包含带有数据集的学术论文的网站。或者，您可以使用网络抓取。

But be cautious — old data is everywhere. When I was searching for the information about the rental prices in Berlin, I found many visualizations but they were old, or without any year specified.

但是要小心-旧数据无处不在。当我在搜索有关柏林租金价格的信息时，我发现了很多可视化效果，但是它们很旧，或者没有指定任何年份。

For some statistics, they even had a note saying that this price would only be for a 2 room apartment of 50 m2 without furniture. But what if I am searching for a smaller apartment with a furnished kitchen?

对于某些统计数据，他们甚至注意到，这个价格仅适用于50平方米(不含家具)的2居室公寓。但是，如果我要寻找一个带家具的小公寓怎么办？

As I could find only old data, I decided to web scrape the websites that offered rental apartments. Web scraping is a technique used to extract data from websites through an automated process.

由于只能找到旧数据，因此我决定通过网络抓取提供出租公寓的网站。 Web抓取是一种用于通过自动化过程从网站提取数据的技术。

My web scraping blogpost goes into the details of pitfalls and design patterns of web scraping.

我的网络抓取博文详细介绍了网络抓取的陷阱和设计模式。

Web Scraping Tutorial with Python: Tips and TricksI was searching for flight tickets and noticed that ticket prices fluctuate during the day. I tried to find out when…hackernoon.com

使用Python的Web爬网教程：技巧和窍门 我当时在寻找机票，发现机票价格在白天波动。 我试图找出何时…… hackernoon.com

Here are the main findings:

以下是主要发现：

Before scraping, check if there is a public API available
抓取之前，请检查是否有公共API
Be kind! Don’t overload the website by sending hundreds of requests per second
仁慈！不要通过每秒发送数百个请求来过载网站
Save the date when the extraction took place. It will be explained why this is important.
保存提取发生的日期。将解释为什么这很重要。

数据清理 (Data cleaning)

Once you starting getting the data, it is very important to have a look at it as early as possible in order to find any possible issues.

一旦开始获取数据，尽快发现它对于发现任何可能的问题非常重要。

While web scraping rental data, I included some small checks such as the number of missing values for all features. Web-masters could change the HTML of the website, which would result in my program not getting the data anymore.

在网络抓取租赁数据时，我进行了一些小检查，例如所有功能的缺失值数量。网站管理员可以更改网站HTML，这将导致我的程序不再获取数据。

Once I had ensured that all technical aspects of web scraping were covered, I thought the data would almost be ideal. However, I ended up cleaning the data for around a week because of not so obvious duplicates.

一旦确保覆盖了Web抓取的所有技术方面，我便认为数据几乎是理想的。但是，由于没有那么明显的重复，我最终清理了大约一周的数据。

Once you starting getting the data, it is very important to have a look at it as early as possible in order to find any possible issues. For instance, if you web scrape, you could have missed some important fields. If you use a comma separator while saving data into a file, and one of the fields also contains commas, you can end up having files which are not separated very well.

一旦开始获取数据，尽快发现它对于发现任何可能的问题非常重要。例如，如果您通过网络抓取，可能会错过一些重要的字段。如果在将数据保存到文件中时使用逗号分隔符，并且其中一个字段还包含逗号，则最终可能会导致文件分隔得不太好。

There were several sources of duplicates:

有几个重复的来源：

Duplicated apartments because they had been online for a while
重复的公寓，因为它们已经上网了一段时间
Agencies had input errors, for example the rental price or the storey of the apartment. They would correct them after a while, or would publish a completely new ad with corrected values and additional description modifications
代理商输入错误，例如租金或公寓楼层。他们会在一段时间后对其进行更正，或者发布具有更正值和其他说明修改的全新广告
Some prices were changed (increased and decreased) after a month for the same apartment
一个公寓一个月后，一些价格发生了变化(上升和下降)

While the duplicates from the first case were easy to identify by their ID, the duplicates from the second case were very complicated. The reason is that an agency could slightly change a description, modify the wrong price, and publish it as a new ad so that the ID would also be new.

第一种情况的重复项很容易通过ID进行识别，而第二种情况的重复项则非常复杂。原因是代理商可以稍微更改说明，修改错误的价格，然后将其发布为新广告，以便ID也可以是新的。

I had to come up with many logic-based rules to filter out the old versions of the ads. Once I was able to identify that these apartments would be the actual duplicates but with slight modifications, I could sort them by the extraction date, taking the latest one as the most recent.

我必须提出许多基于逻辑的规则，以过滤出旧版本的广告。一旦我能够确定这些公寓是实际的重复公寓，但稍作修改，便可以按提取日期对它们进行排序，以最新公寓为最新公寓。

Additionally, some agencies would increase or decrease the price for the same apartment after a month. I was told that if nobody wanted this apartment, the price would decrease. Conversely, I was told that, if there were so many requests for it, that the agencies increased the price. These sounds like good explanations.

此外，一些代理商会在一个月后提高或降低同一套公寓的价格。有人告诉我，如果没人要这个公寓，价格就会降低。相反，有人告诉我，如果有太多要求，代理商会提高价格。这些听起来像是很好的解释。

深入了解 (Gaining deeper insights)

Now that we have everything ready, we can start analyzing the data. I know data scientists love seaborn and ggplot2, as well as many static visualizations from which they can derive some insights.

现在我们已经准备就绪，可以开始分析数据了。我知道数据科学家喜欢seaborn和ggplot2，以及许多静态可视化可以从中得出一些见解。

However, interactive dashboards can help you and other stakeholders to find useful insights. There are many amazing easy-to-use tools for that, such as Tableau and Microstrategy.

但是，交互式仪表板可以帮助您和其他利益相关者找到有用的见解。有许多惊人的易于使用的工具，例如Tableau和Microstrategy 。

It took me less than 30 minutes to create an interactive dashboard where one can select all the important components and see how the price would change.

我用了不到30分钟的时间创建了一个交互式仪表板，从中可以选择所有重要组件并查看价格将如何变化。

A fairly simple dashboard could already provide insights into the prices in Berlin for newcomers and could be a good user driver for a rental apartment website.

相当简单的仪表盘已经可以为新人提供深入了解柏林的价格，并可能成为一个租公寓网站良好的用户驱动程序 。

Already from this data visualization you can see that the price distribution of 2.5 rooms falls into the distribution of 2 room apartment. The reason for this is that most of the 2.5 room apartments aren’t situated in the center of the city which, of course, reduces the price.

通过此数据可视化，您已经可以看到2.5个房间的价格分布属于2个房间的公寓的分布。原因是大多数2.5个房间的公寓都不位于城市中心，这当然会降低价格。

This data was gathered in winter 2017/18 and it will also get outdated. However, my point is that the rental websites could frequently update their statistics and visualizations to provide more transparency to this question.

该数据是在2017/18冬季收集的，并且也会过时。但是，我的观点是，出租网站可以经常更新其统计信息和可视化效果，以使该问题更加透明。

使用外部API设计功能 (Engineering of features using external APIs)

Visualization helps you to identify important attributes, or “features,” that could be used by these machine learning algorithms. If the features you use are very uninformative, any algorithm will produce bad predictions. With very strong features, even a very simple algorithm can produce pretty decent results.

可视化可以帮助您识别这些机器学习算法可以使用的重要属性或“功能”。如果您使用的功能非常无用，则任何算法都会产生错误的预测。具有非常强大的功能，即使是非常简单的算法也可以产生相当不错的结果。

In the rental price project, price is a continuous variable, so it is a typical regression problem. Taking all extracted information, I collected the following features in order to be able to predict a rental price.

在租赁价格项目中，价格是一个连续变量，因此它是一个典型的回归问题。提取所有信息后，我收集了以下功能，以便能够预测租金。

However, there was one feature that was problematic, namely the address. There were 6.6K apartments and around 4.4K unique addresses of different granularity. There were around 200 unique postcodes which could be converted into the dummy variables but then very precious information of a particular location would be lost.

但是，有一个有问题的功能，即地址。有6.6K套公寓和大约4.4K个不同粒度的唯一地址。大约有200个唯一的邮政编码可以转换为虚拟变量，但是特定位置的非常宝贵的信息将丢失。

What do you do when you are given a new address?You either google where it is or how to get there.

收到新地址后该怎么办？ 您可以在google上找到它的位置，或者如何到达那里。

By using an external API following the four additional features given, the apartment’s address could be calculated:

通过使用外部API遵循给出的四个附加功能，可以计算出公寓的地址：

duration of a train trip to the S-Bahn Friedrichstrasse (central station)
到S-Bahn Friedrichstrasse(中央车站)的火车旅行时间

2. distance to U-Bahn Stadtmitte (city center) by car

2.乘车到U-Bahn Stadtmitte(市中心)的距离

3. duration of a walking trip to the nearest metro station

3.步行到最近的地铁站的时间

4. number of metro stations within one kilometer from the apartment

4.距离公寓一公里内的地铁站数量

These four features boosted the performance significantly.

这四个功能大大提高了性能。

进行机器学习和数据科学时的常见错误 (Common mistakes when carrying out machine learning and data science)

After scraping or getting the data, there are many steps to accomplish before applying a machine learning model.

在抓取或获取数据之后，在应用机器学习模型之前，需要完成许多步骤。

You need to visualize each of the variables to see distributions, find the outliers, and understand why there are such outliers.

您需要可视化每个变量以查看分布，找到异常值，并了解为什么存在此类异常值。

What can you do with missing values in certain features?

某些功能缺少值怎么办？

What would be the best way to convert categorical features into numerical ones?

将分类特征转换为数字特征的最佳方法是什么？

There are many such questions, but I will give some details on the ones where the majority of beginners encounter mistakes.

有很多这样的问题，但我将详细介绍大多数初学者遇到错误的地方。

1.可视化 (1. Visualization)

Firstly, you should visualize the distribution of the continuous features to get a feeling if there are many outliers, what the distribution would be, and if it makes sense.

首先，您应该可视化连续特征的分布，以了解是否存在许多异常值，分布将是什么以及是否有意义。

There are many ways to visualize it, for example box plots, histograms, cumulative distribution functions, and violin plots. However, one should pick the plot that will give the most information about the data.

可视化的方法有很多，例如箱形图，直方图，累积分布函数和小提琴图。但是，应该选择能够提供有关数据最多信息的图。

To see the distribution (if it is normal, or bimodal), the histograms will be the most helpful. Although histograms are a good starting point, the box plots might be superior in identifying the number of outliers and seeing where the median quartiles lie.

要查看分布(如果是正态或双峰 )，则直方图将是最有用的。尽管直方图是一个很好的起点，但箱形图在识别异常值和查看中位数四分位数的位置方面可能会更好。

Based on the plots, the most interesting question would be: do you see what you expected to see? Answering this question will help you either in finding insights or finding bugs in the data.

根据这些图，最有趣的问题是： 您看到期望看到的东西了吗？ 回答这个问题将有助于您发现见解或发现数据中的错误。

To get inspired and understand what plot will give the most value, I frequently referred to the Python’s seaborn gallery. Another good source of inspiration for the visualization and finding insights are kernels on Kaggle. Here is my kaggle kernel of the in-depth visualization of the titanic dataset.

为了获得启发并了解哪种情节将带来最大的价值，我经常提到Python的seaborn画廊。可视化和发现见解的另一个很好的灵感来源是Kaggle上的内核。这是泰坦尼克号数据集的深入可视化的我的kaggle内核。

In the context of rental prices, I plotted the histograms of each continuous feature and expected to see a long right tail in the distribution of the rent without bills and total area.

在租金价格的背景下，我绘制了每个连续要素的直方图，并期望在租金分配中看到一条长长的右尾角，而没有账单和总面积。

Box plots helped me see the number of outliers for each of the features. In fact, most of the outliers apartments based on the rent without bills were either the ateliers for the small shops with more than 200m2 or the student dormitories with very low rent.

箱形图帮助我查看了每个功能的离群值数量。实际上，大多数基于无租金租金的离群公寓都是面积超过200平方米的小商店的工作室，或者租金很低的学生宿舍。

2.我是否根据整个数据集估算值？ (2. Do I impute the values based on the whole dataset?)

Sometimes there will be missing values, due to various reasons. If we exclude every observation with at least one missing value, we can end up with a very reduced dataset.

有时由于各种原因会丢失值。如果我们排除每个观察值至少具有一个缺失值，那么最终可以得到一个非常简化的数据集。

There are many ways of imputing the values, mean, or median. It is up to you how to do it but make sure to calculate the imputation statistics only on the training data to avoid data leakage of your test set.

有许多估算值，平均值或中位数的方法。这取决于您如何执行，但请确保仅对训练数据计算插补统计信息，以避免测试集的数据泄漏 。

In the rental data, I also extracted a description of the apartment. Whenever the quality, condition, or type of apartment was missing, I would impute it from the description if the description contained this information.

在租金数据中，我还提取了公寓的描述。每当缺少质量，条件或类型的公寓时，如果描述中包含此信息，我都会从描述中归类。

3.如何转换分类变量？ (3. How do I transform categorical variables?)

Some algorithms, depending on the implementation, wouldn’t work directly with the categorical data, so one would need to somehow transform them into numerical values.

某些算法(取决于实现方式)无法直接与分类数据配合使用，因此需要以某种方式将其转换为数值。

There are many ways of transforming categorical variables into numerical features, such as Label Encoder, One Hot Encoding, bin encoding, and hashing encoding. However, most people use the Label Encoding incorrectly when the One Hot Encoding should have been used instead.

有很多方法可以将分类变量转换为数字特征，例如标签编码器，一次热编码，bin编码和哈希编码。但是，大多数人应该改用“一种热编码”来错误地使用“标签编码”。

Assume, in our rental data, that we have an apartment-type column with the following values: [ground floor, loft, maisonette, loft, loft, ground floor]. LabelEncoder can turn this into [3,2,1,2,2,1], introducing ordinality, which means that ground_floor >loft > maisonette. For some algorithms like decision trees, and its deviations, this type of encoding for this feature would be fine, but applying regressions and SVM might not make that much sense.

假设在我们的租金数据中，我们有一个带有以下值的公寓类型列：[一楼，阁楼，小别墅，阁楼，阁楼，一楼]。 LabelEncoder可以将其转换为[3,2,1,2,2,1]，引入序数，这意味着ground_floor> loft> maisonette。对于决策树及其偏差之类的某些算法，此功能的这种编码类型会很好，但是应用回归和SVM可能没有太大意义。

In the rental price dataset, the condition is encoded as follows:

在租赁价格数据集中，条件编码如下：

new:1
新：1
renovated:2
装修：2
needs renovation: 3
需要装修：3

and the quality as:

质量为：

Luxus:1
勒克斯：1
better than normal: 2
比平常好：2
normal: 3
正常：3
simple: 4
简单：4
unknown: 5
未知：5

4.我需要标准化变量吗？ (4. Do I need to standardize variables?)

Standardization brings all continuous variables to the same scale, meaning if one variable has values from 1K to 1M and another from 0.1 to 1, after standardization they will have the same range.

标准化使所有连续变量达到相同的比例，这意味着如果一个变量的值从1K到1M，另一个变量的值从0.1到1，则在标准化之后，它们将具有相同的范围。

L1 or L2 regularizations are the common way of reducing overfitting and can be used within many regression algorithms. However, it is important to apply feature standardization before L1 or L2.

L1或L2正则化是减少过度拟合的常见方法，可以在许多回归算法中使用。但是，在 L1或L2 之前应用特征标准化很重要。

The rental price is in Euros so the fitted coefficient would be approximately 100 times larger than the fitted coefficient if the price was in cents. L1 and L2 penalize the larger coefficients more, meaning it will penalize the features in smaller scales more. To prevent this, the features should be standardized before applying L1 or L2.

租赁价格以欧元为单位，因此，如果价格为美分，则拟合系数将比拟合系数大约大100倍。 L1和L2对较大的系数的惩罚更多，这意味着对较小比例的特征的惩罚会更多。为防止这种情况，应在应用L1或L2之前对功能进行标准化。

Another reason to standardize is that if you or the your algorithm use gradient descent, gradient descent converges much faster with feature scaling.

进行标准化的另一个原因是，如果您或您的算法使用梯度下降，则通过特征缩放可以更快地收敛梯度下降。

5.我是否需要导出目标变量的对数？ (5. Do I need to derive the logarithm of the target variable?)

It took me a while to understand that there is no universal answer.

我花了一段时间才知道没有普遍的答案 。

It depends on many factors:

它取决于许多因素：

whether you want fractional or absolute error
您是否要分数误差或绝对误差
which algorithm you use
您使用哪种算法
what residual plots and changes in the metrics tell you
残差图和度量标准的变化告诉您什么

In regression, firstly pay attention to the residual plots and the metric. Sometimes the logarithmization of the target variable leads to a better model and the results of the model would still be easy to understand. However, there are still other transformations that could be of interest, such as to taking the square root.

在回归中，首先要注意残差图和度量。有时，目标变量的对数导致更好的模型，并且模型的结果仍将易于理解。但是，还有其他一些有趣的转换，例如求平方根。

There are many answers on Stack Overflow regarding this question, and I think Residual Plots and RMSE on raw and log target variable explains it very well.

关于这个问题，有很多关于堆栈溢出的答案，我认为原始图和对数目标变量上的残差图和RMSE很好地解释了这个问题。

For the rental data, I derived the logarithm of the price as the residual plots looked a bit better.

对于租金数据，我得出了价格的对数，因为残差图看起来更好。

6.一些更重要的东西 (6. Some more important stuff)

Some algorithms, such as regressions, will suffer from collinearities in the data because the coefficients become very unstable (more math). SVM might or might not suffer from collinearity due to the choice of kernel.

某些算法(例如回归)将因数据的共线性而变得不稳定，因为系数变得非常不稳定( 更多数学 )。由于选择内核， SVM 可能会或可能不会共线。

Decision-based algorithms will not suffer from multicollinearity as they could use features interchangeably in different trees without it affecting the performance. However, the interpretation of feature importance then gets more difficult as the correlated variable may not appear to be as important as it is.

基于决策的算法不会受到多重共线性的困扰，因为它们可以在不同的树中互换使用特征而不影响性能。但是，由于相关变量可能看起来不那么重要，因此特征重要性的解释变得更加困难。

机器学习 (Machine learning)

After you have familiarized yourself with data and cleaned out the outliers, it is the perfect time to get the hang of machine learning. There are many algorithms you could use for this supervised machine learning.

在您熟悉数据并清除异常值之后，现在正是掌握机器学习的最佳时机。您可以将许多算法用于这种有监督的机器学习。

There were three different algorithms I wanted to explore, comparing characterstics such as performance differences and speed. These three were gradient boosted trees with different implementations (XGBoost and LightGMB), Random Forest (FR, scikit-learn) and 3-layer Neuronal Networks (NN, Tensorflow). I selected RMSLE (root mean squared logarithm error) to be the metric for the optimization of the process. I used RMSLE because I derived the logarithm of the target variable.

我想探索三种不同的算法，以比较诸如性能差异和速度之类的特征。这三个是具有不同实现方式(XGBoost和LightGMB)，随机森林(FR，scikit-learn)和3层神经元网络(NN，Tensorflow)的梯度增强树。我选择RMSLE(均方根对数误差)作为过程优化的度量。我使用RMSLE是因为我导出了目标变量的对数。

XGBoost and LigthGBM performed comparably, RF slightly worse, whereas NN was the worst.

XGBoost和LigthGBM的性能相当，RF稍差，而NN最差。

Decision tree-based algorithms are very good at interpreting features. For example, they produce a feature importance score.

基于决策树的算法非常擅长于解释功能。例如，它们产生特征重要性分数。

功能重要性：找到租金价格的驱动因素 (Feature importance: finding the drivers of the rental price)

After fitting a decision tree-based model, you can see what features are the most valuable for the price prediction.

拟合基于决策树的模型后，您可以查看哪些功能对价格预测最有价值。

Feature importance provides a score that indicates how informative each feature was in the construction of the decision trees within the model. One of the ways to calculate this score is to count how many times a feature is used to split the data across all trees. This score can be computed in different ways.

特征重要性提供了一个分数，该分数指示每个特征在模型中决策树的构造中的信息量。计算该分数的方法之一是计算功能用于在所有树上拆分数据的次数。这个分数可以用不同的方法来计算。

Feature importance can reveal other insights about the main price drivers.

功能重要性可以揭示有关主要价格驱动因素的其他见解。

For the rental price prediction, it isn’t surprising that total area is the most important driver of the price. Interestingly, some features that were engineered with external API are also in the top most important features.

对于租金价格预测，不足为奇的是总面积是价格的最重要驱动因素。有趣的是，使用外部API设计的某些功能也是最重要的功能。

However, as mentioned in “Interpretable Machine Learning with XGBoost”, there can be inconsistencies in feature importance depending on the attribution option. The author of the linked blogpost, and SHAP NIPS paper, proposes a new way of calculating feature importance that will be both accurate and consistent. This uses the shap Python library. SHAP values represent the responsibility of a feature for a change in the model output.

但是，如“使用XGBoost进行可解释的机器学习”中所述，取决于归因选项，功能重要性可能存在不一致之处。链接的博客文章和SHAP NIPS论文的作者提出了一种计算特征重要性的新方法，该方法既准确又一致。这使用了shap Python库。 SHAP值代表要素对模型输出变化的责任。

The output of the analysis on the rental price data is shown in the figure below.

下图显示了租金价格数据的分析结果。

The figure incorporates a lot of valuable information (features are sorted by mean (|Tree SHAP|)). Small disclaimer: data is from the beginning of 2018; the district can evolve and therefore the price-dependent factors could change.

该图包含许多有价值的信息(功能按均值排序(| Tree SHAP |))。小免责声明：数据来自2018年初；该地区可以发展，因此价格相关因素可能会发生变化。

the proximity to the city center (kilometers till U-Bahn Stadtmitte by car and duration of a train trip to S-Bahn Friedrichstrasse) increases the predicted rental apartment price
距市中心的距离(驾车到U-Bahn Stadtmitte的公里以及到S-Bahn Friedrichstrasse的火车旅行的持续时间)会增加预计的出租公寓价格
total area as the strongest driver of the rental price
总面积成为租金最强劲的推动力
if the apartment owner requires you to have a low income certificate (WBS in German), the predicted price is lower
如果公寓业主要求您提供低收入证明(德语中的WBS)，则预计价格较低
renting an apartment in these districts would also increase the rental price: Mitte, Prenzlauer Berg, Wilmersdorf, Charlottenburg, Zehlendorf and Friedrichshain.
在这些地区租用公寓还会增加租金：米特(Mitte)，普伦茨劳伯格(Prenzlauer Berg)，威尔默斯多夫(Wilmersdorf)，夏洛滕堡(Charlottenburg)，泽伦多夫(Zehlendorf)和弗里德里希斯海因(Friedrichshain)。
districts with lower prices would be: Spandau, Tempelhof, Wedding and Reinickendorf
价格较低的地区是：斯潘道，滕珀尔霍夫，婚礼和赖尼肯多夫
obviously, an apartment in better condition — the lower value is the better — of better quality — the lower value is better — with furniture, a built-in kitchen, and elevator will cost more
显然，条件更好的公寓-价格越低越好-质量越好-价格越低越好-带有家具，内置厨房和电梯的价格会更高

Interesting are the impacts of following features:

有趣的是以下功能的影响：

duration to the nearest metro station
到最近地铁站的时间
number of stations within 1 km.
1公里以内的车站数量。

Duration to the nearest metro station: It seems that, for some apartments, the high value of this feature indicates the higher price. The reason for this is that these apartments are situated in very wealthy residential areas outside of Berlin.

到最近地铁站的时间：对于某些公寓，此功能的高价值似乎表明价格较高。原因是这些公寓位于柏林以外非常富裕的住宅区。

One can also see that the proximity to the metro station has two directions: it lowers and it increases the price for some apartments. The reason could be that the apartments that are very close to metro station would also suffer from underground noise or vibrations caused by trains but, on the other hand, they would be well-connected to the public transportation. However, one could investigate a bit more into this feature as it shows the proximity only to the nearest metro stations and not tram/bus stations.

人们还可以看到，靠近地铁站有两个方向：它会降低，它增加了一些公寓的价格。原因可能是，距离地铁站很近的公寓也会受到地下噪音或火车引起的振动的影响，但另一方面，它们与公共交通的连接也很好。但是，您可以对此功能进行更多的研究，因为它仅显示了距离最近的地铁站的距离，而不显示有轨电车/公共汽车站的距离。

Number of stations within 1 km: The same applies to the number of stations within one kilometer from the apartment. Many metro stations around would, in general, increase the rental price. However, it also had a negative effect — more noise.

1公里以内的车站数量：距公寓一公里以内的车站数量相同。一般来说，周围的许多地铁站会增加租金。但是，它也会产生负面影响-产生更多噪音。

合奏平均 (Ensemble averaging)

After playing around with different models and comparing performance, you could just combine the results of each of the model and build an ensemble!

在试用了不同的模型并比较了性能之后，您可以合并每个模型的结果并建立一个整体！

Bagging is the machine learning ensemble model that utilizes the predictions of several algorithms to calculate the final aggregated predictions. It is designed to prevent overfitting and reduces the variance of the algorithms.

Bagging是一种机器学习集成模型，它利用几种算法的预测来计算最终的聚合预测。它旨在防止过度拟合并减少算法的差异。

As I already had predictions from the above mentioned algorithms, I combined all four models in all possible ways and picked the seven best single and ensemble models based on the RMSLE of the validation set.

正如我对上述算法的预测一样，我以所有可能的方式组合了所有四个模型，并根据验证集的RMSLE选择了七个最佳的单一模型和集成模型。

Then the RMSLE of those seven models was calculated on the test set.

然后，在测试集上计算出这七个模型的RMSLE。

The ensemble of three decision-tree based algorithms performed the best compared to each single model.

与每个模型相比，基于三种决策树的算法的组合表现最佳。

You could also produce a weighted ensemble, assigning more weight to a better single model. The reasoning behind it is that other models could overrule the best model only if they collectively agree on an alternative.

您还可以生成加权合奏，为更好的单个模型分配更多的权重。其背后的原因是，只有其他模型共同同意替代方案，其他模型才能推翻最佳模型。

In reality, one would never know if an averaged ensemble would be better than the single model without just trying it out.

实际上，如果不尝试一下，平均合奏是否会比单一模型更好呢？

堆叠模型 (Stacked models)

An averaged or weighted ensemble is not the only way to combine the predictions of different models. You could also stack the models in very different ways!

平均或加权的集合不是组合不同模型的预测的唯一方法。您也可以以非常不同的方式堆叠模型！

The idea behind stacked models is to create several base models and a meta model on top of the results from the base models in order to produce final predictions. However, it is not so obvious how to train the meta model because it can be biased towards the best of the base models. A very good explanation of how to do it correctly can be found in the post “Stacking models for improved predictions”.

堆叠模型背后的思想是在基础模型的结果之上创建几个基础模型和一个元模型，以产生最终预测。但是，如何训练元模型并不是很明显，因为它可能会偏向最佳的基础模型。有关如何正确执行此操作的很好解释，请参见“用于改进预测的堆叠模型”一文。

For the rental price case, stacked models didn’t improve the RMSLE at all — they even increased the metrics. There might be several reasons for this — either I coded it incorrectly ;) or there was just too much noise introduced by stacking.

对于租赁价格而言，堆叠模型根本无法改善RMSLE，甚至可以提高指标。造成这种情况的原因可能有多种-要么我对它的编码不正确;)，要么由于堆叠引入了太多的噪音。

If you want to explore more of the ensemble and stacked model articles, the Kaggle Ensemble Guide explains many different kinds of ensembling with the performance comparison and referrals on how such stacked models got to the top of Kaggle’s competitions.

如果您想了解更多有关合奏和堆叠模型的文章，则《 Kaggle合奏指南》将通过性能比较来解释许多不同的合奏，并推荐此类堆叠模型如何在Kaggle竞争中脱颖而出。

最后的想法 (Final thoughts)

listen to what people talk about around you; their complaining can serve as a good starting point for solving something big
听别人谈论你周围的事；他们的抱怨可以作为解决大问题的良好起点
let people find their own insights by providing interactive dashboards
通过提供交互式仪表板让人们找到自己的见解
don’t restrict yourself to common feature engineering as multiplying two variables. Try to find additional sources of data or explanations
不要将自己与两个要素相乘就限于通用要素工程。尝试查找其他数据来源或说明
try out ensembles and stacked models as those methods could improve the performance
尝试集成和堆叠模型，因为这些方法可以提高性能

And please, provide the date of the data you display!

并请提供您显示数据的日期！

Sources of figures:https://www.pinterest.de/minimalcouture/paris-apartments/https://www.theodysseyonline.com/the-struggles-of-moving-into-your-first-apartmenthttps://www.fashionbeans.com/content/the-worlds-10-smallest-apartments-are-downright-shocking/

数字来源： https://www.pinterest.de/minimalcouture/paris-apartments/ https://www.theodysseyonline.com/the-struggles-of-moving-into-your-first-apartment 的https：// WWW .fashionbeans.com / content / the-worlds-10-smallest-apartments-are-downright-shocking /

翻译自: https://www.freecodecamp.org/news/how-to-build-a-data-science-project-from-scratch-dc4f096a62a1/

从头大数据项目

你可能感兴趣的:(算法,可视化,大数据,数据挖掘,编程语言)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
【一起学Rust | 设计模式】习惯语法——使用借用类型作为参数、格式化拼接字符串、构造函数广龙宇一起学Rust #Rust设计模式 rust 设计模式开发语言
提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录前言一、使用借用类型作为参数二、格式化拼接字符串三、使用构造函数总结前言Rust不是传统的面向对象编程语言，它的所有特性，使其独一无二。因此，学习特定于Rust的设计模式是必要的。本系列文章为作者学习《Rust设计模式》的学习笔记以及自己的见解。因此，本系列文章的结构也与此书的结构相同（后续可能会调成结构），基本上分为三个部分
Python数据分析与可视化实战指南 William数据分析 python python 数据
在数据驱动的时代，Python因其简洁的语法、强大的库生态系统以及活跃的社区，成为了数据分析与可视化的首选语言。本文将通过一个详细的案例，带领大家学习如何使用Python进行数据分析，并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前，我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学
Pyecharts数据可视化大屏：打造沉浸式数据分析体验我的运维人生信息可视化数据分析数据挖掘运维开发技术共享
Pyecharts数据可视化大屏：打造沉浸式数据分析体验在当今这个数据驱动的时代，如何将海量数据以直观、生动的方式展现出来，成为了数据分析师和企业决策者关注的焦点。Pyecharts，作为一款基于Python的开源数据可视化库，凭借其丰富的图表类型、灵活的配置选项以及高度的定制化能力，成为了构建数据可视化大屏的理想选择。本文将深入探讨如何利用Pyecharts打造数据可视化大屏，并通过实际代码案例
Goolge earth studio 进阶4——路径修改与平滑陟彼高冈yu Google earth studio 进阶教程旅游
如果我们希望在大约中途时获得更多的城市鸟瞰视角。可以将相机拖动到这里并创建一个新的关键帧。camera_target_clip_7EarthStudio会自动平滑我们的路径，所以当我们通过这个关键帧时，不是一个生硬的角度，而是一个平滑的曲线。camera_target_clip_8路径上有贝塞尔控制手柄，允许我们调整路径的形状。右键单击，我们可以选择“平滑路径”，这是默认的自动平滑算法，或者我们可
Google earth studio 简介陟彼高冈yu 旅游
GoogleEarthStudio是一个基于Web的动画工具，专为创作使用GoogleEarth数据的动画和视频而设计。它利用了GoogleEarth强大的三维地图和卫星影像数据库，使用户能够轻松地创建逼真的地球动画、航拍视频和动态地图可视化。网址为https://www.google.com/earth/studio/。GoogleEarthStudio是一个基于Web的动画工具，专为创作使用G
基于社交网络算法优化的二维最大熵图像分割智能算法研学社（Jack旭）智能优化算法应用图像分割算法 php 开发语言
智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码文章目录智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码1.前言2.二维最大熵阈值分割原理3.基于社交网络优化的多阈值分割4.算法结果：5.参考文献：6.Matlab代码摘要：本文介绍基于最大熵的图像分割，并且应用社交网络算法进行阈值寻优。1.前言阅读此文章前，请阅读《图像分割：直方图区域划分及信息统计介绍》htt
Day1笔记-Python简介&标识符和关键字&输入输出 ~在杰难逃~ Python python 开发语言大数据数据分析数据挖掘
大家好，从今天开始呢，杰哥开展一个新的专栏，当然，数据分析部分也会不定时更新的，这个新的专栏主要是讲解一些Python的基础语法和知识，帮助0基础的小伙伴入门和学习Python，感兴趣的小伙伴可以开始认真学习啦！一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码，再通过语言处理程序执行向计算机发送指令，让计算机完成对应的工作，编程
121. 买卖股票的最佳时机薄荷糖的味道_fb40
给定一个数组，它的第i个元素是一支给定股票第i天的价格。如果你最多只允许完成一笔交易（即买入和卖出一支股票），设计一个算法来计算你所能获取的最大利润。注意你不能在买入股票前卖出股票。示例1:输入:[7,1,5,3,6,4]输出:5解释:在第2天（股票价格=1）的时候买入，在第5天（股票价格=6）的时候卖出，最大利润=6-1=5。注意利润不能是7-1=6,因为卖出价格需要大于买入价格。示例2:输入:
每日算法&面试题，大厂特训二十八天——第二十天（树）肥学 ⚡算法题⚡面试题每日精进 java 算法数据结构
目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题，最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧！！特别介绍小白练手专栏，适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章
pyecharts——绘制柱形图折线图 2224070247 信息可视化 python java 数据可视化
一、pyecharts概述自2013年6月百度EFE(ExcellentFrontEnd）数据可视化团队研发的ECharts1.0发布到GitHub网站以来，ECharts一直备受业界权威的关注并获得广泛好评，成为目前成熟且流行的数据可视化图表工具，被应用到诸多数据可视化的开发领域。Python作为数据分析领域最受欢迎的语言，也加入ECharts的使用行列，并研发出方便Python开发者使用的数据
回溯算法-重新安排行程 chirou_ 算法数据结构图论 c++图搜索
leetcode332.重新安排行程这题我还没自己ac过，只能现在凭着刚学完的热乎劲把我对题解的理解记下来。本题我认为对数据结构的考察比较多，用什么数据结构去存数据，去读取数据，都是很重要的。classSolution{private:unordered_map>targets;boolbacktracking(intticketNum,vector&result){//1.确定参数和返回值//2
高级 ECharts 技巧：自定义图表主题与样式 SnowMan1993 echarts 信息可视化数据分析
ECharts是一个强大的数据可视化库，提供了多种内置主题和样式，但你也可以根据项目的设计需求，自定义图表的主题与样式。本文将介绍如何使用ECharts自定义图表主题，以提升数据可视化的吸引力和一致性。1.什么是ECharts主题？ECharts的主题是指定义图表样式的配置项，包括颜色、字体、线条样式等。通过预设主题，你可以快速更改图表的整体风格，而自定义主题则允许你在此基础上进行个性化设置。2.
Faiss：高效相似性搜索与聚类的利器网络·魚大数据 faiss
Faiss是一个针对大规模向量集合的相似性搜索库，由FacebookAIResearch开发。它提供了一系列高效的算法和数据结构，用于加速向量之间的相似性搜索，特别是在大规模数据集上。本文将介绍Faiss的原理、核心功能以及如何在实际项目中使用它。Faiss原理：近似最近邻搜索：Faiss的核心功能之一是近似最近邻搜索，它能够高效地在大规模数据集中找到与给定查询向量最相似的向量。这种搜索是近似的，
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
insert into select 主键自增_mybatis拦截器实现主键自动生成 weixin_39521651 insert into select 主键自增 mybatis delete返回值 mybatis insert返回主键 mybatis insert返回对象 mybatis plus insert返回主键 mybatis plus 插入生成id
前言前阵子和朋友聊天，他说他们项目有个需求，要实现主键自动生成，不想每次新增的时候，都手动设置主键。于是我就问他，那你们数据库表设置主键自动递增不就得了。他的回答是他们项目目前的id都是采用雪花算法来生成，因此为了项目稳定性，不会切换id的生成方式。朋友问我有没有什么实现思路，他们公司的orm框架是mybatis，我就建议他说，不然让你老大把mybatis切换成mybatis-plus。mybat
k均值聚类算法考试例题_k均值算法(k均值聚类算法计算题) 寻找你83497 k均值聚类算法考试例题
?算法：第一步：选K个初始聚类中心，z1(1),z2(1)，…，zK(1)，其中括号内的序号为寻找聚类中心的迭代运算的次序号。聚类中心的向量值可任意设定，例如可选开始的K个.k均值聚类：---------一种硬聚类算法，隶属度只有两个取值0或1，提出的基本根据是“类内误差平方和最小化”准则；模糊的c均值聚类算法：--------一种模糊聚类算法，是.K均值聚类算法是先随机选取K个对象作为初始的聚类
Python开发常用的三方模块如下：换个网名有点难 python 开发语言
Python是一门功能强大的编程语言，拥有丰富的第三方库，这些库为开发者提供了极大的便利。以下是100个常用的Python库，涵盖了多个领域：1、NumPy，用于科学计算的基础库。2、Pandas，提供数据结构和数据分析工具。3、Matplotlib，一个绘图库。4、Scikit-learn，机器学习库。5、SciPy，用于数学、科学和工程的库。6、TensorFlow，由Google开发的开源机
ES聚合分析原理与代码实例讲解光剑书架上的书大厂Offer收割机面试题简历程序员读书硅基计算碳基计算认知计算生物计算深度学习神经网络大数据 AIGC AGI LLM Java Python 架构设计 Agent 程序员实现财富自由
ES聚合分析原理与代码实例讲解1.背景介绍1.1问题的由来在大规模数据分析场景中，特别是在使用Elasticsearch（ES）进行数据存储和检索时，聚合分析成为了一个至关重要的功能。聚合分析允许用户对数据集进行细分和分组，以便深入探索数据的结构和模式。这在诸如实时监控、日志分析、业务洞察等领域具有广泛的应用。1.2研究现状目前，ES聚合分析已经成为现代大数据平台的核心组件之一。它支持多种类型的聚
Python实现简单的机器学习算法 master_chenchengg python python 办公效率 python开发 IT
Python实现简单的机器学习算法开篇：初探机器学习的奇妙之旅搭建环境：一切从安装开始必备工具箱第一步：安装Anaconda和JupyterNotebook小贴士：如何配置Python环境变量算法初体验：从零开始的Python机器学习线性回归：让数据说话数据准备：从哪里找数据编码实战：Python实现线性回归模型评估：如何判断模型好坏逻辑回归：从分类开始理论入门：什么是逻辑回归代码实现：使用skl
JVM、JRE和 JDK：理解Java开发的三大核心组件 Y雨何时停T Java java
Java是一门跨平台的编程语言，它的成功离不开背后强大的运行环境与开发工具的支持。在Java的生态中，JVM（Java虚拟机）、JRE（Java运行时环境）和JDK（Java开发工具包）是三个至关重要的核心组件。本文将探讨JVM、JDK和JRE的区别，帮助你更好地理解Java的运行机制。1.JVM：Java虚拟机（JavaVirtualMachine）什么是JVM？JVM，即Java虚拟机，是Ja
推荐算法_隐语义-梯度下降 _feivirus_ 算法机器学习和数学推荐算法机器学习隐语义
importnumpyasnp1.模型实现"""inputrate_matrix:M行N列的评分矩阵，值为P*Q.P:初始化用户特征矩阵M*K.Q:初始化物品特征矩阵K*N.latent_feature_cnt:隐特征的向量个数max_iteration:最大迭代次数alpha:步长lamda:正则化系数output分解之后的P和Q"""defLFM_grad_desc(rate_matrix,l
K近邻算法_分类鸢尾花数据集 _feivirus_ 算法机器学习和数学分类机器学习 K近邻
importnumpyasnpimportpandasaspdfromsklearn.datasetsimportload_irisfromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportaccuracy_score1.数据预处理iris=load_iris()df=pd.DataFrame(data=ir
数据结构 | 栈和队列 TT-Kun 数据结构与算法数据结构栈队列 C语言
文章目录栈和队列1.栈：后进先出（LIFO）的数据结构1.1概念与结构1.2栈的实现2.队列：先进先出（FIFO）的数据结构2.1概念与结构2.2队列的实现3.栈和队列算法题3.1有效的括号3.2用队列实现栈3.3用栈实现队列3.4设计循环队列结论栈和队列在计算机科学中，栈和队列是两种基本且重要的数据结构，它们在处理数据存储和访问顺序方面有着独特的规则和应用。本文将详细介绍栈和队列的概念、结构、实
[Python] 数据结构详解及代码 AIAdvocate 算法 python 数据结构链表
今日内容大纲介绍数据结构介绍列表链表1.数据结构和算法简介程序大白话翻译,程序=数据结构+算法数据结构指的是存储,组织数据的方式.算法指的是为了解决实际业务问题而思考思路和方法,就叫:算法.2.算法的5大特性介绍算法具有独立性算法是解决问题的思路和方式,最重要的是思维,而不是语言,其(算法)可以通过多种语言进行演绎.5大特性有输入,需要传入1或者多个参数有输出,需要返回1个或者多个结果有穷性,执行
Python算法L5：贪心算法小熊同学哦 Python算法算法 python 贪心算法
Python贪心算法简介目录Python贪心算法简介贪心算法的基本步骤贪心算法的适用场景经典贪心算法问题1.**零钱兑换问题**2.**区间调度问题**3.**背包问题**贪心算法的优缺点优点：缺点：结语贪心算法（GreedyAlgorithm）是一种在每一步选择中都采取当前最优或最优解的算法。它的核心思想是，在保证每一步局部最优的情况下，希望通过贪心选择达到全局最优解。虽然贪心算法并不总能得到全
JAVA·一个简单的登录窗口 MortalTom java 开发语言学习
文章目录概要整体架构流程技术名词解释技术细节资源概要JavaSwing是Java基础类库的一部分，主要用于开发图形用户界面（GUI）程序整体架构流程新建项目，导入sql.jar包（链接放在了文末），编译项目并运行技术名词解释一、特点丰富的组件提供了多种可视化组件，如按钮（JButton）、文本框（JTextField）、标签（JLabel）、下拉列表（JComboBox）等，可以满足不同的界面设计
WebMagic：强大的Java爬虫框架解析与实战 Aaron_945 Java java 爬虫开发语言
文章目录引言官网链接WebMagic原理概述基础使用1.添加依赖2.编写PageProcessor高级使用1.自定义Pipeline2.分布式抓取优点结论引言在大数据时代，网络爬虫作为数据收集的重要工具，扮演着不可或缺的角色。Java作为一门广泛使用的编程语言，在爬虫开发领域也有其独特的优势。WebMagic是一个开源的Java爬虫框架，它提供了简单灵活的API，支持多线程、分布式抓取，以及丰富的
GenVisR 基因组数据可视化实战(三) 11的雾
3.genCov画每个突变位点附件的coverage，跟igv有点相似。这个操作起来很复杂，但是图还是挺有用的。可以考虑。由于我的referencegenomebuild是hg38BiocManager::install(c("TxDb.Hsapiens.UCSC.hg38.knownGene","BSgenome.Hsapiens.UCSC.hg38"))library(TxDb.Hsapien
【RabbitMQ 项目】服务端：数据管理模块之绑定管理月夜星辉雪 rabbitmq 分布式
文章目录一.编写思路二.代码实践一.编写思路定义绑定信息类交换机名称队列名称绑定关键字：交换机的路由交换算法中会用到没有是否持久化的标志，因为绑定是否持久化取决于交换机和队列是否持久化，只有它们都持久化时绑定才需要持久化。绑定就好像一根绳子，两端连接着交换机和队列，当一方不存在，它就没有存在的必要了定义绑定持久化类构造函数：如果数据库文件不存在则创建，打开数据库，创建binding_table插入
Dom 周华华 JavaScript html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&q
【Spark九十六】RDD API之combineByKey bit1129 spark
1. combineByKey函数的运行机制 RDD提供了很多针对元素类型为(K,V)的API，这些API封装在PairRDDFunctions类中，通过Scala隐式转换使用。这些API实现上是借助于combineByKey实现的。combineByKey函数本身也是RDD开放给Spark开发人员使用的API之一首先看一下combineByKey的方法说明：
msyql设置密码报错：ERROR 1372 (HY000): 解决方法详解 daizj mysql 设置密码
MySql给用户设置权限同时指定访问密码时，会提示如下错误： ERROR 1372 (HY000): Password hash should be a 41-digit hexadecimal number；问题原因：你输入的密码是明文。不允许这么输入。解决办法：用select password('你想输入的密码');查询出你的密码对应的字符串，然后
路漫漫其修远兮吾将上下而求索周凡杨学习思索
王国维在他的《人间词话》中曾经概括了为学的三种境界古今之成大事业、大学问者，罔不经过三种之境界。“昨夜西风凋碧树。独上高楼，望尽天涯路。”此第一境界也。“衣带渐宽终不悔，为伊消得人憔悴。”此第二境界也。“众里寻他千百度，蓦然回首，那人却在灯火阑珊处。”此第三境界也。学习技术，这也是你必须经历的三种境界。第一层境界是说，学习的路是漫漫的，你必须做好充分的思想准备，如果半途而废还不如不要开始。这里，注
Hadoop(二)对话单的操作朱辉辉33 hadoop
Debug： 1、 A = LOAD '/user/hue/task.txt' USING PigStorage(' ') AS (col1,col2,col3); DUMP A; //输出结果前几行示例： (>ggsnPDPRecord(21),,) (-->recordType(0),,) (-->networkInitiation(1),,)
web报表工具FineReport常用函数的用法总结（日期和时间函数）老A不折腾 finereport 报表工具 web开发
web报表工具FineReport常用函数的用法总结（日期和时间函数）说明：凡函数中以日期作为参数因子的，其中日期的形式都必须是yy/mm/dd。而且必须用英文环境下双引号(" ")引用。 DATE DATE(year,month,day):返回一个表示某一特定日期的系列数。 Year:代表年，可为一到四位数。 Month:代表月份。
c++ 宏定义中的##操作符墙头上一根草 C++
#与##在宏定义中的--宏展开 #include <stdio.h> #define f(a,b) a##b #define g(a) #a #define h(a) g(a) int main() { &nbs
分析Spring源代码之，DI的实现 aijuans spring DI 现源代码
(转) 分析Spring源代码之，DI的实现 2012/1/3 by tony 接着上次的讲，以下这个sample [java] view plain copy print
for循环的进化 alxw4616 JavaScript
// for循环的进化 // 菜鸟 for (var i = 0; i < Things.length ; i++) { // Things[i] } // 老鸟 for (var i = 0, len = Things.length; i < len; i++) { // Things[i] } // 大师 for (var i = Things.le
网络编程Socket和ServerSocket简单的使用百合不是茶网络编程基础 IP地址端口
网络编程;TCP/IP协议网络:实现计算机之间的信息共享,数据资源的交换协议:数据交换需要遵守的一种协议,按照约定的数据格式等写出去端口:用于计算机之间的通信每运行一个程序，系统会分配一个编号给该程序，作为和外界交换数据的唯一标识 0~65535 查看被使用的
JDK1.5 生产消费者 bijian1013 java thread 生产消费者 java多线程
ArrayBlockingQueue：一个由数组支持的有界阻塞队列。此队列按 FIFO（先进先出）原则对元素进行排序。队列的头部是在队列中存在时间最长的元素。队列的尾部是在队列中存在时间最短的元素。新元素插入到队列的尾部，队列检索操作则是从队列头部开始获得元素。 ArrayBlockingQueue的常用方法：
JAVA版身份证获取性别、出生日期及年龄 bijian1013 java 性别出生日期年龄
工作中需要根据身份证获取性别、出生日期及年龄，且要还要支持15位长度的身份证号码，网上搜索了一下，经过测试好像多少存在点问题，干脆自已写一个。 CertificateNo.java package com.bijian.study; import java.util.Calendar; import
【Java范型六】范型与枚举 bit1129 java
首先，枚举类型的定义不能带有类型参数，所以，不能把枚举类型定义为范型枚举类，例如下面的枚举类定义是有编译错的 public enum EnumGenerics<T> { //编译错，提示枚举不能带有范型参数 OK, ERROR; public <T> T get(T type) { return null;
【Nginx五】Nginx常用日志格式含义 bit1129 nginx
1. log_format 1.1 log_format指令用于指定日志的格式，格式： log_format name(格式名称) type(格式样式) 1.2 如下是一个常用的Nginx日志格式： log_format main '[$time_local]|$request_time|$status|$body_bytes
Lua 语言 15 分钟快速入门 ronin47 lua 基础
- - 单行注释 - - [[ [多行注释] - - ]] - - - - - - - - - - - 1. 变量 & 控制流 - - - - - - - - - - num = 23 - - 数字都是双精度 str = 'aspythonstring'
java-35.求一个矩阵中最大的二维矩阵 ( 元素和最大 ) bylijinnan java
the idea is from: http://blog.csdn.net/zhanxinhang/article/details/6731134 public class MaxSubMatrix { /**see http://blog.csdn.net/zhanxinhang/article/details/6731134 * Q35 求一个矩阵中最大的二维
mongoDB文档型数据库特点开窍的石头 mongoDB文档型数据库特点
MongoDD: 文档型数据库存储的是Bson文档-->json的二进制特点：内部是执行引擎是js解释器，把文档转成Bson结构，在查询时转换成js对象。 mongoDB传统型数据库对比传统类型数据库：结构化数据，定好了表结构后每一个内容符合表结构的。也就是说每一行每一列的数据都是一样的文档型数据库：不用定好数据结构，
[毕业季节]欢迎广大毕业生加入JAVA程序员的行列 comsci java
一年一度的毕业季来临了。。。。。。。。正在投简历的学弟学妹们。。。如果觉得学校推荐的单位和公司不适合自己的兴趣和专业，可以考虑来我们软件行业，做一名职业程序员。。。软件行业的开发工具中，对初学者最友好的就是JAVA语言了，网络上不仅仅有大量的
PHP操作Excel – PHPExcel 基本用法详解 cuiyadll PHP Excel
导出excel属性设置//Include classrequire_once('Classes/PHPExcel.php');require_once('Classes/PHPExcel/Writer/Excel2007.php');$objPHPExcel = new PHPExcel();//Set properties 设置文件属性$objPHPExcel->getProperties
IBM Webshpere MQ Client User Issue (MCAUSER) darrenzhu IBM jms user MQ MCAUSER
IBM MQ JMS Client去连接远端MQ Server的时候，需要提供User和Password吗？答案是根据情况而定，取决于所定义的Channel里面的属性Message channel agent user identifier (MCAUSER)的设置。 http://stackoverflow.com/questions/20209429/how-mca-user-i
网线的接法 dcj3sjt126com
一、PC连HUB (直连线)A端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。 B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。二、PC连PC （交叉线）A端：(568A)：白绿，绿，白橙，蓝，白蓝，橙，白棕，棕； B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。三、HUB连HUB&nb
Vimium插件让键盘党像操作Vim一样操作Chrome dcj3sjt126com chrome vim
什么是键盘党？键盘党是指尽可能将所有电脑操作用键盘来完成，而不去动鼠标的人。鼠标应该说是新手们的最爱，很直观，指哪点哪，很听话！不过常常使用电脑的人，如果一直使用鼠标的话，手会发酸，因为操作鼠标的时候，手臂不是在一个自然的状态，臂肌会处于绷紧状态。而使用键盘则双手是放松状态，只有手指在动。而且尽量少的从鼠标移动到键盘来回操作，也省不少事。在chrome里安装 vimium 插件
MongoDB查询（2）——数组查询[六] eksliang mongodb MongoDB查询数组
MongoDB查询数组转载请出自出处：http://eksliang.iteye.com/blog/2177292 一、概述 MongoDB查询数组与查询标量值是一样的，例如，有一个水果列表，如下所示： > db.food.find() { "_id" : "001", "fruits" : [ "苹
cordova读写文件（1） gundumw100 JavaScript Cordova
使用cordova可以很方便的在手机sdcard中读写文件。首先需要安装cordova插件：file 命令为： cordova plugin add org.apache.cordova.file 然后就可以读写文件了，这里我先是写入一个文件，具体的JS代码为： var datas=null;//datas need write var directory=&
HTML5 FormData 进行文件jquery ajax 上传到又拍云 ileson jquery Ajax html5 FormData
html5 新东西：FormData 可以提交二进制数据。页面test.html <!DOCTYPE> <html> <head> <title> formdata file jquery ajax upload</title> </head> <body> <
swift appearanceWhenContainedIn:(version1.2 xcode6.4) 啸笑天 version
swift1.2中没有oc中对应的方法： + (instancetype)appearanceWhenContainedIn:(Class <UIAppearanceContainer>)ContainerClass, ... NS_REQUIRES_NIL_TERMINATION; 解决方法：在swift项目中新建oc类如下： #import &
java实现SMTP邮件服务器 macroli java 编程
电子邮件传递可以由多种协议来实现。目前，在Internet 网上最流行的三种电子邮件协议是SMTP、POP3 和 IMAP，下面分别简单介绍。　　◆ SMTP 协议　　简单邮件传输协议(Simple Mail Transfer Protocol,SMTP)是一个运行在TCP/IP之上的协议，用它发送和接收电子邮件。SMTP 服务器在默认端口25上监听。SMTP客户使用一组简单的、基于文本的
mongodb group by having where 查询sql qiaolevip 每天进步一点点学习永无止境 mongo 纵观千象
SELECT cust_id, SUM(price) as total FROM orders WHERE status = 'A' GROUP BY cust_id HAVING total > 250 db.orders.aggregate( [ { $match: { status: 'A' } }, { $group: {
Struts2 Pojo（六） Luob. POJO strust2
注意：附件中有完整案例 1.采用POJO对象的方法进行赋值和传值 2.web配置 <?xml version="1.0" encoding="UTF-8"?> <web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee&q
struts2步骤 wuai struts
1、添加jar包 2、在web.xml中配置过滤器 <filter> <filter-name>struts2</filter-name> <filter-class>org.apache.st