随机森林随机回归预测

数据集 (The Data Set)

In order to demonstrate a random forest regression, a data set of e-commerce sales from popular online retailer, Wish, will be used. The data comes from Kaggle and only features sales information on summer clothing. Among the attributes include product descriptions, rating, whether ad boosts were used, whether urgency text was added to the product listing, and the number of units sold, among others.

为了证明随机森林回归，将使用来自受欢迎的在线零售商Wish的电子商务销售数据集。数据来自Kaggle ，仅包含夏季服装的销售信息。这些属性包括产品说明，等级，是否使用了广告宣传，是否在产品列表中添加了紧急文本以及已售出的产品数量等。

To show the power of the random forest regression, the number of units sold will be predicted. Making good, accurate predictions would be invaluable to not only inventory planners, who need to make estimates on how much product to order or produce, but also sales, who need to understand how product moves in an e-commerce setting.

为了显示随机森林回归的力量，将预测出售的单位数量。做出准确，准确的预测不仅对需要计划订购或生产多少产品的库存计划人员，而且对于需要了解产品在电子商务环境中的移动方式的销售人员而言都是无价的。

导入和清理数据 (Importing and Cleaning the Data)

All data imports and manipulations will be done through python along with the pandas and numpy libraries.

所有数据导入和操作都将通过python以及pandas和numpy库完成。

import pandas as pd
import numpy as np# import the data saved as a csv
df = pd.read_csv("Summer_Sales_08.2020.csv")

The first two lines simply import the pandas and numpy libraries. The final line reads a CSV file previously saved and renamed to “Summer_Sales_08.2020” and creates a data frame.

前两行仅导入pandas和numpy库。最后一行读取先前保存并重命名为“ Summer_Sales_08.2020”的CSV文件，并创建一个数据框。

df["has_urgency_banner"] = df["has_urgency_banner"].fillna(0)df["discount"] = (df["retail_price"] - df["price"])/df["retail_price"]

When reviewing the data, the “has_urgency_banner” column, which indicates whether an urgency banner was applied to the product listing, was coded improperly. Instead of using 1’s and 0’s, it simply leaves a blank when a banner wasn’t used. The first line code fills those blanks with 0's.

在检查数据时，“ has_urgency_banner”列表示对产品列表是否应用了紧急横幅，因此编码不正确。除了不使用1和0之外，当不使用横幅时，它只是留下一个空白。第一行代码用0填充这些空白。

The second line creates a new column called “discount” which calculates the discount on the product compared to the listed retail price.

第二行创建一个名为“折扣”的新列，该列计算与列出的零售价相比的产品折扣。

df["rating_five_percent"] = df["rating_five_count"]/df["rating_count"]
df["rating_four_percent"] = df["rating_four_count"]/df["rating_count"]
df["rating_three_percent"] = df["rating_three_count"]/df["rating_count"]
df["rating_two_percent"] = df["rating_two_count"]/df["rating_count"]
df["rating_one_percent"] = df["rating_one_count"]/df["rating_count"]

The original data set includes several columns dedicated to the products’ ratings. In addition to an average rating, it also included the total number of ratings and the number of five, four, three, two, and one star reviews. Since the total number of reviews will already be considered, it’s better to look at star ratings as a percent of total ratings, so direct comparisons between products may be made.

原始数据集包括几列专门用于产品额定值的列。除了平均评分之外，它还包括评分总数以及五，四，三，二和一星级评论的数量。由于已经考虑了评论的总数，因此最好将星级评定为总评定的百分比，这样就可以在产品之间进行直接比较。

The lines above simply create five new columns giving the percent of five, four, three, two, and one star reviews for every product in the data set.

上面的几行简单地创建了五个新列，为数据集中的每种产品给出了五，四，三，二和一星级评论的百分比。

ratings = [
    "rating_five_percent",
    "rating_four_percent",
    "rating_three_percent",
    "rating_two_percent",
    "rating_one_percent"
]for rating in ratings:
    df[rating] = df[rating].apply(lambda x: x if x>= 0 and x<= 1 else 0)

While pandas doesn’t throw an error when dividing by 0, it creates issues when trying to analyze the data. In this case, products with 0 ratings would cause issues when calculated in the previous step.

虽然大熊猫除以0不会引发错误，但在尝试分析数据时会产生问题。在这种情况下，在上一步中进行计算时，评级为0的产品会引起问题。

The above code snippet goes through all the freshly made columns and checks that the values entered are between 0 and 1, inclusive. If they aren’t, they’re replaced with 0, which is an adequate substitute.

上面的代码段遍历了所有新创建的列，并检查输入的值是否介于0和1之间(包括0和1)。如果不是，则将它们替换为0，这是一个足够的替代值。

数据探索 (Data Exploration)

import seaborn as sns# Distribution plot on price
sns.distplot(df['price'])

A Distribution Plot of Price. Figure produced by author. 价格分布图。图由作者制作。

The above code produces a distribution plot of the price across all the products in the data set. The most obvious and interesting insight is that there are no products that cost €10. This is probably a deliberate effort made by merchants to get their products on “€10 & Under” lists.

上面的代码生成了数据集中所有产品的价格分布图。最明显和最有趣的见解是，没有任何产品的价格为10欧元。这可能是商人为使他们的产品进入“ 10欧元及以下”的清单而故意做出的努力。

sns.jointplot(x = "rating", y = "units_sold", data = df, kind = "scatter")

A Scatter Plot Between Rating and Units Sold. Figure produced by author. 评级与售出单位之间的散点图。图由作者制作。

The above figure reveals that the vast majority of sales are made on items with between three and four and half star ratings. It also reveals most product have fewer than 20,000 units sold with a few items making 60,000 and 100,000 units sold, respectively.

从上图可以看出，绝大多数的销售是针对星级介于三，四到一半之间的商品进行的。报告还显示，大多数产品的销量不足20,000件，少数产品的销量分别为60,000件和100,000件。

As an aside, the tendency of the scatter plot to organize in lines is evidence that the units sold is more likely an estimate than hard numbers.

顺便说一句，散点图呈直线排列的趋势证明了售出的单位比硬数字更有可能是一种估计。

sns.jointplot(x = "rating_count", y = "units_sold", data = df, kind = "reg")

A Scatter Plot between the Number of Ratings and Units Sold. Figure produced by author. 等级数量和销售单位之间的散点图。图由作者制作。

This graph demonstrates the other side of ratings. There’s a loose, but positive relationship between the number of ratings and the likelihood a product sells. This might be because consumers look at both the overall rating and the number of ratings when considering a purchase or because high-selling products just naturally produce more ratings.

此图展示了评级的另一面。评分数量与产品销售可能性之间存在松散但正相关的关系。这可能是因为消费者在考虑购买商品时会同时考虑总体评分和评分数量，或者是因为畅销产品自然会产生更多评分。

Without additional data on when purchases were made and when ratings were posted, it’s difficult to discern the cause of the correlation without additional domain knowledge.

如果没有关于何时购买和何时发布评级的其他数据，那么在没有其他领域知识的情况下很难辨别相关原因。

什么是随机森林回归？ (What is the Random Forest Regression?)

In brief, a random forest regression is the average result of a series of decision trees. A decision tree is like a flow chart that asks a bunch of questions and makes a prediction based on the answer to those questions. For example, a decision tree trying to predict if a tennis player will go to the court might ask: Is it raining? If so, is the court indoors? If not, can the player find a partner?

简而言之，随机森林回归是一系列决策树的平均结果。决策树就像流程图一样，它询问一堆问题并根据这些问题的答案做出预测。例如，试图预测网球运动员是否会去法庭的决策树可能会问：下雨了吗？如果是这样，法院在室内吗？如果没有，玩家可以找到伙伴吗？

A simple decision tree. Figure produced by author. 一个简单的决策树。图由作者制作。

The decision tree will then answer each of those questions before it arrives at a prediction. While easy to understand and, according to some experts, better model actual human behavior than other machine learning techniques, they often overfit the data, which means they can often give wildly different results on similar data sets.

然后，决策树在做出预测之前将回答每个问题。尽管比其他机器学习技术更易于理解，并且据一些专家称，它可以更好地模拟实际的人类行为，但它们通常会过拟合数据，这意味着它们通常可以在相似的数据集上得出截然不同的结果。

To address this issue, multiple decision trees are taken from the same data set, bagged, and an average of the result is returned. This becomes known as the random forest regression.

为了解决此问题，从相同的数据集中提取多个决策树，然后放入袋中，并返回结果的平均值。这被称为随机森林回归。

A simple random forest. Figure produced by author. 一个简单的随机森林。图由作者制作。

Its main advantage is making accurate predictions on highly non-linear data. In the Wish data set, a non-linear relationship is seen in the ratings. There isn’t a nice, easily seen correlation, but the cutoff below three stars and above four and half is plainly visible. The random forest regression can recognize this pattern and incorporate it in its results. In a more traditional linear regression, however, it only muddies its prediction.

它的主要优点是可以对高度非线性的数据进行准确的预测。在愿望数据集中，在评分中看到了非线性关系。没有很好的，容易看到的相关性，但是清晰可见三颗星以下，四颗半星以上的分界线。随机森林回归可以识别这种模式并将其纳入结果。但是，在更传统的线性回归中，它只会混淆其预测。

In addition, the random forest classifier is efficient, can handle many input variables, and usually makes accurate predictions. It’s an incredibly powerful tool and doesn’t take too much code to implement.

此外，随机森林分类器效率很高，可以处理许多输入变量，并且通常可以进行准确的预测。这是一个非常强大的工具，不需要太多的代码即可实现。

实施随机森林回归 (Implementing a Random Forest Regression)

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor# Divide the data between units sold and influencing factors
X = df.filter([
    "price",
    "discount",
    "uses_ad_boosts",
    "rating",
    "rating_count",
    "rating_five_percent",
    "rating_four_percent",
    "rating_three_percent",
    "rating_two_percent",
    "rating_one_percent",
    "has_urgency_banner",
    "merchant_rating",
    "merchant_rating_count",
    "merchant_has_profile_picture"
])Y = df["units_sold"]# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state = 42)

Before running any model, the first two lines import relevant libraries. The next set of lines creates two variables, X and Y, which is then split into training and testing data. With a test size of 0.33, this ensures that roughly two-thirds of the data will be used to train the data and one third will be used to test it for accuracy.

在运行任何模型之前，前两行将导入相关的库。下一组线创建两个变量X和Y，然后将其分为训练和测试数据。测试大小为0.33，可确保大约三分之二的数据将用于训练数据，而三分之一将用于测试数据的准确性。

# Set up and run the model
RFRegressor = RandomForestRegressor(n_estimators = 20)
RFRegressor.fit(X_train, Y_train)

Next, the model is actually initialized and run. Note that the parameter n_estimators indicates the number of decision trees to be used.

接下来，该模型实际上已初始化并运行。注意，参数n_estimators指示要使用的决策树的数量。

predictions = RFRegressor.predict(X_test)
error = Y_test - predictions

Finally, the newly fitted random forest regression is applied to the testing data and the difference is taken to produce an error array. That’s all there is to it!

最后，将新拟合的随机森林回归应用于测试数据，并采用差值生成误差数组。这里的所有都是它的！

结论 (Conclusions)

The Wish data set presents a playground of numbers that can be used to solve real world problems. With only minimal data manipulation, the random forest regression proved to be an invaluable tool in analyzing this data and providing tangible results.

Wish数据集提供了一个数字游乐场，可用于解决现实世界中的问题。只需很少的数据操作，随机森林回归被证明是分析此数据并提供实际结果的宝贵工具。

翻译自: https://towardsdatascience.com/predicting-e-commerce-sales-with-a-random-forest-regression-3f3c8783e49b