

You can reach all Python scripts relative to this on my GitHub page. If you are interested, you can also find the scripts used for data cleaning and data visualization for this study in the same repository.

您可以在我的 GitHub页面 找到 所有与此相关的Python脚本 如果您有兴趣,还可以在同一存储库中找到用于此研究的数据清理和数据可视化的脚本。

内容 (Content)

  1. Data Cleaning (Identifying null values, filling missing values and removing outliers)

  2. Data Preprocessing (Standardization or Normalization)

  3. ML Models: Linear Regression, Ridge Regression, Lasso, KNN, Random Forest Regressor, Bagging Regressor, Adaboost Regressor, and XGBoost

  4. Comparison of the performance of the models

  5. Some insights from data


为什么通过对数转换来缩放价格特征? (Why is price feature scaled by log transformation?)

In the regression model, for any fixed value of X, Y is distributed in this problem data-target value (Price ) not normally distributed, it is right skewed.


To solve this problem, the log transformation on the target variable is applied when it has skewed distribution and we need to apply an inverse function on the predicted values to get the actual predicted target value.


Due to this, for evaluating the model, the RMSLE is calculated to check the error and the R2 Score is also calculated to evaluate the accuracy of the model.


一些关键概念: (Some Key Concepts:)

  • Learning Rate: Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network concerning the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.

    学习率:学习率是一个超参数,它控制我们在网络上调整与损耗梯度有关的权重的程度。 值越低,我们沿着下坡行驶的速度就越慢。 尽管就确保我们不错过任何局部最小值而言,这可能是一个好主意(使用较低的学习率),但这也意味着我们将花费很长的时间进行收敛,尤其是如果我们陷入困境高原地区。

  • n_estimators: This is the number of trees you want to build before taking the maximum voting or averages of predictions. A higher number of trees give you better performance but make your code slower.

    n_estimators :这是在进行最大投票或平均预测之前要构建的树数。 数量更多的树可为您提供更好的性能,但会使您的代码变慢。

  • R² Score: It is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean.

    R²得分:它是统计数据与拟合回归线的接近程度的一种统计量度。 也称为确定系数,或用于多元回归的多重确定系数。 0%表示该模型无法解释响应数据均值附近的变化。

1.数据: (1. The Data:)

The dataset used in this project was downloaded from Kaggle.


2.数据清理: (2. Data Cleaning:)

The first step is to remove irrelevant/useless features like ‘URL’, ’region_url’, ’vin’, ’image_url’, ’description’, ’county’, ’state’ from the dataset.

第一步是从数据集中删除不相关/无用的功能,例如“ URL”,“ region_url”,“ vin”,“ image_url”,“ description”,“ county”,“ state”。

As a next step, check missing values for each feature.


Showing missing values (Image By Panwar Abhash Anil) 显示缺失值(Panwar Abhash Anil摄)

Next, now missing values were filled with appropriate values by an appropriate method.


To fill the missing values, IterativeImputer method is used and different estimators are implemented then calculated MSE of each estimator using cross_val_score


  1. Mean and Median

  2. BayesianRidge Estimator

  3. DecisionTreeRegressor Estimator

  4. ExtraTreesRegressor Estimator

  5. KNeighborsRegressor Estimator

MSE with Different Imputation Methods (Image By Panwar Abhash Anil) 具有不同插补方法的MSE(图片由Panwar Abhash Anil提供)

From the above figure, we can conclude that the ExtraTreesRegressor estimator will be better for the imputation method to fill the missing value.

从上图可以得出结论, ExtraTreesRegressor估计器将更适合插补方法来填充缺失值。

At last, after dealing with missing values there zero null values.


Outliers: InterQuartile Range (IQR) method is used to remove the outliers from the data.


  • From figure 1, the prices whose log is below 6.55 and above 11.55 are the outliers

  • From figure 2, it is impossible to conclude something so IQR is calculated to find outliers i.e. odometer values below 6.55 and above 11.55 are the outliers.

  • From figure 3, the year below 1995 and above 2020 are the outliers.


At last, Shape of dataset before process= (435849, 25) and after process= (374136, 18). Total 61713 rows and 7 cols removed.

最后,处理之前的数据集的形状=(435849,25),处理之后的数据集的形状=(374136,18)。 总共61713行和7列删除。

3.数据预处理: (3. Data preprocessing:)

Label Encoder: In our dataset, 12 features are categorical variables and 4 numerical variables (price column excluded). To apply the ML models, we need to transform these categorical variables into numerical variables. And sklearn library LabelEncoder is used to solve this problem.

标签编码器:在我们的数据集中,有12个要素是分类变量和4个数字变量(不包括价格栏)。 要应用ML模型,我们需要将这些分类变量转换为数值变量。 sklearn库LabelEncoder用于解决此问题。

Normalization: The dataset is not normally distributed. All the features have different ranges. Without normalization, the ML model will try to disregard coefficients of features that have low values because their impact will be so small compared to the big value. Hence to normalized, sklearn library i.e. MinMaxScaler is used.

标准化 :数据集不是正态分布的。 所有功能都有不同的范围。 如果不进行归一化,则ML模型将尝试忽略具有低值的要素的系数,因为与大值相比,其影响将很小。 因此,为了进行标准化,使用了sklearn库,即MinMaxScaler

Train the data. In this process, 90% of the data was split for the train data and 10% of the data was taken as test data.

训练数据。 在此过程中,将90%的数据拆分为火车数据,并将10%的数据作为测试数据。

4.机器学习模型: (4. ML Models:)

In this section, different machine learning algorithms are used to predict price/target-variable.


The dataset is supervised, so the models are applied in a given order:


  1. Linear Regression


  2. Ridge Regression


  3. Lasso Regression


  4. K-Neighbors Regressor


  5. Random Forest Regressor


  6. Bagging Regressor


  7. Adaboost Regressor


  8. XGBoost


1)线性回归: (1) Linear Regression:)

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). In linear regression, the relationships are modelled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. More Details

在统计中,线性回归是对标量响应(或因变量)与一个或多个解释变量(或自变量)之间的关系进行建模的线性方法。 在线性回归中,使用线性预测函数对关系进行建模,这些函数的未知模型参数可从数据中估算出来。 这种模型称为线性模型。 更多细节

Coefficients: The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable.


  • A positive sign indicates that as the predictor variable increases, the response variable also increases.

  • A negative sign indicates that as the predictor variable increases, the response variable decreases.

Considering this figure, linear regression suggests that year, cylinder, transmission, fuel, and odometer these five variables are the most important.


2)岭回归: (2) Ridge Regression:)

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.

Ridge回归是一种用于分析遭受多重共线性的多个回归数据的技术。 当发生多重共线性时,最小二乘估计是无偏的,但是它们的方差很大,因此可能与真实值相去甚远。

To find the best alpha value in ridge regression, yellowbrick library AlphaSelection was applied.


Graph showing best value of Alpha 该图显示了Alpha的最佳价值

From the figure, the best value of alpha to fit the dataset is 20.336.


Note: The value of alpha is not constant it varies every time.


Using this value of alpha, Ridgeregressor is implemented.


Graph showing Important Features 该图显示重要功能

Considering this figure, Lasso regression suggests that year, cylinder, transmission, fuel, and odometer these five variables are the most important.


The performance of ridge regression is almost the same as Linear Regression.


3)套索回归: (3)Lasso Regression:)

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point as mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters).

套索回归是一种使用收缩的线性回归。 收缩是指数据值平均向中心点收缩。 套索程序鼓励使用简单,稀疏的模型(即参数较少的模型)。

Why Lasso regression is used?


The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that cause regression coefficients for some variables to shrink toward zero.

套索回归的目标是获得使定量响应变量的预测误差最小化的预测子集。 套索通过对模型参数施加约束来实现此目的,该约束会使某些变量的回归系数缩小为零。

But for this dataset, there is no need for lasso regression as there no much difference in error.


4)KNeighbors回归器:基于k最近邻的回归。 (4)KNeighbors Regressor: Regression-based on k-nearest neighbors.)

The target is predicted by local interpolation of the targets associated with the nearest neighbours the training set.


k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. Read More

k -NN是一种基于实例的学习或懒惰学习 ,其中功能仅在本地近似,所有计算都推迟到功能评估为止。 阅读更多

From the above figure, for k=5 KNN give the least error. So dataset is trained using n_neighbors=5 and metric=’euclidean’.

从上图可以看出,对于k = 5 KNN,误差最小。 因此,使用n_neighbors = 5和metric ='euclidean'训练数据集。

The performance KNN is better and error is decreasing with increased accuracy.


5)随机森林: (5) Random Forest:)

The random forest is a classification algorithm consisting of many decision trees. It uses bagging and feature randomness when building each tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Read More

随机森林是一种由许多决策树组成的分类算法。 在构建每棵树时,它使用套袋和特征随机性来尝试创建不相关的树林,其委员会的预测比任何单个树的预测更为准确。 阅读更多

In our model, 180 decisions are created with max_features 0.5

在我们的模型中,使用max_features 0.5创建了180个决策

Performance of Random Forest (True value vs predicted value) 随机森林的性能(真实值与预测值)

This is the simple bar plot which illustrates that year is the most important feature of a car and then odometer variable and then others.


The performance of the Random forest is better and accuracy is increased by approx. 10% which is good. Since the random forest is using bagging when building each tree so next Bagging Regressor will be performed.

随机森林的性能更好,并且准确性提高了约5%。 10%很好。 由于随机森林在构建每棵树时正在使用装袋,因此将执行下一个装袋回归器。

6)套袋回归器: (6) Bagging Regressor:)

A Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregates their predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. Read More

Bagging回归器是一个集合元估计器,它使每个基本回归器都适合原始数据集的随机子集,然后将其预测(通过投票或平均)进行汇总以形成最终预测。 通过将随机化引入其构造过程中,然后使其整体,这种元估计器通常可以用作减少黑盒估计器(例如决策树)方差的方法。 阅读更多

In our model, DecisionTreeRegressor is used as the estimator with max_depth=20 which creates 50 decision trees and the results show below.

在我们的模型中,DecisionTreeRegressor用作max_depth = 20的估计量,它创建了50个决策树,结果如下所示。

The performance of Random Forest is much better than Bagging regressor.

Random Forest的性能比Bagging回归器要好得多。

The key difference between Random forest and Bagging: The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.


7)Adaboost回归器: (7) Adaboost regressor:)

AdaBoost can be used to boost the performance of any machine learning algorithm. Adaboost helps you combine multiple “weak classifiers” into a single “strong classifier”. Library used: AdaBoostRegressor & Read More

AdaBoost可用于提高任何机器学习算法的性能。 Adaboost可帮助您将多个“弱分类器”组合为一个“强分类器”。 使用的库: AdaBoostRegressor & 阅读更多


This is the simple bar plot which illustrates that year is the most important feature of a car and then odometer variable and then model, etc.


In our model, DecisionTreeRegressor is used as an estimator with 24 max_depth and creates 200 trees & learning the model with 0.6 learning_rate and result shown below.

在我们的模型中,DecisionTreeRegressor用作具有24个max_depth的估计量,并创建200棵树并以0.6 learning_rate和以下所示的结果学习模型。

8)XGBoost:XGBoost代表eXtreme Gradient Boosting (8) XGBoost: XGBoost stands for eXtreme Gradient Boosting)

XGBoost is an ensemble learning method.XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. The beauty of this powerful algorithm lies in its scalability, which drives fast learning through parallel and distributed computing and offers efficient memory usage. Read More

XGBoost是一种整体学习方法 .XGBoost是为速度和性能而设计的梯度增强决策树的实现。 这种强大算法的优点在于可扩展性,可扩展性通过并行和分布式计算驱动快速学习,并提供有效的内存使用率。 阅读更多


This is the simple bar plot in descending of importance which illustrates that which feature/variable is an important feature of a car is more important.


According to XGBoost, Odometer is an important feature whereas from the previous models year is an important feature.

根据XGBoost的介绍, 里程表是一项重要功能,而从以前的型号开始,年份是一项重要功能。

In this model,200 decision trees are created of 24 max depth and the model is learning the parameter with a 0.4 learning rate.


4)模型性能比较: (4)Comparison of the performance of the models:)

From the above figures, we can conclude that XGBoost regressor with 89.662% accuracy is performing better than other models.


5)来自数据集的一些见解: (5) Some insights from the dataset:)

1From the pair plot, we can’t conclude anything. There is no correlation between the variables.

1从对图中,我们无法得出任何结论。 变量之间没有相关性。

Pair Plot to Find Correlation 配对图以找到相关性

2From the distplot, we can conclude that initially, the price is increasing rapidly but after a particular point, the price starts decreasing.


3From figure 1, we analyze that the car price of the diesel variant is high then the price of the electric variant comes. Hybrid variant cars have the lowest price.

3从图1中,我们分析出柴油车型的汽车价格高,然后电动车型的价格就来了。 混合动力汽车的价格最低。

Bar Plot showing the price of each fuel type 条形图显示每种燃料类型的价格

4 From figure 2, we analyze that the car price of the respective fuel also depends upon the condition of the car.


Bar Plot between fuel and price with hue condition 带有色相条件的燃料和价格之间的条形图

5From figure 3, we analyze that car prices are increasing per year after 1995, and from figure 4, the number of cars also increasing per year, and at some point i.e in 2012yr, the number of cars is nearly the same.


Graph showing how the price varies per year 该图显示了价格每年的变化

6From figure 5, we can analyze that the price of the cars also depends upon the condition of the car, and from figure 6, price varies with the condition of the cars with there size also.


Bar Plot showing the price respective of the condition of the car 条形图显示了汽车状况的价格

7From figure 7–8, we analyze that price of the cars also various each transmission of a car. People are ready to buy the car having “other transmission” and the price of the cars having “manual transmission” is low.

7从图7–8中,我们分析了汽车的价格也随汽车的每个变速箱而变化。 人们准备购买具有“其他变速箱”的汽车,并且具有“手动变速箱”的汽车的价格很低。

8 Below there are similar graphs with the same insight but different features.



结论: (Conclusion:)

By performing different ML models, we aim to get a better result or less error with max accuracy. Our purpose was to predict the price of the used cars having 25 predictors and 509577 data entries.

通过执行不同的ML模型,我们旨在以最大的精度获得更好的结果或更少的误差。 我们的目的是通过25个预测器和509577个数据输入来预测二手车的价格。

Initially, data cleaning is performed to remove the null values and outliers from the dataset then ML models are implemented to predict the price of cars.


Next, with the help of data visualization features were explored deeply. The relation between the features is examined.

接下来,借助数据可视化功能进行了深入探索。 检查特征之间的关系。

From the below table, it can be concluded that XGBoost is the best model for the prediction for used car prices. XGBoost as a regression model gave the best MSLE and RMSLE values.

从下表中可以得出结论,XGBoost是预测二手车价格的最佳模型。 XGBoost作为回归模型可提供最佳的MSLE和RMSLE值。

