机器学习 预测模型
In my last article, I gave a gentle overview of what Machine Learning was about and the framework of building a Machine Learning model. The data was cleaned and some Exploratory Data Analysis were carried out, in order to gain insight into the data. In this article, we will be continuing from there. If you’ve not seen the last post, click here to read it, so you can follow along easily.
在上一篇文章中,我对机器学习的内容以及构建机器学习模型的框架进行了简要的概述。 清理数据并进行一些探索性数据分析,以深入了解数据。 在本文中,我们将从那里继续。 如果您没有看到上一篇文章,请单击此处阅读它,以便您轻松进行后续操作。
特征工程 (Feature Engineering)
Feature engineering is a vital preprocessing technique in building any machine learning model. It involves pulling out some features from the data using domain knowledge. First, the ‘location’ and ‘engine’ columns were dropped from the dataframe. The ‘location’ because it does not affect the price. The ‘engine’, on the other hand, because the entries were mostly unique, so the algorithm does not learn from the feature.
特征工程是构建任何机器学习模型的重要预处理技术。 它涉及使用领域知识从数据中提取一些功能。 首先,从数据框中删除了“位置”和“引擎”列。 “位置”,因为它不影响价格。 另一方面,“引擎”由于条目大多是唯一的,因此算法无法从该功能中学习。
ML algorithms deal purely with numerical inputs and outputs. Therefore, the string-type/categorical features were converted to numerical data. This was done using One Hot encoding which creates separate columns for every entry in a column and fills in 1 to indicate an entry’s presence and 0 if otherwise. The get_dummies method performed this operation.
ML算法仅处理数字输入和输出。 因此,将字符串类型/类别特征转换为数值数据。 这是使用One Hot编码完成的,该编码为列中的每个条目创建单独的列,并填充1表示条目的存在,否则填充0。 get _ dummies方法执行了此操作。
data = data.drop(['Location', 'Engine'], axis=1)
cat_features = [x for x in data.columns if data[x].dtype == 'O']
data = pd.get_dummies(data, cat_features)
The data was then split into train and test data. The ML model learns with the train data while the test data, as the name implies, is used to check how well the model has learnt. Outliers and other anomalous data distribution were handled with Feature Scaling. Features Scaling compresses the data within a particular range of values. Features can either be standardized or normalized. In this case, the data was normalized using the StandardScaler class. The scaling was fitted on the train data alone since the test data should be treated as unseen data.
然后将数据分为训练数据和测试数据。 ML模型通过火车数据进行学习,而顾名思义,测试数据用于检查模型的学习程度。 离群值和其他异常数据分布通过特征缩放处理。 功能缩放可在特定值范围内压缩数据。 功能可以标准化也可以标准化。 在这种情况下,使用StandardScaler类对数据进行了标准化 。 由于测试数据应视为看不见的数据,因此仅将缩放比例应用于火车数据。
from sklearn.preprocessing import StandardScaler
# Define the features or independent variable
X = data.drop(['Price'], axis=1)
# Define the label or dependent variable
y = data['Price']# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Apply Standard Scaler to the train data
scaler = StandardScaler()
scaler.fit(X_train[continuous_features[0:-1]])# Transform Standard Scaler to the train and test data
X_train[continuous_features[0:-1]] = scaler.transform(X_train[continuous_features[0:-1]])
X_test[continuous_features[0:-1]] = scaler.transform(X_test[continuous_features[0:-1]])
Now that our data looks good for the ML algorithms, let’s evaluate how some common regression algorithms will perform on the data.
既然我们的数据对于ML算法来说看起来不错,那么让我们评估一些常见的回归算法将如何对数据执行。
模型评估 (Model Evaluation)
Five algorithms were evaluated: linear regressor, ridge regressor, lasso regressor, decision tree and random forest regressor.
评价了五种算法:线性回归,岭回归,套索回归,决策树和随机森林回归。
- Linear regressor finds a line that best fits all the features and predicts the output for a completely different input 线性回归器找到最适合所有特征的线,并为完全不同的输入预测输出
- Ridge regressor uses an L2 regularization technique to reduce the complexity of the model by coefficient shrinkage Ridge回归器使用L2正则化技术通过系数收缩来减少模型的复杂性
- Lasso regressor uses an L1 regularization technique to completely eliminate some features and make predictions 套索回归器使用L1正则化技术来完全消除某些特征并进行预测
- A decision tree regressor splits the data into smaller subdivisions. It asks questions about the data and provides yes or no answers. Each answer improves the confidence for a correct prediction 决策树回归器将数据拆分为较小的细分。 它询问有关数据的问题,并提供是或否的答案。 每个答案都会提高对正确预测的信心
- Decision tree aggregates the predictions of various random forest results to select a more stable prediction 决策树汇总各种随机森林结果的预测,以选择更稳定的预测
K-fold cross-validation was used to check how these models would perform on our data. Cross-validation is necessary because it produces results with low bias and low variance. In cross-validation, the data is split in different folds are each fold is used as test data and the rest as train data. The average result from each fold, therefore, reveals how the algorithm performs across the board.
使用K折交叉验证来检查这些模型如何对我们的数据执行。 交叉验证是必要的,因为它产生的结果具有低偏差和低方差。 在交叉验证中,数据分为不同的折叠,每个折叠用作测试数据,其余作为训练数据。 因此,每次折叠的平均结果揭示了算法如何全面执行。
10 number of splits were used and the scoring was an r-squared (also called r2) score, a measure of how close the test data is with the trained.
使用了10个分割数,并且得分是r平方(也称为r2)得分,该得分用于衡量测试数据与受训者之间的接近程度。
# Create a list of ML algorithms
models = []models.append(('Linear Regression', LinearRegression()))
models.append(('Ridge Regression', Ridge()))
models.append(('Lasso Regression', Lasso()))
models.append(('Decision Tree', DecisionTreeRegressor()))
models.append(('Random Forest', RandomForestRegressor()))# Evalaute each model
for name, model in models:
cv = KFold(n_splits=10, shuffle=True, random_state=1)
score = cross_val_score(model, X, y, cv=cv, scoring='r2')
print(f"{name} has an r2 score: {np.round(score.mean(), 3)}%, and SD : {np.round(score.std(), 4)}")
From the result, you will quickly see that random forest would perform best on the data, having an r-squared of 89.55%. Consequently, the random forest regressor was adopted.
从结果中,您将很快看到随机森林将在数据上表现最佳,其r平方为89.55%。 因此,采用了随机森林回归器。
# Check for model performance on the test data
rf = RandomForestRegressor()
%time rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)print(f'r2 score: {np.round(r2_score(y_test, y_pred), 4)*100}%')
print(f'Mean absoluute error: {metrics.mean_absolute_error(y_test, y_pred)}')
print(f'Mean sqaured error: {metrics.mean_squared_error(y_test, y_pred)}')
Upon fitting with the data, the r2 score was 87.32% which was pretty good. Checking out other metrics, the mean absolute error (1.60) and the mean squared error (14.87) were quite low, which was indeed impressive.
根据数据拟合,r2得分为87.32%,非常好。 检查其他指标,平均绝对误差(1.60)和均方误差(14.87)相当低,的确令人印象深刻。
超参数调整 (Hyperparameter Tuning)
Having an r2 score of 87.32% is good but it can still be bumped up a little. A great way of improving the result is by tweaking the algorithms’ hyperparameters. GridSearchCV and RandomizedSearchCV parse through different combinations of hyperparameters and obtains the combinations that produce the best result. RandomizedSearchCV is however less computationally intensive so it was used to obtain the best hyperparameters.
r2得分为87.32%是好的,但仍然可以提高一点。 改善结果的一种好方法是调整算法的超参数。 GridSearchCV和RandomizedSearchCV通过超参数的不同组合进行解析,并获得产生最佳结果的组合。 但是,RandomizedSearchCV的计算强度较低,因此可用来获得最佳的超参数。
After a list of values was defined for each parameter, RandomizedSearchCV found the best estimators and parameters for the data. Using these parameters, the r-squared score increased to 87.93%. This was adopted as the final model.
在为每个参数定义值列表之后,RandomizedSearchCV找到了数据的最佳估计量和参数。 使用这些参数,r平方得分提高到87.93%。 这被采纳为最终模型。
# Number of trees in the forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['sqrt', 'auto']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 50, num = 10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 3, 5]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Define some paramters
params = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}# Apply RandomizedSearchCV with the defined paramters
model_search = RandomizedSearchCV(rf, param_distributions=params, scoring=’r2’)
%time model_search.fit(X_train, y_train)
y_pred_op = model_search.predict(X_test)# Check metrics
print(r2_score(y_test, y_pred_op))
# Print the best combination
To tie everything together, a function was defined to estimates the price using the final model.
为了将所有东西联系在一起,定义了一个函数,使用最终模型估算价格。
# Define a function that implements the model
def predict_price(name, year, km, fuel, transmission, owner, mileage, power, seats):
# Define column location of the non numerical features
name_index = np.where(X.columns=='Name_'+name.upper())[0][0]
fuel_index = np.where(X.columns=='Fuel_Type_'+fuel)[0][0]
transmission_index = np.where(X.columns=='Transmission_'+transmission)[0][0]
owner_index = np.where(X.columns=='Owner_Type_'+owner)[0][0]
# Assign each of the inputted feature its value
x = np.zeros(len(X.columns))
x[0] = year
x[1] = km
x[2] = mileage
x[3] = power
x[4] = seats
if name_index >= 0:
x[name_index] = 1
if fuel_index >= 0:
x[fuel_index] = 1
if transmission_index >= 0:
x[transmission_index] = 1
if owner_index >= 0:
x[owner_index] = 1
return f'The estimated price of the car is {model_search.predict([x])[0]} Lakh Rupees'
Check out some of the estimations.
查看一些估计。
This final model can now be saved and used in a completely unseen data. Moreso, it can be deployed in production using any of the cloud platforms available.
现在可以保存此最终模型,并在完全看不见的数据中使用。 此外,可以使用任何可用的云平台将其部署在生产中。
# Save the model
pickle.dump(model_search, open('model_final.plk', 'wb'))
结论与未来工作 (Conclusion and Future Work)
We went through how to build an ML model for car price prediction using 5 ML algorithms. We started off by cleaning the data, carrying out data visualizations, and data preprocessing before feeding the data to the ML algorithms. It was observed that the decision tree and Random forest both had good results for the data but the random forest performed better. Well, this is because the random forest is an aggregation of decision trees. The aggregation reduces bias and limit overfitting, therefore producing a better result. A trade-off is random forest takes more time for computation and is more prone to overfitting.
我们介绍了如何使用5 ML算法为汽车价格预测建立ML模型。 首先,我们先清理数据,进行数据可视化和数据预处理,然后再将数据提供给ML算法。 据观察,决策树和随机森林都对数据有良好的结果,但随机森林表现更好。 好吧,这是因为随机森林是决策树的集合。 聚集减少了偏差并限制了过拟合,因此产生了更好的结果。 需要权衡的是,随机森林需要更多时间进行计算,并且更容易过度拟合。
Some other ideas can be implemented for this project. Some of the features in the data that were strongly correlated, such as, the mileage and age can be merged together. Principal Component Analysis is an unsupervised learning approach that can be used to reduce the number of features. Also, other hyperparameters can be checked with GridSearchCV using a wider range of parameters. With a larger dataset, the specific model of the car can be used rather than its brand.
可以为该项目实施其他一些想法。 数据中的某些高度相关的功能(例如里程数和年龄)可以合并在一起。 主成分分析是一种无监督的学习方法,可用于减少特征数量。 此外,可以使用GridSearchCV使用更广泛的参数来检查其他超参数。 有了更大的数据集,就可以使用汽车的特定模型而不是其品牌。
There you have it. If you want to try your hands on this project, you will find the Notebook file here. Thanks for your time.
你有它。 如果您想尝试该项目,可以在此处找到Notebook文件。 谢谢你的时间。
翻译自: https://medium.com/analytics-vidhya/car-price-prediction-with-machine-learning-models-part-2-6a1ef555b49b
机器学习 预测模型