支持向量机 回归分析_支持向量机和回归分析

支持向量机 回归分析

It is a common misconception that support vector machines are only useful when solving classification problems.

常见的误解是,支持向量机仅在解决分类问题时才有用。

The purpose of using SVMs for regression problems is to define a hyperplane as in the image above, and fit as many instances as is feasible within this hyperplane while at the same time limiting margin violations.

使用SVM解决回归问题的目的是定义一个超平面,如上图所示,并在此超平面内尽可能多地容纳实例,同时限制了违反边界的情况。

In this way, SVMs used in this manner differ from classification tasks, where the objective is to fit the largest possible hyperplane between two separate classes (while also limiting margin violations).

通过这种方式,以这种方式使用的SVM与分类任务不同,分类任务的目的是在两个单独的类之间容纳最大可能的超平面(同时还限制违反边界的行为)。

As a matter of fact, SVMs can handle regression modelling quite effectively. Let’s take hotel bookings as an example.

实际上,SVM可以非常有效地处理回归建模。 让我们以酒店预订为例。

预测酒店客户的平均每日房价 (Predicting Average Daily Rates Across Hotel Customers)

Suppose that we are building a regression model to predict the average daily rate (or rate that a customer pays on average per day) for a hotel booking. A model is constructed with the following features:

假设我们正在建立回归模型,以预测酒店预订的平均每日房价(或客户平均每天所支付的房价)。 构建具有以下特征的模型:

  • Cancellation (whether a customer cancels their booking or not)

    取消(客户是否取消预订)
  • Country of Origin

    出生国家
  • Market Segment

    细分市场
  • Deposit Type

    存款类型
  • Customer Type

    客户类型
  • Required Car Parking Spaces

    所需停车位
  • Week of Arrival

    抵达周

Note that the ADR values are also populated for customers that cancelled — the response variable in this case reflects the ADR that would have been paid had the customer proceeded with the booking.

请注意,还会为已取消的客户填充ADR值-在这种情况下,响应变量反映的是如果客户继续进行预订将要支付的ADR。

The original study by Antonio, Almeida and Nunes (2016) can be accessed from the References section below.

Antonio,Almeida和Nunes(2016)的原始研究可从下面的“参考”部分获得。

建筑模型 (Model Building)

Using the features as outlined above, the SVM model is trained and validated on the training set (H1), with the predictions compared to the actual ADR values across the test set (H2).

使用上述功能,可以在训练集(H1)上对SVM模型进行训练和验证,并将预测结果与整个测试集(H2)的实际ADR值进行比较。

The model is trained as follows:

该模型的训练如下:

>>> from sklearn.svm import LinearSVR
>>> svm_reg = LinearSVR(epsilon=1.5)
>>> svm_reg.fit(X_train, y_train)LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
random_state=None, tol=0.0001, verbose=0)>>> predictions = svm_reg.predict(X_val)
>>> predictionsarray([100.75090575, 109.08222631, 79.81544167, ..., 94.50700112,
55.65495607, 65.5248653 ])

Now, the same model is used on the features in the test set to generate predicted ADR values:

现在,对测试集中的特征使用相同的模型来生成预测的ADR值:

bpred = svm_reg.predict(atest)
bpred

Let’s compare the predicted ADR to actual ADR on a mean absolute error (MAE) and root mean squared error (RMSE) basis.

让我们在平均绝对误差(MAE)和均方根误差(RMSE)的基础上将预测的ADR与实际ADR进行比较。

>>> mean_absolute_error(btest, bpred)
29.50931462735928>>> print('mse (sklearn): ', mean_squared_error(btest,bpred))
>>> math.sqrt(mean_squared_error(btest, bpred))
44.60420935095296

Note that the sensitivity of the SVM to additional training instances is set by the epsilon (ϵ) parameter, i.e. the higher the parameter, the more of an impact additional training instances has on the model results.

请注意,SVM对其他训练实例的敏感性由epsilon(ϵ)参数设置,即参数越高,其他训练实例对模型结果的影响就越大。

In this instance, a large margin of 1.5 was used. Here is the model performance when a margin of 0.5 is used.

在这种情况下,使用了1.5的大余量。 这是使用0.5的裕度时的模型性能。

>>> mean_absolute_error(btest, bpred)29.622491512816826>>> print('mse (sklearn): ', mean_squared_error(btest,bpred))
>>> math.sqrt(mean_squared_error(btest, bpred))44.7963000500928

We can see that there has been virtually no change in the MAE or RMSE parameters through modifying the ϵ parameter.

我们可以看到,通过修改ϵ参数,MAE或RMSE参数几乎没有变化。

That said, we want to ensure that the SVM model is not overfitting. Specifically, if we find that the best fit is achieved when ϵ = 0, then this might be a sign that the model is overfitting.

也就是说,我们要确保SVM模型不会过拟合。 具体来说,如果我们发现当ϵ = 0时达到了最佳拟合则可能表明该模型过度拟合。

Here are the results when we set ϵ = 0.

当我们设置ϵ = 0时,结果如下。

  • MAE: 31.86

    MAE: 31.86

  • RMSE: 47.65

    RMSE: 47.65

Given that we are not seeing higher accuracy when ϵ = 0, there does not seem to be any evidence that overfitting is an issue in our model — at least not from this standpoint.

考虑到当ϵ = 0时我们看不到更高的精度,因此似乎没有任何证据表明过度拟合是我们模型中的一个问题-至少从这个角度来看不是这样。

SVM性能与神经网络相比如何? (How Does SVM Performance Compare To A Neural Network?)

When using the same features, how does the SVM performance accuracy compare to that of a neural network?

当使用相同的功能时,SVM的性能精度与神经网络的性能相比如何?

Consider the following neural network configuration:

考虑以下神经网络配置:

>>> model = Sequential()
>>> model.add(Dense(8, input_dim=8, kernel_initializer='normal', activation='elu'))
>>> model.add(Dense(1669, activation='elu'))
>>> model.add(Dense(1, activation='linear'))
>>> model.summary()Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 8) 72
_________________________________________________________________
dense_1 (Dense) (None, 1669) 15021
_________________________________________________________________
dense_2 (Dense) (None, 1) 1670
=================================================================
Total params: 16,763
Trainable params: 16,763
Non-trainable params: 0
_________________________________________________________________

The model is trained across 30 epochs with a batch size of 150:

该模型经过30个时期训练,批量大小为150

>>> model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])
>>> history=model.fit(X_train, y_train, epochs=30, batch_size=150, verbose=1, validation_split=0.2)
>>> predictions = model.predict(X_test)

The following MAE and RMSE are obtained on the test set:

在测试集上获得以下MAE和RMSE:

  • MAE: 29.89

    MAE: 29.89

  • RMSE: 43.91

    RMSE: 43.91

We observed that when ϵ was set to 1.5 for the SVM model, the MAE and RMSE came in at 29.5 and 44.6 respectively. In this regard, the SVM has matched the neural network in prediction accuracy on the test set.

我们观察到 SVM模型设置为1.5 ,MAE和RMSE分别为29.5和44.6。 在这方面,SVM在测试集的预测准确性方面已与神经网络相匹配。

结论 (Conclusion)

It is a common misconception that SVMs are only suitable for working with classification data.

常见的误解是SVM仅适用于分类数据。

However, we have seen in this example that the SVM model has been quite effective at predicting ADR values for the neural network.

但是,我们在此示例中看到,SVM模型在预测神经网络的ADR值方面非常有效。

Many thanks for reading, and any questions or feedback appreciated.

非常感谢您的阅读,并感谢任何问题或反馈。

The GitHub repository for this example, as well as other relevant references are available below.

下面提供了此示例的GitHub存储库以及其他相关参考。

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

免责声明:本文按“原样”撰写,不作任何担保。 它旨在提供数据科学概念的概述,并且不应以任何方式解释为专业建议。

翻译自: https://towardsdatascience.com/support-vector-machines-and-regression-analysis-ad5d94ac857f

支持向量机 回归分析

你可能感兴趣的:(机器学习,python,人工智能,java,深度学习)