xgboost keras
The goal of this challenge is to predict whether a customer will make a transaction (“target” = 1) or not (“target” = 0). For that, we get a data set of 200 incognito variables and our submission is judged based on the Area Under Receiver Operating Characteristic Curve which we have to maximise.
这项挑战的目的是预测客户是否会进行交易(“目标” = 1)(“目标” = 0)。 为此,我们获得了200个隐身变量的数据集,并根据必须最大化的接收器工作特征曲线下面积来判断提交。
This project is somewhat different from others, you basically get a huge amount of data with no missing values and only numbers. A dream come true for any data scientist. Of course, that sounds too good to be true! Let’s dive in.
这个项目与其他项目有些不同,您基本上可以获得大量的数据,没有缺失值,只有数字。 任何数据科学家都梦想成真。 当然,这听起来好得令人难以置信! 让我们潜入。
一,设置 (I. Set up)
We start by loading the data and get a quick overview of the data we’ll have to handle. We do so by calling the describe() and info() functions.
我们首先加载数据,然后快速概览需要处理的数据。 我们通过调用describe()和info()函数来实现。
# Load the data setstrain_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")# Create a merged data set and review initial informationcombined_df = pd.concat([train_df, test_df])
print(combined_df.describe())
print(combined_df.info())
We have a total of 400.000 observations, 200.000 of whom in our training set. We can also see that we will have to deal with the class imbalance issue as we have a mean of 0.1 in the target column.
我们总共有400.000个观测值,其中有200.000个在我们的训练集中。 我们还可以看到,我们必须处理类不平衡问题,因为目标列的平均值为0.1。
二。 缺失值 (II. Missing values)
Let’s check whether we have any missing values. For that, we print the column names that contain missing values.
让我们检查是否缺少任何值。 为此,我们打印包含缺少值的列名称。
# Check missing valuesprint(combined_df.columns[combined_df.isnull().any()])
We have zero missing values. Let’s move forward.
我们的零缺失值。 让我们前进。
三, 资料类型 (III. Data types)
Let’s check the data we have. Are we dealing with categorical variables? Or text? Or just numbers? We print a dictionary containing the different types of data we have and its occurrence.
让我们检查一下我们拥有的数据。 我们在处理分类变量吗? 还是文字? 还是数字? 我们打印一个字典,其中包含我们拥有的不同数据类型及其出现的位置。
# Get the data typesprint(Counter([combined_df[col].dtype for col in combined_df.columns.values.tolist()]).items())
Only float data. We don’t have to create dummy variables.
仅浮动数据。 我们不必创建虚拟变量。
IV。 数据清理 (IV. Data cleaning)
We don’t want to use our ID column to make our predictions and therefore store it into the index.
我们不想使用ID列进行预测,因此将其存储到索引中。
# Set the ID col as indexfor element in [train_df, test_df]:
element.set_index('ID_code', inplace = True)
We now separate the target variable from our training set and create a new dataframe for our target variable.
现在,我们将目标变量从训练集中分离出来,并为目标变量创建一个新的数据框。
# Create X_train_df and y_train_df setX_train_df = train_df.drop("target", axis = 1)
y_train_df = train_df["target"]
V.缩放 (V. Scaling)
We haven’t done anything when it comes to data exploration and outlier analysis. It is always highly recommended to conduct these. However, given the nature of the challenge, we suspect that the variables in themselves might not be too interesting.
在数据探索和离群值分析方面,我们没有做任何事情。 始终强烈建议进行这些操作。 但是,鉴于挑战的性质,我们怀疑变量本身可能不太有趣。
In order to compensate for our lack of outlier detection, we scale the data using RobustScaler().
为了弥补我们对异常值检测的不足,我们使用RobustScaler()缩放数据。
# Scale the data and use RobustScaler to minimise the effect of outliersscaler = RobustScaler()# Scale the X_train setX_train_scaled = scaler.fit_transform(X_train_df.values)
X_train_df = pd.DataFrame(X_train_scaled, index = X_train_df.index, columns= X_train_df.columns)# Scale the X_test setX_test_scaled = scaler.transform(test_df.values)
X_test_df = pd.DataFrame(X_test_scaled, index = test_df.index, columns= test_df.columns)
We now create a X_train, y_train, X_test and y_test set for training our model and then testing it on hold-out data.
现在,我们创建一个X_train,y_train,X_test和y_test集来训练我们的模型,然后对保留数据进行测试。
# Split our training sample into train and test, leave 20% for testX_train, X_test, y_train, y_test = train_test_split(X_train_df, y_train_df, test_size=0.2, random_state = 20)
When it comes to outliers, some could use IsolationForest() in order to automatically identify and remove rows that are outliers. This technique is often used for data sets with numerous variables. This code chunk has been borrowed form MachineLearningMastery.
当涉及到离群值时,有些人可以使用IsolationForest()来自动识别和删除离群值。 此技术通常用于具有众多变量的数据集。 该代码块已从MachineLearningMastery借用。
# OUTLIERS# Remove outliers automaticallyiso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)
print(yhat)# Select all rows that are not outliersmask = yhat != -1
X_train, y_train = X_train.loc[mask, :], y_train.loc[mask]
Please note that this automated outlier discovery did not add any predictive power to our model and we decided to comment it out.
请注意,这种自动化的异常值发现并未为我们的模型增加任何预测能力,因此我们决定将其注释掉。
七。 班级失衡 (VII. Class Imbalance)
In our data, we have seen that we have way less observations that have made a transaction than have not. If we want our model to be equally capable at predicting both, we should make sure we don’t feed it with skewed data.
在我们的数据中,我们已经看到,进行交易的观察少于未观察到的观察。 如果我们希望我们的模型能够同时预测两者,则应确保不向其提供偏斜的数据。
We correct for class imbalance by upsampling the minority class. This techniques are inspired from this excellent article by Tara Boyle.
我们通过增加少数族裔的样本来纠正阶级失衡。 该技术的灵感来自Tara Boyle的这篇出色文章 。
# CLASS IMBALANCE# Downsample majority class# Concatenate our training data back togetherX = pd.concat([X_train, y_train], axis=1)# Separate minority and majority classesnot_transa = X[X.target==0]
transa = X[X.target==1]
not_transa_down = resample(not_transa,
replace = False, # sample without replacementn_samples = len(transa), # match minority nrandom_state = 27) # reproducible results# Combine minority and downsampled majoritydownsampled = pd.concat([not_transa_down, transa])# Checking countsprint(downsampled.target.value_counts())# Create training set againy_train = downsampled.target
X_train = downsampled.drop('target', axis=1)
print(len(X_train))
Here is the code for upsampling the minority class.
这是对少数群体进行升采样的代码。
# Upsample minority class# Concatenate our training data back togetherX = pd.concat([X_train, y_train], axis=1)# Separate minority and majority classesnot_transa = X[X.target==0]
transa = X[X.target==1]
not_transa_up = resample(transa,
replace = True, # sample without replacementn_samples = len(not_transa), # match majority nrandom_state = 27) # reproducible results# Combine minority and downsampled majorityupsampled = pd.concat([not_transa_up, not_transa])# Checking countsprint(upsampled.target.value_counts())# Create training set againy_train = upsampled.target
X_train = upsampled.drop('target', axis=1)
print(len(X_train))
And here is the code for creating synthetic samples with SMOTE.
这是使用SMOTE创建合成样本的代码。
# Create synthetic samplessm = SMOTE(random_state=27, sampling_strategy='minority')
X_train, y_train = sm.fit_sample(X_train, y_train)
print(y_train.value_counts())
八。 造型 (VIII. Modelling)
We now dive deeper into the models. The plan is to create 4 different models and then averaging their predictions to make an ensemble that will yield the final prediction. We do not plan to fine tune the models to a too wide extent. Leaving GridSearch out of this.
现在,我们深入研究模型。 计划是创建4个不同的模型,然后对它们的预测取平均,以形成一个可以产生最终预测的集合。 我们不打算对模型进行微调。 将GridSearch排除在外。
1. Keras神经网络 (1. Neural Network With Keras)
# NEURAL NETWORK# Build our neural network with input dimension 200classifier = Sequential()# First Hidden Layerclassifier.add(Dense(150, activation='relu', kernel_initializer='random_normal', input_dim=200))# Second Hidden Layerclassifier.add(Dense(350, activation='relu', kernel_initializer='random_normal'))# Third Hidden Layerclassifier.add(Dense(250, activation='relu', kernel_initializer='random_normal'))# Fourth Hidden Layerclassifier.add(Dense(50, activation='relu', kernel_initializer='random_normal'))# Output Layerclassifier.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))# Compile the networkclassifier.compile(optimizer ='adam',loss='binary_crossentropy', metrics =['accuracy'])# Fitting the data to the training data setclassifier.fit(X_train,y_train, batch_size=100, epochs=150)# Evaluate the model on training dataeval_model=classifier.evaluate(X_train, y_train)
print(eval_model)# Make predictions on the hold out datay_pred=classifier.predict(X_test)
y_pred =(y_pred>0.5)# Get the confusion matrixcm = confusion_matrix(y_test, y_pred)
print(cm)# Get the accuracy scoreprint("Accuracy of {}".format(accuracy_score(y_test, y_pred)))# Get the f1-Scoreprint("f1 score of {}".format(f1_score(y_test, y_pred)))# Get the recall scoreprint("Recall score of {}".format(recall_score(y_test, y_pred)))# Make predictions and create submission filepredictions = (classifier.predict(X_test_df)>0.5)
predictions = np.concatenate(predictions, axis=0 )
my_pred = pd.DataFrame({'ID_code': X_test_df.index, 'target': predictions})# Set 0 and 1s instead of True and Falsemy_pred["target"] = my_pred["target"].map({True:1, False : 0})# Create CSV filemy_pred.to_csv('pred_ann.csv', index=False)
This model is built upon the excellent review from Renu Khandelwal. We haven’t modified the original script except adding some layers and increased the number of neurons by layer.
该模型基于Renu Khandelwal的出色评论 。 除了添加一些层并逐层增加神经元的数量外,我们没有修改原始脚本。
Our first submission with this Neural Network gives us a score of 0.80882
我们在该神经网络中的首次提交给我们得分0.80882
2. LightGBM (2. LightGBM)
# LIGHT GBM# Get the train and test data for the training sequencetrain_data = lgbm.Dataset(X_train, label=y_train)
test_data = lgbm.Dataset(X_test, label=y_test)# Set parameters
parameters = {'application': 'binary','objective': 'binary','metric': 'auc','is_unbalance': 'true','boosting': 'gbdt','num_leaves': 31,'feature_fraction': 0.5,'bagging_fraction': 0.5,'bagging_freq': 20,'learning_rate': 0.05,'verbose': 0
}# Train our classifier
classifier = lgbm.train(parameters,
train_data,
valid_sets= test_data,
num_boost_round=5000,
early_stopping_rounds=100)# Make predictionspredictions = classifier.predict(X_test_df.values)# Create submission filemy_pred_lgbm = pd.DataFrame({'ID_code': X_test_df.index, 'target': predictions})# Create CSV filemy_pred_lgbm.to_csv('pred_lgbm.csv', index=False)
This code chunk is based on some work from this Kaggle Notebook by E. Zietsman. If you want a complete overview of how LightGBM works and how to optimally tune it, make sure you read this article from Pushkar Mandot.
此代码块基于E. Zietsman的Kaggle笔记本所做的一些工作。 如果要全面了解LightGBM的工作原理以及如何对其进行最佳调整,请确保您阅读了Pushkar Mandot的 这篇文章 。
This gives us a score of 0.89217
这给了我们0.89217的分数
3. XGBoost (3. XGBoost)
# XGBOOST# Instantiate classifierclassifier = XGBClassifier(
tree_method = 'hist',
objective = 'binary:logistic',
eval_metric = 'auc',
learning_rate = 0.01,
max_depth = 2,
colsample_bytree = 0.35,
subsample = 0.8,
min_child_weight = 53,
gamma = 9,
silent= 1)# Fit the dataclassifier.fit(X_train, y_train)# Make predictions on the hold out datay_pred = (classifier.predict_proba(X_test)[:,1] >= 0.5).astype(int)# Get the confusion matrixprint(confusion_matrix(y_test, y_pred))# Get the accuracy scoreprint("Accuracy of {}".format(accuracy_score(y_test, y_pred)))# Get the f1-Scoreprint("f1 score of {}".format(f1_score(y_test, y_pred)))# Get the recall scoreprint("Recall score of {}".format(recall_score(y_test, y_pred)))# Make predictionspredictions = (classifier.predict_proba(X_test_df)[:,1] >= 0.5).astype(int)# Create submission filemy_pred_xgb = pd.DataFrame({'ID_code': X_test_df.index, 'target_xgb': predictions})# Create CSV filemy_pred_xgb.to_csv('pred_xgb.csv', index=False)
We also rely on XGBoost and the helpfull insights from Félix Revert.
我们还依靠XGBoost和有益的见解 ,从费利克斯还原 。
This gives us a score of 0.59283
这使我们得到0.59283的分数
4. Catboost (4. Catboost)
# CATBOOST# Instantiate classifierclassifier = cb.CatBoostClassifier(loss_function="Logloss",
eval_metric="AUC",
learning_rate=0.01,
iterations=1000,
random_seed=42,
od_type="Iter",
depth=10,
early_stopping_rounds=500
)# Fit the dataclassifier.fit(X_train, y_train)# Make predictions on the hold out datay_pred = (classifier.predict_proba(X_test)[:,1] >= 0.5).astype(int)# Get the confusion matrixprint(confusion_matrix(y_test, y_pred))# Get the accuracy scoreprint("Accuracy of {}".format(accuracy_score(y_test, y_pred)))# Get the f1-Scoreprint("f1 score of {}".format(f1_score(y_test, y_pred)))# Get the recall scoreprint("Recall score of {}".format(recall_score(y_test, y_pred)))# Make predictionspredictions = (classifier.predict_proba(X_test_df)[:,1] >= 0.5).astype(int)# Create submission filemy_pred_cat = pd.DataFrame({'ID_code': X_test_df.index, 'target_cat': predictions})# Create CSV filemy_pred_cat.to_csv('pred_cat.csv', index=False)
This part is inspired from Wakame on Kaggle.
这部分灵感来自Kaggle上的Wakame 。
This gives us a score of 0.78769
这给予我们0.78769的分数
5.合奏 (5. Ensemble)
In this last part, we take the 4 models we created and ensemble them in order to generate our final answer. We want at least 3 out of the 4 models to qualify an observation as 1 in order to effectively doing so.
在最后一部分中,我们将使用我们创建的4个模型并将它们集成在一起以生成最终答案。 为了有效地做到这一点,我们希望4个模型中至少有3个将观察结果定为1。
# ENSEMBLE# Create data framemy_pred_ens = pd.concat([my_pred_ann, my_pred_xgb, my_pred_cat, my_pred_lgbm], axis = 1, sort=False)# Review our frameprint(my_pred_ens.describe())# Sum all the predictions and only assign a 1 if sum is higher than 2my_pred_ens["target"] = my_pred_ens["target_ann"] + my_pred_ens["target_xgb"] + my_pred_ens["target_lgbm"] + my_pred_ens["target_cat"]# Assign a 1 if sum is higher than 2my_pred_ens["target"] = np.where(my_pred_ens["target"] > 2, 1, 0)# Remove other target colsmy_pred_ens = my_pred_ens.drop(["target_ann", "target_lgbm", "target_xgb", "target_cat"], axis = 1)# Create submission filemy_pred = pd.DataFrame({'ID_code': X_test_df.index, 'target': my_pred_ens["target"]})# Create CSV filemy_pred.to_csv('pred_ens.csv', index=False)
This gives us a score of 0.78627
这给了我们0.78627的分数
九。 结论 (IX. Conclusion)
Our best models was the LightGBM. In order to improve on our score, we might rely on Stratified Kfolds or any other cross validation technique. We might as well fine tune our models in more detail.
我们最好的模型是LightGBM。 为了提高我们的分数,我们可能依赖于Stratified Kfolds或任何其他交叉验证技术。 我们不妨更详细地调整模型。
使用的包 (Packages used)
import pandas as pdimport numpy as npfrom collections import Counterfrom sklearn.preprocessing import RobustScalerfrom sklearn.model_selection import train_test_splitfrom keras import Sequentialfrom keras.layers import Densefrom sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score, roc_curve, roc_auc_scorefrom sklearn.utils import resampleimport lightgbm as lgbmfrom xgboost import XGBClassifierfrom sklearn import metricsfrom sklearn.model_selection import GridSearchCVfrom sklearn.ensemble import IsolationForestfrom imblearn.over_sampling import SMOTEfrom sklearn.model_selection import StratifiedKFoldimport xgboost as xgbimport catboost as cbfrom catboost import Poolfrom sklearn.model_selection import KFold
我们从中汲取了灵感的有用资源 (Helpful sources we drew inspiration from)
翻译自: https://medium.com/@invest_gs/predicting-financial-transactions-with-catboost-lgbm-xgboost-and-keras-ede24a6e4a76
xgboost keras