fastai深度学习社区
FastAI is an incredibly convenient and powerful machine learning library bringing Deep Learning (DL) to the masses. I was motivated to write this article while troubleshooting some issues related to training a model for a binary classification task. My objective is to walk you through the steps required to a build a simple and effective DL model using FastAI for a tabular, imbalanced dataset while avoiding the mistakes that I had made. The discussion in this article is organized according to the sections listed below.
FastAI是一个非常方便且功能强大的机器学习库,将深度学习(DL)带给了大众。 我很想写这篇文章,同时对一些与训练二进制分类任务的模型有关的问题进行故障排除。 我的目标是引导您逐步完成使用FastAI为表格不平衡数据集构建简单有效的DL模型所需的步骤,同时避免我犯的错误。 本文中的讨论根据下面列出的部分进行组织。
- Dataset 数据集
- Sample Code样例代码
- Code Breakdown代码细目
- Comparison Of FastAI to PySpark ML & Scikit-LearnFastAI与PySpark ML和Scikit-Learn的比较
- Conclusion结论
1.数据集(1. Dataset)
The dataset comes from the context of ad conversions where the binary target variables 1 and 0 correspond to conversion success and failure. This proprietary dataset (no, I don’t own the rights) has some particularly interesting attributes due to its dimensions, class imbalance and rather weak relationship between the features and the target variable.
数据集来自广告转换的上下文,其中二进制目标变量1和0分别对应转换成功和失败。 该专有数据集(不,我不拥有权利)由于其尺寸,类不平衡以及要素与目标变量之间的关系较弱而具有一些特别有趣的属性。
First, the dimensions of the data: this tabular dataset contains a fairly large number of records and categorical features that have a very high cardinality.
首先,数据的维度:该表格数据集包含相当多的基数很高的记录和分类特征。
Note: In FastAI, categorical features are represented using embeddings which can improve classification performance on high cardinality features.
注意:在FastAI中,分类特征使用嵌入表示,这可以提高高基数特征的分类性能。
Second, the binary class labels are highly imbalanced since successful ad conversions are relatively rare. In this article we adapt to this constraint via an algorithm-level approach (weighted cross entropy loss functions) as opposed to a data-level approach (resampling).
其次,二元类标签高度不平衡,因为成功的广告转换相对很少。 在本文中,我们通过算法级别的方法(加权交叉熵损失函数)适应了这种约束,而不是数据级别的方法(重采样)。
Third, the relationship between the features and the target variable is rather weak. For example, a Logistic Regression model had a validation area under ROC curve of 0.74 after significant model tuning.
第三,特征与目标变量之间的关系相当弱。 例如,经过重大模型调整后,逻辑回归模型的ROC曲线下的验证区域为0.74。
数据集属性摘要 (Summary Of Dataset Attributes)
- Dimensions: 17 features, 1 target variable, 3738937 rows维度:17个要素,1个目标变量,3738937行
- Binary target classes二元目标类别
- Class imbalance ratio of 1:87班级不平衡比例为1:87
- 6 numerical features 6个数值特征
- 8 categorical features8个分类特征
- Categorical features have a combined cardinality of 44,000分类特征的总基数为44,000
2.示例代码 (2. Sample Code)
This code was ran on a Jupyter Lab notebook on Google Cloud — AI Platform with the specs listed below.
该代码在具有以下规格的Google Cloud(AI平台)上的Jupyter Lab笔记本上运行。
- 4 N1-standard vCPUs, 15 GB RAM 4个N1标准vCPU,15 GB RAM
- 1 NVIDIA Tesla P4 GPU 1个NVIDIA Tesla P4 GPU
- Environment: PyTorch 1.4 环境:PyTorch 1.4
- Operating System: Debian 9 操作系统:Debian 9
The model training pipeline, which will be explained in the next section, is shown below.
下面显示了模型训练管道,将在下一节中对其进行说明。
"""
Using FastAI to train neural nets on highly imbalanced tabular
data with binary classes. This script provides a template for
training models with weighted cross entropy loss functions.
The weights are constructed based on the class populations to
downweight the contributions from the majority class.
Author : Faiyaz Hasan
Date : September 21, 2020
"""
import pandas as pd
from fastai.tabular.all import *
#############
# Load Data #
#############
df = pd.read_csv('data/training.csv')
#######################################
# Group Columns Names By Feature Type #
#######################################
# Categorical Features
CAT_NAMES = ['col_1', 'col_2', 'col_3', 'col_4',
'col_5', 'col_6', 'col_7', 'col_8']
# Continuous Features
CONT_NAMES = ['col_9', 'col_10', 'col_11', 'col_12',
'col_13', 'col_14']
# Target Variable
TARGET = 'target'
############################################
# Cast Target Variable As Categorical Data #
############################################
df[TARGET] = df[TARGET].astype('category') # Very important to ensure that the target variable is cast to the right type.
# Check DataFrame Info
df.info()
##########################
# Instantiate Dataloader #
##########################
# Data Processors
procs = [Categorify, FillMissing, Normalize]
# Training/Validation Dataset 80:20 Split
splits = RandomSplitter(valid_pct=0.2)(range_of(df))
dls = TabularDataLoaders.from_df(df,
y_names=TARGET,
cat_names=CAT_NAMES,
cont_names=CONT_NAMES,
procs=procs,
splits=splits)
# View Transformed Training/Validation Data
dls.xs
#####################
# Construct Weights #
#####################
class_count_df = df.groupby(TARGET).count()
n_0, n_1 = class_count_df.iloc[0, 0], class_count_df.iloc[1, 0]
w_0 = (n_0 + n_1) / (2.0 * n_0)
w_1 = (n_0 + n_1) / (2.0 * n_1)
# Important: Convert Weights To Float Tensor
class_weights=torch.FloatTensor([w_0, w_1]).cuda()
############################
# Model Performance Metric #
############################
# Instantiate RocAucBinary Score
roc_auc = RocAucBinary() # Very important: Use the binary scoring function and not RocAuc()
#################
# Loss Function #
#################
loss_func = CrossEntropyLossFlat(weight=class_weights)
#################
# Model Learner #
#################
learn = tabular_learner(dls, loss_func=loss_func, metrics=roc_auc)
########################################
# Train Model & Check Validation Score #
########################################
learn.fit_one_cycle(3)
3.代码分解(3. Code Breakdown)
导入包(Import Packages)
Install both fastai
and fastbook
via the command terminal. For more details on the setup, checkout this link.
通过命令终端安装fastai
和fastbook
。 有关设置的更多详细信息,请查看此链接。
conda install -c fastai -c pytorch fastai
git clone https://github.com/fastai/fastbook.git
pip install -Uqq fastbook
Import the FastAI library and pandas into the notebook.
将FastAI库和熊猫导入笔记本。
import pandas as pd
from fastai.tabular.all import *
按要素类型加载数据和组列名称 (Load Data & Group Column Names By Feature Types)
The column names had to be anonymized due to privacy reasons.
由于隐私原因,列名必须匿名。
df = pd.read_csv('data/training.csv')# Categorical Features
CAT_NAMES = ['col_1', 'col_2', 'col_3', 'col_4',
'col_5', 'col_6', 'col_7', 'col_8'] # Continuous Features
CONT_NAMES = ['col_9', 'col_10', 'col_11',
'col_12', 'col_13', 'col_14'] # Target Variable
TARGET = 'target'
转换目标变量数据类型 (Cast Target Variable Data type)
Change the data type of the binary target variable to category.
将二进制目标变量的数据类型更改为category 。
df[TARGET] = df[TARGET].astype('category')
Pitfall #1: If the target variable data type is left as a numeric value, FastAI/PyTorch will treat it as such and yield a runtime error.
陷阱#1 :如果将目标变量数据类型保留为数值,FastAI / PyTorch将照此处理并产生运行时错误。
实例化数据加载器 (Instantiate Dataloader)
Next, list the data preprocessors, training/validation set split and create the tabular dataloader.
接下来,列出数据预处理器,训练/验证集拆分并创建表格数据加载器。
# Data Processors
procs = [Categorify, FillMissing, Normalize] # Training/Validation Dataset 80:20 Split
splits = RandomSplitter(valid_pct=0.2)(range_of(df)) dls = TabularDataLoaders.from_df(df,
y_names=TARGET,
cat_names=CAT_NAMES,
cont_names=CONT_NAMES,
procs=procs,
splits=splits)
Use dls.xs
to see the transformed training data.
使用dls.xs
查看转换后的训练数据。
Construct Loss Function Weights
构造损失函数权重
The class imbalances are used to create the weights for the cross entropy loss function ensuring that the majority class is down-weighted accordingly. The formula for the weights used here is the same as in scikit-learn and PySPark ML.
使用类别不平衡来创建交叉熵损失函数的权重,以确保相应地降低多数类别的权重。 此处使用的权重公式与scikit-learn和PySPark ML中的相同。
class_count_df = df.groupby(TARGET).count()
n_0, n_1 = class_count_df.iloc[0, 0], class_count_df.iloc[1, 0]
w_0 = (n_0 + n_1) / (2.0 * n_0)
w_1 = (n_0 + n_1) / (2.0 * n_1)
class_weights=torch.FloatTensor([w_0, w_1]).cuda()
Pitfall #2: Ensure that the class weights are converted to a float tensor and that cuda operations are enabled via .cuda()
. Otherwise, you will get a type error.
陷阱2 : 确保将类权重转换为浮点张量,并通过.cuda()
启用cuda操作。 否则,您将得到type error 。
TypeError: cannot assign 'list' object to buffer 'weight' (torch Tensor or None required)
Instantiate Area Under ROC Metric
在ROC指标下实例化区域
roc_auc = RocAucBinary()
Pitfall #3: For binary class labels, use RocAucBinary()
and NOT RocAuc()
in order to avoid a value error.
陷阱3 :对于二进制类标签,请使用RocAucBinary()
和NOT RocAuc()
以避免值错误。
ValueError: y should be a 1d array, got an array of shape (2000, 2) instead.
Instantiate Loss Function
实例化损失函数
loss_func = CrossEntropyLossFlat(weight=class_weights)
Pitfall #5: Use the FastAI cross entropy loss function as opposed to the PyTorch equivalent of torch.nn.CrossEntropyLoss()
in order to avoid errors. The FastAI loss functions are listed here. Using the PyTorch cross entropy loss gave me the following runtime error.
陷阱5:使用FastAI交叉熵损失函数,而不是torch.nn.CrossEntropyLoss()
的PyTorch等效函数,以避免错误。 FastAI损失函数在此处列出。 使用PyTorch交叉熵损失给了我以下运行时错误。
RuntimeError: Expected object of scalar type Long but got scalar type Char for argument #2 'target' in call to _thnn_nll_loss_forward
Instantiate Learner
实例化学习者
Use tabular_learner
in FastAI to easily instantiate an architecture.
在FastAI中使用tabular_learner
可以轻松实例化架构。
learn = tabular_learner(dls,
layers=[500, 250],
loss_func=loss_func,
metrics=roc_auc)
Double check that learn
is using the correct loss function:
仔细检查learn
是否使用了正确的损失函数:
learn.loss_func
Out [1]: FlattenedLoss of CrossEntropyLoss()
模型训练与验证分数 (Model Training & Validation Score)
Train the model over the desired number of epochs.
在所需的时期数上训练模型。
learn.fit_one_cycle(3)
The performance metric and the loss function values for the training and validation set is shown below.
训练和验证集的性能指标和损失函数值如下所示。
Area under ROC curve reaches 0.75 in just 3 epochs! 仅3个时间段,ROC曲线下的面积就达到0.75!That’s great! With minimal tuning, the FastAI model is performing better than the models built painstakingly using PySpark and Scikit-Learn.
那很棒! 通过最少的调整,FastAI模型的性能优于使用PySpark和Scikit-Learn精心构建的模型。
4. FastAI与PySpark ML和Scikit-Learn Logistic回归模型的比较 (4. Comparison Of FastAI To PySpark ML & Scikit-Learn Logistic Regression Model)
In this section , we compare model performance and computation time of these three ML libraries.
在本节中,我们将比较这三个ML库的模型性能和计算时间。
Note: While the neural network performed well without extensive hyper-parameter tuning, PySpark ML and Scikit-Learn did not. As a result, I have added those times since they were relevant to training a descent model.
注意:虽然神经网络在没有进行广泛的超参数调整的情况下表现良好,但PySpark ML和Scikit-Learn却没有。 结果,我添加了那些时间,因为它们与训练下降模型有关。
Model Training Time
模型训练时间
- FastAI — 6 minutesFastAI — 6分钟
- PySpark ML — 0.7 seconds + 38 minutes for hyper-parameter tuningPySpark ML — 0.7秒+ 38分钟用于超参数调整
- Scikit-Learn — 36 seconds + 8 minutes for hyper-parameter tuning (on a subsample of the data)Scikit-Learn-36秒+ 8分钟用于超参数调整(基于数据的子样本)
Area Under ROC Curve
ROC曲线下的面积
- FastAI — 0.75FastAI — 0.75
- PySpark ML — 0.74 PySpark ML — 0.74
- Scikit-Learn — 0.73 Scikit-Learn — 0.73
For more details, check out my Github repo.
有关更多详细信息,请查看我的Github存储库。
5.结论 (5. Conclusion)
In this article, we saw the power of FastAI when it comes to quickly building DL models. Before playing around with FastAI, I would put off using neural networks until I had already tried Logistic Regression, Random Forests etc. because of the stigma that neural networks are difficult to tune and computationally expensive. However, with increased accessibility to GPUs via Google AI Platform Notebooks and the layered FastAI approach to deep learning, now it will surely be one of the first tools I reach for when it comes to classification tasks on large, complex datasets.
在本文中,我们了解了FastAI在快速构建DL模型方面的强大功能。 在使用FastAI之前,我会推迟使用神经网络,直到我已经尝试过Logistic回归,Random Forests等为止,因为神经网络难以调整且计算成本很高,这是我的耻辱。 但是,随着通过Google AI Platform Notebooks对GPU的可访问性的增强以及分层的FastAI深度学习方法的发展,现在,当涉及到大型,复杂数据集的分类任务时,它无疑将成为我使用的首批工具之一。
翻译自: https://medium.com/@faiyaz.hasan1/deep-learning-with-weighted-cross-entropy-loss-on-imbalanced-tabular-data-using-fastai-fe1c009e184c
fastai深度学习社区