python 机器学习管道

Below are the usual steps involved in building the ML pipeline:

以下是构建ML管道所涉及的通常步骤：

Import Data
汇入资料
Exploratory Data Analysis (EDA)
探索性数据分析(EDA)
Missing Value Imputation
缺失价值估算
Outlier Treatment
离群值处理
Feature Engineering
特征工程
Model Building
建筑模型
Feature Selection
功能选择
Model Interpretation
模型解释
Save the model
保存模型
Model Deployment *
模型部署*

问题陈述和数据获取 (Problem Statement and Getting the Data)

I’m using a relatively bigger and more complicated data set to demonstrate the process. Refer to the Kaggle competition — IEEE-CIS Fraud Detection.

我正在使用相对较大和较复杂的数据集来演示该过程。请参阅Kaggle竞赛-IEEE-CIS欺诈检测。

Navigate to Data Explorer and you will see something like this:

导航到“数据资源管理器”，您将看到以下内容：

Select train_transaction.csv and it will show you a glimpse of what data looks like. Click on the download icon highlighted by a red arrow to get the data.

选择train_transaction.csv ，它将向您展示数据的外观。单击以红色箭头突出显示的下载图标以获取数据。

Other than the usual library import statements, you will need to check for 2 additional libraries —

除了通常的库导入语句外，您还需要检查另外两个库-

pip安装pyarrow (pip install pyarrow)

点安装fast_ml (pip install fast_ml)

主要亮点 (Key Highlights)

This is the first article in the series of building machine learning pipeline. In this article, we will focus on optimizations around importing the data in Jupyter notebook and executing things faster.

这是构建机器学习管道系列文章中的第一篇。在本文中，我们将专注于围绕Jupyter Notebook中导入数据并更快地执行操作的优化。

There are 3 key things to note in this article —

本文中有3个关键要注意的地方-

Python zipfile
Python压缩档
Reducing the memory usage of the dataset
减少数据集的内存使用量
A faster way of saving/loading working datasets
保存/加载工作数据集的更快方法

1：导入数据 (1: Import Data)

After you have downloaded the zipped file. It is so much better to use python to unzip the file.

下载压缩文件后。使用python解压缩文件要好得多。

Tip 1: Use function from python zipfile library to unzip the file.

提示1： 使用python zipfile库中的函数来解压缩文件。

import zipfilewith zipfile.ZipFile('train_transaction.csv.zip', mode='r') as zip_ref:
    zip_ref.extractall('data/')

This will create a folder data and unzip the CSV file train_transaction.csv in that folder.

这将创建一个文件夹data ，并将CSV文件train_transaction.csv解压缩到该文件夹中。

We will use pandas read_csv method to load the data set into Jupyter notebook.

我们将使用pandas read_csv方法将数据集加载到Jupyter笔记本中。

%time trans = pd.read_csv('train_transaction.csv')df_size = trans.memory_usage().sum() / 1024**2
print(f'Memory usage of dataframe is {df_size} MB')print (f'Shape of dataframe is {trans.shape}')---- Output ----
CPU times: user 23.2 s, sys: 7.87 s, total: 31 s
Wall time: 32.5 sMemory usage of dataframe is 1775.1524047851562 MB
Shape of dataframe is (590540, 394)

This data is ~1.5 GB with more than half a million rows.

该数据约为1.5 GB，具有超过一百万行。

Tip 2: We will use a function from fast_ml to reduce this memory usage.

提示2： 我们将使用fast_ml中的函数来减少此内存使用量。

from fast_ml.utilities import reduce_memory_usage%time trans = reduce_memory_usage(trans, convert_to_category=False)---- Output ----
Memory usage of dataframe is 1775.15 MB
Memory usage after optimization is: 542.35 MB
Decreased by 69.4%CPU times: user 2min 25s, sys: 2min 57s, total: 5min 23s
Wall time: 5min 56s

This step took almost 5 mins but it has reduced the memory size by almost 70% that’s quite a significant reduction

此步骤花费了将近5分钟，但已将内存大小减少了将近70％，这是一个相当大的减少

For further analysis, we will create a sample dataset of 200k records just so that our data processing steps won’t take a long time to run.

为了进行进一步的分析，我们将创建一个包含20万条记录的样本数据集，以便我们的数据处理步骤不会花费很长时间。

# Take a sample of 200k records
%time trans = trans.sample(n=200000)#reset the index because now index would have shuffled
trans.reset_index(inplace = True, drop = True)df_size = trans.memory_usage().sum() / 1024**2
print(f'Memory usage of sample dataframe is {df_size} MB')---- Output ----
CPU times: user 1.39 s, sys: 776 ms, total: 2.16 s
Wall time: 2.43 s
Memory usage of sample dataframe is 185.20355224609375 MB

Now, we will save this in our local drive — CSV Format

现在，我们将其保存在本地驱动器中-CSV格式

import osos.makedirs('data', exist_ok=True) trans.to_feather('data/train_transaction_sample')

Tip 3: use the feather format instead of csv

提示3： 使用羽毛格式而不是csv

import osos.makedirs('data', exist_ok=True)
trans.to_feather('data/train_transaction_sample')

Once you load the data from these 2 sources you will observe the significant performance improvements.

从这两个来源加载数据后，您将观察到显着的性能改进。

Load the saved sample data — CSV Format

加载保存的样本数据-CSV格式

%time trans = pd.read_csv('data/train_transaction_sample.csv')df_size = tras.memory_usage().sum() / 1024**2
print(f'Memory usage of dataframe is {df_size} MB')
print (f'Shape of dataframe is {trans.shape}')---- Output ----
CPU times: user 7.37 s, sys: 1.06 s, total: 8.42 s
Wall time: 8.5 sMemory usage of dataframe is 601.1964111328125 MB
Shape of dataframe is (200000, 394)

Load the saved sample data — Feather Format

加载保存的样本数据—羽毛格式

%time trans = pd.read_feather('tmp/train_transaction_sample')df_size = trans.memory_usage().sum() / 1024**2
print(f'Memory usage of dataframe is {df_size} MB')
print (f'Shape of dataframe is {trans.shape}')---- Output ----
CPU times: user 1.32 s, sys: 930 ms, total: 2.25 s
Wall time: 892 msMemory usage of dataframe is 183.67779541015625 MB
Shape of dataframe is (200000, 394)

注意这里两件事： (Notice 2 things here :)

i. The amount of time it took to load the CSV file is almost 10 times the time it took for loading feather format data.

一世。加载CSV文件所花费的时间几乎是加载羽毛格式数据所花费的时间的10倍。

ii. Size of the data set loaded was retained in feather format whereas in CSV format the data set is again consuming high memory and we will have to run the reduce_memory_usage function again.

ii。加载的数据集的大小以羽毛格式保留，而在CSV格式中，数据集再次占用大量内存，我们将不得不再次运行reduce_memory_usage函数。

结束语： (Closing Notes:)

Please feel free to write your thoughts/suggestions/feedback.
请随时写下您的想法/建议/反馈。
We will use the new sample data set what we created for our further analysis.
我们将使用我们创建的新样本数据集进行进一步分析。
We will talk about Exploratory Data Analysis in our next article.
在下一篇文章中，我们将讨论探索性数据分析。
Github link
Github链接

翻译自: https://towardsdatascience.com/building-a-machine-learning-pipeline-part-1-b19f8c8317ae