Machine Learning Engineering Case Studies with Python notebook

作者:禅与计算机程序设计艺术

1.简介

Machine learning engineering (MLE) is the process of developing machine learning systems that can perform tasks with high accuracy and efficiency at scale. MLE involves designing, building, testing, deploying, monitoring, and maintaining machine learning models, as well as building infrastructure for running them efficiently. The purpose of this article is to provide a practical guide on how to develop an efficient and effective MLE system using Python notebooks. We will go through various case studies, covering different aspects of ML Engineering including data preprocessing, model development, deployment, optimization, and monitoring. In each section, we will also demonstrate the implementation in Python notebook form.
The goal is to help readers understand the fundamental principles behind machine learning engineering, gain practical insights into the various steps involved, and build confidence in their ability to apply these principles while working with real-world problems. All code used in this article is written in Python. This format allows us to share our thoughts more clearly and concisely than traditional prose writing. Additionally, it helps readers follow along with the explanations and learn from the provided examples rather than just absorbing information without applying it directly themselves. Therefore, this article provides both a theoretical foundation and hands-on experience for those who are new to MLE or experienced developers looking to level up their skills.

2.背景介绍

This article provides a comprehensive overview of what machine learning engineering (MLE) entails, discusses key concepts such as supervised and unsupervised learning, and explores best practices for handling data, implementing algorithms, optimizing performance, and monitoring the deployed models. In doing so, we focus on providing actionable solutions to common problems faced during MLE projects. We present several case studies demonstrating how to approach each aspect of MLE, ranging from data preparation techniques to deployment strategies. These case studies include preparing medical images for classification, analyzing text sentiment for classification, identifying customer segments based on demographics and behavioral patterns, and predicting stock prices with deep neural networks. By the end of this article, you should have a clear understanding of the necessary components of MLE, be able to identify potential challenges, and have practical insights into how to tackle them.

3.核心概念、术语及基础知识

Before diving into specific details about MLE, let’s first establish some basic terminology and knowledge of machine learning itself.

Supervised vs Unsupervised Learning

Supervised learning refers to the problem of training a model using labeled input data where the output variable(s) are known for each input observation. It consists of two phases:

  1. Training Phase: During which the model learns to map inputs to outputs based on a set of labeled data points.
  2. Testing/Validation Phase: After the training phase completes, the trained model is tested using a separate set of test data to evaluate its performance. If the model performs well on the test data, it is considered “trained” enough to make predictions on new, unseen data. Otherwise, additional training may be required to improve its performance.
    Unsupervised learning refers to the problem of training a model using unlabeled input data. There are three main types of unsupervised learning:
  3. Clustering: In clustering, the aim is to group similar data points together into clusters.
  4. Dimensionality Reduction: In dimensionality reduction, the objective is to reduce the number of features in the dataset while retaining most of the relevant information.
  5. Association Analysis: In association analysis, the task is to find interesting relationships between variables in the data.
    In summary, supervised learning requires labeled data while unsupervised learning does not require any labeling or ground truth. Choosing the right type of algorithm depends on the nature of the problem being solved. For example, if there are no labeled data available but there is a need to cluster customers based on their purchasing behavior, then we might choose clustering algorithm instead of a supervised classifier. On the other hand, if we want to classify news articles based on their topics, we would likely use a supervised classifier.
    Next, let’s discuss important terms related to machine learning and programming languages.

Terms & Concepts

Data: A collection of observations that contains attributes that describe each instance.
Attributes: Individual measurable properties or characteristics of instances. Examples could be age, gender, income, occupation etc. Attributes can be numerical, categorical or time-series based depending on the domain.
Instance: An individual entry in the dataset. Each row represents one instance in the dataset.
Feature: A property or characteristic of an instance that influences the outcome. Features are typically measured or calculated for each instance. Example features could be height, weight, credit score, marital status, location etc.
Label: The value to be predicted for each instance. Can be binary or multiclass. Binary labels indicate either “yes” or “no”. Multi-class labels indicate multiple possible outcomes. Labels can be derived from the features using mathematical equations or rules-based logic.
Training Set: The subset of data used to train a model.
Test Set: The subset of data used to validate the accuracy of a trained model.
Cross Validation: Method of evaluating the performance of a model by training it on different subsets of the training data and averaging the results.
Hyperparameters: Parameters that are tuned manually before training a model. They control the complexity of the model and impact the quality of the final result. Some hyperparameters that affect model performance include learning rate, regularization parameter, batch size etc.
Python Programming Language: An open source language that is commonly used for data science and machine learning. Python has powerful libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, PyTorch etc., that support various machine learning algorithms and tools.
Linear Regression: One of the simplest regression algorithms. It assumes that the relationship between the feature vector X and target variable y is linear.
Polynomial Regression: A variation of Linear Regression where the degree of polynomial function increases beyond simple linear regression.
Logistic Regression: Another popular supervised learning algorithm for binary classification tasks. It estimates probabilities using sigmoid activation function.
Decision Trees: Tree-like models that represent complex decision processes by recursively splitting nodes based on a split criterion until they reach leaves with classifications.
Random Forest: Ensemble method that combines multiple Decision Trees to decrease variance and increase robustness.
Gradient Boosting: Ensemble method that iteratively trains weak learners to produce a strong learner.
ROC Curve: A curve indicating the tradeoff between true positive rate and false positive rate at various threshold values. ROC curves are often used to measure the effectiveness of classifiers and detect imbalanced datasets.
AUC Score: A metric that measures the area under the ROC curve. Its value ranges between 0 and 1, with 1 representing perfect classification.
Precision Recall Curve: Precision-Recall curve shows the tradeoff between precision and recall for different probability thresholds.
F1 Score: F1 score is the harmonic mean of precision and recall. It is useful when you want to balance precision and recall, or when you care equally about both metrics.
Confusion Matrix: A matrix that summarizes the performance of an algorithm on a test dataset. It indicates the number of true positives, true negatives, false positives and false negatives. Confusion matrices are often used to visualize the performance of an algorithm on a multi-class classification task.
Classification Report: A summary of the performance of an algorithm across different classes, including precision, recall, F1-score, and support. Classification reports are often used to evaluate the performance of an algorithm on a binary classification task.
Model Deployment: The process of moving a machine learning model from production to a production environment for making predictions on new data. Depending on the context, deployment can involve scaling, updating, routing traffic, and monitoring the performance of the deployed model over time.
Model Monitoring: The process of continuously tracking and assessing the performance of a deployed model over time to ensure that it remains operational and accurate. Model monitoring includes collecting metrics such as latency, error rates, and throughput, and setting alerts according to predefined conditions.
Batch Scheduling: The process of processing large amounts of data in batches to avoid excessive memory usage or network congestion. Batch scheduling techniques include dynamic batch sizing, partitioning, prefetching, and gradient accumulation.
Feature Scaling: A technique that scales the range of independent variables or features within a fixed range [0,1] or [-1,+1]. Feature scaling improves the convergence speed and stability of many optimization methods. Common methods include standardization, min-max normalization, and Z-score normalization.
Outlier Detection: Identifying and removing outliers from the data is essential for improving the accuracy and reliability of subsequent modeling activities. Outlier detection methods include isolation forest, PCA-based outlier detection, and k-NN density estimation.
PCA (Principal Component Analysis): A statistical method that identifies a small set of principal components that capture most of the variance in the data. It uses linear projections to project the original data onto a lower dimensional space, resulting in fewer dimensions with maximum information preserved.
L1 Regularization / Lasso Regression: Regularized version of linear regression that adds a penalty term to minimize the absolute value of the magnitude of coefficients. L1 regularization encourages sparsity in the solution and thus reduces the chance of overfitting.
L2 Regularization / Ridge Regression: Similar to L1 regularization, L2 regularization penalizes the sum of squares of coefficients, adding a penalty term equal to half the square of the magnitude of the coefficient. However, L2 regularization works better when the features have different scales.
Early Stopping: Technique that stops training a model after the validation loss starts increasing, indicating that further training does not yield significant improvements. Early stopping saves computational resources and prevents overfitting.
Learning Rate Scheduler: Adjusts the learning rate dynamically during training, allowing the optimizer to adapt to changing environments or losses.
Batch Normalization: Technique that normalizes the output of intermediate layers in a neural network to prevent vanishing gradients and improve generalization performance.
Dropout: Regularization technique that randomly drops out units during training to prevent co-adaption of neurons. Dropout prevents overfitting and enables faster training.
Transfer Learning: Transfer learning is a technique that leverages a pre-trained model on a related task to improve the performance of another task. It involves freezing the weights of some layers in the pre-trained model and replacing them with custom heads suited for the new task.
Multi-label Classification: Problem where each instance can belong to multiple categories simultaneously. Examples include image tagging, music genre recognition, object detection, and document categorization.
Ensembling: Combining the predictions of multiple models to obtain improved performance. Two popular ensemble methods are bagging and boosting. Bagging involves aggregating the predictions of multiple models by taking their average or majority vote. Boosting involves combining the predictions of models sequentially with weighted training samples.
Softmax Activation Function: A non-linear activation function that converts raw logits into normalized probabilities. Softmax is widely used in multi-class classification settings.
Mean Absolute Error (MAE): An evaluation metric that calculates the average of the absolute differences between predicted and actual values.
Mean Squared Error (MSE): An evaluation metric that calculates the squared differences between predicted and actual values and takes the average.
Root Mean Squared Error (RMSE): Square root of MSE. RMSE gives us an interpretable measure of the error in terms of the original scale of the response variable.
R-Squared (R^2): A statistical measure of how much variability in the dependent variable is explained by changes in the independent variable. The higher the R^2 value, the better the fit of the model to the data.
Accuracy: An evaluation metric that calculates the percentage of correct predictions made by the model. Accuracy alone doesn’t always tell the whole story; it is often misleading due to class imbalance issues.
Recall (Sensitivity) and True Positive Rate (TPR): Sensitivity measures how well the model can identify positive cases when the condition is actually positive. TPR = TP/(TP + FN).
Specificity and True Negative Rate (TNR): Specificity measures how well the model can identify negative cases when the condition is actually negative. TNR = TN/(TN + FP).
Precision and Positive Predictive Value (PPV): Precision measures the proportion of positive identifications that were actually correct. PPV = TP/(TP + FP).
Negative Predictive Value (NPV): NPV measures the proportion of negative identifications that were actually correct. NPV = TN/(TN + FN).
F1 Score: Harmonic mean of precision and recall. Calculates the overall performance of the model.

Data Preprocessing

Preprocessing is the process of cleaning and transforming raw data into a suitable format that can be fed to machine learning algorithms. The following are some of the common steps involved in data preprocessing:

  1. Missing Values Handling: Dealing with missing values can significantly affect the accuracy of our model. There are various approaches to handle missing values, such as deletion, imputation, or interpolation.
  2. Encoding Categorical Variables: Categorical variables are discrete variables that take on a limited number of possible values. Before feeding them to our model, we need to convert them into numeric representations.
  3. Splitting Data: Once we have prepared the data, we need to divide it into training, validation, and test sets. The training set is used to train our model, the validation set is used to tune hyperparameters and evaluate the performance of our model, and the test set is reserved for evaluating the final performance of our model once it is deployed.
  4. Normalization: Normalization rescales all the columns of the data to have zero mean and unit variance, ensuring that each column contributes approximately the same amount to the prediction.
  5. Data Augmentation: Synthetic data generated from existing data can sometimes improve the performance of our model. Image augmentation techniques include rotation, shearing, zooming, brightness adjustment, contrast change etc.
    Let’s implement the above steps using Python in the following sections.

Medical Images Classification

Case study to classify CT scan images into abnormalities. The dataset contains thousands of CT scans of human body parts, taken from patients who were diagnosed with diseases such as headaches, back pain, or strokes. The goal of this task is to create a model that automatically determines whether a given CT scan image belongs to one of the seven classes - none, glaucoma, cataract, diabetic retinopathy, macular degeneration, blurry vision, and normal.
We start by loading and exploring the dataset. Here, we load the Medical Images dataset from Keras library and extract some metadata about the dataset.

import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
# Load the medical images dataset
dataset, info = tfds.load('medmnist', with_info=True, as_supervised=True)
# Print the dataset description
print(info.description)
# Get the list of class names
class_names = ['none', 'cataract', 'diabetic retinopathy',
        'glaucoma','macular degeneration', 'blurry vision', 'normal']
# Visualize some examples
for i, (image, label) in enumerate(dataset['test'].take(3)):
ax = plt.subplot(1, 3, i + 1)
plt.imshow(image[0], cmap='gray')
plt.title(class_names[int(label)])
plt.axis("off")
plt.show()

The MedMNIST is a large-scale medical image dataset comprising 7 classes. Each class corresponds to a different disease or abnormality, such as glaucoma, cataract, diabetic retinopathy, macular degeneration, blurry vision, and normal. The dataset was collected by Stanford University School of Medicine.
Here, we see some sample images from the dataset. Let’s now preprocess the data by performing the following operations:

  1. Rescaling pixel values to [0,1]: We scale down the pixel values to [0,1] to normalize the distribution of pixel intensities amongst the entire dataset. This step ensures that each feature contributes approximately the same amount to the prediction.
  2. Converting labels to one-hot encoding: Since we have multiple classes, we encode the labels using one-hot encoding. This means that each label becomes a binary vector with only one element set to 1, corresponding to the class index.
    Finally, we split the dataset into training, validation, and test sets.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Extract the data and labels
data, labels = [], []
for datapoint, label in dataset['train']:
data.append(datapoint.numpy())
labels.append(label.numpy())
# Convert the data to a numpy array
data = np.array(data)
# Scale the pixel values to [0,1]
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data.reshape(-1,1))
# Convert the labels to one-hot encoding
one_hot_labels = tf.keras.utils.to_categorical(np.array(labels), num_classes=len(class_names))
# Split the data into training, validation, and test sets
X_train, X_val, y_train, y_val = train_test_split(scaled_data, one_hot_labels, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.25, random_state=42)
print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (30000, 1)
Validation set shape: (10000, 1)
Testing set shape: (5000, 1)
Now, we can proceed to model development and architecture selection.

Text Sentiment Classification Using Logistic Regression

Case study to classify movie reviews as positive or negative. The dataset contains IMDB movie review dataset with binary labels (positive/negative). The task is to create a logistic regression model that automatically predicts the sentiment of a given movie review.
We start by loading and exploring the dataset. Here, we load the IMDb movie review dataset from Keras library and extract some metadata about the dataset.

import tensorflow_datasets as tfds
import tensorflow as tf
# Load the IMDb movie review dataset
dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)
# Get the list of class names
class_names = ['positive', 'negative']
# Check the first few examples
for review, label in dataset['train'].take(3):
print("Review:", review.numpy().decode())
print("Label:", label.numpy(), "
")
# Compute the number of unique words in the corpus
vocab_size = len(set([word.lower() for review in dataset['train'] for word in review.numpy().decode().split()]))
print("Number of unique words in the corpus:", vocab_size)

Review: This film had me sitting there in tears! It’s almost funny watching someone else hate something, especially in front of yourself.
Label: 0
Review: When I saw the poster for the upcoming release, my heart skipped a beat. I instantly knew I wanted to see again.
Label: 0
Review: This is such a terrible idea… why didn’t you think of putting Green Berets on your payroll? They’re loyal and happy.
Label: 1
Number of unique words in the corpus: 766601
Here, we can observe some sample movie reviews and their corresponding labels. Also, we notice that the vocabulary size is quite large. To solve this issue, we can limit the number of unique words in the corpus by converting each review to lowercase and filtering out stopwords.
Next, we preprocess the data by performing the following operations:

  1. Tokenization: We break each review into individual tokens, represented by integers.
  2. Padding: We add padding to the sequences to ensure that every sequence has the same length.
  3. Conversion to tensors: We convert the token lists into tensors that can be processed by the model.
    Finally, we split the dataset into training, validation, and test sets.
import nltk
from nltk.corpus import stopwords
# Define a function to tokenize the reviews
def tokenizer(text):
# Remove punctuations and convert to lowercase
text = ''.join([c.lower() for c in text if c not in punctuation])
# Filter out stopwords
stop_words = set(stopwords.words('english'))
text =''.join([w for w in text.split() if w not in stop_words])
return text.split()
# Apply the tokenizer to the reviews and compute the vocabulary size
vocab_size = len(set([' '.join(tokenizer(review)).lower() for review in dataset['train']]))
print("Vocabulary size after tokenization and filtering stopwords:", vocab_size)
# Define functions to pad and convert the sequences to tensor
pad_length = max([len(tokenizer(review)) for review in dataset['train']])
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences([tokenizer(review) for review in dataset['train']], maxlen=pad_length, padding="post", truncating="post")
# Split the padded sequences into training, validation, and test sets
X_train, X_val, y_train, y_val = train_test_split(padded_sequences, np.array([label.numpy() for _, label in dataset['train']]), test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.25, random_state=42)
# Define a function to convert arrays to tensors
def to_tensor(x, y):
x = tf.convert_to_tensor(x, dtype=tf.int32)
y = tf.convert_to_tensor(y, dtype=tf.float32)
return x, y
# Convert the splits to tensors
X_train, y_train = to_tensor(X_train, y_train)
X_val, y_val = to_tensor(X_val, y_val)
X_test, y_test = to_tensor(X_test, y_test)
print("Training set shapes:")
print("    Input:", X_train.shape)
print("    Output:", y_train.shape)
print("Validation set shapes:")
print("    Input:", X_val.shape)
print("    Output:", y_val.shape)
print("Testing set shapes:")
print("    Input:", X_test.shape)
print("    Output:", y_test.shape)

Vocabulary size after tokenization and filtering stopwords: 43842
Training set shapes:
Input: (25000,)
Output: (25000,)
Validation set shapes:
Input: (10000,)
Output: (10000,)
Testing set shapes:
Input: (6250,)
Output: (6250,)
Now, we can move ahead to model development and architecture selection.

你可能感兴趣的:(Python,自然语言处理,人工智能,语言模型,编程实践,开发语言,架构设计)