手语翻译系统
The author selected Code Org to receive a donation as part of the Write for DOnations program.
作者选择Code Org接受捐赠,这是Write for DOnations计划的一部分。
Computer vision is a subfield of computer science that aims to extract a higher-order understanding from images and videos. This powers technologies such as fun video chat filters, your mobile device’s face authenticator, and self-driving cars.
计算机视觉是计算机科学的一个子领域,旨在从图像和视频中提取更高层次的理解。 这推动了诸如有趣的视频聊天过滤器,移动设备的面部识别器和自动驾驶汽车等技术的发展。
In this tutorial, you’ll use computer vision to build an American Sign Language translator for your webcam. As you work through the tutorial, you’ll use OpenCV
, a computer-vision library, PyTorch
to build a deep neural network, and onnx
to export your neural network. You’ll also apply the following concepts as you build a computer-vision application:
在本教程中,您将使用计算机视觉为网络摄像头构建American Sign Language转换器。 在学习本教程的过程中,将使用OpenCV
(一个计算机视觉库), PyTorch
构建一个深层神经网络,并使用onnx
导出您的神经网络。 在构建计算机视觉应用程序时,您还将应用以下概念:
You’ll use the same three-step method as used in How To Apply Computer Vision to Build an Emotion-Based Dog Filter tutorial: preprocess a dataset, train a model, and evaluate the model.
您将使用与如何应用计算机视觉来构建基于情感的狗过滤器教程中所用的三步方法相同的方法:预处理数据集,训练模型并评估模型。
Along the way, you’ll also explore related concepts in machine learning.
在此过程中,您还将探索机器学习中的相关概念。
By the end of this tutorial, you’ll have both an American Sign Language translator and foundational deep learning know-how. You can also access the complete source code for this project.
在本教程结束时,您将拥有美国手语翻译者和基础深度学习知识。 您也可以访问该项目的完整源代码 。
To complete this tutorial, you will need the following:
要完成本教程,您将需要以下内容:
A local development environment for Python 3 with at least 1GB of RAM. You can follow How to Install and Set Up a Local Programming Environment for Python 3 to configure everything you need.
具有至少1GB RAM的Python 3本地开发环境。 您可以按照如何为Python 3安装和设置本地编程环境来配置所需的一切。
(Recommended) Build an Emotion-Based Dog Filter; this tutorial is not explicitly used but the same ideas are reinforced and built upon.
(推荐) 构建基于情感的狗过滤器 ; 本教程未明确使用,但相同的思想得到了加强和建立。
Let’s create a workspace for this project and install the dependencies we’ll need.
让我们为该项目创建一个工作区并安装所需的依赖项。
On Linux distributions, start by preparing your system package manager and install the Python3 virtualenv package. Use:
在Linux发行版上,首先准备系统软件包管理器并安装Python3 virtualenv软件包。 用:
We’ll call our workspace SignLanguage
:
我们将工作区SignLanguage
:
Navigate to the SignLanguage
directory:
导航到SignLanguage
目录:
Then create a new virtual environment for the project:
然后为项目创建一个新的虚拟环境:
python3 -m venv signlanguage
python3 -m venv 手语
Activate your environment:
激活您的环境:
source signlanguage/bin/activate
源手语 / bin / activate
Then install PyTorch, a deep-learning framework for Python that we’ll use in this tutorial.
然后安装PyTorch ,这是我们在本教程中将使用的Python深度学习框架。
On macOS, install Pytorch with the following command:
在macOS上,使用以下命令安装Pytorch:
On Linux and Windows, use the following commands for a CPU-only build:
在Linux和Windows上,对仅CPU的构建使用以下命令:
Now install prepackaged binaries for OpenCV
, numpy
, and onnx
, which are libraries for computer vision, linear algebra, AI model exporting, and AI model execution, respectively. OpenCV
offers utilities such as image rotations, and numpy
offers linear algebra utilities such as a matrix inversion:
现在为OpenCV
, numpy
和onnx
安装预打包的二进制文件,它们分别是计算机视觉,线性代数,AI模型导出和AI模型执行的库。 OpenCV
提供诸如图像旋转之类的实用程序,而numpy
提供诸如矩阵求逆之类的线性代数实用程序:
On Linux distributions, you will need to install libSM.so
:
在Linux发行版上,您将需要安装libSM.so
:
With the dependencies installed, let’s build the first version of our sign language translator: a sign language classifier.
安装依赖项后,让我们构建手语翻译器的第一个版本:手语分类器。
In these next three sections, you’ll build a sign language classifier using a neural network. Your goal is to produce a model that accepts a picture of a hand as input and outputs a letter.
在接下来的三部分中,您将使用神经网络构建手语分类器。 您的目标是产生一个模型,该模型接受一只手的图片作为输入并输出一个字母。
The following three steps are required to build a machine learning classification model:
建立机器学习分类模型需要以下三个步骤:
Preprocess the data: Apply one-hot encoding to your labels and wrap your data in PyTorch Tensors. Train your model on augmented data to prepare it for “unusual” input, like an off-center or rotated hand.
预处理数据:对标签应用一键编码 ,然后将数据包装在PyTorch张量中。 在增强数据上训练模型,以使其为“异常”输入做好准备,例如偏心或旋转手。
Specify and train the model: Set up a neural network using PyTorch. Define training hyper-parameters—such as how long to train for—and run stochastic gradient descent. You’ll also vary a specific training hyper-parameter, which is learning rate schedule. These will boost model accuracy.
指定并训练模型:使用PyTorch建立神经网络。 定义训练超参数(例如,训练时间)并进行随机梯度下降。 您还将更改特定的训练超参数,即学习率计划。 这些将提高模型的准确性。
In this section of the tutorial, you will accomplish step 1 of 3. You will download the data, create a Dataset
object to iterate over your data, and finally apply data augmentation. At the end of this step, you will have a programmatic way of accessing images and labels in your dataset to feed to your model.
在本教程的这一部分中,您将完成第1步(共3步)。您将下载数据,创建一个Dataset
对象以遍历您的数据,最后应用数据扩充 。 在此步骤的最后,您将以编程方式访问数据集中的图像和标签以馈入模型。
First, download the dataset to your current working directory:
首先,将数据集下载到当前工作目录:
Note: On macOS, wget
is not available by default. To do so, install Homebrew by following this DigitalOcean tutorial. Then, run brew install wget
.
注意 :在macOS上,默认情况下wget
不可用。 为此,请按照此DigitalOcean教程安装Homebrew。 然后,运行brew install wget
。
Unzip the zip file, which contains a data/
directory:
解压缩包含data/
目录的压缩文件:
Create a new file, named step_2_dataset.py
:
创建一个名为step_2_dataset.py
的新文件:
As before, import the necessary utilities and create the class that will hold your data. For data processing here, you will create the train and test datasets. You’ll implement PyTorch’s Dataset
interface, allowing you to load and use PyTorch’s built-in data pipeline for your sign language classification dataset:
和以前一样,导入必要的实用程序并创建将保存您的数据的类。 对于此处的数据处理,您将创建训练和测试数据集。 您将实现PyTorch的Dataset
接口,从而允许您为手语分类数据集加载和使用PyTorch的内置数据管道:
from torch.utils.data import Dataset
from torch.autograd import Variable
import torch.nn as nn
import numpy as np
import torch
import csv
class SignLanguageMNIST(Dataset):
"""Sign Language classification dataset.
Utility for loading Sign Language dataset into PyTorch. Dataset posted on
Kaggle in 2017, by an unnamed author with username `tecperson`:
https://www.kaggle.com/datamunge/sign-language-mnist
Each sample is 1 x 1 x 28 x 28, and each label is a scalar.
"""
pass
Delete the pass
placeholder in the SignLanguageMNIST
class. In its place, add a method to generate a label mapping:
删除SignLanguageMNIST
类中的pass
占位符。 在其位置上,添加一种方法来生成标签映射:
@staticmethod
def get_label_mapping():
"""
We map all labels to [0, 23]. This mapping from dataset labels [0, 23]
to letter indices [0, 25] is returned below.
"""
mapping = list(range(25))
mapping.pop(9)
return mapping
Labels range from 0 to 25. However, letters J (9) and Z (25) are excluded. This means there are only 24 valid label values. So that the set of all label values starting from 0 is contiguous, we map all labels to [0, 23]. This mapping from dataset labels [0, 23] to letter indices [0, 25] is provided by this get_label_mapping
method.
标签范围从0到25。但是,字母J(9)和Z(25)被排除。 这意味着只有24个有效标签值。 为了使所有从0开始的标签值都是连续的,我们将所有标签映射到[0,23]。 此get_label_mapping
方法提供了从数据集标签[ get_label_mapping
]到字母索引[ get_label_mapping
]的get_label_mapping
。
Next, add a method to extract labels and samples from a CSV file. The following assumes that each line starts with the label
and is then followed by 784 pixel values. These 784 pixel values represent a 28x28
image:
接下来,添加一种从CSV文件提取标签和样本的方法。 以下假设每行以label
开头,然后是784个像素值。 这784个像素值代表28x28
图像:
@staticmethod
def read_label_samples_from_csv(path: str):
"""
Assumes first column in CSV is the label and subsequent 28^2 values
are image pixel values 0-255.
"""
mapping = SignLanguageMNIST.get_label_mapping()
labels, samples = [], []
with open(path) as f:
_ = next(f) # skip header
for line in csv.reader(f):
label = int(line[0])
labels.append(mapping.index(label))
samples.append(list(map(int, line[1:])))
return labels, samples
For an explanation of how these 784 values represent an image, see Build an Emotion-Based Dog Filter, Step 4.
有关这784个值如何表示图像的说明,请参阅第4步 , 构建基于情感的狗过滤器 。
Note that each line in the csv.reader
iterable is a list of strings; the int
and map(int, ...)
invocations cast all strings to integers. Directly beneath our static method, add a function that will initialize our data holder:
注意, csv.reader
可迭代的每一行都是一个字符串列表。 int
和map(int, ...)
调用将所有字符串转换为整数。 在我们的静态方法的正下方,添加一个函数来初始化我们的数据持有人:
def __init__(self,
path: str="data/sign_mnist_train.csv",
mean: List[float]=[0.485],
std: List[float]=[0.229]):
"""
Args:
path: Path to `.csv` file containing `label`, `pixel0`, `pixel1`...
"""
labels, samples = SignLanguageMNIST.read_label_samples_from_csv(path)
self._samples = np.array(samples, dtype=np.uint8).reshape((-1, 28, 28, 1))
self._labels = np.array(labels, dtype=np.uint8).reshape((-1, 1))
self._mean = mean
self._std = std
This function starts by loading the samples and labels. Then it wraps the data in NumPy arrays. The mean and standard deviation information will be explained shortly, in the __getitem__
section following.
此功能首先加载样品和标签。 然后,将数据包装在NumPy数组中。 平均值和标准差信息将在下面的__getitem__
部分中__getitem__
说明。
Directly after the __init__
function, add a __len__
function. The Dataset
requires this method to determine when to stop iterating over data:
在__init__
函数之后,直接添加__len__
函数。 Dataset
需要此方法来确定何时停止遍历数据:
...
def __len__(self):
return len(self._labels)
Finally, add a __getitem__
method, which returns a dictionary containing the sample and the label:
最后,添加__getitem__
方法,该方法返回包含样本和标签的字典:
def __getitem__(self, idx):
transform = transforms.Compose([
transforms.ToPILImage(),
transforms.RandomResizedCrop(28, scale=(0.8, 1.2)),
transforms.ToTensor(),
transforms.Normalize(mean=self._mean, std=self._std)])
return {
'image': transform(self._samples[idx]).float(),
'label': torch.from_numpy(self._labels[idx]).float()
}
You use a technique called data augmentation, where samples are perturbed during training, to increase the model’s robustness to these perturbations. In particular, randomly zoom in on the image by varying amounts and on different locations, via RandomResizedCrop
. Note that zooming in should not affect the final sign language class; thus, the label is not transformed. You additionally normalize the inputs so that image values are rescaled to the [0, 1] range in expectation, instead of [0, 255]; to accomplish this, use the dataset _mean
and _std
when normalizing.
您使用一种称为数据增强的技术,该技术在训练过程中会干扰样本,以提高模型对这些干扰的鲁棒性。 特别是,可以通过RandomResizedCrop
在不同位置和不同位置随机放大图像。 请注意,放大不应影响最终的手语类。 因此,标签不会变形。 您还对输入进行了归一化,以便图像值按预期重新缩放到[0,1]范围,而不是[0,255]。 为此, _std
在标准化时使用数据集_mean
和_std
。
Your completed SignLanguageMNIST
class will look like the following:
您完成的SignLanguageMNIST
类将如下所示:
from torch.utils.data import Dataset
from torch.autograd import Variable
import torchvision.transforms as transforms
import torch.nn as nn
import numpy as np
import torch
from typing import List
import csv
class SignLanguageMNIST(Dataset):
"""Sign Language classification dataset.
Utility for loading Sign Language dataset into PyTorch. Dataset posted on
Kaggle in 2017, by an unnamed author with username `tecperson`:
https://www.kaggle.com/datamunge/sign-language-mnist
Each sample is 1 x 1 x 28 x 28, and each label is a scalar.
"""
@staticmethod
def get_label_mapping():
"""
We map all labels to [0, 23]. This mapping from dataset labels [0, 23]
to letter indices [0, 25] is returned below.
"""
mapping = list(range(25))
mapping.pop(9)
return mapping
@staticmethod
def read_label_samples_from_csv(path: str):
"""
Assumes first column in CSV is the label and subsequent 28^2 values
are image pixel values 0-255.
"""
mapping = SignLanguageMNIST.get_label_mapping()
labels, samples = [], []
with open(path) as f:
_ = next(f) # skip header
for line in csv.reader(f):
label = int(line[0])
labels.append(mapping.index(label))
samples.append(list(map(int, line[1:])))
return labels, samples
def __init__(self,
path: str="data/sign_mnist_train.csv",
mean: List[float]=[0.485],
std: List[float]=[0.229]):
"""
Args:
path: Path to `.csv` file containing `label`, `pixel0`, `pixel1`...
"""
labels, samples = SignLanguageMNIST.read_label_samples_from_csv(path)
self._samples = np.array(samples, dtype=np.uint8).reshape((-1, 28, 28, 1))
self._labels = np.array(labels, dtype=np.uint8).reshape((-1, 1))
self._mean = mean
self._std = std
def __len__(self):
return len(self._labels)
def __getitem__(self, idx):
transform = transforms.Compose([
transforms.ToPILImage(),
transforms.RandomResizedCrop(28, scale=(0.8, 1.2)),
transforms.ToTensor(),
transforms.Normalize(mean=self._mean, std=self._std)])
return {
'image': transform(self._samples[idx]).float(),
'label': torch.from_numpy(self._labels[idx]).float()
}
As before, you will now verify our dataset utility functions by loading the SignLanguageMNIST
dataset. Add the following code to the end of your file after the SignLanguageMNIST
class:
和以前一样,您现在将通过加载SignLanguageMNIST
数据集来验证我们的数据集实用程序功能。 在SignLanguageMNIST
类之后,将以下代码添加到文件SignLanguageMNIST
:
def get_train_test_loaders(batch_size=32):
trainset = SignLanguageMNIST('data/sign_mnist_train.csv')
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)
testset = SignLanguageMNIST('data/sign_mnist_test.csv')
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)
return trainloader, testloader
This code initializes the dataset using the SignLanguageMNIST
class. Then for the train and validation sets, it wraps the dataset in a DataLoader
. This will translate the dataset into an iterable to use later.
此代码使用SignLanguageMNIST
类初始化数据集。 然后,对于训练集和验证集,它将数据集包装在DataLoader
。 这会将数据集转换为可迭代以供以后使用。
Now you’ll verify that the dataset utilities are functioning. Create a sample dataset loader using DataLoader
and print the first element of that loader. Add the following to the end of your file:
现在,您将验证数据集实用程序是否正常运行。 使用DataLoader
创建样本数据集加载器,并打印该加载器的第一个元素。 将以下内容添加到文件末尾:
if __name__ == '__main__':
loader, _ = get_train_test_loaders(2)
print(next(iter(loader)))
You can check that your file matches the step_2_dataset
file in this (repository). Exit your editor and run the script with the following:
您可以检查文件是否与此( 存储库 )中的step_2_dataset
文件匹配。 退出编辑器并使用以下命令运行脚本:
This outputs the following pair of tensors. Our data pipeline outputs two samples and two labels. This indicates that our data pipeline is up and ready to go:
这将输出以下张量。 我们的数据管道输出两个样本和两个标签。 这表明我们的数据管道已启动并准备就绪:
Output
{'image': tensor([[[[ 0.4337, 0.5022, 0.5707, ..., 0.9988, 0.9646, 0.9646],
[ 0.4851, 0.5536, 0.6049, ..., 1.0502, 1.0159, 0.9988],
[ 0.5364, 0.6049, 0.6392, ..., 1.0844, 1.0844, 1.0673],
...,
[-0.5253, -0.4739, -0.4054, ..., 0.9474, 1.2557, 1.2385],
[-0.3369, -0.3369, -0.3369, ..., 0.0569, 1.3584, 1.3242],
[-0.3712, -0.3369, -0.3198, ..., 0.5364, 0.5364, 1.4783]]],
[[[ 0.2111, 0.2796, 0.3481, ..., 0.2453, -0.1314, -0.2342],
[ 0.2624, 0.3309, 0.3652, ..., -0.3883, -0.0629, -0.4568],
[ 0.3309, 0.3823, 0.4337, ..., -0.4054, -0.0458, -1.0048],
...,
[ 1.3242, 1.3584, 1.3927, ..., -0.4054, -0.4568, 0.0227],
[ 1.3242, 1.3927, 1.4612, ..., -0.1657, -0.6281, -0.0287],
[ 1.3242, 1.3927, 1.4440, ..., -0.4397, -0.6452, -0.2856]]]]), 'label': tensor([[24.],
[11.]])}
You’ve now verified that your data pipeline works. This concludes the first step—preprocessing your data—which now includes data augmentation for increased model robustness. Next you will define the neural network and optimizer.
现在,您已验证数据管道可以正常工作。 到此结束第一步-预处理数据-现在包括数据增强以提高模型的鲁棒性。 接下来,您将定义神经网络和优化器。
With a functioning data pipeline, you will now define a model and train it on the data. In particular, you will build a neural network with six layers, define a loss, an optimizer, and finally, optimize the loss function for your neural network predictions. At the end of this step, you will have a working sign language classifier.
借助正常运行的数据管道,您现在将定义一个模型并对其进行训练。 特别是,您将构建一个具有六层的神经网络,定义一个损失,使用一个优化器,最后为神经网络预测优化损失函数。 在此步骤结束时,您将拥有一个有效的手语分类器。
Create a new file called step_3_train.py
:
创建一个名为step_3_train.py
的新文件:
Import the necessary utilities:
导入必要的实用程序:
from torch.utils.data import Dataset
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
from step_2_dataset import get_train_test_loaders
Define a PyTorch neural network that includes three convolutional layers, followed by three fully connected layers. Add this to the end of your existing script:
定义一个PyTorch神经网络,其中包括三个卷积层,然后是三个完全连接的层。 将此添加到现有脚本的末尾:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 6, 3)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 6, 3)
self.conv3 = nn.Conv2d(6, 16, 3)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 48)
self.fc3 = nn.Linear(48, 24)
def forward(self, x):
x = F.relu(self.conv1(x))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
Now initialize the neural network, define a loss function, and define optimization hyperparameters by adding the following code to the end of the script:
现在,通过在脚本末尾添加以下代码,初始化神经网络,定义损失函数并定义优化超参数:
def main():
net = Net().float()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
Finally, you’ll train for two epochs:
最后,您将训练两个时期 :
def main():
net = Net().float()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
trainloader, _ = get_train_test_loaders()
for epoch in range(2): # loop over the dataset multiple times
train(net, criterion, optimizer, trainloader, epoch)
torch.save(net.state_dict(), "checkpoint.pth")
You define an epoch to be an iteration of training where every training sample has been used exactly once. At the end of the main function, the model parameters will be saved to a file called "checkpoint.pth"
.
您将一个时期定义为训练的迭代,其中每个训练样本都被精确地使用过一次。 在主要功能的最后,模型参数将保存到名为"checkpoint.pth"
的文件中。
Add the following code to the end of your script to extract image
and label
from the dataset loader and then wrap each in a PyTorch Variable
:
将以下代码添加到脚本的末尾,以从数据集加载器中提取image
和label
,然后将它们包装在PyTorch Variable
:
def train(net, criterion, optimizer, trainloader, epoch):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs = Variable(data['image'].float())
labels = Variable(data['label'].long())
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels[:, 0])
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 100 == 0:
print('[%d, %5d] loss: %.6f' % (epoch, i, running_loss / (i + 1)))
This code will also run the forward pass and then backpropagate through the loss and neural network.
此代码还将运行前向传递,然后通过损耗和神经网络反向传播。
At the end of your file, add the following to invoke the main
function:
在文件末尾,添加以下内容以调用main
功能:
if __name__ == '__main__':
main()
Double-check that your file matches the following:
仔细检查您的文件是否符合以下条件:
from torch.utils.data import Dataset
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
from step_2_dataset import get_train_test_loaders
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 6, 3)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 6, 3)
self.conv3 = nn.Conv2d(6, 16, 3)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 48)
self.fc3 = nn.Linear(48, 25)
def forward(self, x):
x = F.relu(self.conv1(x))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def main():
net = Net().float()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
trainloader, _ = get_train_test_loaders()
for epoch in range(2): # loop over the dataset multiple times
train(net, criterion, optimizer, trainloader, epoch)
torch.save(net.state_dict(), "checkpoint.pth")
def train(net, criterion, optimizer, trainloader, epoch):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs = Variable(data['image'].float())
labels = Variable(data['label'].long())
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels[:, 0])
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 100 == 0:
print('[%d, %5d] loss: %.6f' % (epoch, i, running_loss / (i + 1)))
if __name__ == '__main__':
main()
Save and exit. Then, launch our proof-of-concept training by running:
保存并退出。 然后,通过运行以下命令来启动我们的概念验证培训:
You’ll see output akin to the following as the neural network trains:
当神经网络训练时,您将看到类似于以下的输出:
Output
[0, 0] loss: 3.208171
[0, 100] loss: 3.211070
[0, 200] loss: 3.192235
[0, 300] loss: 2.943867
[0, 400] loss: 2.569440
[0, 500] loss: 2.243283
[0, 600] loss: 1.986425
[0, 700] loss: 1.768090
[0, 800] loss: 1.587308
[1, 0] loss: 0.254097
[1, 100] loss: 0.208116
[1, 200] loss: 0.196270
[1, 300] loss: 0.183676
[1, 400] loss: 0.169824
[1, 500] loss: 0.157704
[1, 600] loss: 0.151408
[1, 700] loss: 0.136470
[1, 800] loss: 0.123326
To obtain lower loss, you could increase the number of epochs to 5, 10, or even 20. However, after a certain period of training time, the network loss will cease to decrease with increased training time. To sidestep this issue, as training time increases, you will introduce a learning rate schedule, which decreases learning rate over time. To understand why this works, see Distill’s visualization at “Why Momentum Really Works”.
为了获得更低的损失,您可以将时期数增加到5、10甚至20。但是,经过一定时间的训练后,网络损失将随着训练时间的增加而减少。 为了避免这个问题,随着培训时间的增加,您将引入学习率计划,随着时间的推移,学习率会降低。 要了解为什么这样做有效,请参阅“为什么动量真正起作用”中 Distill的可视化。
Amend your main
function with the following two lines, defining a scheduler
and invoking scheduler.step
. Furthermore, change the number of epochs to 12
:
用以下两行修改您的main
功能,定义一个scheduler
并调用scheduler.step
。 此外,将纪元数更改为12
:
def main():
net = Net().float()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
trainloader, _ = get_train_test_loaders()
for epoch in range(12): # loop over the dataset multiple times
train(net, criterion, optimizer, trainloader, epoch)
scheduler.step()
torch.save(net.state_dict(), "checkpoint.pth")
Check that your file matches the step 3 file in this repository. Training will run for around 5 minutes. Your output will resemble the following:
检查文件是否与此存储库中的步骤3文件匹配。 培训将进行约5分钟。 您的输出将类似于以下内容:
Output
[0, 0] loss: 3.208171
[0, 100] loss: 3.211070
[0, 200] loss: 3.192235
[0, 300] loss: 2.943867
[0, 400] loss: 2.569440
[0, 500] loss: 2.243283
[0, 600] loss: 1.986425
[0, 700] loss: 1.768090
[0, 800] loss: 1.587308
...
[11, 0] loss: 0.000302
[11, 100] loss: 0.007548
[11, 200] loss: 0.009005
[11, 300] loss: 0.008193
[11, 400] loss: 0.007694
[11, 500] loss: 0.008509
[11, 600] loss: 0.008039
[11, 700] loss: 0.007524
[11, 800] loss: 0.007608
The final loss obtained is 0.007608
, which is 3 orders of magnitude smaller than the starting loss 3.20
. This concludes the second step of our workflow, where we set up and train the neural network. With that said, as small as this loss value is, it has little meaning. To put the model’s performance in perspective, we will compute its accuracy—the percentage of images the model correctly classified.
最终损失为0.007608
,比起始损失3.20
小3个数量级。 至此,我们工作流程的第二步结束了,我们在这里设置和训练了神经网络。 话虽如此,损失值虽然小,但意义不大。 为了正确看待模型的性能,我们将计算其准确性-模型正确分类的图像百分比。
You will now evaluate your sign language classifier by computing its accuracy on the validation set, a set of images the model did not see during training. This will provide a better sense of model performance than the final loss value did. Furthermore, you will add utilities to save our trained model at the end of training and load our pre-trained model when performing inference.
现在,您将通过在验证集上计算模型的准确性来评估手语分类器, 验证集是模型在训练过程中未看到的图像。 与最终损失值相比,这将提供更好的模型性能意识。 此外,您将添加实用程序以在训练结束时保存我们的训练后的模型,并在执行推理时加载我们的训练前的模型。
Create a new file, called step_4_evaluate.py
.
创建一个名为step_4_evaluate.py
的新文件。
Import the necessary utilities:
导入必要的实用程序:
from torch.utils.data import Dataset
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
import numpy as np
import onnx
import onnxruntime as ort
from step_2_dataset import get_train_test_loaders
from step_3_train import Net
Next, define a utility to evaluate the neural network’s performance. The following function compares the neural network’s predicted letter to the true letter, for a single image:
接下来,定义一个实用程序来评估神经网络的性能。 以下函数将单个图像的神经网络预测字母与真实字母进行比较:
def evaluate(outputs: Variable, labels: Variable) -> float:
"""Evaluate neural network outputs against non-one-hotted labels."""
Y = labels.numpy()
Yhat = np.argmax(outputs, axis=1)
return float(np.sum(Yhat == Y))
outputs
is a list of class probabilities for each sample. For example, outputs
for a single sample may be [0.1, 0.3, 0.4, 0.2]
. labels
is a list of label classes. For example, the label class may be 3
.
outputs
是每个样本的类别概率的列表。 例如,单个样本的outputs
可以是[0.1, 0.3, 0.4, 0.2]
。 labels
是标签类别的列表。 例如,标签类别可以是3
。
Y = ...
converts the labels into a NumPy array. Next, Yhat = np.argmax(...)
converts the outputs
class probabilities into predicted classes. For example, the list of class probabilities [0.1, 0.3, 0.4, 0.2]
would yield the predicted class 2
, because the index 2 value of 0.4 is the largest value.
Y = ...
将标签转换为NumPy数组。 接下来, Yhat = np.argmax(...)
将outputs
类别的概率转换为预测的类别。 例如,类别概率列表[0.1, 0.3, 0.4, 0.2]
将产生预测的类别2
,因为索引2值为0.4是最大值。
Since both Y
and Yhat
are now classes, you can compare them. Yhat == Y
checks if the predicted class matches the label class, and np.sum(...)
is a trick that computes the number of truth-y values. In other words, np.sum
will output the number of samples that were classified correctly.
由于Y
和Yhat
现在都是类,因此可以对其进行比较。 Yhat == Y
检查所预测的类是否与标签类匹配,并且np.sum(...)
是一种计算np.sum(...)
-y值数量的技巧。 换句话说, np.sum
将输出正确分类的样本数。
Add the second function batch_evaluate
, which applies the first function evaluate
to all images:
添加第二个函数batch_evaluate
,它将第一个函数evaluate
应用于所有图像:
def batch_evaluate(
net: Net,
dataloader: torch.utils.data.DataLoader) -> float:
"""Evaluate neural network in batches, if dataset is too large."""
score = n = 0.0
for batch in dataloader:
n += len(batch['image'])
outputs = net(batch['image'])
if isinstance(outputs, torch.Tensor):
outputs = outputs.detach().numpy()
score += evaluate(outputs, batch['label'][:, 0])
return score / n
batch
is a group of images stored as a single tensor. First, you increment the total number of images you’re evaluating (n
) by the number of images in this batch. Next, you run inference on the neural network with this batch of images, outputs = net(...)
. The type check if isinstance(...)
converts the outputs in a NumPy array if needed. Finally, you use evaluate
to compute the number of correctly-classified samples. At the conclusion of the function, you compute the percent of samples you correctly classified, score / n
.
batch
是一组存储为单个张量的图像。 首先,将要评估的图像总数( n
)乘以该批次中的图像数。 接下来,使用这批图像在神经网络上运行推理, outputs = net(...)
。 如果需要,类型检查if isinstance(...)
将输出转换为NumPy数组。 最后,使用evaluate
来计算正确分类的样本数。 在函数结束时,您将计算正确分类的样本的百分比, score / n
。
Finally, add the following script to leverage the preceding utilities:
最后,添加以下脚本以利用上述实用程序:
def validate():
trainloader, testloader = get_train_test_loaders()
net = Net().float()
pretrained_model = torch.load("checkpoint.pth")
net.load_state_dict(pretrained_model)
print('=' * 10, 'PyTorch', '=' * 10)
train_acc = batch_evaluate(net, trainloader) * 100.
print('Training accuracy: %.1f' % train_acc)
test_acc = batch_evaluate(net, testloader) * 100.
print('Validation accuracy: %.1f' % test_acc)
if __name__ == '__main__':
validate()
This loads a pretrained neural network and evaluates its performance on the provided sign language dataset. Specifically, the script here outputs accuracy on the images you used for training and a separate set of images you put aside for testing purposes, called the validation set.
这将加载预训练的神经网络,并在提供的手语数据集上评估其性能。 具体来说,此处的脚本会输出用于训练的图像的准确性,以及为测试目的而放置的另一组图像,称为验证集 。
You will next export the PyTorch model to an ONNX binary. This binary file can then be used in production to run inference with your model. Most importantly, the code running this binary does not need a copy of the original network definition. At the end of the validate
function, add the following:
接下来,您将PyTorch模型导出到ONNX二进制文件。 然后,可以在生产中使用此二进制文件来对模型进行推断。 最重要的是,运行此二进制文件的代码不需要原始网络定义的副本。 在validate
函数的末尾,添加以下内容:
trainloader, testloader = get_train_test_loaders(1)
# export to onnx
fname = "signlanguage.onnx"
dummy = torch.randn(1, 1, 28, 28)
torch.onnx.export(net, dummy, fname, input_names=['input'])
# check exported model
model = onnx.load(fname)
onnx.checker.check_model(model) # check model is well-formed
# create runnable session with exported model
ort_session = ort.InferenceSession(fname)
net = lambda inp: ort_session.run(None, {'input': inp.data.numpy()})[0]
print('=' * 10, 'ONNX', '=' * 10)
train_acc = batch_evaluate(net, trainloader) * 100.
print('Training accuracy: %.1f' % train_acc)
test_acc = batch_evaluate(net, testloader) * 100.
print('Validation accuracy: %.1f' % test_acc)
This exports the ONNX model, checks the exported model, and then runs inference with the exported model. Double-check that your file matches the step 4 file in this repository:
这将导出ONNX模型,检查导出的模型,然后对导出的模型进行推断。 仔细检查您的文件是否与该存储库中的步骤4文件匹配:
from torch.utils.data import Dataset
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
import numpy as np
import onnx
import onnxruntime as ort
from step_2_dataset import get_train_test_loaders
from step_3_train import Net
def evaluate(outputs: Variable, labels: Variable) -> float:
"""Evaluate neural network outputs against non-one-hotted labels."""
Y = labels.numpy()
Yhat = np.argmax(outputs, axis=1)
return float(np.sum(Yhat == Y))
def batch_evaluate(
net: Net,
dataloader: torch.utils.data.DataLoader) -> float:
"""Evaluate neural network in batches, if dataset is too large."""
score = n = 0.0
for batch in dataloader:
n += len(batch['image'])
outputs = net(batch['image'])
if isinstance(outputs, torch.Tensor):
outputs = outputs.detach().numpy()
score += evaluate(outputs, batch['label'][:, 0])
return score / n
def validate():
trainloader, testloader = get_train_test_loaders()
net = Net().float().eval()
pretrained_model = torch.load("checkpoint.pth")
net.load_state_dict(pretrained_model)
print('=' * 10, 'PyTorch', '=' * 10)
train_acc = batch_evaluate(net, trainloader) * 100.
print('Training accuracy: %.1f' % train_acc)
test_acc = batch_evaluate(net, testloader) * 100.
print('Validation accuracy: %.1f' % test_acc)
trainloader, testloader = get_train_test_loaders(1)
# export to onnx
fname = "signlanguage.onnx"
dummy = torch.randn(1, 1, 28, 28)
torch.onnx.export(net, dummy, fname, input_names=['input'])
# check exported model
model = onnx.load(fname)
onnx.checker.check_model(model) # check model is well-formed
# create runnable session with exported model
ort_session = ort.InferenceSession(fname)
net = lambda inp: ort_session.run(None, {'input': inp.data.numpy()})[0]
print('=' * 10, 'ONNX', '=' * 10)
train_acc = batch_evaluate(net, trainloader) * 100.
print('Training accuracy: %.1f' % train_acc)
test_acc = batch_evaluate(net, testloader) * 100.
print('Validation accuracy: %.1f' % test_acc)
if __name__ == '__main__':
validate()
To use and evaluate the checkpoint from the last step, run the following:
要从最后一步使用和评估检查点,请运行以下命令:
This will yield output similar to the following, affirming that your exported model not only works, but also agrees with your original PyTorch model:
这将产生类似于以下内容的输出,确认您导出的模型不仅有效,而且与原始PyTorch模型一致:
Output
========== PyTorch ==========
Training accuracy: 99.9
Validation accuracy: 97.4
========== ONNX ==========
Training accuracy: 99.9
Validation accuracy: 97.4
Your neural network attains a train accuracy of 99.9% and a 97.4% validation accuracy. This gap between train and validation accuracy indicates your model is overfitting. This means that instead of learning generalizable patterns, your model has memorized the training data. To understand the implications and causes of overfitting, see Understanding Bias-Variance Tradeoffs.
您的神经网络的训练精度为99.9%,验证精度为97.4%。 训练和验证准确性之间的差距表明您的模型过度拟合 。 这意味着您的模型没有记住通用的模式,而是存储了训练数据。 要了解过拟合的含义和原因,请参阅了解偏差-权衡折衷 。
At this point, we have completed a sign language classifier. In essence, our model can correctly disambiguate between signs correctly almost all the time. This is a reasonably good model, so we move on to the final stage of our application. We will use this sign language classifier in a real-time webcam application.
至此,我们已经完成了手语分类器。 本质上,我们的模型几乎可以始终正确地正确消除符号之间的歧义。 这是一个相当不错的模型,因此我们进入应用程序的最后阶段。 我们将在实时网络摄像头应用程序中使用此手语分类器。
Your next objective is to link the computer’s camera to your sign language classifier. You will collect camera input, classify the displayed sign language, and then report the classified sign back to the user.
下一个目标是将计算机的摄像头链接到手语分类器。 您将收集摄像机输入,对显示的手势语进行分类,然后将分类的手势报告给用户。
Now create a Python script for the face detector. Create the file step_6_camera.py
using nano
or your favorite text editor:
现在为面部检测器创建一个Python脚本。 使用nano
或您喜欢的文本编辑器创建文件step_6_camera.py
:
Add the following code into the file:
将以下代码添加到文件中:
"""Test for sign language classification"""
import cv2
import numpy as np
import onnxruntime as ort
def main():
pass
if __name__ == '__main__':
main()
This code imports OpenCV, which contains your image utilities, and the ONNX runtime, which is all you need to run inference with your model. The rest of the code is typical Python program boilerplate.
这段代码导入了OpenCV(包含图像实用程序)和ONNX运行时,即运行模型所需的全部操作。 其余代码是典型的Python程序样板。
Now replace pass
in the main
function with the following code, which initializes a sign language classifier using the parameters you trained previously. Additionally add a mapping from indices to letters and image statistics:
现在,将main
函数中的pass
替换为以下代码,该代码使用您先前训练的参数来初始化手语分类器。 另外,添加从索引到字母和图像统计信息的映射:
def main():
# constants
index_to_letter = list('ABCDEFGHIKLMNOPQRSTUVWXY')
mean = 0.485 * 255.
std = 0.229 * 255.
# create runnable session with exported model
ort_session = ort.InferenceSession("signlanguage.onnx")
You will use elements of this test script from the official OpenCV documentation. Specifically, you will update the body of the main
function. Start by initializing a VideoCapture
object that is set to capture live feed from your computer’s camera. Place this at the end of the main
function:
您将使用官方OpenCV文档中的此测试脚本的元素。 具体来说,您将更新main
函数的main
。 首先初始化一个VideoCapture
对象,该对象设置为从计算机的摄像机捕获实时供稿。 将其放在main
函数的末尾:
def main():
...
# create runnable session with exported model
ort_session = ort.InferenceSession("signlanguage.onnx")
cap = cv2.VideoCapture(0)
Then add a while
loop, which reads from the camera at every timestep:
然后添加一个while
循环,该循环在每个时间步都从相机读取:
def main():
...
cap = cv2.VideoCapture(0)
while True:
# Capture frame-by-frame
ret, frame = cap.read()
Write a utility function that takes the center crop for the camera frame. Place this function before main
:
编写一个实用程序功能,使相机框的中心裁切。 将此函数放在main
之前:
def center_crop(frame):
h, w, _ = frame.shape
start = abs(h - w) // 2
if h > w:
frame = frame[start: start + w]
else:
frame = frame[:, start: start + h]
return frame
Next, take the center crop for the camera frame, convert to grayscale, normalize, and resize to 28x28
. Place this inside the while
loop within the main
function:
接下来,对相机框进行中心裁剪,转换为灰度,规格化并调整为28x28
。 将其放置在main
函数的while
循环中:
def main():
...
while True:
# Capture frame-by-frame
ret, frame = cap.read()
# preprocess data
frame = center_crop(frame)
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
x = cv2.resize(frame, (28, 28))
x = (frame - mean) / std
Still within the while
loop, run inference with the ONNX runtime. Convert the outputs to a class index, then to a letter:
仍在while
循环中,使用ONNX运行时运行推理。 将输出转换为类索引,然后转换为字母:
...
x = (frame - mean) / std
x = x.reshape(1, 1, 28, 28).astype(np.float32)
y = ort_session.run(None, {'input': x})[0]
index = np.argmax(y, axis=1)
letter = index_to_letter[int(index)]
Display the predicted letter inside the frame, and display the frame back to the user:
在框架内显示预测字母,然后将框架显示给用户:
...
letter = index_to_letter[int(index)]
cv2.putText(frame, letter, (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 2.0, (0, 255, 0), thickness=2)
cv2.imshow("Sign Language Translator", frame)
At the end of the while
loop, add this code to check if the user hits the q
character and, if so, quit the application. This line halts the program for 1 millisecond. Add the following:
在while
循环的末尾,添加此代码以检查用户是否按了q
字符,如果是,请退出应用程序。 此行将程序暂停1毫秒。 添加以下内容:
...
cv2.imshow("Sign Language Translator", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
Finally, release the capture and close all windows. Place this outside of the while
loop to end the main
function.
最后,释放捕获并关闭所有窗口。 将其放置在while
循环之外以结束main
功能。
...
while True:
...
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Double-check your file matches the following or this repository:
仔细检查您的文件是否与以下或此存储库匹配:
import cv2
import numpy as np
import onnxruntime as ort
def center_crop(frame):
h, w, _ = frame.shape
start = abs(h - w) // 2
if h > w:
return frame[start: start + w]
return frame[:, start: start + h]
def main():
# constants
index_to_letter = list('ABCDEFGHIKLMNOPQRSTUVWXY')
mean = 0.485 * 255.
std = 0.229 * 255.
# create runnable session with exported model
ort_session = ort.InferenceSession("signlanguage.onnx")
cap = cv2.VideoCapture(0)
while True:
# Capture frame-by-frame
ret, frame = cap.read()
# preprocess data
frame = center_crop(frame)
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
x = cv2.resize(frame, (28, 28))
x = (x - mean) / std
x = x.reshape(1, 1, 28, 28).astype(np.float32)
y = ort_session.run(None, {'input': x})[0]
index = np.argmax(y, axis=1)
letter = index_to_letter[int(index)]
cv2.putText(frame, letter, (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 2.0, (0, 255, 0), thickness=2)
cv2.imshow("Sign Language Translator", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
if __name__ == '__main__':
main()
Exit your file and run the script.
退出文件并运行脚本。
Once the script is run, a window will pop up with your live webcam feed. The predicted sign language letter will be shown in the top left. Hold up your hand and make your favorite sign to see your classifier in action. Here are some sample results showing the letter L and D.
运行脚本后,将弹出一个带有实时网络摄像头供稿的窗口。 预计的手语字母将显示在左上方。 举起您的手,做自己喜欢的手势,看看分类器的作用。 这是一些显示字母L和D的示例结果。
While testing, note that the background needs to be fairly clear for this translator to work. This is an unfortunate consequence of the dataset’s cleanliness. Had the dataset included images of hand signs with miscellaneous backgrounds, the network would be robust to noisy backgrounds. However, the dataset features blank backgrounds and nicely centered hands. As a result, this webcam translator works best when your hand is likewise centered and placed against a blank background.
在测试时,请注意,要使此翻译器正常工作,必须清楚其背景。 这是数据集整洁度的不幸结果。 如果数据集包含具有其他背景的手势图像,则该网络对于嘈杂的背景将是可靠的。 但是,该数据集具有空白的背景和很好地居中的手。 因此,当您的手同样居中并放在空白背景下时,此网络摄像头翻译器效果最佳。
This concludes the sign language translator application.
到此结束手语翻译器的应用。
In this tutorial, you built an American Sign Language translator using computer vision and a machine learning model. In particular, you saw new aspects of training a machine learning model—specifically, data augmentation for model robustness, learning rate schedules for lower loss, and exporting AI models using ONNX for production use. This then culminated in a real-time computer vision application, which translates sign language into letters using a pipeline you built. It’s worth noting that combatting the brittleness of the final classifier can be tackled with any or all of the following methods. For further exploration try the following topics to in improve your application:
在本教程中,您使用计算机视觉和机器学习模型构建了美国手语翻译器。 特别是,您看到了训练机器学习模型的新方面-具体来说,是增强模型健壮性的数据增强,降低损失的学习率计划以及使用ONNX导出AI模型用于生产用途。 然后,这最终出现在实时计算机视觉应用程序中,该应用程序使用您构建的管道将手语翻译成字母。 值得注意的是,可以使用以下任何一种或所有方法来解决最终分类器的脆性问题。 为了进一步探索,请尝试以下主题来改善您的应用程序:
Generalization: This isn’t a sub-topic within computer vision, rather, it’s a constant problem throughout all of machine learning. See Understanding Bias-Variance Tradeoffs.
通用性:这不是计算机视觉中的子主题,而是在整个机器学习中始终存在的问题。 请参阅了解偏差-方差折衷 。
翻译自: https://www.digitalocean.com/community/tutorials/how-to-build-a-neural-network-to-translate-sign-language-into-english
手语翻译系统