周末项目:使用scikit-learn进行手语和静态手势识别

by Sreehari

通过Sreehari

周末项目:使用scikit-learn进行手语和静态手势识别 (Weekend project: sign language and static-gesture recognition using scikit-learn)

Let’s build a machine learning pipeline that can read the sign language alphabet just by looking at a raw image of a person’s hand.

让我们建立一个机器学习管道,仅需查看一个人的手的原始图像即可读取手语字母。

This problem has two parts to it:

这个问题有两个部分:

  1. Building a static-gesture recognizer, which is a multi-class classifier that predicts the static sign language gestures.

    构建一个静态手势识别器,它是一个预测静态手势语手势的多类分类器。
  2. Locating the hand in the raw image and feeding this section of the image to the static gesture recognizer (the multi-class classifier).

    将手放在原始图像中,并将图像的此部分馈送到静态手势识别器(多类分类器)。

You can get my example code and dataset for this project here.

您可以在此处获取此项目的示例代码和数据集。

首先,一些背景。 (First, some background.)

Gesture recognition is an open problem in the area of machine vision, a field of computer science that enables systems to emulate human vision. Gesture recognition has many applications in improving human-computer interaction, and one of them is in the field of Sign Language Translation, wherein a video sequence of symbolic hand gestures is translated into natural language.

手势识别是机器视觉领域的一个悬而未决的问题,机器视觉是使系统能够模仿人类视觉的计算机科学领域。 手势识别在改善人机交互方面具有许多应用,其中之一是手语翻译领域,其中将象征性手势的视频序列翻译为自然语言。

A range of advanced methods for the same have been developed. Here, we’ll look at how to perform static-gesture recognition using the scikit learn and scikit image libraries.

已经开发出了一系列相同的高级方法。 在这里,我们将研究如何使用scikit Learn和scikit图像库执行静态手势识别。

第1部分:构建静态手势识别器 (Part 1: Building a static-gesture recognizer)

For this part, we use a data set comprising raw images and a corresponding csv file with coordinates indicating the bounding box for the hand in each image. (Use the Dataset.zip file to get the sample data set. Extract as per instructions in the readme file)

对于这一部分,我们使用一个数据集,该数据集包括原始图像和相应的csv文件,该文件的坐标指示每个图像中手的边界框。 ( 使用Dataset.zip文件获取示例数据集。按照自述文件中的说明进行提取 )

This data set is organized user-wise and the directory structure of the dataset is as follows. The image names indicate the alphabet represented by the image.

该数据集是按用户组织的,数据集的目录结构如下。 图像名称表示图像代表的字母。

dataset
   |----user_1
          |---A0.jpg
          |---A1.jpg
          |---A2.jpg
          |---...
          |---Y9.jpg
   |----user_2
          |---A0.jpg
          |---A1.jpg
          |---A2.jpg
          |---...
          |---Y9.jpg
   |---- ...
   |---- ...

The static-gesture recognizer is essentially a multi-class classifier that is trained on input images representing the 24 static sign-language gestures (A-Y, excluding J).

静态手势识别器本质上是一个多类分类器,该分类器在代表24种静态手语手势(AY,不包括J)的输入图像上进行训练。

Building a static-gesture recognizer using the raw images and the csv file is fairly simple.

使用原始图像和csv文件构建静态手势识别器非常简单。

To use the multi-class classifiers from the scikit learn library, we’ll need to first build the data set — that is, every image has to be converted into a feature vector (X) and every image will have a label corresponding to the sign language alphabet that it denotes (Y).

要使用scikit学习库中的多类分类器,我们需要首先构建数据集-也就是说,每个图像都必须转换为特征向量(X),并且每个图像都将具有对应于它表示的手语字母(Y)。

The key now is to use an appropriate strategy to vectorize the image and extract meaningful information to feed to the classifier. Simply using the raw pixel values will not work if we plan on using simple multi-class classifiers (as opposed to using Convolution Networks).

现在的关键是使用适当的策略对图像进行矢量化并提取有意义的信息以馈送到分类器。 如果我们计划使用简单的多类分类器(与使用卷积网络相反),则仅使用原始像素值将无法工作。

To vectorize our images, we use the Histogram of Oriented Gradients (HOG) approach, as it has been proven to yield good results on problems such as this one. Other feature extractors that can be used include Local Binary Patterns and Haar Filters.

为了对我们的图像进行矢量化处理,我们使用了“定向直方图”(HOG)方法,因为它已被证明可以在诸如此类的问题上产生良好的结果。 可以使用的其他特征提取器包括本地二进制模式和Haar过滤器。

码: (Code:)

We use pandas in the get_data() function to load the CSV file. Two functions-crop() and convertToGrayToHog() are used to get the required hog vector and append it to the list of vectors that we’re building, in order to train the multi-class classifier.

我们在get_data()函数中使用了pandas来加载CSV文件。 两个函数-crop() 并convertToGrayToHog() 用于获取所需的猪向量并将其附加到我们正在构建的向量列表中,以训练多类分类器。

# returns hog vector of a particular image vector
def convertToGrayToHOG(imgVector):
    rgbImage = rgb2gray(imgVector)
    return hog(rgbImage)
    
# returns cropped image 
def crop(img, x1, x2, y1, y2, scale):
    crp=img[y1:y2,x1:x2]
    crp=resize(crp,((scale, scale))) 
    return crp
    
#loads data for multiclass classification
def get_data(user_list, img_dict, data_directory):
  X = []
  Y = []
  
  for user in user_list:
    user_images = glob.glob(data_directory+user+'/*.jpg')
    
    boundingbox_df = pd.read_csv(data_directory + user + '/'
 + user + '_loc.csv')
        
    for rows in boundingbox_df.iterrows():
      cropped_img = crop( img_dict[rows[1]['image']], 
                         rows[1]['top_left_x'], 
                         rows[1]['bottom_right_x'], 
                         rows[1]['top_left_y'], 
                         rows[1]['bottom_right_y'], 
                         128
                        )
       hogvector = convertToGrayToHOG(cropped_img)
       
       X.append(hogvector.tolist())
       Y.append(rows[1]['image'].split('/')[1][0])
       
    return X, Y

The next step is to encode the output labels (the Y-values) to numerical values. We do this using sklearn’s label encoder.

下一步是将输出标签(Y值)编码为数值。 我们使用sklearn的标签编码器执行此操作。

In our code, we have done this as follows:

在我们的代码中,我们这样做如下:

Y_mul = self.label_encoder.fit_transform(Y_mul)

where, the label_encoder object is constructed as follows within the gesture-recognizer class constructor:

其中,label_encoder对象在手势识别器类构造函数中的构造如下:

self.label_encoder = LabelEncoder().fit(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y'])

Once this is done, the model can be trained using any Multi-class classification algorithm of your choice from the scikit learn toolbox. We have trained ours using Support Vector Classification, with a linear kernel.

完成此操作后,可以使用scikit学习工具箱中选择的任何多分类算法对模型进行训练。 我们已经使用支持向量分类 (带有线性核)对我们进行了训练。

Training a model using sklearn does not involve more than two lines of code. Here’s how you do it:

使用sklearn训练模型不会涉及超过两行代码。 这是您的操作方式:

svcmodel = SVC(kernel='linear', C=0.9, probability=True) 
self.signDetector = svcmodel.fit(X_mul, Y_mul)

The hyperparameters (i.e., C=0.9 in this case) can be tuned using a Grid Search. Read more about this here.

可以使用网格搜索来调整超参数(在这种情况下,C = 0.9)。 在此处阅读有关此内容的更多信息。

In this case, we do not know a whole lot about the data as such (i.e., the hog vectors). So, it’d be a good idea to try and use algorithms like xgboost (Extreme Gradient Boosting) or Random Forest Classifiers and see how these algorithms perform.

在这种情况下,我们对这样的数据(即猪矢量)一无所知。 因此,尝试使用诸如xgboost(极端梯度增强)或Random Forest Classifiers之类的算法并查看这些算法的性能是一个好主意。

第2部分:构建本地化程序 (Part 2: Building the Localizer)

This part requires a slightly more effort as compared to the first.

与第一部分相比,这部分需要更多的努力。

Broadly, we’ll employ the following steps in completing this task.

概括地说,我们将采用以下步骤来完成此任务。

  1. Build a data set comprising images of hands and parts that are not-hand, using the given data set and the bounding box values for each image.

    使用给定的数据集和每个图像的边界框值, 构建一个包含手和非手部图像的数据集。

  2. Train a binary classifier to detect hand/not-hand images using the above data set.

    使用上述数据集训练二进制分类器以检测手/非手图像。

  3. (Optional) Use Hard Negative Mining to improve the classifier.

    (可选)使用“ 硬否定挖掘”来改进分类器。

  4. Use a sliding window approach with various scales, on the query image to isolate the region of interest.

    在查询图像上使用具有各种比例滑动窗口方法来隔离关注区域。

Here, we are not going to be using any image processing techniques like filtering, color segmentation, etc. The scikit image library is used to read, crop, scale, convert images to gray scale and extract hog vectors.

在这里,我们将不使用任何图像处理技术,例如过滤,颜色分割等。scikit图像库用于读取,裁剪,缩放,将图像转换为灰度并提取生猪矢量。

建立手/不手数据集: (Building the hand/not hand dataset:)

The data set could be built using any strategy you like. One way to do this, is to generate random coordinates and then check the ratio of area of intersection to area of union (i.e., the degree of overlap with the given bounding box) to determine if it is a non-hand section. (Another approach could be to use a sliding window to determine the coordinates. But this is horribly slow and unnecessary)

可以使用您喜欢的任何策略来构建数据集。 一种方法是生成随机坐标,然后检查相交面积与并集面积的比率(即与给定边界框的重叠程度),以确定其是否为非手工剖面。 (另一种方法可能是使用滑动窗口来确定坐标。但这非常缓慢且不必要)

"""
This function randomly generates bounding boxes 
Returns hog vector of those cropped bounding boxes along with label 
Label : 1 if hand ,0 otherwise 
"""
def buildhandnothand_lis(frame,imgset):
    poslis =[]
    neglis =[]
    
    for nameimg in frame.image:
        tupl = frame[frame['image']==nameimg].values[0]
        x_tl = tupl[1]
        y_tl = tupl[2]
        side = tupl[5]
        conf = 0
        
        dic = [0, 0]
        
        arg1 = [x_tl,y_tl,conf,side,side]
        
        poslis.append( convertToGrayToHOG(crop(imgset[nameimg],  x_tl,x_tl+side,y_tl,y_tl+side)))
        
        while dic[0] <= 1 or dic[1] < 1:
            x = random.randint(0,320-side)
            y = random.randint(0,240-side) 
            crp = crop(imgset[nameimg],x,x+side,y,y+side)
            hogv = convertToGrayToHOG(crp)
            arg2 = [x,y, conf, side, side]
            
            z = overlapping_area(arg1,arg2)
            if dic[0] <= 1 and z <= 0.5:
                neglis.append(hogv)
                dic[0] += 1
            if dic[0]== 1:
                break
        label_1 = [1 for i in range(0,len(poslis)) ]
        label_0 = [0 for i in range(0,len(neglis))]
        label_1.extend(label_0)
        poslis.extend(neglis)
        
        return poslis,label_1

训练二元分类器: (Training a binary classifier:)

Once the data set is ready, training the classifier can be done exactly as seen before in part 1.

一旦数据集准备就绪,就可以完全按照第1部分中之前的方法训练分类器。

Usually, in this case, a technique called Hard Negative Mining is employed to reduce the number of false positive detections and improve the classifier. One or two iterations of hard negative mining using a Random Forest Classifier, is enough to ensure that your classifier reaches acceptable classification accuracies, which in this case is anything above 80%.

通常,在这种情况下,采用一种称为“ 硬否定挖掘”的技术来减少误报检测的次数并改善分类器。 使用随机森林分类器进行一两次硬性否定挖掘足以确保您的分类器达到可接受的分类精度,在这种情况下,该精度是80%以上。

Take a look at the code here for a sample implementation of the same.

在此处查看该代码的示例实现代码 。

在测试图像中检测手: (Detecting hands in test images:)

Now, to actually use the above classifier, we scale the test image by various factors and then use a sliding window approach on all of them to pick the window which captures the region of interest perfectly. This is done by selecting the region corresponding to the max of the confidence scores allotted by the binary (hand/not-hand) classifier across all scales.

现在,要实际使用上述分类器,我们可以通过各种因素缩放测试图像,然后对所有因素使用滑动窗口方法来选择能够完美捕获感兴趣区域的窗口。 这是通过选择与二进制(手/不手)分类器在所有标度上分配的置信度得分的最大值相对应的区域来完成的。

The test images need to be scaled because, we run a set sized window (in our case, it is 128x128) across all images to pick the region of interest and it is possible that the region of interest does not fit perfectly into this window size.

需要缩放测试图像,因为我们在所有图像上运行了一个设置大小的窗口(在本例中为128x128)以选择感兴趣的区域,并且感兴趣的区域可能无法完全适合此窗口大小。

Sample implementation and overall detection across all scales.

各种规模的 样本实施和整体检测 。

全部放在一起 (Putting it all together)

After both parts are complete, all that’s left to do is to call them in succession to get the final output when provided with a test image.

在完成这两个部分之后,剩下的要做的就是在提供测试图像时依次调用它们以获得最终输出。

That is, given a test image, we first get the various detected regions across different scales of the image and pick the best one among them. This region is then cropped out, rescaled (to 128x128) and its corresponding hog vector is fed to the multi-class classifier (i.e., the gesture recognizer). The gesture recognizer then predicts the gesture denoted by the hand in the image.

也就是说,给定一张测试图像,我们首先获取图像不同比例尺上的各种检测区域,然后从中选择最佳区域。 然后将该区域裁剪出来,重新缩放(缩放为128x128),并将其相应的猪矢量输入多类分类器(即手势识别器)。 然后,手势识别器预测图像中由手表示的手势。

关键点 (Key points)

To summarize, this project involves the following steps. The links refer to the relevant code in the github repository.

总而言之,该项目涉及以下步骤。 这些链接引用了github存储库中的相关代码。

  1. Building the hand/not-hand dataset.

    构建手/非手数据集 。

  2. Converting all the images i.e., cropped sections with the gestures and the hand, not-hand images, to its vectorized form.

    将所有图像(即带有手势和手的裁剪部分,而非手图像)转换为其矢量化形式。

  3. Building a binary classifier for detecting the section with the hand and building a multi-class classifier for identifying the gesture using these data sets.

    使用这些数据集构建用于检测手部的二进制分类器,并构建用于识别手势的多类分类器。

  4. Using the above classifiers one after the other to perform the required task.

    一个接一个地使用以上分类器来执行所需的任务。

Suks and I worked on this project as part of the Machine Learning course that we took up in college. A big shout out to her for all her contributions!

Suks和我参与了该项目,这是我们在大学学习的机器学习课程的一部分。 感谢她的所有贡献!

Also, we wanted to mention Pyimagesearch, which is a wonderful blog that we used extensively while we were working on the project! Do check it out for content on image processing and opencv related content.

另外,我们想提及Pyimagesearch ,这是一个很棒的博客,在我们从事该项目时我们广泛使用它! 一定要检查一下有关图像处理的内容以及与opencv相关的内容。

Cheers!

干杯!

翻译自: https://www.freecodecamp.org/news/weekend-projects-sign-language-and-static-gesture-recognition-using-scikit-learn-60813d600e79/

你可能感兴趣的:(算法,python,计算机视觉,机器学习,人工智能)