Hi folks,
嗨伙计,
Today we are going to understand how active learning can be used in data labeling.
今天,我们将了解如何在数据标记中使用主动学习。
Machine learning algorithms require -generally lots of- enough amount of data to be trained. In this stage obviously humans can label data by their hands. But what will be happened if there is no enough money to use AMT like services?
机器学习算法通常需要大量的数据才能进行训练。 在这个阶段,显然人类可以用手标记数据。 但是,如果没有足够的资金来使用类似AMT的服务,将会发生什么?
If you’re suffering from this situation, yes there is one more salvation way to label your data. And your hero’s name is Active Learning!
如果您正遭受这种情况的困扰,是的,还有另一种挽救方式来标记您的数据。 而你的英雄的名字叫主动学习!
By the way this post is my first tutorial on Medium so i’m not going talk to much :)
顺便说一句,这篇文章是我关于Medium的第一个教程,所以我不打算讨论太多:)
So i’m going to give you naive active learning labeling strategy to implement yourself using Python, Scikit-learn on FashionMnist dataset.
因此,我将为您提供天真的主动学习标签策略,以使用Python和FashionMnist数据集上的Scikit-learn实现自己。
Here are the steps;
步骤如下;
1- Label only small part of your data — lets call it “df_labeled”
1-仅标记您数据的一小部分-让我们称其为“ df_labeled”
2- Train a classifier (Linear SVM will be used in here) with these data
2-使用这些数据训练分类器(此处将使用线性SVM)
3- Using your trained classifier -which comes from in step 2- predict the class probabilities for your unlabeled data — lets call it “df_unlabeled”
3-使用受过训练的分类器-来自步骤2-预测未标记数据的分类概率-称之为“ df_unlabeled”
4- Foreach sample if predicted class probability is above from your pre-defined threshold, -yes, its a hyperparam :(- move that sample from “df_unlabeled” to “df_labeled”
4- Foreach样本,如果预测的类别概率高于您的预定义阈值-是,则为超参数:(-将样本从“ df_unlabeled”移到“ df_labeled”
5- Repeat 2–4 step until some sort of stopping criteria
5-重复2–4步,直到达到某种停止标准
Of course, there are many different starategies can be existed. For example, after 4.th step you can define one more threshold for lowest boundary and if predicted class probability is below from that threshold, this sample can be labeled manually and then will be moved to “df_labeled”.
当然,可以存在许多不同的starategies。 例如,在第4步之后,您可以为最低边界再定义一个阈值,并且如果预测的类别概率低于该阈值,则可以手动标记该样本,然后将其移至“ df_labeled”。
Yes, i hope we got the main concept for active labeling. And the time comes to the coding section.
是的,我希望我们掌握了有源标签的主要概念。 时间到了编码部分。
Import libraries which will be used in this notebook
导入将在此笔记本中使用的库
import pandas as pd
import numpy as np
from tensorflow.keras.datasets import fashion_mnist
import matplotlib.pyplot as plt
import random
import cv2from sklearn import svm
from sklearn.metrics import confusion_matrix, classification_report
Now import FashionMnist dataset;
现在导入FashionMnist数据集;
((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()
Now define HoG features to transform raw pixels to feature set;
现在定义HoG特征以将原始像素转换为特征集;
def hog_feature_extractor(hog_extractor, im):
descriptor = hog_extractor.compute(im)
return descriptor# Hog Parameters
winSize = (28,28)
blockSize = (14,14)
blockStride = (7,7)
cellSize = (7,7)
nbins = 9
derivAperture = 1
winSigma = -1.
histogramNormType = 0
L2HysThreshold = 0.2
gammaCorrection = 1
nlevels = 64
useSignedGradients = Truehog = cv2.HOGDescriptor(winSize,blockSize,blockStride,cellSize,nbins,derivAperture,winSigma,histogramNormType,L2HysThreshold,gammaCorrection,nlevels, useSignedGradients)
To see data sample in visual;
以可视方式查看数据样本;
def show_sample(x,y,i):
print("Label: {}".format(y[i]))
plt.imshow(x[i], cmap="gray");
For name convention;
为了命名惯例;
df_x = trainX
df_y = trainY
Lets see our labels
让我们看看我们的标签
nclasses = set(df_y)
print(nclasses)
Now, its time to select subset from our data — be aware; our dataset has already labelled otherwise we have to do it manually-
现在,是时候从我们的数据中选择子集了。 我们的数据集已经标记了,否则我们必须手动进行操作-
# what percentage of data is used initially
percentage = 1selected_indices = []for c in nclasses:
indices_c = list(np.where(df_y == c))[0]
len_c = len(list(np.where(df_y == c))[0])
len_c_subset = int(len_c * percentage / 100)
df_c_subset = random.sample(list(indices_c), len_c_subset)
selected_indices += df_c_subset
print("There are '{}' images for class label '{}' and selected only '{}' for active learning.".format(len_c, c, len_c_subset))
print("----")
df_subset_x = df_x[selected_indices]
df_subset_y = df_y[selected_indices]
Lets see how many samples we have;
让我们看看我们有多少样本;
print("Subset {}, {}".format(df_subset_x.shape, df_subset_y.shape))
And see what is the remaning set;
看看还剩下什么?
df_remainder_x = np.delete(df_x, selected_indices, axis=0)
df_remainder_y = np.delete(df_y, selected_indices, axis=0)print("Remainder {}, {}".format(df_remainder_x.shape, df_remainder_y.shape))
Now its time to extract HoG features from images;
现在是时候从图像中提取HoG功能了;
# Feature Extractiondf_subset_x_hog = []for elem in df_subset_x:
df_subset_x_hog.append(hog_feature_extractor(hog, elem).reshape(-1))df_remainder_x_hog = []for elem in df_remainder_x:
df_remainder_x_hog.append(hog_feature_extractor(hog, elem).reshape(-1))df_subset_y_hog = list(df_subset_y.copy())df_remainder_y_hog = list(df_remainder_y.copy())
Now check how many sample is recognized as a labeled
现在检查有多少样品被识别为标记
print("Labeled {}, Unlabelled {}".format(len(df_subset_x_hog),len(df_remainder_x_hog)))
As mentioned in step 5, i used no evaluation criteria, just number of iteration is used in here; -For 10 iteration, we repeat our process with decreasing upper threshold from 0.75 to 0.25-
如第5步所述,我没有使用评估标准,这里只使用了迭代次数; -对于10次迭代,我们将上限阈值从0.75降低到0.25,以重复进行此过程-
for iteration in range(10):
clf=svm.LinearSVC()
clf.fit(df_subset_x_hog, df_subset_y_hog)
res = clf._predict_proba_lr(df_remainder_x_hog)
# Params for unlabeled samples
threshold = 0.75 - (iteration * 0.05)del_indices = []
for sample_counter in range(len(res)):
if res[sample_counter][np.argmax(res[sample_counter])] > threshold:
predicted_label = np.argmax(res[sample_counter])df_subset_x_hog.append(list(df_remainder_x_hog[sample_counter]))
df_subset_y_hog.append(df_remainder_y_hog[sample_counter])del_indices.append(sample_counter)
df_remainder_x_hog = [i for j, i in enumerate(df_remainder_x_hog) if j not in del_indices]
df_remainder_y_hog = [i for j, i in enumerate(df_remainder_y_hog) if j not in del_indices]
print("Iteration: {} has done...".format(iteration))
And the finally 21009 of data sample from unlabeled set is still unlabeled;
来自未标记集合的最后21009个数据样本仍未标记;
print("Remain: {}, Labeled: {}".format(len(df_remainder_x_hog), len(df_subset_x_hog)))
Now we decrease our upper-threshold to 0.10 and make training again
现在我们将上限阈值降低到0.10,然后再次进行训练
# Finally label without threshold
clf=svm.LinearSVC()
clf.fit(df_subset_x_hog, df_subset_y_hog)res = clf._predict_proba_lr(df_remainder_x_hog)
# Params for unlabeled samples
threshold = 0.1del_indices = []
for sample_counter in range(len(res)):if res[sample_counter][np.argmax(res[sample_counter])] > threshold:
predicted_label = np.argmax(res[sample_counter])df_subset_x_hog.append(list(df_remainder_x_hog[sample_counter]))
df_subset_y_hog.append(df_remainder_y_hog[sample_counter])del_indices.append(sample_counter)df_remainder_x_hog = [i for j, i in enumerate(df_remainder_x_hog) if j not in del_indices]
df_remainder_y_hog = [i for j, i in enumerate(df_remainder_y_hog) if j not in del_indices]
Finally, lets see our performance on test set;
最后,让我们看一下测试集的性能;
# Test this model with test set
df_test_x_hog = []for elem in testX:
df_test_x_hog.append(hog_feature_extractor(hog, elem).reshape(-1))test_res = clf.predict(df_test_x_hog)print(confusion_matrix(testY, test_res, labels=[0,1,2,3,4,5,6,7,8,9]))print(classification_report(testY, test_res, labels=[0,1,2,3,4,5,6,7,8,9]))
And this is our active labeling based classifier performance on test set;
这是我们在测试集上基于主动标记的分类器性能;
precision recall f1-score support
0 0.83 0.81 0.82 1000
1 0.95 0.96 0.96 1000
2 0.78 0.80 0.79 1000
3 0.83 0.86 0.85 1000
4 0.75 0.83 0.79 1000
5 0.98 0.96 0.97 1000
6 0.70 0.58 0.63 1000
7 0.94 0.97 0.95 1000
8 0.96 0.97 0.96 1000
9 0.97 0.96 0.96 1000
accuracy 0.87 10000
macro avg 0.87 0.87 0.87 10000
weighted avg 0.87 0.87 0.87 10000
Actually you may want to ask this; what will be happened if we use all training set? Now we are going to train another classifier that uses all training set
实际上,您可能想问这个; 如果我们使用所有训练集会发生什么? 现在我们要训练另一个使用所有训练集的分类器
## What happen if we use all training samples
# Test this model with test set
df_train_x_hog = []for elem in trainX:
df_train_x_hog.append(hog_feature_extractor(hog, elem).reshape(-1))clf=svm.LinearSVC()
clf.fit(df_train_x_hog, trainY)test_res = clf.predict(df_test_x_hog)print(confusion_matrix(testY, test_res, labels=[0,1,2,3,4,5,6,7,8,9]))print(classification_report(testY, test_res, labels=[0,1,2,3,4,5,6,7,8,9]))
And its classification report looks like this;
其分类报告如下:
precision recall f1-score support
0 0.84 0.87 0.86 1000
1 0.99 0.98 0.98 1000
2 0.84 0.83 0.83 1000
3 0.88 0.92 0.90 1000
4 0.80 0.84 0.82 1000
5 0.98 0.97 0.98 1000
6 0.74 0.66 0.69 1000
7 0.94 0.97 0.96 1000
8 0.97 0.97 0.97 1000
9 0.97 0.96 0.97 1000
accuracy 0.90 10000
macro avg 0.90 0.90 0.90 10000
weighted avg 0.90 0.90 0.90 10000
结论 (Conclusion)
When we analyze the results; if we have only 600 images labeled, this strategy obtains 0.87 F1-Score. And in the case of usage of all labeled data -60k- we obtain 0.90 F1-Score.
当我们分析结果时; 如果我们仅标记了600张图像,则此策略可获得0.87 F1-Score。 在使用所有标记数据-60k的情况下,我们得到0.90 F1-得分。
Of course we have low performance on 6.th class with 0.63 F1-Score but fortunately it doesn’t change too much when we use all the data.
当然,使用0.63 F1-Score在6.th类中的性能较低,但是幸运的是,使用所有数据时,它的变化不会太大。
Thank you for your reading. And all contributions of corrections are warmly welcome :)
感谢您的阅读。 热烈欢迎所有更正的贡献:)
Peace at home, peace in the world!
家庭的和谐才能使世界和平!
翻译自: https://medium.com/@eroltak/active-learning-for-labeling-in-python-cde06d54baf1