博文链接:https://www.iamzlt.com/?p=336
参考论文:《高效视频智能分类系统的设计与实现》
本文是通过目标检测实现对视频中人们是否戴着口罩的分类,从这一角度对深度学习在该方向上的算法进行总结。
在充分调研前期的基础上,该系统的模型包括:训练集获取模块,深度学习系统训练模块,单张图像标注模块,实时视频检测模块。
系统关键问题在于反复将小权重矩阵应用于视频中的所有帧,由于这些模型查看视频中的每个帧,因此即使内存占用量很小,浮动点运算 (FLOP) 的数量仍然很大。因此专注于构建具有较少 FLOP 的计算效率模型,SSD采用金字塔结构,即利用了conv4-3/conv-7/conv6-2/conv7-2/conv8_2/conv9_2这些大小不同的feature maps,在多个feature maps上同时进行softmax分类和位置回归,待检测图像作为输入候选框,采用滑动窗口方法;基于颜色/纹理/形状的方法,将PCA语义特征进行降维,最后通过NMS先选一个置信度最高的框,然后和剩下的候选框都计算一个IOU,如果IOU大于某个阈值,说明这个框和候选框重叠区域很大,可以将该候选框删除,置信度直接降为0,从而过滤候选框重叠的部分,以确保最终视频分类训练结果的准确度。
Pytorch各版本下载链接:https://download.pytorch.org/whl/torch_stable.html
本系统环境:
cuda10
python3.6
torch1.1
torchvison0.4
WIDER Face Dataset 数据集是针对人们脸部的识别检测,如图1所示,它总共包含282692 张脸部图像和21192 张图像,并且在表述、装修和灯照等因素具有不一样的影响。其次它将近30万个人的脸部,它是当今世界图像数量最多、情景是最有难度以及挑战性非常高的面部识别数据集。因为它具有最精准的标注信息和最高的挑战性,故Wider Face Dataset已经超越了FDDB作为各大企业和研究所相互竞争的人脸识别的范例,很多著名的高校研究所都已经在该数据集的基础上计算出了各自的算法成效。
本系统使用 Wider Face Dataset对SSD模型进行多次的测试和验证。该数据集中所标人脸尺度不一、姿态各异,人脸的尺度多分布于 40 ~ 140 像素之间,并存在不同程度的物体遮挡和光照差异,是目前人脸检测领域广为认可的基准数据集。在系统中通过对Wider Face Dataset中 的3114张图像进行模型验证,其中验证集共包含780张图像,并且对测试集中不同复杂度的子集进行模型识别测试。
2.打标签
标注流程如图2所示:
面部:使用正方形候选框标记,不计小于三十二像素、不清晰的图像或图像扭曲等因素的候选框,这些因素的候选框会被删除掉,不会对它们进行继续分类。
眼睛:面部中一双眼睛的坐标值应该标记出来。
遮挡面部的位置:遮挡面部的东西要用长方形的候选框标注出来,比如眼镜。
面部朝向:图像中的面部要定义五中走向:左、右、前、左前,右前。
遮挡度:整个面部主要为四种类型:下巴、嘴巴、鼻子、眼睛。
遮挡类型:分为4种遮挡类型:人与人之间的遮挡、地理环境的遮挡、被人的某个特殊部位的遮挡、带着眼镜的遮挡。
总结: MAFA Dataset是一个人们脸部被遮挡的目标识别数据集,他的图像大多数都是从互联网上得到的。如图3所示, MAFA Dataset数据集一共包括和29700张普通的图像和24705张被不同类型和不同程度遮挡面部的图像,其中被遮挡面部的图像中有上面所述的六种特征,并且这几千张图像全是由公司员工手工进行标注的。
3.生成.txt文件(make_txt.py)
#!/opt/conda/envs/python35-paddle120-env/bin/python
import os
import random
import sys
sys.path.append('/home/aistudio/ssd.pytorch-master/data/')
# 训练集:训练90%,验证集10%
trainval_percent = 0.1
# 训练集90%,测试集10%
train_percent = 0.9
xmlfilepath = 'mask_or_not/Annotations'
txtsavepath = 'mask_or_not/ImagesSets'
total_xml = os.listdir(xmlfilepath)
num = len(total_xml)
print(num)
list = range(num)
train_val_num = int(num * train_percent)
print(train_val_num)
val = int(train_val_num * trainval_percent)
print(val)
trainval = random.sample(list, train_val_num)
print(len(trainval))
validation = random.sample(trainval, val)
ftrainval = open('/home/aistudio/ssd.pytorch-master/data/mask_or_not/ImageSets/trainval.txt', 'w')
ftest = open('/home/aistudio/ssd.pytorch-master/data/mask_or_not/ImageSets/test.txt', 'w')
ftrain = open('/home/aistudio/ssd.pytorch-master/data/mask_or_not/ImageSets/train.txt', 'w')
fval = open('/home/aistudio/ssd.pytorch-master/data/mask_or_not/ImageSets/val.txt', 'w')
for i in list:
name = total_xml[i][:-4] + '\n'
if i in trainval:
ftrainval.write(name)
if i in validation:
fval.write(name)
else:
ftrain.write(name)
else:
ftest.write(name)
ftrainval.close()
ftrain.close()
fval.close()
ftest.close()
txt内容就是各个图片的索引,意味着哪些图片用做训练,哪些用做测试。
#!/opt/conda/envs/python35-paddle120-env/bin/python
import sys
# config.py
import os.path
# gets home dir cross platform
HOME = os.path.expanduser("/home/aistudio/ssd.pytorch-master")
# for making bounding boxes pretty
COLORS = ((255, 0, 0, 128), (0, 255, 0, 128), (0, 0, 255, 128),
(0, 255, 255, 128), (255, 0, 255, 128), (255, 255, 0, 128))
MEANS = (104, 117, 123)
# 新加的
mask = {
'num_classes': 3,
'lr_steps': (80000, 100000, 120000), # 学习率变化步数
# 训练12万次
'max_iter': 120000,
# 6个特征图
'feature_maps': [38, 19, 10, 5, 3, 1],
'min_dim': 300,
'steps': [8, 16, 32, 64, 100, 300], # 特征图对应的缩放比
'min_sizes': [30, 60, 111, 162, 213, 264],
'max_sizes': [60, 111, 162, 213, 264, 315],
'aspect_ratios': [[2], [2, 3], [2, 3], [2, 3], [2], [2]], # 先验框宽高比
'variance': [0.1, 0.2],
'clip': True,
'name': 'MASK',
}
# SSD300 CONFIGS
voc = {
'num_classes': 21,
'lr_steps': (80000, 100000, 120000),
'max_iter': 120000,
'feature_maps': [38, 19, 10, 5, 3, 1],
'min_dim': 300,
'steps': [8, 16, 32, 64, 100, 300],
'min_sizes': [30, 60, 111, 162, 213, 264],
'max_sizes': [60, 111, 162, 213, 264, 315],
'aspect_ratios': [[2], [2, 3], [2, 3], [2, 3], [2], [2]],
'variance': [0.1, 0.2],
'clip': True,
'name': 'VOC',
}
coco = {
'num_classes': 201,
'lr_steps': (280000, 360000, 400000),
'max_iter': 400000,
'feature_maps': [38, 19, 10, 5, 3, 1],
'min_dim': 300,
'steps': [8, 16, 32, 64, 100, 300],
'min_sizes': [21, 45, 99, 153, 207, 261],
'max_sizes': [45, 99, 153, 207, 261, 315],
'aspect_ratios': [[2], [2, 3], [2, 3], [2, 3], [2], [2]],
'variance': [0.1, 0.2],
'clip': True,
'name': 'COCO',
}
2.对voc0712.py进行修改(mask.py)
from .config import HOME
import os.path as osp
import sys
import os
sys.path.append('/home/aistudio')
import torch
import torch.utils.data as data
import cv2
import numpy as np
if sys.version_info[0] == 2:
import xml.etree.cElementTree as ET
else:
import xml.etree.ElementTree as ET
# 类别,这里有两类,一类为face_mask:戴口罩,另一类为face:不带口罩
MASK_CLASSES = ( # always index 0
'face','face_mask'
)
# note: if you used our download scripts, this should be right
home = os.path.abspath(os.path.dirname(__file__))
#MASK_ROOT = osp.join(HOME, "data/mask_or_not")
MASK_ROOT = osp.join(home, "mask_or_not")
class MASKAnnotationTransform(object):
"""Transforms a MASK annotation into a Tensor of bbox coords and label index
Initilized with a dictionary lookup of classnames to indexes
Arguments:
class_to_ind (dict, optional): dictionary lookup of classnames -> indexes
(default: alphabetic indexing of MASK's 2 classes)
keep_difficult (bool, optional): keep difficult instances or not
(default: False)
height (int): height
width (int): width
获取xml里面的坐标值和label,并将坐标值转换成0到1
"""
def __init__(self, class_to_ind=None, keep_difficult=False):
# 将类别名字转换成数字label
self.class_to_ind = class_to_ind or dict(
zip(MASK_CLASSES, range(len(MASK_CLASSES))))
# 在xml里面,有个difficult的参数,这个表示特别难识别的目标,一般是小目标或者遮挡严重的目标
# 因此,可以通过这个参数,忽略这些目标
self.keep_difficult = keep_difficult
def __call__(self, target, width, height):
"""
Arguments:
target (annotation) : the target annotation to be made usable
will be an ET.Element
Returns:
a list containing lists of bounding boxes [bbox coords, class name]
将一张图里面包含若干个目标,获取这些目标的坐标值,并转换成0到1,并得到其label
:param target: xml格式
:return: 返回List,每个目标对应一行,每行包括5个参数[xmin, ymin, xmax, ymax, label_ind]
"""
res = []
for obj in target.iter('object'):
difficult = int(obj.find('difficult').text) == 1
if not self.keep_difficult and difficult:
continue
name = obj.find('name').text.lower().strip() # text是获得目标的名称,lower将字符转换成小写,strip去除前后空格
bbox = obj.find('bndbox')
pts = ['xmin', 'ymin', 'xmax', 'ymax']
bndbox = []
for i, pt in enumerate(pts):
cur_pt = int(bbox.find(pt).text) - 1
# scale height or width
# 将坐标转换成[0,1],这样图片尺寸发生变化的时候,真实框也随之变化,即平移不变形
cur_pt = cur_pt / width if i % 2 == 0 else cur_pt / height
bndbox.append(cur_pt)
label_idx = self.class_to_ind[name]
bndbox.append(label_idx)
res += [bndbox] # [xmin, ymin, xmax, ymax, label_ind]
# img_id = target.find('filename').text[:-4]
return res # [[xmin, ymin, xmax, ymax, label_ind], ... ]
class MASKDetection(data.Dataset):
"""VOC Detection Dataset Object
input is image, target is annotation
Arguments:
root (string): filepath to VOCdevkit folder.
image_set (string): imageset to use (eg. 'train', 'val', 'test')
transform (callable, optional): transformation to perform on the
input image
图片预处理的方式,这里使用了大量数据增强的方式
target_transform (callable, optional): transformation to perform on the
target `annotation`
(eg: take in caption string, return tensor of word indices)
真实框预处理的方式
dataset_name (string, optional): which dataset to load
(default: 'VOC2007')
"""
# image_sets=[('2007', 'trainval'), ('2012', 'trainval')],
def __init__(self, root,
image_sets='trainval',
transform=None, target_transform=MASKAnnotationTransform(),
dataset_name='MASK'):
self.root = root
self.image_set = image_sets
self.transform = transform # 定义图像转换方法
self.target_transform = target_transform #定义标签的转换方法
self.name = dataset_name
self._annopath = osp.join('%s', 'Annotations', '%s.xml')
self._imgpath = osp.join('%s', 'JPEGImages', '%s.jpg')
self.ids = list()
for line in open(MASK_ROOT+'/ImageSets/'+self.image_set+'.txt'):
self.ids.append((MASK_ROOT, line.strip()))
def __getitem__(self, index):
im, gt, h, w = self.pull_item(index)
return im, gt
def __len__(self):
return len(self.ids)
def pull_item(self, index):
img_id = self.ids[index]
target = ET.parse(self._annopath % img_id).getroot()
img = cv2.imread(self._imgpath % img_id)
height, width, channels = img.shape
if self.target_transform is not None:
# 真实框处理
target = self.target_transform(target, width, height)
if self.transform is not None:
# 图像预处理,进行数据增强,只在训练进行数据增强,测试的时候不需要
target = np.array(target)
# 前四个和最后一个类别
img, boxes, labels = self.transform(img, target[:, :4], target[:, 4])
# to rgb
img = img[:, :, (2, 1, 0)] #opencv读入图像的顺序是BGR,该操作将图像转为RGB
# img = img.transpose(2, 0, 1)
target = np.hstack((boxes, np.expand_dims(labels, axis=1)))
return torch.from_numpy(img).permute(2, 0, 1), target, height, width
# return torch.from_numpy(img), target, height, width
def pull_image(self, index):
'''Returns the original image object at index in PIL form
Note: not using self.__getitem__(), as any transformations passed in
could mess up this functionality.
Argument:
index (int): index of img to show
Return:
PIL img
测试时候用,返回opencv读取的图像
'''
img_id = self.ids[index]
return cv2.imread(self._imgpath % img_id, cv2.IMREAD_COLOR)
def pull_anno(self, index):
'''Returns the original annotation of image at index
Note: not using self.__getitem__(), as any transformations passed in
could mess up this functionality.
Argument:
index (int): index of img to get annotation of
Return:
list: [img_id, [(label, bbox coords),...]]
eg: ('001718', [('dog', (96, 13, 438, 332))])
测试时候用,返回图片名称和原始未经缩放的GT框
'''
img_id = self.ids[index]
anno = ET.parse(self._annopath % img_id).getroot()
gt = self.target_transform(anno, 1, 1)
return img_id[1], gt
def pull_tensor(self, index):
'''Returns the original image at an index in tensor form
Note: not using self.__getitem__(), as any transformations passed in
could mess up this functionality.
Argument:
index (int): index of img to show
Return:
tensorized version of img, squeezed
'''
return torch.Tensor(self.pull_image(index)).unsqueeze_(0)
3.修改SSD框架的ssd.py和train.py
4.添加训练好的模型到eval.py,对模型进行验证,得到较好的训练结果ssd300_MASK_88000.pth
#!/opt/conda/envs/python35-paddle120-env/bin/python
from ssd import build_ssd
from data import MASKDetection, MASKAnnotationTransform, MASK_ROOT
from data import MASK_CLASSES as labels
import os
import sys
# module_path = os.path.abspath(os.path.join('..'))
# if module_path not in sys.path:
# sys.path.append(module_path)
import sys
sys.path.append('/home/aistudio')
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
from torch.autograd import Variable
import numpy as np
import cv2
if torch.cuda.is_available():
torch.set_default_tensor_type('torch.cuda.FloatTensor')
net = build_ssd('test', 300, len(labels)+1) # initialize SSD
net.load_weights('./weights/ssd300_MASK_88000.pth')
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
image = cv2.imread('mm.jpg', cv2.IMREAD_COLOR)
# testset = MASKDetection(MASK_ROOT, 'test', None, MASKAnnotationTransform())
# img_id = 1
# image = testset.pull_image(img_id)
rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# View the sampled input image before transform
plt.figure(figsize=(10,10))
plt.imshow(rgb_image)
plt.show()
# 对于SSD,在测试时,我们使用一个自定义的BaseTransform调用将图像大小调整为300x300,减去数据集的平均rgb值,并将颜色通道交换为SSD300的输入。
# Pre-process the input.
x = cv2.resize(image, (300, 300)).astype(np.float32)
x -= (104.0, 117.0, 123.0)
x = x.astype(np.float32)
x = x[:, :, ::-1].copy()
plt.imshow(x)
x = torch.from_numpy(x).permute(2, 0, 1)
# 现在只需将图像包装在一个变量中,这样它就可以被PyTorch autograd识别
# SSD Forward Pass
xx = Variable(x.unsqueeze(0)) # wrap tensor in Variable
if torch.cuda.is_available():
xx = xx.cuda()
y = net(xx)
# Parse the Detections and View Results
# Filter outputs with confidence scores lower than a threshold Here we choose 60%
top_k=10
plt.figure(figsize=(10,10))
colors = plt.cm.hsv(np.linspace(0, 1, 21)).tolist()
plt.imshow(rgb_image) # plot the image for matplotlib
currentAxis = plt.gca()
detections = y.data
# scale each detection back up to the image
scale = torch.Tensor(rgb_image.shape[1::-1]).repeat(2)
for i in range(detections.size(1)):
j = 0
while detections[0,i,j,0] >= 0.6:
score = detections[0,i,j,0]
label_name = labels[i-1]
display_txt = '%s: %.2f'%(label_name, score)
pt = (detections[0,i,j,1:]*scale).cpu().numpy()
coords = (pt[0], pt[1]), pt[2]-pt[0]+1, pt[3]-pt[1]+1
color = colors[i]
currentAxis.add_patch(plt.Rectangle(*coords, fill=False, edgecolor=color, linewidth=2))
currentAxis.text(pt[0], pt[1], display_txt, bbox={'facecolor':color, 'alpha':0.5})
j+=1
plt.savefig('112.png')
运行结果:
非口罩类:
口罩类:
from __future__ import print_function
import sys
sys.path.append('/home/aistudio')
import torch
from torch.autograd import Variable
import cv2
import time
from imutils.video import FPS, WebcamVideoStream
import argparse
parser = argparse.ArgumentParser(description='Single Shot MultiBox Detection')
parser.add_argument('--weights', default='../weights/ssd300_MASK_88000.pth',
type=str, help='Trained state_dict file path')
parser.add_argument('--cuda', default=False, type=bool,
help='Use cuda in live demo')
args = parser.parse_args()
COLORS = [(255, 0, 0), (0, 255, 0), (0, 0, 255)]
FONT = cv2.FONT_HERSHEY_SIMPLEX
def cv2_demo(net, transform):
def predict(frame):
height, width = frame.shape[:2]
x = torch.from_numpy(transform(frame)[0]).permute(2, 0, 1)
x = Variable(x.unsqueeze(0))
y = net(x) # forward pass
detections = y.data
# scale each detection back up to the image
scale = torch.Tensor([width, height, width, height])
for i in range(detections.size(1)):
j = 0
while detections[0, i, j, 0] >= 0.6:
score = detections[0,i,j,0]
label_name = labels[i-1]
display_txt = '%s: %.2f'%(label_name, score)
pt = (detections[0, i, j, 1:] * scale).cpu().numpy()
cv2.rectangle(frame,
(int(pt[0]), int(pt[1])),
(int(pt[2]), int(pt[3])),
COLORS[i % 3], 2)
#cv2.putText(frame, labelmap[i - 1], (int(pt[0]), int(pt[1])),
# FONT, 2, (255, 255, 255), 2, cv2.LINE_AA)
color = COLORS[i]
cv2.putText(frame, display_txt,(int(pt[0]), int(pt[1])),
FONT, 1, (255, 255, 255), 2, cv2.LINE_AA)
j += 1
return frame
# start video stream thread, allow buffer to fill
print("[INFO] starting threaded video stream...")
global stream
stream = WebcamVideoStream(src=0).start() # default camera
time.sleep(1.0)
# start fps timer
# loop over frames from the video file stream
while True:
# grab next frame
frame = stream.read()
key = cv2.waitKey(1) & 0xFF
# update FPS counter
fps.update()
frame = predict(frame)
# keybindings for display
if key == ord('p'): # pause
while True:
key2 = cv2.waitKey(1) or 0xff
cv2.imshow('frame', frame)
if key2 == ord('p'): # resume
break
cv2.imshow('frame', frame)
if key == 27: # exit
break
if __name__ == '__main__':
import sys
from os import path
sys.path.append(path.dirname(path.dirname(path.abspath(__file__))))
from data import MASKDetection, MASKAnnotationTransform, MASK_ROOT
from data import MASK_CLASSES as labels
from data import BaseTransform, VOC_CLASSES as labelmap
from ssd import build_ssd
net = build_ssd('test', 300, 3) # initialize SSD
#net.load_state_dict(torch.load('../weights/ssd300_MASK_88000.pth'))
net.load_weights('../weights/ssd300_MASK_88000.pth')
transform = BaseTransform(net.size, (104/256.0, 117/256.0, 123/256.0))
fps = FPS().start()
cv2_demo(net.eval(), transform)
# stop the timer and display FPS information
fps.stop()
print("[INFO] elasped time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
# cleanup
cv2.destroyAllWindows()
stream.stop()
运行结果截图:
代码(包括数据集)可关注微信公众号:IamZLT,后台回复:口罩,即可获取。
微信公众号: