
During COVID19 quarantine I decided to build my own implementation of a mask-detector able to detect whether a person is wearing a mask in images or videos just for fun.


Like every Machine Learning project, the very first step is to collect the necessary data. As we’re trying to build a mask detector model which should be able to output “mask” or “no mask”, given an input face image, so we need to collect images of people wearing and not wearing a mask.

I just came to ask all my friends to send me a selfie where they were wearing a mask and another where they were not wearing it. I was able to collect around 200 images which seems to be very poor for training an accurate Machine Learning model, however the results were quite acceptable.

To build a mask detector let’s first split the problem into 2 main steps:


1. Given an input image we need to detect faces on it, this is a task called “Object Detection” in the world of Computer Vision. Object Detection is the task of detecting object positions and their types over images as the example below:

In our problem we need to detect only faces and output their bounding boxes delimiting their positions, so we can pass them to the next step:


2. Given one or more face images, we need to classify them into “mask” or “no mask”. In the Machine Learning vocabulary this is called “binary classification”, where we need to classify some input data into 2 possible classes (in this case [“mask”, “no mask”]). Our input data will be the RGB image representation of human faces.

So, given the 2 steps mentioned before, here we’re building a pipeline of processing, the first step takes an input image and outputs the bounding boxes of human faces found in that image, and the second step takes that cropped face images delimited by the bounding boxes and classifies them into “mask” or “nomask”.


Let’s start by talking about the second step: “The classification problem” as it is the main focus of this article.


关键概念 (Key Concepts)

Transfer Learning: Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks. Wikipedia

Data Augmentation: Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data. Data augmentation techniques such as cropping, padding, and horizontal flipping are commonly used to train large neural networks

预处理人脸图像 (Preprocessing Face Images)

In order to build the mask detector model which takes faces as input and detects masks i needed to crop the faces from the collected images. Am not copying/pasting the code here to avoid making this article too large, but you can find the code here: labelling_images.ipynb

训练模型 (Training the model)

Let’s start by coding step by step the model training algorithm. I used Python and Jupyter notebooks in a Google Colab environment with GPU, but you can also run the code in whatever python environment you prefer.

The training notebook can be found here in my Github repository if you prefer to see the entire code directly: https://github.com/fnandocontreras/mask-detector/blob/master/training_model.ipynb

Let’s import some dependencies:


import tensorflow as tfimport pathlibimport numpy as npimport IPython.display as displayfrom PIL import Imageimport matplotlib.pyplot as pltimport osfrom keras.preprocessing.image import ImageDataGeneratorfrom keras.callbacks import EarlyStopping

I’m using Google drive to store the training images, but feel free to use your local machine if you want to run your code locally.Let’s mount the Google drive storage into the notebook and set the path to the collected images to base_path and get a directory info object with pathlib as follows:


from google.colab import drive
drive.mount(‘/content/drive’)base_path = '/content/drive/My Drive/Colab Notebooks/mask-detector/'
data_dir_collected = pathlib.Path(os.path.join(base_path, 'data/training'))
data_dir_test = pathlib.Path(os.path.join(base_path, 'data/test'))image_count_collected = len(list(data_dir_collected.glob('**/*.jpg')))
test_images_count = len(list(data_dir_test.glob('**/*.jpg')))
print('images collected', image_count_collected)
print('test images', test_images_count)

In my google drive the images are stored with the following folder structure:A folder named “mask” containing all the images with masksA folder named “nomask” containing all the images without masks

let’s check that we are loading the images in the right path:


CLASS_NAMES = np.array([item.name for item in data_dir_collected.glob('*')])

this should print: [‘nomask’ ‘mask’]


let’s define some constants and create the tensorflow image generator that will load images and feed them to the model training process:


IMG_SIZE = (IMG_WIDTH, IMG_HEIGHT)image_generator = ImageDataGenerator(rescale=1./255)
train_data_gen = image_generator.flow_from_directory(directory=str(data_dir_collected), shuffle=True, target_size=IMG_SIZE, classes = list(CLASS_NAMES))

Let’s show the images being loaded by the training data generator:


def show_batch(image_batch, label_batch):
for n in range(25):
ax = plt.subplot(5,5,n+1)
image_batch, label_batch = next(train_data_gen)
show_batch(image_batch, label_batch)

Let’s define some auxiliar methods that will be useful for data preprocessing:


def get_label(file_path):
# convert the path to a list of path components
parts = tf.strings.split(file_path, os.path.sep)
# The second to last is the class-directory
return parts[-2] == CLASS_NAMESdef decode_img(img):
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_jpeg(img, channels=3)
# Use `convert_image_dtype` to convert to floats in the [0,1] range.
img = tf.image.convert_image_dtype(img, tf.float32)
# resize the image to the desired size.
return tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])def process_path(file_path):
label = get_label(file_path)
# load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label

Let’s load now all the images from the storage, including the test images that we will use to evaluate the model:


list_ds_collected = tf.data.Dataset.list_files(str(data_dir_collected/'*/*'))
list_ds_test = tf.data.Dataset.list_files(str(data_dir_test/'*/*'))

Let’s now apply the preprocessing functions defined before to the loaded images:


train_ds = list_ds_collected.map(process_path)
validation_ds = list_ds_collected.map(process_path)

Let’s define an ImageDataGenerator, this will define a generator class that will perform Data Augmentation over the loaded images, this will allow us to train the model over a bigger distribution of data, it performs some operations over the images like: zoom, rotation, horizontal flip, etc.


Data augmentation is useful in many cases, when we don’t have enough training data or in the cases when the model is overfitting the training dataset. Intuitively we’re training a model to predict whether people are wearing masks, and our training data containing faces might be augmented by applying transformations to every face picture. In fact these transformations don’t modify the resulting class (“mask”, “nomask”), so let’s go for it.

def get_data_generator():
return ImageDataGenerator(

Let’s generate a batch of images containing all the collected/training and validation images and fit the image generator with them.


train_ds = train_ds.batch(image_count_collected)
validation_ds = validation_ds.batch(test_images_count)datagen = get_data_generator()
image_batch, label_batch = next(iter(train_ds))

Now we can start building our Deep Learning model. As we mentioned before, we’re using Transfer Learning, which is the task of using a pre-trained model as part of our final model, this allows us to take advantage of the parameters learned by a general purpose computer vision model to build our model adapted to our requirements.

In the code block below we’re loading the MobileNET V2, you can find the research paper here if you want to know deeper details about the network architecture: https://arxiv.org/abs/1801.04381

IMG_SHAPE = (IMG_WIDTH, IMG_HEIGHT, 3)# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
base_model.trainable = False

We’re setting the model propery: “traianble” to False, because we don’t want to retrain that model.

So now that we have the base model let’s complete it by adding some layers that we’ll need for our prediction outputs:


model = tf.keras.Sequential([
Dense(len(CLASS_NAMES), activation='softmax')

You should see the model summary as follows:


Model: “sequential_3” _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= mobilenetv2_1.00_160 (Model) (None, 5, 5, 1280) 2257984 _________________________________________________________________ global_average_pooling2d_4 ( (None, 1280) 0 _________________________________________________________________ dense_4 (Dense) (None, 2) 2562 ================================================================= Total params: 2,260,546 Trainable params: 2,562 Non-trainable params: 2,257,984

We are adding to the base model’s output a Global Average Pooling layer and a Dense layer with a “softmax” activation function, please, find more details about this layers in the oficial Keras documentation:GlobalAveragePooling2DDense

The model’s output is “(None, 2)” where: “None” represents the batch size which might vary, and “2" is the size of the softmax layer, corresponding to the number of classes. A softmax layer outputs a probability distribution of the possible output classes.

Now that we have the model, let’s proceed to the training. Let’s iterate through the training images 50 times to generate the images that we will transform afterwards with our DataAugmentation generator:

reps = 50
training_ds = train_ds.repeat(reps)
X_training, Y_training = next(iter(training_ds))
for X, Y in training_ds:
X_training = tf.concat([X, X_training], axis=0)
Y_training = tf.concat([Y, Y_training], axis=0)

Now we can compile and fit our model:We’re using here “Adam” as the optimisation algorithm during training, we are training the model for 10 epochs and we’re using early stopping which will stop the training process if the accuracy doesn’t get higher through 6 training iterations.

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
initial_epochs = 10
earlystopping = EarlyStopping(monitor='val_loss', patience=6)
batch_size = 100history = model.fit(
datagen.flow(X_training, Y_training, batch_size=batch_size),
steps_per_epoch=len(X_training) / batch_size,

At the end of the training i had the following results:Epoch 10/10 98/97 [==============================] — 46s 466ms/step — loss: 0.0018 — accuracy: 1.0000 — val_loss: 6.2493e-04 — val_accuracy: 1.0000

Accuracy: 1.00 means the model predicted the test dataset with 100% of accuracy. It doesn’t mean though that the model performs the same in every data set. Remember that for this experiment i’am using only near of 200 images during training, which isn’t enough for the most of Machine Learning problems.

In order to build a model that performs well and generalizes correctly in most of the cases we would need maybe thousands of images. However for the purpose of a proof of concept, it’s more than enough and the model seems to work very well on several people i used to test in real time using the webcam video stream.

Now that we have the trained model let’s save it and see how we can use it:



Now you can load the saved model from another python program like this:


from tensorflow.keras.models import load_model
model = load_model(‘model.h5’)

Now that we have a mask detector model, we need the first part of our pipeline: “a face detector”. Object detection is one of the main tasks of Computer Vision. You can find a lot of pretrained models out there for object detection with sometimes several thousands of different classes. Here i used MTCNN which stands for “Multi Task Convolutional Neural Network”. You can find the github repository here: https://github.com/ipazc/mtcnn

let’s import MTCNN and create an instance of face detector:


from mtcnn import MTCNN
detector = MTCNN()

Let’s load an image with opencv for testing:


import cv2
from matplotlib import pyplot as plt
test_img_path = os.path.join(data_dir_test, 'macron.jpg')
img = cv2.imread(test_img_path,cv2.IMREAD_COLOR)

Let’s run the face detector on the image:


face_boxes = detector.detect_faces(img)

[{ ‘box’: [463, 187, 357, 449], ‘confidence’: 0.9995754361152649, ‘keypoints’: { ‘left_eye’: (589, 346), ‘right_eye’: (750, 357), ‘nose’: (678, 442), ‘mouth_left’: (597, 525), ‘mouth_right’: (733, 537) }}]

We can see that the face is detected and we have all the relevant information like bounding box, and position of points of interest. In this case we only need the bounding box, which will help us to crop the image delimiting the face.

And now let’s see how the model performs with the french president:Am using some auxiliary functions for cropping and drawing faces that you can find in tools.py


face_boxes, faces = extract_faces(face_boxes, img)
preds = model.predict(tf.data.Dataset.from_tensors(faces))
probas = preds.max(axis=1)
y_preds = [CLASS_NAMES[c] for c in preds.argmax(axis=1)]
draw_boxes(img, face_boxes, (y_preds, probas))

And voilà: the model says the french president Macron is wearing a mask.


You can try it yourself with you own images, you can find the whole code and the trained model saved in Github: https://github.com/fnandocontreras/mask-detector

You can run it in real time using opencv and your webcam, for details on how to run the program, please find instructions in readme.md


That’s all for this tutorial, remember this is just an experiment, it is not intended to be used in a real life environment because of its limitations. One important limitation is the fact that the face detector in many cases fails to detect masked faces, so it breaks the first step of the pipeline and it will fail to work as intended.

I hope you enjoyed reading.


Keep calm and wear a mask to help stop #covid19


