本案例演示在 OpenVINO 中使用 MidasNet 进行单目深度估计,输入图片情况。模型信息可以在 这里找到。
环境描述:
3-monodepth-imaging
深度估计就是从RGB图像中估计图像中物体的深度,是一个从二维到三维的艰难过程。说道测距,我们首先会想到使用双目摄像头或者激光雷达,当然,这些方法各有优缺点,比如比如体积大(TOF)、能耗高(Kinect配有散热系统)、受环境影响(阳光中红外线影响)、算法复杂度高、实时性差(TOF实时性最高但精度较低)等。对于单目深度估算,其先天缺陷就是无法通过传感器直接得到精确的距离信息,但是随着软件算法的发展,我们可以通过深度学习来弥补硬件上的不足,同时为其他图像应用如语义分割、物体识别等提供更多的特征信息。
我们知道,就算我们闭上一只眼,也可以对眼前物体的距离有一个判断。 那也就是说,我们可以通过深度学习,希望机器能拥有像人脑一样的学习能力,2D图像的距离信息有一个估算。
在这个演示中,我们使用了一个名为MiDaS 的神经网络模型。论文出处:
R. Ranftl, K. Lasinger, D. Hafner, K. Schindler and V. Koltun, “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2020.3019967.
这篇文章提出了一种监督的深度估计方法,具体来讲文章的策略可以归纳为:
1)使用多个深度数据集(各自拥有不同的scale和shift属性)加入进行训练,增大数据量与实现场景的互补;
2)提出了一种深度和偏移不变性的损失函数用于去监督深度的回归过程,从而使得可以更加有效使用现有数据;
3)采用从3D电影中进行采样的方式扩充数据集,从而进一步增加数据量;
4)使用带有原则属性的多目标训练方法,从而得到一种更加行之有效的优化方法;
结合上述的优化策略与方法,文章的最后得到的模型具有较强的泛化能力,从而摆脱了之前一些公开数据集场景依赖严重的问题。
代码整体逻辑:
ie.read_model
)并且编译(ie.compile_model
);compiled_model([input_image])[output_key]
)。得到的结果的尺寸和模型的输出尺寸相符。然后,我们将输出的结果转化为RGB图(通过函数convert_result_to_image
),将其尺寸转换回输入是的图像大小,最后可视化结果。代码如下:
import sys
import time
from pathlib import Path
import cv2
import matplotlib.cm
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import (
HTML,
FileLink,
Pretty,
ProgressBar,
Video,
clear_output,
display,
)
from openvino.runtime import Core
DEVICE = "CPU"
MODEL_FILE = "model/MiDaS_small.xml"
model_xml_path = Path(MODEL_FILE)
def normalize_minmax(data):
"""
Normalizes the values in `data` between 0 and 1
"""
return (data - data.min()) / (data.max() - data.min())
def convert_result_to_image(result, colormap="viridis"):
"""
Convert network result of floating point numbers to an RGB image with
integer values from 0-255 by applying a colormap.
`result` is expected to be a single network result in 1,H,W shape
`colormap` is a matplotlib colormap.
See https://matplotlib.org/stable/tutorials/colors/colormaps.html
"""
cmap = matplotlib.cm.get_cmap(colormap)
result = result.squeeze(0)
result = normalize_minmax(result)
result = cmap(result)[:, :, :3] * 255
result = result.astype(np.uint8)
return result
def to_rgb(image_data) -> np.ndarray:
"""
Convert image_data from BGR to RGB
"""
return cv2.cvtColor(image_data, cv2.COLOR_BGR2RGB)
print("1 - Load Model")
ie = Core()
model = ie.read_model(model=model_xml_path, weights=model_xml_path.with_suffix(".bin"))
compiled_model = ie.compile_model(model=model, device_name=DEVICE)
input_key = compiled_model.input(0)
output_key = compiled_model.output(0)
print("- Input layer info: {}".format(input_key))
print("- Output layer info: {}".format(output_key))
network_input_shape = list(input_key.shape)
network_image_height, network_image_width = network_input_shape[2:]
print("2 - Load Image")
IMAGE_FILE = "data/coco_bike.jpg"
image = cv2.imread(IMAGE_FILE)
print("- Input image size: {}".format(image.shape))
# resize to input shape for network
resized_image = cv2.resize(src=image, dsize=(network_image_height, network_image_width))
# reshape image to network input shape NCHW
input_image = np.expand_dims(np.transpose(resized_image, (2, 0, 1)), 0)
print("- Image resize into: {}".format(input_image.shape))
print("3 - Model Inference")
result = compiled_model([input_image])[output_key]
print("- Inference result shape: {}".format(result.shape))
print("- convert network result of disparity map to an image that shows distance as colors.")
result_image = convert_result_to_image(result)
# resize back to original image shape. cv2.resize expects shape
# in (width, height), [::-1] reverses the (height, width) shape to match this
result_image = cv2.resize(result_image, image.shape[:2][::-1])
print("- resize back to original image shape from (width, height) to (height, width) based on cv2.resize requirement with final image shape {}".format(result_image.shape))
print("- final results visualization.")
fig, ax = plt.subplots(1, 2, figsize=(20, 15))
ax[0].imshow(to_rgb(image))
ax[1].imshow(result_image)
Terminal输出:
1 - Load Model
- Input layer info:
- Output layer info:
2 - Load Image
- Input image size: (600, 800, 3)
- Image resize into: (1, 3, 256, 256)
3 - Model Inference
- Inference result shape: (1, 256, 256)
- convert network result of disparity map to an image that shows distance as colors.
- resize back to original image shape from (width, height) to (height, width) based on cv2.resize requirement with final image shape (600, 800, 3)
- final results visualization.