[OpenCV实战]5 基于深度学习的文本检测

目录

1 网络加载

2 读取图像

3 前向传播

4 处理输出

3结果和代码

3.1结果

3.2 代码

参考


在这篇文章中,我们将逐字逐句地尝试找到图片中的单词!基于最近的一篇论文进行文字检测。

EAST: An Efficient and Accurate Scene Text Detector.

https://arxiv.org/abs/1704.03155v2

https://github.com/argman/EAST

应该注意,文本检测不同于文本识别。在文本检测中,我们只检测文本周围的边界框。但是,在文本识别中,我们实际上找到了框中所写的内容。例如,在下面给出的图像中,文本检测将为您提供单词周围的边界框,文本识别将告诉您该框包含单词STOP。本文只进行文本检测。

本文基于tensorflow模型,基于OpenCV调用tensorflow模型。我们将逐步讨论算法是如何工作的。您将需要OpenCV3.4.3以上版本来运行代码。其他opencv DNN模型读取也类似这样步骤。

涉及的步骤如下:

  1. 下载EAST模型
  2. 将模型加载到内存中
  3. 准备输入图像
  4. 正向传递blob通过网络
  5. 处理输出

1 网络加载

我们将使用cv :: dnn :: readnet(C++版本)或cv2.dnn.ReadNet(python版本)函数将网络加载到内存中。它会根据指定的文件名自动检测配置和框架。在我们的例子中,它是一个pb文件,因此,它将假定要加载Tensorflow网络。和加载图像不大一样,没有模型结构描述文件。

C++

   Net net = readNet(model);

Python

   net = cv.dnn.readNet(model)

2 读取图像

我们需要创建一个4-D输入blob,用于将图像输送到网络。这是使用blobFromImage函数完成的。

C++

   blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);

Python

   blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)

我们需要为此函数指定一些参数。它们如下:

  1. 第一个参数是图像本身。
  2. 第二个参数指定每个像素值的缩放。在这种情况下,它不是必需的。因此我们将其保持为1。
  3. 第三个参数是设定网络的默认输入为320×320。因此,我们需要在创建blob时指定它。最好和网络输入一致。
  4. 第四个参数是训练时候设定的模型均值。需要减去模型均值。
  5. 第五个参数是我们是否要交换R和B通道。这是必需的,因为OpenCV使用BGR格式,Tensorflow使用RGB格式,caffe模型使用BGR格式。
  6. 最后一个参数是我们是否要裁剪图像并采取中心裁剪。在这种情况下我们指定False。

3 前向传播

现在我们已准备好输入,我们将通过网络传递它。网络有两个输出。一个指定文本框的位置,另一个指定检测到的框的置信度分数。两个输出层如下:

feature_fusion/concat_3

feature_fusion/Conv_7/Sigmoid

这两个输出可以直接用netron这个软件打开pb模型,看到最后输出结果。Netron是一个模型结构可视化神器,支持tf, caffe, keras,mxnet等多种框架。Netron下载地址:

https://github.com/lutzroeder/Netron

c++读取输出代码如下:

   std::vector outputLayers(2);

outputLayers[0] = "feature_fusion/Conv_7/Sigmoid";

outputLayers[1] = "feature_fusion/concat_3";

python读取输出代码如下:

   outputLayers = []

outputLayers.append("feature_fusion/Conv_7/Sigmoid")

outputLayers.append("feature_fusion/concat_3")

接下来,我们通过将输入图像传递到网络来获得输出。如前所述,输出由两部分组成:置信度和位置。

C++

   std::vector output;

net.setInput(blob);

net.forward(output, outputLayers);

Mat scores = output[0];

Mat geometry = output[1];

python:

   net.setInput(blob)

output = net.forward(outputLayers)

scores = output[0]

geometry = output[1]

4 处理输出

如前所述,我们将使用两个层的输出并解码文本框的位置及其方向。我们可能会得到许多文本框。因此,我们需要从该批次中筛选出看起来最好的文本框。这是使用非极大值抑制算法完成的。

非极大值抑制算法在目标检测中应用很广泛,具体可以参考

http://www.it610.com/article/5215825.htm

https://blog.csdn.net/qq_14845119/article/details/52064928

1 解码

C++:

   std::vector boxes;

std::vector confidences;

decode(scores, geometry, confThreshold, boxes, confidences);

python:

   [boxes, confidences] = decode(scores, geometry, confThreshold)

2 非极大值抑制

我们使用OpenCV函数NMSBoxes(C ++)或NMSBoxesRotated(Python)来过滤掉误报并获得最终预测。

C++:

   std::vector indices;

NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);

Python:

   indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold, nmsThreshold)

3结果和代码

3.1结果

在VS2017下运行了C++代码,其中OpenCV版本至少要3.4.5以上。不然模型读取会有问题。模型文件太大,见下载链接:

 

如果没有积分(系统自动设定资源分数)看看参考链接。我搬运过来的,大修改没有。

或者梯子直接下载模型:

https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1

结果如下,效果还不错,速度也还好。

[OpenCV实战]5 基于深度学习的文本检测_第1张图片

[OpenCV实战]5 基于深度学习的文本检测_第2张图片

 

[OpenCV实战]5 基于深度学习的文本检测_第3张图片

3.2 代码

C++代码有所更改,python没有。对文本检测不熟悉,注释较少。

C++代码:

   // text_detection.cpp : 此文件包含 "main" 函数。程序执行将在此处开始并结束。
//

#include "pch.h"
#include 
#include 

using namespace std;
using namespace cv;
using namespace cv::dnn;

//解码
void decode(const Mat &scores, const Mat &geometry, float scoreThresh,
	std::vector &detections, std::vector &confidences);

/**
 * @brief
 *
 * @param srcImg 检测图像
 * @param inpWidth 深度学习图像输入宽
 * @param inpHeight 深度学习图像输入高
 * @param confThreshold 置信度
 * @param nmsThreshold 非极大值抑制算法阈值
 * @param net
 * @return Mat
 */
Mat text_detect(Mat srcImg, int inpWidth, int inpHeight, float confThreshold, float nmsThreshold, Net net)
{
	//输出
	std::vector output;
	std::vector outputLayers(2);
	outputLayers[0] = "feature_fusion/Conv_7/Sigmoid";
	outputLayers[1] = "feature_fusion/concat_3";

	//检测图像
	Mat frame, blob;
	frame = srcImg.clone();
	//获取深度学习模型的输入
	blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);
	net.setInput(blob);
	//输出结果
	net.forward(output, outputLayers);

	//置信度
	Mat scores = output[0];
	//位置参数
	Mat geometry = output[1];

	// Decode predicted bounding boxes, 对检测框进行解码,获取文本框位置方向
	//文本框位置参数
	std::vector boxes;
	//文本框置信度
	std::vector confidences;
	decode(scores, geometry, confThreshold, boxes, confidences);

	// Apply non-maximum suppression procedure, 应用非极大性抑制算法
	//符合要求的文本框
	std::vector indices;
	NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);

	// Render detections. 输出预测
	//缩放比例
	Point2f ratio((float)frame.cols / inpWidth, (float)frame.rows / inpHeight);
	for (size_t i = 0; i < indices.size(); ++i)
	{
		RotatedRect &box = boxes[indices[i]];

		Point2f vertices[4];
		box.points(vertices);
		//还原坐标点
		for (int j = 0; j < 4; ++j)
		{
			vertices[j].x *= ratio.x;
			vertices[j].y *= ratio.y;
		}
		//画框
		for (int j = 0; j < 4; ++j)
		{
			line(frame, vertices[j], vertices[(j + 1) % 4], Scalar(0, 255, 0), 2, LINE_AA);
		}
	}

	// Put efficiency information. 时间
	std::vector layersTimes;
	double freq = getTickFrequency() / 1000;
	double t = net.getPerfProfile(layersTimes) / freq;
	std::string label = format("Inference time: %.2f ms", t);
	putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));

	return frame;
}

//模型地址
auto model = "./model/frozen_east_text_detection.pb";
//检测图像
auto detect_image = "./image/patient.jpg";
//输入框尺寸
auto inpWidth = 320;
auto inpHeight = 320;
//置信度阈值
auto confThreshold = 0.5;
//非极大值抑制算法阈值
auto nmsThreshold = 0.4;

int main()
{
	//读取模型
	Net net = readNet(model);
	//读取检测图像
	Mat srcImg = imread(detect_image);
	if (!srcImg.empty())
	{
		cout << "read image success!" << endl;
	}
	Mat resultImg = text_detect(srcImg, inpWidth, inpHeight, confThreshold, nmsThreshold, net);
	imshow("result", resultImg);
	waitKey();
	return 0;
}

/**
 * @brief 输出检测到的文本框相关信息
 *
 * @param scores 置信度
 * @param geometry 位置信息
 * @param scoreThresh 置信度阈值
 * @param detections 位置
 * @param confidences 分类概率
 */
void decode(const Mat &scores, const Mat &geometry, float scoreThresh,
	std::vector &detections, std::vector &confidences)
{
	detections.clear();
	//判断是不是符合提取要求
	CV_Assert(scores.dims == 4);
	CV_Assert(geometry.dims == 4);
	CV_Assert(scores.size[0] == 1);
	CV_Assert(geometry.size[0] == 1);
	CV_Assert(scores.size[1] == 1);
	CV_Assert(geometry.size[1] == 5);
	CV_Assert(scores.size[2] == geometry.size[2]);
	CV_Assert(scores.size[3] == geometry.size[3]);

	const int height = scores.size[2];
	const int width = scores.size[3];
	for (int y = 0; y < height; ++y)
	{
		//识别概率
		const float *scoresData = scores.ptr(0, 0, y);
		//文本框坐标
		const float *x0_data = geometry.ptr(0, 0, y);
		const float *x1_data = geometry.ptr(0, 1, y);
		const float *x2_data = geometry.ptr(0, 2, y);
		const float *x3_data = geometry.ptr(0, 3, y);
		//文本框角度
		const float *anglesData = geometry.ptr(0, 4, y);
		//遍历所有检测到的检测框
		for (int x = 0; x < width; ++x)
		{
			float score = scoresData[x];
			//低于阈值忽略该检测框
			if (score < scoreThresh)
			{
				continue;
			}

			// Decode a prediction.
			// Multiple by 4 because feature maps are 4 time less than input image.
			float offsetX = x * 4.0f, offsetY = y * 4.0f;
			//角度及相关正余弦计算
			float angle = anglesData[x];
			float cosA = std::cos(angle);
			float sinA = std::sin(angle);
			float h = x0_data[x] + x2_data[x];
			float w = x1_data[x] + x3_data[x];

			Point2f offset(offsetX + cosA * x1_data[x] + sinA * x2_data[x],
				offsetY - sinA * x1_data[x] + cosA * x2_data[x]);
			Point2f p1 = Point2f(-sinA * h, -cosA * h) + offset;
			Point2f p3 = Point2f(-cosA * w, sinA * w) + offset;
			//旋转矩形,分别输入中心点坐标,图像宽高,角度
			RotatedRect r(0.5f * (p1 + p3), Size2f(w, h), -angle * 180.0f / (float)CV_PI);
			//保存检测框
			detections.push_back(r);
			//保存检测框的置信度
			confidences.push_back(score);
		}
	}
}

Python代码:

   # Import required modules
import cv2 as cv
import math
import argparse

parser = argparse.ArgumentParser(description='Use this script to run text detection deep learning networks using OpenCV.')
# Input argument
parser.add_argument('--input', help='Path to input image or video file. Skip this argument to capture frames from a camera.')
# Model argument
parser.add_argument('--model', default="./model/frozen_east_text_detection.pb",
                    help='Path to a binary .pb file of model contains trained weights.'
                    )
# Width argument
parser.add_argument('--width', type=int, default=320,
                    help='Preprocess input image by resizing to a specific width. It should be multiple by 32.'
                   )
# Height argument
parser.add_argument('--height',type=int, default=320,
                    help='Preprocess input image by resizing to a specific height. It should be multiple by 32.'
                   )
# Confidence threshold
parser.add_argument('--thr',type=float, default=0.5,
                    help='Confidence threshold.'
                   )
# Non-maximum suppression threshold
parser.add_argument('--nms',type=float, default=0.4,
                    help='Non-maximum suppression threshold.'
                   )

args = parser.parse_args()


############ Utility functions ############
def decode(scores, geometry, scoreThresh):
    detections = []
    confidences = []

    ############ CHECK DIMENSIONS AND SHAPES OF geometry AND scores ############
    assert len(scores.shape) == 4, "Incorrect dimensions of scores"
    assert len(geometry.shape) == 4, "Incorrect dimensions of geometry"
    assert scores.shape[0] == 1, "Invalid dimensions of scores"
    assert geometry.shape[0] == 1, "Invalid dimensions of geometry"
    assert scores.shape[1] == 1, "Invalid dimensions of scores"
    assert geometry.shape[1] == 5, "Invalid dimensions of geometry"
    assert scores.shape[2] == geometry.shape[2], "Invalid dimensions of scores and geometry"
    assert scores.shape[3] == geometry.shape[3], "Invalid dimensions of scores and geometry"
    height = scores.shape[2]
    width = scores.shape[3]
    for y in range(0, height):

        # Extract data from scores
        scoresData = scores[0][0][y]
        x0_data = geometry[0][0][y]
        x1_data = geometry[0][1][y]
        x2_data = geometry[0][2][y]
        x3_data = geometry[0][3][y]
        anglesData = geometry[0][4][y]
        for x in range(0, width):
            score = scoresData[x]

            # If score is lower than threshold score, move to next x
            if(score < scoreThresh):
                continue

            # Calculate offset
            offsetX = x * 4.0
            offsetY = y * 4.0
            angle = anglesData[x]

            # Calculate cos and sin of angle
            cosA = math.cos(angle)
            sinA = math.sin(angle)
            h = x0_data[x] + x2_data[x]
            w = x1_data[x] + x3_data[x]

            # Calculate offset
            offset = ([offsetX + cosA * x1_data[x] + sinA * x2_data[x], offsetY - sinA * x1_data[x] + cosA * x2_data[x]])

            # Find points for rectangle
            p1 = (-sinA * h + offset[0], -cosA * h + offset[1])
            p3 = (-cosA * w + offset[0],  sinA * w + offset[1])
            center = (0.5*(p1[0]+p3[0]), 0.5*(p1[1]+p3[1]))
            detections.append((center, (w,h), -1*angle * 180.0 / math.pi))
            confidences.append(float(score))

    # Return detections and confidences
    return [detections, confidences]

if __name__ == "__main__":
    # Read and store arguments
    confThreshold = args.thr
    nmsThreshold = args.nms
    inpWidth = args.width
    inpHeight = args.height
    model = args.model

    # Load network
    net = cv.dnn.readNet(model)

    # Create a new named window
    kWinName = "EAST: An Efficient and Accurate Scene Text Detector"
    outputLayers = []
    outputLayers.append("feature_fusion/Conv_7/Sigmoid")
    outputLayers.append("feature_fusion/concat_3")

    # Read frame
    frame = cv.imread("./image/stop1.jpg")


    # Get frame height and width
    height_ = frame.shape[0]
    width_ = frame.shape[1]
    rW = width_ / float(inpWidth)
    rH = height_ / float(inpHeight)

    # Create a 4D blob from frame.
    blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)

    # Run the model
    net.setInput(blob)
    output = net.forward(outputLayers)
    t, _ = net.getPerfProfile()
    label = 'Inference time: %.2f ms' % (t * 1000.0 / cv.getTickFrequency())

    # Get scores and geometry
    scores = output[0]
    geometry = output[1]
    [boxes, confidences] = decode(scores, geometry, confThreshold)
    # Apply NMS
    indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold,nmsThreshold)
    for i in indices:
        # get 4 corners of the rotated rect
        vertices = cv.boxPoints(boxes[i[0]])
        # scale the bounding box coordinates based on the respective ratios
        for j in range(4):
            vertices[j][0] *= rW
            vertices[j][1] *= rH
        for j in range(4):
            p1 = (vertices[j][0], vertices[j][1])
            p2 = (vertices[(j + 1) % 4][0], vertices[(j + 1) % 4][1])
            cv.line(frame, p1, p2, (0, 255, 0), 2, cv.LINE_AA);
            # cv.putText(frame, "{:.3f}".format(confidences[i[0]]), (vertices[0][0], vertices[0][1]), cv.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1, cv.LINE_AA)

    # Put efficiency information
    cv.putText(frame, label, (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))

    # Display the frame
    cv.imshow("result",frame)
    cv.waitKey(0)

参考

https://www.learnopencv.com/deep-learning-based-text-detection-using-opencv-c-python/

你可能感兴趣的:(geek)