1.Paddle学习地址:飞桨AI Studio - 人工智能学习与实训社区
2.AI Studio基本操作(一) Notebook篇
3.飞桨框架文档
4.PaddleOCR学习–Github
5.十分钟掌握PaddleOCR使用
本次通过AI达人创造营学习PaddlePaddle基本使用,并参考其他开源项目完成PaddleOCR比赛实战。接下来将介绍使用Paddle进行验证码识别比赛及具体训练流程。
比赛链接:
2022数字中国创新大赛(简称2022 DCIC)科技金融子赛道——基于文本字符的交易验证码识别
项目链接:https://aistudio.baidu.com/aistudio/projectdetail/3501451?channelType=0&channel=0
验证码作为性价较高的安全验证方法,在多场合得到了广泛的应用,有效地防止了机器人进行身份欺骗,其中,以基于文本字符的静态验证码最为常见。随着使用的深入,噪声点、噪声线、重叠、形变等干扰手段层出不穷,不断提升安全防范级别。RPA技术作为企业数字化转型的关键,因为其部署的非侵入式备受企业青睐,验证码识别率不高往往限制了RPA技术的应用。一个能同时过滤多种干扰的验证码模型,对于相关自动化技术的拓展使用有着一定的商业价值。
验证码作为性价较高的安全验证方法,在多场合得到了广泛的应用,有效地防止了机器人进行身份欺骗,其中,以基于文本字符的静态验证码最为常见。随着使用的深入,噪声点、噪声线、重叠、形变等干扰手段层出不穷,不断提升安全防范级别。RPA技术作为企业数字化转型的关键,因为其部署的非侵入式备受企业青睐,验证码识别率不高往往限制了RPA技术的应用。一个能同时过滤多种干扰的验证码模型,对于相关自动化技术的拓展使用有着一定的商业价值。
本次大赛以已标记字符信息的实例字符验证码图像数据为训练样本,参赛选手需基于提供的样本构建模型,对测试集中的字符验证码图像进行识别,提取有效的字符信息。训练数据集不局限于提供的数据,可以加入公开的数据集。
此次比赛为选手提供15000张带标注信息的训练数据集,每张训练数据都是包含一个4位文本字符的验证码图像,并对当前图像中的文本字符进行了标注;测试数据集含25000张验证码图像。
提供训练数据集打包文件train_imgs.zip(文件名称即对应该图片文本字符标签);提供测试数据集打包文件test_imgs.zip,测试数据集包含待识别的图像文件。
文件名称 | 说明 |
---|---|
train_imgs.zip | 训练集图片,包含15000张验证码图片 |
test_imgs.zip | 测试集图片,里面包含25000张待识别验证码图片 |
submit_example.csv | 提交样例,参赛者参考此数据格式进行提交 |
本次比赛采用评价方式为准确率(accuracy),对于参赛者提交的结果,要求完全识别出完整的验证码文本信息,最终根据测试图像数据预测的准确率进行从高到低的排序。
同等准确率的以提交结果的时间排名,先提交者胜出。
P ( 准 确 率 ) = 所 有 待 检 测 的 目 标 数 量 / 检 测 正 确 的 目 标 数 量 P( 准确率 )= 所有待检测的目标数量 /检测正确的目标数量 P(准确率)=所有待检测的目标数量/检测正确的目标数量
数据集链接:https://aistudio.baidu.com/aistudio/datasetdetail/126477
大家运行项目直接需要挂载该比赛数据集
!ls data/data126477/
# 一共三个文件
# submit_example.csv test_dataset.zip training_dataset.zip
submit_example.csv test_dataset.zip training_dataset.zip
# 解压数据集
!unzip -o data/data126477/training_dataset.zip -d data/
!unzip -o data/data126477/test_dataset.zip -d data/
!cp data/data126477/submit_example.csv data/
Archive: data/data126477/training_dataset.zip
creating: data/training_dataset/
inflating: data/training_dataset/00IS.png
inflating: data/training_dataset/00O3.png
inflating: data/training_dataset/0180.png
inflating: data/training_dataset/01BA.png
......
我们可以将15000张训练集按照8:2进行划分,12000张作为训练集 3000作为验证集
import pandas as pd
import shutil
import os
import glob
from tqdm import tqdm
from sklearn.model_selection import train_test_split
data_path = 'train_data/'
dcic_data_path = './PaddleOCR/train_data/dcic_data/'
dcic_train = './PaddleOCR/train_data/dcic_data/train'
dcic_valid = './PaddleOCR/train_data/dcic_data/valid'
dcic_test = './PaddleOCR/train_data/dcic_data/test'
os.makedirs(dcic_data_path, exist_ok=True)
os.makedirs(dcic_train, exist_ok=True)
os.makedirs(dcic_valid, exist_ok=True)
os.makedirs(dcic_test, exist_ok=True)
# print([filepath for filepath in glob.glob('data/dcic_data/training_dataset/')])
# print(glob.glob('data/dcic_data/training_dataset/*.png'))
# print(os.listdir('data/training_dataset'))
train_images = os.listdir('data/training_dataset')
test_images = os.listdir('data/test_dataset')
train_imgs, valid_imgs = train_test_split(train_images, test_size=0.2, random_state=42, shuffle=True)
print(len(train_imgs), len(valid_imgs))
all_txts = []
# shutil.copy('data/dcic_data/training_dataset/0A5o.png', 'train_data/dcic_data/train/0A5o.png')
with open('./PaddleOCR/train_data/dcic_data/rec_gt_train.txt', 'w', encoding='utf-8') as f:
for image in tqdm(train_imgs):
shutil.copy(f'data/training_dataset/{image}', f'./PaddleOCR/train_data/dcic_data/train/{image}')
txt = image.split('.png')[0]
all_txts.append(txt)
f.write(f'train/{image}\t{txt}' + '\n')
with open('./PaddleOCR/train_data/dcic_data/rec_gt_valid.txt', 'w', encoding='utf-8') as f:
for image in tqdm(valid_imgs):
shutil.copy(f'data/training_dataset/{image}', f'./PaddleOCR/train_data/dcic_data/valid/{image}')
txt = image.split('.png')[0]
all_txts.append(txt)
f.write(f'valid/{image}\t{txt}' + '\n')
for image in tqdm(test_images):
shutil.copy(f'data/test_dataset/{image}', f'./PaddleOCR/train_data/dcic_data/test/{image}')
# with open('train_data/dcic_data/captcha.txt', 'w', encoding='utf-8') as f:
# all_str = ''.join(all_txts)
# dict_char=sorted(set(all_str))
# for char in dict_char:
# f.write(char+'\n')
14%|█▍ | 1736/12000 [00:00<00:00, 17353.23it/s]
12000 3000
100%|██████████| 12000/12000 [00:00<00:00, 17161.60it/s]
100%|██████████| 3000/3000 [00:00<00:00, 17321.81it/s]
100%|██████████| 25000/25000 [00:01<00:00, 17481.43it/s]
import cv2
import matplotlib.pyplot as plt
# 读图
raw_img = cv2.imread("train_data/dcic_data/valid/01jQ.png")
plt.figure()
plt.subplot(2,1,1)
# 可视化原图
plt.imshow(raw_img)
# 缩放并归一化
padding_im, draw_img = resize_norm_img(raw_img)
plt.subplot(2,1,2)
# 可视化网络输入图
plt.imshow(draw_img)
plt.show()
PaddleOCR训练与验证可以通过config文件进行配置,以下为确认配置文件中的数据路径是否正确,以 rec_icdar15_train.yml为例:
Train:
dataset:
name: SimpleDataSet
# 训练数据根目录
data_dir: ./train_data/ic15_data/
# 训练数据标签
label_file_list: ["./train_data/ic15_data/rec_gt_train.txt"]
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- CTCLabelEncode: # Class handling label
- RecResizeImg:
image_shape: [3, 32, 100] # [3,32,320]
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
loader:
shuffle: True
batch_size_per_card: 256
drop_last: True
num_workers: 8
use_shared_memory: False
Eval:
dataset:
name: SimpleDataSet
# 评估数据根目录
data_dir: ./train_data/ic15_data
# 评估数据标签
label_file_list: ["./train_data/ic15_data/rec_gt_test.txt"]
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- CTCLabelEncode: # Class handling label
- RecResizeImg:
image_shape: [3, 32, 100]
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
loader:
shuffle: False
drop_last: False
batch_size_per_card: 256
num_workers: 4
use_shared_memory: False
# 复制一份配置文件,作为dcic比赛配置文件
!cp ./PaddleOCR/configs/rec/rec_icdar15_train.yml ./PaddleOCR1/rec_dcic_train.yml
将以下内容填充到./PaddleOCR/configs/rec/rec_dcic_train.yml
,为了方面大家理解,我这里加了一些核心注释:
Global:
use_gpu: true
# 训练轮数
epoch_num: 300
log_smooth_window: 20
print_batch_step: 10
# 模型保存路径
save_model_dir: ./output/rec/dcic/
save_epoch_step: 3
# evaluation is run every 2000 iterations
eval_batch_step: [0, 2000]
cal_metric_during_train: True
pretrained_model: pretrain_models/rec_mv3_none_bilstm_ctc/best_accuracy
checkpoints:
save_inference_dir: ./
use_visualdl: False
infer_img: doc/imgs_words_en/word_10.png
# for data or label process
character_dict_path: ppocr/utils/en_dict.txt
max_text_length: 4
infer_mode: False
use_space_char: False
save_res_path: ./output/rec/predicts_dcic.txt
# 优化器设置
Optimizer:
name: Adam
beta1: 0.9
beta2: 0.999
lr:
learning_rate: 0.0005
regularizer:
name: 'L2'
factor: 0
# 模型结构
Architecture:
model_type: rec
algorithm: CRNN
Transform:
Backbone:
name: MobileNetV3
scale: 0.5
model_name: large
Neck:
name: SequenceEncoder
encoder_type: rnn
# rnn隐层单元个数,超参数
hidden_size: 96
Head:
name: CTCHead
fc_decay: 0
Loss:
name: CTCLoss
PostProcess:
name: CTCLabelDecode
Metric:
name: RecMetric
main_indicator: acc
Train:
dataset:
name: SimpleDataSet
# 训练集路径
data_dir: ./train_data/dcic_data/
# 训练集标签文件
label_file_list: ["./train_data/dcic_data/rec_gt_train.txt"]
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- CTCLabelEncode: # Class handling label
- RecResizeImg:
image_shape: [3, 32, 96]
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
loader:
shuffle: True
batch_size_per_card: 256
drop_last: True
num_workers: 0
use_shared_memory: False
Eval:
dataset:
name: SimpleDataSet
data_dir: ./train_data/dcic_data
label_file_list: ["./train_data/dcic_data/rec_gt_valid.txt"]
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- CTCLabelEncode: # Class handling label
- RecResizeImg:
image_shape: [3, 32, 96]
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
loader:
shuffle: False
drop_last: False
batch_size_per_card: 256
num_workers: 4
use_shared_memory: False
下载预训练模型:为了加快收敛速度,建议下载训练好的模型在 比赛 数据上进行 finetune
%cd PaddleOCR/
# 下载MobileNetV3的预训练模型
!wget -nc -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_mv3_none_bilstm_ctc_v2.0_train.tar
# 解压模型参数
!tar -xf pretrain_models/rec_mv3_none_bilstm_ctc_v2.0_train.tar && rm -rf pretrain_models/rec_mv3_none_bilstm_ctc_v2.0_train.tar
!mv rec_mv3_none_bilstm_ctc_v2.0_train ./pretrain_models
/home/aistudio/PaddleOCR
--2022-02-19 20:43:08-- https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_mv3_none_bilstm_ctc_v2.0_train.tar
正在解析主机 paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 182.61.200.229, 182.61.200.195, 2409:8c04:1001:1002:0:ff:b001:368a
正在连接 paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)|182.61.200.229|:443... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 51200000 (49M) [application/x-tar]
正在保存至: “./pretrain_models/rec_mv3_none_bilstm_ctc_v2.0_train.tar”
rec_mv3_none_bilstm 100%[===================>] 48.83M 17.6MB/s in 2.8s
2022-02-19 20:43:11 (17.6 MB/s) - 已保存 “./pretrain_models/rec_mv3_none_bilstm_ctc_v2.0_train.tar” [51200000/51200000])
启动训练命令很简单,指定好配置文件即可。另外在命令行中可以通过 -o
修改配置文件中的参数值。启动训练命令如下所示
其中:
Global.pretrained_model
: 加载的预训练模型路径Global.character_dict_path
: 字典路径(这里只支持26个小写字母+数字)Global.eval_batch_step
: 评估频率Global.epoch_num
: 总训练轮数!pwd
!cd PaddleOCR
!pwd
/home/aistudio
/home/aistudio
!python3 tools/train.py -c configs/rec/rec_dcic_train.yml \
-o Global.pretrained_model=./pretrain_models/rec_mv3_none_bilstm_ctc_v2.0_train/best_accuracy
python3: can't open file 'tools/train.py': [Errno 2] No such file or directory
评估数据集可以通过 configs/rec/rec_dcic_train.yml
修改Eval中的 label_file_path
设置。
这里默认使用 dcic 的评估集,加载刚刚训练好的模型权重:
!python tools/eval.py -c configs/rec/rec_dcic_train.yml -o Global.checkpoints=output/rec/dcic/best_accuracy
使用 PaddleOCR 训练好的模型,可以通过以下脚本进行快速预测。
train_data/dcic_data/train/0a1E.png
默认预测图片存储在 infer_img
里,通过 -o Global.checkpoints
加载训练好的参数文件:
!python tools/infer_rec.py -c configs/rec/rec_dcic_train.yml \
-o Global.checkpoints=./output/rec/dcic/best_accuracy \
Global.infer_img=./train_data/dcic_data/valid/01jU.png
[2022/01/24 03:54:25] root INFO: Architecture :
[2022/01/24 03:54:25] root INFO: Backbone :
[2022/01/24 03:54:25] root INFO: model_name : large
[2022/01/24 03:54:25] root INFO: name : MobileNetV3
[2022/01/24 03:54:25] root INFO: scale : 0.5
[2022/01/24 03:54:25] root INFO: Head :
[2022/01/24 03:54:25] root INFO: fc_decay : 0
[2022/01/24 03:54:25] root INFO: name : CTCHead
[2022/01/24 03:54:25] root INFO: Neck :
[2022/01/24 03:54:25] root INFO: encoder_type : rnn
[2022/01/24 03:54:25] root INFO: hidden_size : 96
[2022/01/24 03:54:25] root INFO: name : SequenceEncoder
[2022/01/24 03:54:25] root INFO: Transform : None
[2022/01/24 03:54:25] root INFO: algorithm : CRNN
[2022/01/24 03:54:25] root INFO: model_type : rec
[2022/01/24 03:54:25] root INFO: Eval :
[2022/01/24 03:54:25] root INFO: dataset :
[2022/01/24 03:54:25] root INFO: data_dir : ./train_data/dcic_data
[2022/01/24 03:54:25] root INFO: label_file_list : ['./train_data/dcic_data/rec_gt_valid.txt']
[2022/01/24 03:54:25] root INFO: name : SimpleDataSet
[2022/01/24 03:54:25] root INFO: transforms :
[2022/01/24 03:54:25] root INFO: DecodeImage :
[2022/01/24 03:54:25] root INFO: channel_first : False
[2022/01/24 03:54:25] root INFO: img_mode : BGR
[2022/01/24 03:54:25] root INFO: CTCLabelEncode : None
[2022/01/24 03:54:25] root INFO: RecResizeImg :
[2022/01/24 03:54:25] root INFO: image_shape : [3, 32, 96]
[2022/01/24 03:54:25] root INFO: KeepKeys :
[2022/01/24 03:54:25] root INFO: keep_keys : ['image', 'label', 'length']
[2022/01/24 03:54:25] root INFO: loader :
[2022/01/24 03:54:25] root INFO: batch_size_per_card : 256
[2022/01/24 03:54:25] root INFO: drop_last : False
[2022/01/24 03:54:25] root INFO: num_workers : 4
[2022/01/24 03:54:25] root INFO: shuffle : False
[2022/01/24 03:54:25] root INFO: use_shared_memory : False
[2022/01/24 03:54:25] root INFO: Global :
[2022/01/24 03:54:25] root INFO: cal_metric_during_train : True
[2022/01/24 03:54:25] root INFO: character_dict_path : ppocr/utils/en_dict.txt
[2022/01/24 03:54:25] root INFO: checkpoints : ./output/rec/dcic/best_accuracy
[2022/01/24 03:54:25] root INFO: debug : False
[2022/01/24 03:54:25] root INFO: distributed : False
[2022/01/24 03:54:25] root INFO: epoch_num : 300
[2022/01/24 03:54:25] root INFO: eval_batch_step : [0, 2000]
[2022/01/24 03:54:25] root INFO: infer_img : ./train_data/dcic_data/valid/01jU.png
[2022/01/24 03:54:25] root INFO: infer_mode : False
[2022/01/24 03:54:25] root INFO: log_smooth_window : 20
[2022/01/24 03:54:25] root INFO: max_text_length : 4
[2022/01/24 03:54:25] root INFO: pretrained_model : pretrain_models/rec_mv3_none_bilstm_ctc/best_accuracy
[2022/01/24 03:54:25] root INFO: print_batch_step : 10
[2022/01/24 03:54:25] root INFO: save_epoch_step : 3
[2022/01/24 03:54:25] root INFO: save_inference_dir : ./
[2022/01/24 03:54:25] root INFO: save_model_dir : ./output/rec/dcic/
[2022/01/24 03:54:25] root INFO: save_res_path : ./output/rec/predicts_dcic.txt
[2022/01/24 03:54:25] root INFO: use_gpu : True
[2022/01/24 03:54:25] root INFO: use_space_char : False
[2022/01/24 03:54:25] root INFO: use_visualdl : False
[2022/01/24 03:54:25] root INFO: Loss :
[2022/01/24 03:54:25] root INFO: name : CTCLoss
[2022/01/24 03:54:25] root INFO: Metric :
[2022/01/24 03:54:25] root INFO: main_indicator : acc
[2022/01/24 03:54:25] root INFO: name : RecMetric
[2022/01/24 03:54:25] root INFO: Optimizer :
[2022/01/24 03:54:25] root INFO: beta1 : 0.9
[2022/01/24 03:54:25] root INFO: beta2 : 0.999
[2022/01/24 03:54:25] root INFO: lr :
[2022/01/24 03:54:25] root INFO: learning_rate : 0.0005
[2022/01/24 03:54:25] root INFO: name : Adam
[2022/01/24 03:54:25] root INFO: regularizer :
[2022/01/24 03:54:25] root INFO: factor : 0
[2022/01/24 03:54:25] root INFO: name : L2
[2022/01/24 03:54:25] root INFO: PostProcess :
[2022/01/24 03:54:25] root INFO: name : CTCLabelDecode
[2022/01/24 03:54:25] root INFO: Train :
[2022/01/24 03:54:25] root INFO: dataset :
[2022/01/24 03:54:25] root INFO: data_dir : ./train_data/dcic_data/
[2022/01/24 03:54:25] root INFO: label_file_list : ['./train_data/dcic_data/rec_gt_train.txt']
[2022/01/24 03:54:25] root INFO: name : SimpleDataSet
[2022/01/24 03:54:25] root INFO: transforms :
[2022/01/24 03:54:25] root INFO: DecodeImage :
[2022/01/24 03:54:25] root INFO: channel_first : False
[2022/01/24 03:54:25] root INFO: img_mode : BGR
[2022/01/24 03:54:25] root INFO: CTCLabelEncode : None
[2022/01/24 03:54:25] root INFO: RecResizeImg :
[2022/01/24 03:54:25] root INFO: image_shape : [3, 32, 96]
[2022/01/24 03:54:25] root INFO: KeepKeys :
[2022/01/24 03:54:25] root INFO: keep_keys : ['image', 'label', 'length']
[2022/01/24 03:54:25] root INFO: loader :
[2022/01/24 03:54:25] root INFO: batch_size_per_card : 256
[2022/01/24 03:54:25] root INFO: drop_last : True
[2022/01/24 03:54:25] root INFO: num_workers : 0
[2022/01/24 03:54:25] root INFO: shuffle : True
[2022/01/24 03:54:25] root INFO: use_shared_memory : False
[2022/01/24 03:54:25] root INFO: profiler_options : None
[2022/01/24 03:54:25] root INFO: train with paddle 2.2.1 and device CUDAPlace(0)
W0124 03:54:25.561218 8122 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0124 03:54:25.566077 8122 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2022/01/24 03:54:30] root INFO: resume from ./output/rec/dcic/best_accuracy
[2022/01/24 03:54:30] root INFO: infer_img: ./train_data/dcic_data/valid/01jU.png
[2022/01/24 03:54:30] root INFO: result: o1lU 0.9584374
[2022/01/24 03:54:30] root INFO: success!
我们可以看到预测结果
[2022/01/24 03:54:30] root INFO: infer_img: ./train_data/dcic_data/valid/01jU.png
[2022/01/24 03:54:30] root INFO: result: o1lU 0.9584374
[2022/01/24 03:54:30] root INFO: success!
# 预测全部测试集
!python tools/infer_rec.py -c configs/rec/rec_dcic_train.yml \
-o Global.checkpoints=./output/rec/dcic/best_accuracy \
Global.infer_img=../data/test_dataset
!pwd
/home/aistudio/PaddleOCR
import pandas as pd
submit = pd.read_csv('../data/data126477/submit_example.csv')
# print(submit)
nums = []
results = []
with open('output/rec/predicts_dcic.txt', 'r', encoding='utf-8') as f:
# print(f.read().split('\t')[:2])
data = f.read().split('\t')
for i in range(2, len(data), 2):
img,res=data[i - 2:i]
# print(img)
img=img.split('/')[-1].split('.png')[0]
# print(img)
nums.append(int(img))
results.append(res)
result_df=pd.DataFrame({'num':nums,'tag':results})
result_df=result_df.sort_values('num',ascending=True)
result_df.to_csv('baseline.csv',index=None)
result_df
num | tag | |
---|---|---|
0 | 1 | 01Fb |
1 | 10 | 04xs |
2 | 100 | 0Onx |
113 | 101 | 0OU1 |
224 | 102 | 0p3c |
... | ... | ... |
234 | 10208 | OxkP |
235 | 10209 | oxmH |
237 | 10210 | 0XMy |
238 | 10211 | 0xp6 |
239 | 10212 | 0xq2 |
本文参考:
https://aistudio.baidu.com/aistudio/projectdetail/3438655?channelType=0&channel=0
https://aistudio.baidu.com/aistudio/projectdetail/3526082