项目需求:识别专利网查询页数字验证码
专利查询网址:http://cpquery.sipo.gov.cn/,项目需对该网站进行爬取,网站登陆页面如下:
当公众查询登录之后查询页含数字验证码。
此类验证码(字符未扭曲)识别相对容易,只需要识别0-9以内的数字即可。放大点再看
ocr模型训练的四大步骤:
首先,下载一些此类验证码素材,网址:http://cpquery.sipo.gov.cn//freeze.main?txn-code=createImgServlet&freshStept=1。
此链接可能打不开,原因是你未登录,专利网就是这样,登录是一套流程,仅凭url无法获取资源,否则无需针对该网站另写一套爬虫代码了,由于商业原因,暂不把该网站爬取逻辑公开,本文仅探讨数字验证码。
2.1、背景去噪
代码如下:
private BufferedImage removeBackground(BufferedImage pic) {
BufferedImage img = pic;
int width = pic.getWidth();
int height = pic.getHeight();
int[] rgb = new int[3];
for(int x = 0; x < width; ++x) {
for(int y = 0; y < height; ++y) {
if (this.isWhite(img.getRGB(x, y)) == 1) {
int pixel = img.getRGB(x, y);
//阈值需自行调整
rgb[0] = (pixel & 16711680) >> 16;
rgb[1] = (pixel & '\uff00') >> 8;
rgb[2] = pixel & 255;
img.setRGB(x, y, Color.WHITE.getRGB());
} else {
img.setRGB(x, y, Color.BLACK.getRGB());
}
}
}
img = this.removeNoise(this.removeNoise(img));
img = img.getSubimage(1, 1, img.getWidth() - 2, img.getHeight() - 2);
return img;
}
private BufferedImage removeNoise(BufferedImage image) {
int width = image.getWidth();
int height = image.getHeight();
for(int x = 1; x < width - 1; ++x) {
for(int y = 1; y < height - 1; ++y) {
if (this.getColorBright(image.getRGB(x, y)) < 100 && this.isWhite(image.getRGB(x - 1, y)) + this.isWhite(image.getRGB(x + 1, y)) + this.isWhite(image.getRGB(x, y - 1)) + this.isWhite(image.getRGB(x, y + 1)) >= 3) {
image.setRGB(x, y, Color.WHITE.getRGB());
}
}
}
return image;
}
private int getColorBright(int colorInt) {
Color color = new Color(colorInt);
return color.getRed() + color.getGreen() + color.getBlue();
}
private int isWhite(int colorInt) {
Color color = new Color(colorInt);
//阈值需自行调整
return color.getRed() + color.getGreen() + color.getBlue() > 440 ? 1 : 0;
}
private int isBlack(int colorInt) {
Color color = new Color(colorInt);
//阈值需自行调整
return color.getRed() + color.getGreen() + color.getBlue() <= 440 ? 1 : 0;
}
代码到这一步可以生成去噪之后的图片如下所示:
接下来就是垂直分割了
2.2、垂直及水平分割
垂直分割及水平分割代码如下:
private List splitImage(BufferedImage img) throws Exception {
int i;
int length;
int width = img.getWidth();
int height = img.getHeight();
List subImgs = new ArrayList();
List weightList = new ArrayList();
for(i = 0; i < width; ++i) {
length = 0;
for(int y = 0; y < height; ++y) {
if (this.isBlack(img.getRGB(i, y)) == 1) {
++length;
}
}
weightList.add(length);
}
for(i = 0; i < weightList.size(); ++i) {
for(length = 0; i < weightList.size() && (Integer)weightList.get(i) > 0; ++length) {
++i;
}
if (length > 2) {
subImgs.add(this.removeBlank(img.getSubimage(i - length, 0, length, height)));
}
}
return subImgs;
}
private BufferedImage removeBlank(BufferedImage img) throws Exception {
int width = img.getWidth();
int height = img.getHeight();
int start = 0;
int end = 0;
int y;
int x;
label42:
for(y = 0; y < height; ++y) {
for(x = 0; x < width; ++x) {
if (this.isBlack(img.getRGB(x, y)) == 1) {
start = y;
break label42;
}
}
}
for(y = height - 1; y >= 0; --y) {
for(x = 0; x < width; ++x) {
if (this.isBlack(img.getRGB(x, y)) == 1) {
end = y;
return img.getSubimage(0, start, width, end - start + 1);
}
}
}
return img.getSubimage(0, start, width, end - start + 1);
}
将分割后的4张图片分别存储起来作为后续识别的素材:
2.3、字符标记
将图片名temp0改为9_0,同样temp1改为-_0,temp2改为5_0,temp3改为=_0;
上述步骤就完成了一次素材训练,如此反复多积累些素材,新建一名为num的文件夹存储这些资源图片,像这样:
同样再新建一名为operate的文件夹存储加减运算符图片,所示如下:
后续验证码识别,无非是将待识别图片中的字符按上述流程分割成四个字符,然后遍历上面命名好的资源文件一一做比对,同位置像素点的RGB值相同,则比对成功,得到该资源文件名即得到待识别字符。
因为我们的目的是识别数字验证码,得出运算结果就可以了,故只需对前三个元素识别出来就行,等号的训练并无多大意义。
为便于将本文程序作为可调用的接口,训练素材和字符相关识别均需导入爬虫工程,有2种方法,其一,将上述资源文件及相关代码打成jar包发布到maven私服上,再导入爬虫工程中;或者直接将资源文件拷贝进项目中,为便于演示,直接采用方法二,在工程下新建config目录,再将上述num、operate文件夹一并放入其中。
在识别之前,肯定要加载训练素材,故加载资源图片代码如下(方法中参数type代表加载的是运算符还是数字资源文件类型):
private Map loadTrainData(String type) throws IOException {
Map map = new HashMap(64);
File file = new File(System.getProperty("user.dir") + pathSeparator + "config" + pathSeparator + "ocr" + pathSeparator + "ipbankspider" + pathSeparator + type);
if (!file.exists()) {
logger.error("数字验证码资源文件目录错误!");
}
File[] files = file.listFiles();
File[] var6 = files;
int var7 = files.length;
for(int var8 = 0; var8 < var7; ++var8) {
File pic = var6[var8];
map.put(ImageIO.read(pic), pic.getName().charAt(0) + "");
}
return map;
}
完整接口如下:
public class IpBankImageProcess {
private static final String ADD = "+";
private String pathSeparator;
public IpBankImageProcess() {
pathSeparator = this.getSeparator();
}
private int isWhite(int colorInt) {
Color color = new Color(colorInt);
return color.getRed() + color.getGreen() + color.getBlue() > 440 ? 1 : 0;
}
private int isBlack(int colorInt) {
Color color = new Color(colorInt);
return color.getRed() + color.getGreen() + color.getBlue() <= 440 ? 1 : 0;
}
private BufferedImage removeBackground(BufferedImage pic, String path) throws IOException {
BufferedImage img = pic;
int width = pic.getWidth();
int height = pic.getHeight();
int[] rgb = new int[3];
for(int x = 0; x < width; ++x) {
for(int y = 0; y < height; ++y) {
if (this.isWhite(img.getRGB(x, y)) == 1) {
int pixel = img.getRGB(x, y);
/*rgb[0] = (pixel & 0xff0000) >> 16;
rgb[1] = (pixel & 0xff00) >> 8;
rgb[2] = (pixel & 0xff);*/
rgb[0] = (pixel & 16711680) >> 16;
rgb[1] = (pixel & '\uff00') >> 8;
rgb[2] = pixel & 255;
img.setRGB(x, y, Color.WHITE.getRGB());
} else {
img.setRGB(x, y, Color.BLACK.getRGB());
}
}
}
img = this.removeNoise(this.removeNoise(img));
img = img.getSubimage(1, 1, img.getWidth() - 2, img.getHeight() - 2);
storeProcessedPic(img, path+"removeBackground"+pathSeparator+"pic.jpg");
return img;
}
private BufferedImage removeNoise(BufferedImage image) {
int width = image.getWidth();
int height = image.getHeight();
for(int x = 1; x < width - 1; ++x) {
for(int y = 1; y < height - 1; ++y) {
if (this.getColorBright(image.getRGB(x, y)) < 100 && this.isWhite(image.getRGB(x - 1, y)) + this.isWhite(image.getRGB(x + 1, y)) + this.isWhite(image.getRGB(x, y - 1)) + this.isWhite(image.getRGB(x, y + 1)) >= 3) {
image.setRGB(x, y, Color.WHITE.getRGB());
}
}
}
return image;
}
private int getColorBright(int colorInt) {
Color color = new Color(colorInt);
return color.getRed() + color.getGreen() + color.getBlue();
}
private int isBlackOrWhite(int colorInt) {
return this.getColorBright(colorInt) >= 30 && this.getColorBright(colorInt) <= 700 ? 0 : 1;
}
private BufferedImage removeBlank(BufferedImage img) throws IOException {
int width = img.getWidth();
int height = img.getHeight();
int start = 0;
int end = 0;
int y;
int x;
label42:
for(y = 0; y < height; ++y) {
for(x = 0; x < width; ++x) {
if (this.isBlack(img.getRGB(x, y)) == 1) {
start = y;
break label42;
}
}
}
for(y = height - 1; y >= 0; --y) {
for(x = 0; x < width; ++x) {
if (this.isBlack(img.getRGB(x, y)) == 1) {
end = y;
return img.getSubimage(0, start, width, end - start + 1);
}
}
}
return img.getSubimage(0, start, width, end - start + 1);
}
private List splitImage(BufferedImage img,String path) throws IOException {
List subImgs = new ArrayList();
int width = img.getWidth();
int height = img.getHeight();
List weightList = new ArrayList();
int i;
int length;
for(i = 0; i < width; ++i) {
length = 0;
for(int y = 0; y < height; ++y) {
if (this.isBlack(img.getRGB(i, y)) == 1) {
++length;
}
}
weightList.add(length);
}
for(i = 0; i < weightList.size(); ++i) {
for(length = 0; i < weightList.size() && (Integer)weightList.get(i) > 0; ++length) {
++i;
}
if (length > 2) {
subImgs.add(this.removeBlank(img.getSubimage(i - length, 0, length, height)));
}
}
for (int j = 0; j < subImgs.size(); j++) {
storeProcessedPic(subImgs.get(j), path+"splitImage"+pathSeparator+"temp"+j+".jpg");
}
return subImgs;
}
private Map loadTrainData(String type) throws IOException {
Map map = new HashMap(64);
File file = new File(System.getProperty("user.dir") + pathSeparator + "config" + pathSeparator + "ocr" + pathSeparator + "ipbankspider" + pathSeparator + type);
if (!file.exists()) {
//logger.error("数字验证码资源文件目录错误!");
}
File[] files = file.listFiles();
File[] var6 = files;
int var7 = files.length;
for(int var8 = 0; var8 < var7; ++var8) {
File pic = var6[var8];
map.put(ImageIO.read(pic), pic.getName().charAt(0) + "");
}
return map;
}
private String getSingleCharOcr(BufferedImage img, Map map) {
String result = "";
int width = img.getWidth();
int height = img.getHeight();
int min = width * height;
int deviation = 1;
Iterator var8 = map.keySet().iterator();
while(true) {
BufferedImage bi;
int sampleHeight;
do {
int sampleWidth;
do {
do {
do {
if (!var8.hasNext()) {
if ("".equals(result)) {
//logger.info("数字验证码识别异常");
Random r = new Random();
result = r.nextInt(10) + "";
}
return result;
}
bi = (BufferedImage)var8.next();
sampleWidth = bi.getWidth();
sampleHeight = bi.getHeight();
} while(sampleWidth < width);
} while(sampleWidth > width + deviation);
} while(sampleHeight > height + deviation);
} while(sampleHeight < height);
int count = 0;
label58:
for(int x = 0; x < width; ++x) {
for(int y = 0; y < height; ++y) {
if (this.isWhite(img.getRGB(x, y)) != this.isWhite(bi.getRGB(x, y))) {
++count;
if (count >= min) {
break label58;
}
}
}
}
if (count < min) {
min = count;
result = (String)map.get(bi);
}
}
}
public int getResult(BufferedImage image,String path) throws IOException {
BufferedImage img = this.removeBackground(image, path);
List listImg = this.splitImage(img,path);
Map operateMap = this.loadTrainData("operate");
Map numMap = this.loadTrainData("num");
String operatorSymbol = "";
int a = 0;
int b = 0;
for(int i = 0; i < listImg.size() - 1; ++i) {
if (i % 2 == 0) {
if (i == 0) {
a = Integer.parseInt(this.getSingleCharOcr((BufferedImage)listImg.get(i), numMap));
}
b = Integer.parseInt(this.getSingleCharOcr((BufferedImage)listImg.get(i), numMap));
} else {
operatorSymbol = this.getSingleCharOcr((BufferedImage)listImg.get(i), operateMap);
}
}
if (ADD.equals(operatorSymbol)) {
return a + b;
} else {
return a - b;
}
}
private String getSeparator() {
String pathSeparator = File.separator;
if (!"\\".equals(File.separator)) {
pathSeparator = "/";
}
return pathSeparator;
}
private void storeProcessedPic(BufferedImage img, String storePath) throws IOException {
File file = new File(storePath);
if (!file.exists())
{
File dir = file.getParentFile();
if (!dir.exists())
{
dir.mkdirs();
}
try
{
file.createNewFile();
}
catch (IOException e)
{
e.printStackTrace();
}
}
ImageIO.write(img, "jpg", file);
}
}
测试调用:
public class OcrTest {
private static final String BASE_PATH = System.getProperty("user.dir");
public static void main(String[] args){
IpBankImageProcess processor = new IpBankImageProcess();
BufferedImage img;
try {
//以下修改成待识别图片路径
img = ImageIO.read(new File(BASE_PATH+"\\test\\yourpic.jpg"));
int result = processor.getResult(img, BASE_PATH+"\\data\\");
System.out.println("Ocr识别结果: \n" + result);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
如此便得到了识别结果,上述代码拷贝至本地稍加修改就能用,识别率几乎100%。
参考博客
https://blog.csdn.net/problc/article/details/5794460
https://blog.csdn.net/zhulier1124/article/details/80606647
https://www.cnblogs.com/nayitian/p/3282862.html