专利网数字验证码识别

1、背景

项目需求:识别专利网查询页数字验证码

专利查询网址:http://cpquery.sipo.gov.cn/,项目需对该网站进行爬取,网站登陆页面如下:

图1 首页登录

当公众查询登录之后查询页含数字验证码。

专利网数字验证码识别_第1张图片

此类验证码(字符未扭曲)识别相对容易,只需要识别0-9以内的数字即可。放大点再看

 


先训练模型

ocr模型训练的四大步骤:

  • 去噪
  • 垂直分割
  • 水平分割(去除字符上下空白,提取单字符)
  • 字符标记

2、字符训练

首先,下载一些此类验证码素材,网址:http://cpquery.sipo.gov.cn//freeze.main?txn-code=createImgServlet&freshStept=1。

此链接可能打不开,原因是你未登录,专利网就是这样,登录是一套流程,仅凭url无法获取资源,否则无需针对该网站另写一套爬虫代码了,由于商业原因,暂不把该网站爬取逻辑公开,本文仅探讨数字验证码。

2.1、背景去噪

代码如下:

private BufferedImage removeBackground(BufferedImage pic) {
	BufferedImage img = pic;
	int width = pic.getWidth();
	int height = pic.getHeight();
	int[] rgb = new int[3];

	for(int x = 0; x < width; ++x) {
		for(int y = 0; y < height; ++y) {
			if (this.isWhite(img.getRGB(x, y)) == 1) {
				int pixel = img.getRGB(x, y);
				//阈值需自行调整
				rgb[0] = (pixel & 16711680) >> 16;
				rgb[1] = (pixel & '\uff00') >> 8;
				rgb[2] = pixel & 255;
				img.setRGB(x, y, Color.WHITE.getRGB());
			} else {
				img.setRGB(x, y, Color.BLACK.getRGB());
			}
		}
	}

	img = this.removeNoise(this.removeNoise(img));
	img = img.getSubimage(1, 1, img.getWidth() - 2, img.getHeight() - 2);
	return img;
}

private BufferedImage removeNoise(BufferedImage image) {
	int width = image.getWidth();
	int height = image.getHeight();

	for(int x = 1; x < width - 1; ++x) {
		for(int y = 1; y < height - 1; ++y) {
			if (this.getColorBright(image.getRGB(x, y)) < 100 && this.isWhite(image.getRGB(x - 1, y)) + this.isWhite(image.getRGB(x + 1, y)) + this.isWhite(image.getRGB(x, y - 1)) + this.isWhite(image.getRGB(x, y + 1)) >= 3) {
				image.setRGB(x, y, Color.WHITE.getRGB());
			}
		}
	}

	return image;
}

private int getColorBright(int colorInt) {
	Color color = new Color(colorInt);
	return color.getRed() + color.getGreen() + color.getBlue();
}

private int isWhite(int colorInt) {
	Color color = new Color(colorInt);
	//阈值需自行调整
	return color.getRed() + color.getGreen() + color.getBlue() > 440 ? 1 : 0;
}

private int isBlack(int colorInt) {
	Color color = new Color(colorInt);
	//阈值需自行调整
	return color.getRed() + color.getGreen() + color.getBlue() <= 440 ? 1 : 0;
}

代码到这一步可以生成去噪之后的图片如下所示:

接下来就是垂直分割了

2.2、垂直及水平分割

垂直分割及水平分割代码如下:

private List splitImage(BufferedImage img) throws Exception {
	int i;
	int length;
	int width = img.getWidth();
	int height = img.getHeight();
	List subImgs = new ArrayList();
	List weightList = new ArrayList();
	
	for(i = 0; i < width; ++i) {
		length = 0;
		for(int y = 0; y < height; ++y) {
			if (this.isBlack(img.getRGB(i, y)) == 1) {
				++length;
			}
		}
		weightList.add(length);
	}

	for(i = 0; i < weightList.size(); ++i) {
		for(length = 0; i < weightList.size() && (Integer)weightList.get(i) > 0; ++length) {
			++i;
		}
		if (length > 2) {
			subImgs.add(this.removeBlank(img.getSubimage(i - length, 0, length, height)));
		}
	}
	return subImgs;
}

private BufferedImage removeBlank(BufferedImage img) throws Exception {
	int width = img.getWidth();
	int height = img.getHeight();
	int start = 0;
	int end = 0;
	int y;
	int x;
	label42:
	for(y = 0; y < height; ++y) {
		for(x = 0; x < width; ++x) {
			if (this.isBlack(img.getRGB(x, y)) == 1) {
				start = y;
				break label42;
			}
		}
	}

	for(y = height - 1; y >= 0; --y) {
		for(x = 0; x < width; ++x) {
			if (this.isBlack(img.getRGB(x, y)) == 1) {
				end = y;
				return img.getSubimage(0, start, width, end - start + 1);
			}
		}
	}
	return img.getSubimage(0, start, width, end - start + 1);
}

将分割后的4张图片分别存储起来作为后续识别的素材:

专利网数字验证码识别_第2张图片

2.3、字符标记

将图片名temp0改为9_0,同样temp1改为-_0,temp2改为5_0,temp3改为=_0;

上述步骤就完成了一次素材训练,如此反复多积累些素材,新建一名为num的文件夹存储这些资源图片,像这样:

专利网数字验证码识别_第3张图片

同样再新建一名为operate的文件夹存储加减运算符图片,所示如下:

      后续验证码识别,无非是将待识别图片中的字符按上述流程分割成四个字符,然后遍历上面命名好的资源文件一一做比对,同位置像素点的RGB值相同,则比对成功,得到该资源文件名即得到待识别字符。

       因为我们的目的是识别数字验证码,得出运算结果就可以了,故只需对前三个元素识别出来就行,等号的训练并无多大意义。

       为便于将本文程序作为可调用的接口,训练素材和字符相关识别均需导入爬虫工程,有2种方法,其一,将上述资源文件及相关代码打成jar包发布到maven私服上,再导入爬虫工程中;或者直接将资源文件拷贝进项目中,为便于演示,直接采用方法二,在工程下新建config目录,再将上述num、operate文件夹一并放入其中。

专利网数字验证码识别_第4张图片

在识别之前,肯定要加载训练素材,故加载资源图片代码如下(方法中参数type代表加载的是运算符还是数字资源文件类型):

private Map loadTrainData(String type) throws IOException {
    Map map = new HashMap(64);
    File file = new File(System.getProperty("user.dir") + pathSeparator + "config" + pathSeparator + "ocr" + pathSeparator + "ipbankspider" + pathSeparator + type);
    if (!file.exists()) {
        logger.error("数字验证码资源文件目录错误!");
    }

    File[] files = file.listFiles();
    File[] var6 = files;
    int var7 = files.length;

    for(int var8 = 0; var8 < var7; ++var8) {
        File pic = var6[var8];
        map.put(ImageIO.read(pic), pic.getName().charAt(0) + "");
    }

    return map;
}

完整接口如下:

public class IpBankImageProcess {
    private static final String ADD = "+";
    private String pathSeparator; 

    public IpBankImageProcess() {
    	pathSeparator = this.getSeparator();
    }

    private int isWhite(int colorInt) {
        Color color = new Color(colorInt);
        return color.getRed() + color.getGreen() + color.getBlue() > 440 ? 1 : 0;
    }

    private int isBlack(int colorInt) {
        Color color = new Color(colorInt);
        return color.getRed() + color.getGreen() + color.getBlue() <= 440 ? 1 : 0;
    }

    private BufferedImage removeBackground(BufferedImage pic, String path) throws IOException {
        BufferedImage img = pic;
        int width = pic.getWidth();
        int height = pic.getHeight();
        int[] rgb = new int[3];

        for(int x = 0; x < width; ++x) {
            for(int y = 0; y < height; ++y) {
                if (this.isWhite(img.getRGB(x, y)) == 1) {
                    int pixel = img.getRGB(x, y);
                    /*rgb[0] = (pixel & 0xff0000) >> 16;
                    rgb[1] = (pixel & 0xff00) >> 8;
                    rgb[2] = (pixel & 0xff);*/
                    rgb[0] = (pixel & 16711680) >> 16;
                    rgb[1] = (pixel & '\uff00') >> 8;
                    rgb[2] = pixel & 255;
                    img.setRGB(x, y, Color.WHITE.getRGB());
                } else {
                    img.setRGB(x, y, Color.BLACK.getRGB());
                }
            }
        }

        img = this.removeNoise(this.removeNoise(img));
        img = img.getSubimage(1, 1, img.getWidth() - 2, img.getHeight() - 2);
        storeProcessedPic(img, path+"removeBackground"+pathSeparator+"pic.jpg");

        return img;
    }

    private BufferedImage removeNoise(BufferedImage image) {
        int width = image.getWidth();
        int height = image.getHeight();

        for(int x = 1; x < width - 1; ++x) {
            for(int y = 1; y < height - 1; ++y) {
                if (this.getColorBright(image.getRGB(x, y)) < 100 && this.isWhite(image.getRGB(x - 1, y)) + this.isWhite(image.getRGB(x + 1, y)) + this.isWhite(image.getRGB(x, y - 1)) + this.isWhite(image.getRGB(x, y + 1)) >= 3) {
                    image.setRGB(x, y, Color.WHITE.getRGB());
                }
            }
        }

        return image;
    }

    private int getColorBright(int colorInt) {
        Color color = new Color(colorInt);
        return color.getRed() + color.getGreen() + color.getBlue();
    }

    private int isBlackOrWhite(int colorInt) {
        return this.getColorBright(colorInt) >= 30 && this.getColorBright(colorInt) <= 700 ? 0 : 1;
    }

    private BufferedImage removeBlank(BufferedImage img) throws IOException {
        int width = img.getWidth();
        int height = img.getHeight();
        int start = 0;
        int end = 0;

        int y;
        int x;
        label42:
        for(y = 0; y < height; ++y) {
            for(x = 0; x < width; ++x) {
                if (this.isBlack(img.getRGB(x, y)) == 1) {
                    start = y;
                    break label42;
                }
            }
        }

        for(y = height - 1; y >= 0; --y) {
            for(x = 0; x < width; ++x) {
                if (this.isBlack(img.getRGB(x, y)) == 1) {
                    end = y;
                    return img.getSubimage(0, start, width, end - start + 1);
                }
            }
        }

        return img.getSubimage(0, start, width, end - start + 1);
    }

    private List splitImage(BufferedImage img,String path) throws IOException {
        List subImgs = new ArrayList();
        int width = img.getWidth();
        int height = img.getHeight();
        List weightList = new ArrayList();

        int i;
        int length;
        for(i = 0; i < width; ++i) {
            length = 0;

            for(int y = 0; y < height; ++y) {
                if (this.isBlack(img.getRGB(i, y)) == 1) {
                    ++length;
                }
            }

            weightList.add(length);
        }

        for(i = 0; i < weightList.size(); ++i) {
            for(length = 0; i < weightList.size() && (Integer)weightList.get(i) > 0; ++length) {
                ++i;
            }

            if (length > 2) {
                subImgs.add(this.removeBlank(img.getSubimage(i - length, 0, length, height)));
            }
        }
        for (int j = 0; j < subImgs.size(); j++) {
			storeProcessedPic(subImgs.get(j), path+"splitImage"+pathSeparator+"temp"+j+".jpg");
		}
        return subImgs;
    }

    private Map loadTrainData(String type) throws IOException {
        Map map = new HashMap(64);
        File file = new File(System.getProperty("user.dir") + pathSeparator + "config" + pathSeparator + "ocr" + pathSeparator + "ipbankspider" + pathSeparator + type);
        if (!file.exists()) {
            //logger.error("数字验证码资源文件目录错误!");
        }

        File[] files = file.listFiles();
        File[] var6 = files;
        int var7 = files.length;

        for(int var8 = 0; var8 < var7; ++var8) {
            File pic = var6[var8];
            map.put(ImageIO.read(pic), pic.getName().charAt(0) + "");
        }

        return map;
    }

    private String getSingleCharOcr(BufferedImage img, Map map) {
        String result = "";
        int width = img.getWidth();
        int height = img.getHeight();
        int min = width * height;
        int deviation = 1;
        Iterator var8 = map.keySet().iterator();

        while(true) {
            BufferedImage bi;
            int sampleHeight;
            do {
                int sampleWidth;
                do {
                    do {
                        do {
                            if (!var8.hasNext()) {
                                if ("".equals(result)) {
                                	//logger.info("数字验证码识别异常");
                                    Random r = new Random();
                                    result = r.nextInt(10) + "";
                                }

                                return result;
                            }

                            bi = (BufferedImage)var8.next();
                            sampleWidth = bi.getWidth();
                            sampleHeight = bi.getHeight();
                        } while(sampleWidth < width);
                    } while(sampleWidth > width + deviation);
                } while(sampleHeight > height + deviation);
            } while(sampleHeight < height);

            int count = 0;

            label58:
            for(int x = 0; x < width; ++x) {
                for(int y = 0; y < height; ++y) {
                    if (this.isWhite(img.getRGB(x, y)) != this.isWhite(bi.getRGB(x, y))) {
                        ++count;
                        if (count >= min) {
                            break label58;
                        }
                    }
                }
            }

            if (count < min) {
                min = count;
                result = (String)map.get(bi);
            }
        }
    }

    public int getResult(BufferedImage image,String path) throws IOException {
        BufferedImage img = this.removeBackground(image, path);
        List listImg = this.splitImage(img,path);
        Map operateMap = this.loadTrainData("operate");
        Map numMap = this.loadTrainData("num");
        String operatorSymbol = "";
        int a = 0;
        int b = 0;

        for(int i = 0; i < listImg.size() - 1; ++i) {
            if (i % 2 == 0) {
                if (i == 0) {
                    a = Integer.parseInt(this.getSingleCharOcr((BufferedImage)listImg.get(i), numMap));
                }

                b = Integer.parseInt(this.getSingleCharOcr((BufferedImage)listImg.get(i), numMap));
            } else {
                operatorSymbol = this.getSingleCharOcr((BufferedImage)listImg.get(i), operateMap);
            }
        }

        if (ADD.equals(operatorSymbol)) {
            return a + b;
        } else {
            return a - b;
        }
    }

    private String getSeparator() {
        String pathSeparator = File.separator;
        if (!"\\".equals(File.separator)) {
            pathSeparator = "/";
        }

        return pathSeparator;
    }
    
    private void storeProcessedPic(BufferedImage img, String storePath) throws IOException {
    	File file = new File(storePath);
        if (!file.exists())
        {
            File dir = file.getParentFile();
            if (!dir.exists())
            {
                dir.mkdirs();
            }
            try
            {
                file.createNewFile();
            }
            catch (IOException e)
            {
                e.printStackTrace();
            }
        }
        ImageIO.write(img, "jpg", file);
    }
}

测试调用:

public class OcrTest {
	
	private static final String BASE_PATH = System.getProperty("user.dir");
	
    public static void main(String[] args){
    	IpBankImageProcess processor = new IpBankImageProcess();
    	BufferedImage img;
		try {
			//以下修改成待识别图片路径
			img = ImageIO.read(new File(BASE_PATH+"\\test\\yourpic.jpg"));
			int result = processor.getResult(img, BASE_PATH+"\\data\\");
			System.out.println("Ocr识别结果: \n" + result);
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
    }
}

如此便得到了识别结果,上述代码拷贝至本地稍加修改就能用,识别率几乎100%。

 

参考博客

https://blog.csdn.net/problc/article/details/5794460

https://blog.csdn.net/zhulier1124/article/details/80606647

https://www.cnblogs.com/nayitian/p/3282862.html

你可能感兴趣的:(java爬虫,OCR,数字验证码识别,专利网,模型训练)