【图片识别】不应用tess4j的图片验证码识别

依旧是爬虫的问题拓展吧

以我爬取http://www.digifilm.com.cn/index.php/index/index.html这个网站上的数据来说吧

下载文件需要登陆,登陆需要验证码。

首先要了解这个登陆的原理。是先随机生成了一个4位数字的带干扰线的图片,然后把图片对应的数字存到了session中,进行验证的时候,根据填写的数字和session中的验证码进行比较,一致则认为验证码正确。

所以 画个重点,需要session。

爬虫获取session是很基础的吧 应用Jsoup进行访问该网页

Response resultImageResponse = Jsoup.connect("http://www.digifilm.com.cn/index.php/public/checklogin.html").ignoreContentType(true).execute();
Map cookies = resultImageResponse.cookies();

然后其他操作记得都把这个cookies带上,就算把session绑定了。

然后进行下一步,图片识别首先得有图片。下载图片一定记得带上刚刚的cookies,否则你的图片和session没有关联,肯定是识别不上的,就好比是你用火狐打开了网页,然后把图片地址复制,粘贴,用谷歌打开,然后刷新换了验证码,再去火狐上输入,这是百分百没有用会报验证码错误的。

附上一个可以直接用的下载代码,当然也可以根据需要改成把下载路径传过来,都随意无所谓的,并不重要

 public static String downloadImg(String url, Map cookies) throws IOException 
	 {
	        Connection connect = Jsoup.connect(url);
	        connect.cookies(cookies);// 携带cookies爬取图片
	        connect.timeout(5 * 10000);
	        Connection.Response response = connect.ignoreContentType(true).execute();
	        byte[] img = response.bodyAsBytes();
	        // 读取文件存储位置
	        String directory = "f://test1//";
	        savaImage(img, directory, "yzm.png");
	        return "f://test1//yzm.png";
	    }
public static void savaImage(byte[] img, String filePath, String fileName) {
	        BufferedOutputStream bos = null;
	        FileOutputStream fos = null;
	        File file = null;
	        File dir = new File(filePath);
	        try {
	            // 判断文件目录是否存在
	            if (!dir.exists()) {
	                dir.mkdir();
	            }
	           
	            file = new File(filePath + "\\" + fileName);
	            fos = new FileOutputStream(file);
	            bos = new BufferedOutputStream(fos);
	            bos.write(img);
//	            System.out.println("验证码已经下载到:"+filePath);
	        } catch (FileNotFoundException e) {
	            e.printStackTrace();
	        } catch (IOException e) {
	            e.printStackTrace();
	        } finally {
	            if (bos != null) {
	                try {
	                    bos.close();
	                } catch (IOException e) {
	                    // TODO Auto-generated catch block
	                    e.printStackTrace();
	                }
	            }
	            if (fos != null) {
	                try {
	                    fos.close();
	                } catch (IOException e) {
	                    // TODO Auto-generated catch block
	                    e.printStackTrace();
	                }
	            }
	        }

	    }

当然,不这么麻烦,直接使用Jsoup下载文件也是可以的。

Response resultImageResponse = Jsoup.connect(url).cookies(cookies).ignoreContentType(true).execute(); 
FileOutputStream out = (new FileOutputStream(new java.io.File("f://test//yzm.png")));
out.write(resultImageResponse.bodyAsBytes());             
out.close();

有了图片,就到了要说的重点了,图片识别。

图片识别因为之前没有接触过,所以首先选择了百度。。然后度娘的搜索结果,大部分是Tess4j,个别还有OpenCV,用法倒是很简单,如果是直接使用Windows系统的就移步百度使用那些就可以了,因为现成的引个jar包就直接识别了,很稳。

但是使用linux系统的话,就会接下来继续百度 怎么在linux上安装tess4j。。。emmmm 可以推荐个我觉得靠谱的链接,虽然反正我没装上。http://www.cnblogs.com/dajianshi/p/4932882.html

然后linux装不上tess4j的那一堆东西,就很尴尬,要是不想给领导留下,你花一周的时间就跟我说做不了,你个废物的形象,就得研究一下图像识别具体的原理了。

图像识别实现起来首先第一步是读图,把图片转为二维矩阵,或者说是黑白图,要识别的部分转为黑的,背景转为白色。然后把图片进行切分,这个需要结合图片来实现,多打印几个矩形分析就可以。如果是不规则位置的,那就得自己写算法实现切分。

依旧以我的网站为例,验证码下过来是22*50的,长不管,宽50是5+8+2+8+2+8+2+8+7实现的。前5后7是背景边框不用管,中间的2是分隔不用管 8是数字矩阵的实际宽度。

然后打印下来的数字矩阵,对照每一个0-9,会发现连倾斜都没有 哦吼吼的。

接下来想到的就是把这22*8的4个数字 再进行切分 成10*8,再与标准数字矩阵进行比较,最相似的就是识别结果。这个相似也有很多种算法吧,我看网上有什么cos什么什么的,我用了个比较简单的思路,对比比较矩阵和标准矩阵,相同记数加1,最后除以80,取最大结果作为识别结果。

整理代码如下:

标准矩阵

public static final String[][] ziro = new String[][] 
	{
		{" "," "," ","*","*"," "," "," "},
		{" "," ","*","*","*","*"," "," "},
		{" ","*","*"," "," ","*","*"," "},
		{"*","*"," "," "," "," ","*","*"},
		{"*","*"," "," "," "," ","*","*"},
		{"*","*"," "," "," "," ","*","*"},
		{"*","*"," "," "," "," ","*","*"},
		{" ","*","*"," "," ","*","*"," "},
		{" "," ","*","*","*","*"," "," "},
		{" "," "," ","*","*"," "," "," "},
	};
	public static final String[][] one = new String[][] 
	{
		{" "," "," ","*","*"," "," "," "},
		{" "," ","*","*","*"," "," "," "},
		{" ","*","*","*","*"," "," "," "},
		{" "," "," ","*","*"," "," "," "},
		{" "," "," ","*","*"," "," "," "},
		{" "," "," ","*","*"," "," "," "},
		{" "," "," ","*","*"," "," "," "},
		{" "," "," ","*","*"," "," "," "},
		{" "," "," ","*","*"," "," "," "},
		{" ","*","*","*","*","*","*"," "},
	};
	public static final String[][] two = new String[][] 
	{
		{" "," ","*","*","*","*"," "," "},
		{"*","*","*"," "," ","*","*"," "},
		{"*","*"," "," "," "," ","*","*"},
		{" "," "," "," "," "," ","*","*"},
		{" "," "," "," "," ","*","*"," "},
		{" "," "," "," ","*","*"," "," "},
		{" "," "," ","*","*"," "," "," "},
		{" "," ","*","*"," "," "," "," "},
		{" ","*","*"," "," "," "," "," "},
		{"*","*","*","*","*","*","*","*"},
	};
	public static final String[][] three = new String[][] 
	{
		{" ","*","*","*","*","*"," "," "},
		{"*","*"," "," "," ","*","*"," "},
		{" "," "," "," "," "," ","*","*"},
		{" "," "," "," "," ","*","*"," "},
		{" "," "," ","*","*","*"," "," "},
		{" "," "," "," "," ","*","*"," "},
		{" "," "," "," "," "," ","*","*"},
		{" "," "," "," "," "," ","*","*"},
		{"*","*"," "," "," ","*","*"," "},
		{" ","*","*","*","*","*"," "," "},
	};
	public static final String[][] four = new String[][] 
	{
		{" "," "," "," "," ","*","*"," "},
		{" "," "," "," ","*","*","*"," "},
		{" "," "," ","*","*","*","*"," "},
		{" "," ","*","*"," ","*","*"," "},
		{" ","*","*"," "," ","*","*"," "},
		{"*","*"," "," "," ","*","*"," "},
		{"*","*","*","*","*","*","*","*"},
		{" "," "," "," "," ","*","*"," "},
		{" "," "," "," "," ","*","*"," "},
		{" "," "," "," "," ","*","*"," "},
	};
	public static final String[][] five = new String[][] 
	{
		{"*","*","*","*","*","*","*"," "},
		{"*","*"," "," "," "," "," "," "},
		{"*","*"," "," "," "," "," "," "},
		{"*","*"," ","*","*","*"," "," "},
		{"*","*","*"," "," ","*","*"," "},
		{" "," "," "," "," "," ","*","*"},
		{" "," "," "," "," "," ","*","*"},
		{"*","*"," "," "," "," ","*","*"},
		{" ","*","*"," "," ","*","*"," "},
		{" "," ","*","*","*","*"," "," "},
	};
	public static final String[][] six = new String[][] 
	{
		{" "," ","*","*","*","*"," "," "},
		{" ","*","*"," "," ","*","*"," "},
		{"*","*"," "," "," "," ","*"," "},
		{"*","*"," "," "," "," "," "," "},
		{"*","*"," ","*","*","*"," "," "},
		{"*","*","*"," "," ","*","*"," "},
		{"*","*"," "," "," "," ","*","*"},
		{"*","*"," "," "," "," ","*","*"},
		{" ","*","*"," "," ","*","*"," "},
		{" "," ","*","*","*","*"," "," "},
	};
	public static final String[][] seven = new String[][] 
	{
		{"*","*","*","*","*","*","*","*"},
		{" "," "," "," "," "," ","*","*"},
		{" "," "," "," "," "," ","*","*"},
		{" "," "," "," "," ","*","*"," "},
		{" "," "," "," ","*","*"," "," "},
		{" "," "," ","*","*"," "," "," "},
		{" "," ","*","*"," "," "," "," "},
		{" ","*","*"," "," "," "," "," "},
		{"*","*"," "," "," "," "," "," "},
		{"*","*"," "," "," "," "," "," "},
	};	
	public static final String[][] eight = new String[][] 
	{
		{" "," ","*","*","*","*"," "," "},
		{" ","*","*"," "," ","*","*"," "},
		{"*","*"," "," "," "," ","*","*"},
		{" ","*","*"," "," ","*","*"," "},
		{" "," ","*","*","*","*"," "," "},
		{" ","*","*"," "," ","*","*"," "},
		{"*","*"," "," "," "," ","*","*"},
		{"*","*"," "," "," "," ","*","*"},
		{" ","*","*"," "," ","*","*"," "},
		{" "," ","*","*","*","*"," "," "},
	};
	public static final String[][] nine = new String[][] 
	{
		{" "," ","*","*","*","*"," "," "},
		{" ","*","*"," "," ","*","*"," "},
		{"*","*"," "," "," "," ","*","*"},
		{"*","*"," "," "," "," ","*","*"},
		{" ","*","*"," "," ","*","*","*"},
		{" "," ","*","*","*"," ","*","*"},
		{" "," "," "," "," "," ","*","*"},
		{" ","*"," "," "," "," ","*","*"},
		{" ","*","*"," "," ","*","*"," "},
		{" "," ","*","*","*","*"," "," "},
	};
	public static final String[][][] nums = new String[][][] {ziro,one,two,three,four,five,six,seven,eight,nine};

识别图片为二维矩阵

public static String cleanImage(File sfile)throws IOException
	{
		BufferedImage bufferedImage = ImageIO.read(sfile);
		int h = bufferedImage.getHeight();
		int w = bufferedImage.getWidth();
	
		// 灰度化
		int[][] gray = new int[w][h];
		for (int x = 0; x < w; x++)
		{
			for (int y = 0; y < h; y++)
			{
				int argb = bufferedImage.getRGB(x, y);
				// 图像加亮(调整亮度识别率非常高)
				int r = (int) (((argb >> 16) & 0xFF) * 1.1 + 30);
				int g = (int) (((argb >> 8) & 0xFF) * 1.1 + 30);
				int b = (int) (((argb >> 0) & 0xFF) * 1.1 + 30);
				if (r >= 255)
				{
					r = 255;
				}
				if (g >= 255)
				{
					g = 255;
				}
				if (b >= 255)
				{
					b = 255;
				}
				gray[x][y] = (int) Math.pow((Math.pow(r, 2.2) * 0.2973 + Math.pow(g, 2.2)* 0.6274 + Math.pow(b, 2.2) * 0.0753), 1 / 2.2);
			}
		}
		// 二值化
		int threshold = ostu(gray, w, h);
		BufferedImage binaryBufferedImage = new BufferedImage(w, h,BufferedImage.TYPE_BYTE_BINARY);
		for (int x = 0; x < w; x++)
		{
			for (int y = 0; y < h; y++)
			{
				if (gray[x][y] > threshold)
				{
					gray[x][y] |= 0x00FFFF;
				} else
				{
					gray[x][y] &= 0xFF0000;
				}
				binaryBufferedImage.setRGB(x, y, gray[x][y]);
			}
		}
		//打印矩阵
		for (int y = 0; y < h; y++)
		{
			for (int x = 0; x < w; x++)
			{
				if (isBlack(binaryBufferedImage.getRGB(x, y)))
				{
					System.out.print("*");
				} else
				{
					System.out.print(" ");
				}
			}
			System.out.println("");
		}
		return getNum(binaryBufferedImage,h,w);
	}
	
	
	public static boolean isBlack(int colorInt)
	{
		Color color = new Color(colorInt);
		if (color.getRed() + color.getGreen() + color.getBlue() <= 300)
		{
			return true;
		}
		return false;
	}
 
	public static int ostu(int[][] gray, int w, int h)
	{
		int[] histData = new int[w * h];
		for (int x = 0; x < w; x++)
		{
			for (int y = 0; y < h; y++)
			{
				int red = 0xFF & gray[x][y];
				histData[red]++;
			}
		}
		int total = w * h;
		float sum = 0;
		for (int t = 0; t < 256; t++) 
		{
			sum += t * histData[t];
		}
		float sumB = 0;
		int wB = 0;
		int wF = 0;
	
		float varMax = 0;
		int threshold = 0;
	
		for (int t = 0; t < 256; t++)
		{
			wB += histData[t]; // Weight Background
			if (wB == 0)
				continue;
			wF = total - wB; // Weight Foreground
			if (wF == 0)
				break;
	
			sumB += (float) (t * histData[t]);
	
			float mB = sumB / wB; // Mean Background
			float mF = (sum - sumB) / wF; // Mean Foreground
	
			// Calculate Between Class Variance
			float varBetween = (float) wB * (float) wF * (mB - mF) * (mB - mF);
	
			// Check if new maximum found
			if (varBetween > varMax)
			{
				varMax = varBetween;
				threshold = t;
			}
		}
		return threshold;
	}

切分为4个数字并进行具体比较

/**
	 * 根据矩阵识别数字
	 * 切分为 5 8 2 8 2 8 2 8 7
	 * 前5个没有用 8为数字 2为间隙 7为后面的空白部分
	 * @param binaryBufferedImage
	 * @param w 
	 * @param h 
	 */
	private static String getNum(BufferedImage binaryBufferedImage, int h, int w)
	{
		String result = "";
		//第一个数字
		String[][] toCompare = new String[h][8];
		for (int y = 0; y < h; y++)
		{
			for (int x = 5; x < 13; x++)
			{
				if (isBlack(binaryBufferedImage.getRGB(x, y)))
				{
					toCompare[y][x-5] = "*";
				} else
				{
					toCompare[y][x-5] = " ";
				}
			}
		}
		//把这个数字和0-9的数组进行比较
		result += compare(toCompare);
		for (int y = 0; y < h; y++)
		{
			for (int x = 15; x < 23; x++)
			{
				if (isBlack(binaryBufferedImage.getRGB(x, y)))
				{
					toCompare[y][x-15] = "*";
				} else
				{
					toCompare[y][x-15] = " ";
				}
			}
		}
		result += compare(toCompare);
		for (int y = 0; y < h; y++)
		{
			for (int x = 25; x < 33; x++)
			{
				if (isBlack(binaryBufferedImage.getRGB(x, y)))
				{
					toCompare[y][x-25] = "*";
				} else
				{
					toCompare[y][x-25] = " ";
				}
			}
		}
		result += compare(toCompare);
		for (int y = 0; y < h; y++)
		{
			for (int x = 35; x < 43; x++)
			{
				if (isBlack(binaryBufferedImage.getRGB(x, y)))
				{
					toCompare[y][x-35] = "*";
				} else
				{
					toCompare[y][x-35] = " ";
				}
			}
		}
		result += compare(toCompare);
		return result;
	}
	/**
	 * 比较0-9数组
	 * @param toCompare
	 */
	private static String compare(String[][] original) 
	{
		String[][] toCompare = new String[10][8];
		//确定开始
		int st =0;
		for(int y=1 ;y similar) 
			{
				similar = thisSimilar;
				res = numIndex;
			}
		}
//		System.out.println("识别结果:"+res);
		return res+"";
	}

 

你可能感兴趣的:(java,爬虫)