依旧是爬虫的问题拓展吧
以我爬取http://www.digifilm.com.cn/index.php/index/index.html这个网站上的数据来说吧
下载文件需要登陆,登陆需要验证码。
首先要了解这个登陆的原理。是先随机生成了一个4位数字的带干扰线的图片,然后把图片对应的数字存到了session中,进行验证的时候,根据填写的数字和session中的验证码进行比较,一致则认为验证码正确。
所以 画个重点,需要session。
爬虫获取session是很基础的吧 应用Jsoup进行访问该网页
Response resultImageResponse = Jsoup.connect("http://www.digifilm.com.cn/index.php/public/checklogin.html").ignoreContentType(true).execute();
Map cookies = resultImageResponse.cookies();
然后其他操作记得都把这个cookies带上,就算把session绑定了。
然后进行下一步,图片识别首先得有图片。下载图片一定记得带上刚刚的cookies,否则你的图片和session没有关联,肯定是识别不上的,就好比是你用火狐打开了网页,然后把图片地址复制,粘贴,用谷歌打开,然后刷新换了验证码,再去火狐上输入,这是百分百没有用会报验证码错误的。
附上一个可以直接用的下载代码,当然也可以根据需要改成把下载路径传过来,都随意无所谓的,并不重要
public static String downloadImg(String url, Map cookies) throws IOException
{
Connection connect = Jsoup.connect(url);
connect.cookies(cookies);// 携带cookies爬取图片
connect.timeout(5 * 10000);
Connection.Response response = connect.ignoreContentType(true).execute();
byte[] img = response.bodyAsBytes();
// 读取文件存储位置
String directory = "f://test1//";
savaImage(img, directory, "yzm.png");
return "f://test1//yzm.png";
}
public static void savaImage(byte[] img, String filePath, String fileName) {
BufferedOutputStream bos = null;
FileOutputStream fos = null;
File file = null;
File dir = new File(filePath);
try {
// 判断文件目录是否存在
if (!dir.exists()) {
dir.mkdir();
}
file = new File(filePath + "\\" + fileName);
fos = new FileOutputStream(file);
bos = new BufferedOutputStream(fos);
bos.write(img);
// System.out.println("验证码已经下载到:"+filePath);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (bos != null) {
try {
bos.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
if (fos != null) {
try {
fos.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
当然,不这么麻烦,直接使用Jsoup下载文件也是可以的。
Response resultImageResponse = Jsoup.connect(url).cookies(cookies).ignoreContentType(true).execute();
FileOutputStream out = (new FileOutputStream(new java.io.File("f://test//yzm.png")));
out.write(resultImageResponse.bodyAsBytes());
out.close();
有了图片,就到了要说的重点了,图片识别。
图片识别因为之前没有接触过,所以首先选择了百度。。然后度娘的搜索结果,大部分是Tess4j,个别还有OpenCV,用法倒是很简单,如果是直接使用Windows系统的就移步百度使用那些就可以了,因为现成的引个jar包就直接识别了,很稳。
但是使用linux系统的话,就会接下来继续百度 怎么在linux上安装tess4j。。。emmmm 可以推荐个我觉得靠谱的链接,虽然反正我没装上。http://www.cnblogs.com/dajianshi/p/4932882.html
然后linux装不上tess4j的那一堆东西,就很尴尬,要是不想给领导留下,你花一周的时间就跟我说做不了,你个废物的形象,就得研究一下图像识别具体的原理了。
图像识别实现起来首先第一步是读图,把图片转为二维矩阵,或者说是黑白图,要识别的部分转为黑的,背景转为白色。然后把图片进行切分,这个需要结合图片来实现,多打印几个矩形分析就可以。如果是不规则位置的,那就得自己写算法实现切分。
依旧以我的网站为例,验证码下过来是22*50的,长不管,宽50是5+8+2+8+2+8+2+8+7实现的。前5后7是背景边框不用管,中间的2是分隔不用管 8是数字矩阵的实际宽度。
然后打印下来的数字矩阵,对照每一个0-9,会发现连倾斜都没有 哦吼吼的。
接下来想到的就是把这22*8的4个数字 再进行切分 成10*8,再与标准数字矩阵进行比较,最相似的就是识别结果。这个相似也有很多种算法吧,我看网上有什么cos什么什么的,我用了个比较简单的思路,对比比较矩阵和标准矩阵,相同记数加1,最后除以80,取最大结果作为识别结果。
整理代码如下:
标准矩阵
public static final String[][] ziro = new String[][]
{
{" "," "," ","*","*"," "," "," "},
{" "," ","*","*","*","*"," "," "},
{" ","*","*"," "," ","*","*"," "},
{"*","*"," "," "," "," ","*","*"},
{"*","*"," "," "," "," ","*","*"},
{"*","*"," "," "," "," ","*","*"},
{"*","*"," "," "," "," ","*","*"},
{" ","*","*"," "," ","*","*"," "},
{" "," ","*","*","*","*"," "," "},
{" "," "," ","*","*"," "," "," "},
};
public static final String[][] one = new String[][]
{
{" "," "," ","*","*"," "," "," "},
{" "," ","*","*","*"," "," "," "},
{" ","*","*","*","*"," "," "," "},
{" "," "," ","*","*"," "," "," "},
{" "," "," ","*","*"," "," "," "},
{" "," "," ","*","*"," "," "," "},
{" "," "," ","*","*"," "," "," "},
{" "," "," ","*","*"," "," "," "},
{" "," "," ","*","*"," "," "," "},
{" ","*","*","*","*","*","*"," "},
};
public static final String[][] two = new String[][]
{
{" "," ","*","*","*","*"," "," "},
{"*","*","*"," "," ","*","*"," "},
{"*","*"," "," "," "," ","*","*"},
{" "," "," "," "," "," ","*","*"},
{" "," "," "," "," ","*","*"," "},
{" "," "," "," ","*","*"," "," "},
{" "," "," ","*","*"," "," "," "},
{" "," ","*","*"," "," "," "," "},
{" ","*","*"," "," "," "," "," "},
{"*","*","*","*","*","*","*","*"},
};
public static final String[][] three = new String[][]
{
{" ","*","*","*","*","*"," "," "},
{"*","*"," "," "," ","*","*"," "},
{" "," "," "," "," "," ","*","*"},
{" "," "," "," "," ","*","*"," "},
{" "," "," ","*","*","*"," "," "},
{" "," "," "," "," ","*","*"," "},
{" "," "," "," "," "," ","*","*"},
{" "," "," "," "," "," ","*","*"},
{"*","*"," "," "," ","*","*"," "},
{" ","*","*","*","*","*"," "," "},
};
public static final String[][] four = new String[][]
{
{" "," "," "," "," ","*","*"," "},
{" "," "," "," ","*","*","*"," "},
{" "," "," ","*","*","*","*"," "},
{" "," ","*","*"," ","*","*"," "},
{" ","*","*"," "," ","*","*"," "},
{"*","*"," "," "," ","*","*"," "},
{"*","*","*","*","*","*","*","*"},
{" "," "," "," "," ","*","*"," "},
{" "," "," "," "," ","*","*"," "},
{" "," "," "," "," ","*","*"," "},
};
public static final String[][] five = new String[][]
{
{"*","*","*","*","*","*","*"," "},
{"*","*"," "," "," "," "," "," "},
{"*","*"," "," "," "," "," "," "},
{"*","*"," ","*","*","*"," "," "},
{"*","*","*"," "," ","*","*"," "},
{" "," "," "," "," "," ","*","*"},
{" "," "," "," "," "," ","*","*"},
{"*","*"," "," "," "," ","*","*"},
{" ","*","*"," "," ","*","*"," "},
{" "," ","*","*","*","*"," "," "},
};
public static final String[][] six = new String[][]
{
{" "," ","*","*","*","*"," "," "},
{" ","*","*"," "," ","*","*"," "},
{"*","*"," "," "," "," ","*"," "},
{"*","*"," "," "," "," "," "," "},
{"*","*"," ","*","*","*"," "," "},
{"*","*","*"," "," ","*","*"," "},
{"*","*"," "," "," "," ","*","*"},
{"*","*"," "," "," "," ","*","*"},
{" ","*","*"," "," ","*","*"," "},
{" "," ","*","*","*","*"," "," "},
};
public static final String[][] seven = new String[][]
{
{"*","*","*","*","*","*","*","*"},
{" "," "," "," "," "," ","*","*"},
{" "," "," "," "," "," ","*","*"},
{" "," "," "," "," ","*","*"," "},
{" "," "," "," ","*","*"," "," "},
{" "," "," ","*","*"," "," "," "},
{" "," ","*","*"," "," "," "," "},
{" ","*","*"," "," "," "," "," "},
{"*","*"," "," "," "," "," "," "},
{"*","*"," "," "," "," "," "," "},
};
public static final String[][] eight = new String[][]
{
{" "," ","*","*","*","*"," "," "},
{" ","*","*"," "," ","*","*"," "},
{"*","*"," "," "," "," ","*","*"},
{" ","*","*"," "," ","*","*"," "},
{" "," ","*","*","*","*"," "," "},
{" ","*","*"," "," ","*","*"," "},
{"*","*"," "," "," "," ","*","*"},
{"*","*"," "," "," "," ","*","*"},
{" ","*","*"," "," ","*","*"," "},
{" "," ","*","*","*","*"," "," "},
};
public static final String[][] nine = new String[][]
{
{" "," ","*","*","*","*"," "," "},
{" ","*","*"," "," ","*","*"," "},
{"*","*"," "," "," "," ","*","*"},
{"*","*"," "," "," "," ","*","*"},
{" ","*","*"," "," ","*","*","*"},
{" "," ","*","*","*"," ","*","*"},
{" "," "," "," "," "," ","*","*"},
{" ","*"," "," "," "," ","*","*"},
{" ","*","*"," "," ","*","*"," "},
{" "," ","*","*","*","*"," "," "},
};
public static final String[][][] nums = new String[][][] {ziro,one,two,three,four,five,six,seven,eight,nine};
识别图片为二维矩阵
public static String cleanImage(File sfile)throws IOException
{
BufferedImage bufferedImage = ImageIO.read(sfile);
int h = bufferedImage.getHeight();
int w = bufferedImage.getWidth();
// 灰度化
int[][] gray = new int[w][h];
for (int x = 0; x < w; x++)
{
for (int y = 0; y < h; y++)
{
int argb = bufferedImage.getRGB(x, y);
// 图像加亮(调整亮度识别率非常高)
int r = (int) (((argb >> 16) & 0xFF) * 1.1 + 30);
int g = (int) (((argb >> 8) & 0xFF) * 1.1 + 30);
int b = (int) (((argb >> 0) & 0xFF) * 1.1 + 30);
if (r >= 255)
{
r = 255;
}
if (g >= 255)
{
g = 255;
}
if (b >= 255)
{
b = 255;
}
gray[x][y] = (int) Math.pow((Math.pow(r, 2.2) * 0.2973 + Math.pow(g, 2.2)* 0.6274 + Math.pow(b, 2.2) * 0.0753), 1 / 2.2);
}
}
// 二值化
int threshold = ostu(gray, w, h);
BufferedImage binaryBufferedImage = new BufferedImage(w, h,BufferedImage.TYPE_BYTE_BINARY);
for (int x = 0; x < w; x++)
{
for (int y = 0; y < h; y++)
{
if (gray[x][y] > threshold)
{
gray[x][y] |= 0x00FFFF;
} else
{
gray[x][y] &= 0xFF0000;
}
binaryBufferedImage.setRGB(x, y, gray[x][y]);
}
}
//打印矩阵
for (int y = 0; y < h; y++)
{
for (int x = 0; x < w; x++)
{
if (isBlack(binaryBufferedImage.getRGB(x, y)))
{
System.out.print("*");
} else
{
System.out.print(" ");
}
}
System.out.println("");
}
return getNum(binaryBufferedImage,h,w);
}
public static boolean isBlack(int colorInt)
{
Color color = new Color(colorInt);
if (color.getRed() + color.getGreen() + color.getBlue() <= 300)
{
return true;
}
return false;
}
public static int ostu(int[][] gray, int w, int h)
{
int[] histData = new int[w * h];
for (int x = 0; x < w; x++)
{
for (int y = 0; y < h; y++)
{
int red = 0xFF & gray[x][y];
histData[red]++;
}
}
int total = w * h;
float sum = 0;
for (int t = 0; t < 256; t++)
{
sum += t * histData[t];
}
float sumB = 0;
int wB = 0;
int wF = 0;
float varMax = 0;
int threshold = 0;
for (int t = 0; t < 256; t++)
{
wB += histData[t]; // Weight Background
if (wB == 0)
continue;
wF = total - wB; // Weight Foreground
if (wF == 0)
break;
sumB += (float) (t * histData[t]);
float mB = sumB / wB; // Mean Background
float mF = (sum - sumB) / wF; // Mean Foreground
// Calculate Between Class Variance
float varBetween = (float) wB * (float) wF * (mB - mF) * (mB - mF);
// Check if new maximum found
if (varBetween > varMax)
{
varMax = varBetween;
threshold = t;
}
}
return threshold;
}
切分为4个数字并进行具体比较
/**
* 根据矩阵识别数字
* 切分为 5 8 2 8 2 8 2 8 7
* 前5个没有用 8为数字 2为间隙 7为后面的空白部分
* @param binaryBufferedImage
* @param w
* @param h
*/
private static String getNum(BufferedImage binaryBufferedImage, int h, int w)
{
String result = "";
//第一个数字
String[][] toCompare = new String[h][8];
for (int y = 0; y < h; y++)
{
for (int x = 5; x < 13; x++)
{
if (isBlack(binaryBufferedImage.getRGB(x, y)))
{
toCompare[y][x-5] = "*";
} else
{
toCompare[y][x-5] = " ";
}
}
}
//把这个数字和0-9的数组进行比较
result += compare(toCompare);
for (int y = 0; y < h; y++)
{
for (int x = 15; x < 23; x++)
{
if (isBlack(binaryBufferedImage.getRGB(x, y)))
{
toCompare[y][x-15] = "*";
} else
{
toCompare[y][x-15] = " ";
}
}
}
result += compare(toCompare);
for (int y = 0; y < h; y++)
{
for (int x = 25; x < 33; x++)
{
if (isBlack(binaryBufferedImage.getRGB(x, y)))
{
toCompare[y][x-25] = "*";
} else
{
toCompare[y][x-25] = " ";
}
}
}
result += compare(toCompare);
for (int y = 0; y < h; y++)
{
for (int x = 35; x < 43; x++)
{
if (isBlack(binaryBufferedImage.getRGB(x, y)))
{
toCompare[y][x-35] = "*";
} else
{
toCompare[y][x-35] = " ";
}
}
}
result += compare(toCompare);
return result;
}
/**
* 比较0-9数组
* @param toCompare
*/
private static String compare(String[][] original)
{
String[][] toCompare = new String[10][8];
//确定开始
int st =0;
for(int y=1 ;y similar)
{
similar = thisSimilar;
res = numIndex;
}
}
// System.out.println("识别结果:"+res);
return res+"";
}