这一步需要用到HttpClient,下载到的压缩文件里有多个Lib,我们只需要使用到其中的 httpcore-4.4.9.jar、 httpclient-4.5.5.jar、 commons-logging-1.2.jar。
这时候需要知道正方教务系统验证码的链接,一般都是正方教务系统链接后面加上/CheckCode.aspx就是验证码的获取路径。
通过MySetting类可以方便的更改核心的参数。这里稍微说明下,建议先获取100张验证码照片,之后按照教程的过程进行;之后再获取500张验证码照片,再按照教程的过程进行一遍;最后获取1000张验证码照片,再进行一遍。这样借助循序渐进的过程,可以减少人力的付出,减少操作过程的人为错误,最后也可拥有丰富的字模。
另外需要额外执行一次,将验证码图片保存至RES\imgTEST中,以备最后进行准确率检测。
package getCode;
public class MySetting {
public static String IMG_1K = "RES\\img1K\\";
// 从正方教务验证码直接获取的验证码GIF格式图片
public static String IMG_TEST = "RES\\imgTEST\\";
// 验证码识别测试集
public static String SECRETCODE_URL = "http://jwxt.domain.edu.cn/CheckCode.aspx";
// 正方教务验证码URL链接
}
package getCode;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class getCodeIMG {
public static void main(String[] args) throws IOException {
for (int i = 0; i < 1000; i++) {
String SECRETCODE_URL = MySetting.SECRETCODE_URL;
HttpGet secretCodeGet = new HttpGet(SECRETCODE_URL);
CloseableHttpClient client = HttpClients.createDefault();
CloseableHttpResponse responseSecret = client.execute(secretCodeGet);
FileOutputStream fileOutputStream = new FileOutputStream(new File(MySetting.IMG_1K + "Code" + i + ".gif"));
responseSecret.getEntity().writeTo(fileOutputStream);
fileOutputStream.close();
System.out.println("Code" + i + ".gif");
}
System.out.println("Finish!");
}
}
到这里,我们就很快的获取了100张/500张/1000张的验证码的GIF照片了。
这步应该是可有可无的,由于起初想借助腾讯优图辅助我完成验证码图片的标签添加,因此我先将验证码图片从GIF转换为PNG格式。
package getCode;
public class MySetting {
public static String IMG_1K = "RES\\img1K\\";
// 从正方教务验证码直接获取的验证码GIF格式图片
public static String IMG_PNG_1K = "RES\\imgPNG1K\\";
// PNG格式的验证码图片
public static String IMG_TEST = "RES\\imgTEST\\";
// 验证码识别测试集
public static String IMG_PNG_TEST = "RES\\imgPNGTEST\\";
// PNG格式验证码识别测试集
public static String SECRETCODE_URL = "http://jwxt.domain.edu.cn/CheckCode.aspx";
// 正方教务验证码URL链接
}
package getCode;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import javax.imageio.ImageIO;
public class gifToPNG {
public static void main(String[] args) throws IOException {
for (int i = 0; i < 1000; i++) {
OutputStream out = new FileOutputStream(MySetting.IMG_PNG_1K + "Code" + i + ".png");
ImageIO.write(ImageIO.read(new File(MySetting.IMG_1K + "Code" + i + ".gif")), "png", out);
out.close();
System.out.println("Now:" + i);
}
System.out.println("Finish");
}
}
这时已经在RES\imgPNG1K文件下获得了所有PNG格式的文件了。
正方教务系统的验证码总体来看,是比较好进行处理的。四个字符的颜色均为纯蓝色,因此只需要去除背景上的噪点,再进行二值化处理为黑白图片,最后进行分割图片即可得到字模。
这个就需要依次给验证码重命名为正确的答案以作为标签,但是现在你可以直接使用我生成的字模来自动生成标签,再检查一遍即可。(由于需要人工校验的验证码实在太多,附件中1千个已经添加标签的验证码中可能存在少量错误,这会影响到字模的准确性,如果大家发现了RES\imgPNG1K中标签的错误,感谢大家在文章下方进行评论)
图像的预处理过程包括了去除背景噪点及黑白二值化两个关键点。
public static BufferedImage removeBackgroud(String picFile) throws Exception {
BufferedImage img = ImageIO.read(new File(picFile));
int width = img.getWidth();
int height = img.getHeight();
for (int x = 0; x < width; x++) {
for (int y = 0; y < height; y++) {
if (isBlue(img.getRGB(x, y)) == 1) {
img.setRGB(x, y, Color.BLACK.getRGB());
} else {
img.setRGB(x, y, Color.WHITE.getRGB());
}
}
}
return img;
}
public static int isBlue(int colorInt) {
Color color = new Color(colorInt);
int rgb = color.getRed() + color.getGreen() + color.getBlue();
if (rgb == 153) {
return 1;
}
return 0;
}
public static int isBlack(int colorInt) {
Color color = new Color(colorInt);
if (color.getRed() + color.getGreen() + color.getBlue() <= 100) {
return 1;
}
return 0;
}
public static int isWhite(int colorInt) {
Color color = new Color(colorInt);
if (color.getRed() + color.getGreen() + color.getBlue() > 600) {
return 1;
}
return 0;
}
处理后的图片如下:
由于正方教务系统的验证码具有一定的规律性,不需要借助复杂的切割算法,直接等块分割即可。我们可以借助画图软件分析下图片以确定分割参数。
高度可以锁定在0-23像素之间,宽度可以锁定在5-53像素之间。每个字符占据1/4的像素,因此规定第一个字符的像素在5-17之间,第二个在17-29之间,第三个在29-41之间,第四个在41-53之间。这种分割方法可以满足大多数验证码,但有少数验证码会导致微小的分割错误,但由于字符的基本特征没有被破坏,基本不会影响字符的识别,同时鉴于最终字模库有4千个字模,因此容错性还是较高的。就直接采用这种直接粗暴的方法。
public static List splitImage(BufferedImage img) throws Exception {
List subImgs = new ArrayList();
subImgs.add(img.getSubimage(5, 0, 12, 23));
subImgs.add(img.getSubimage(17, 0, 12, 23));
subImgs.add(img.getSubimage(29, 0, 12, 23));
subImgs.add(img.getSubimage(41, 0, 12, 23));
return subImgs;
}
切割后即可得到含4000个字模的库:
有趣的是:正方教务系统不会去生成含字符'9','o','z'的验证码,这大大提升了验证码识别准确率。另外字模'm','w'由于切割问题,左右侧边被切割去,但没有影响到字符的特征。
同样是需要是要先去除背景及黑白二值化,之后再进行分割。分割得到的子图片与字模库进行比对,并获得结果。
public static String getSingleCharOcr(BufferedImage img, Map map) {
String result = "#";
int width = img.getWidth();
int height = img.getHeight();
int min = width * height;
for (BufferedImage bi : map.keySet()) {
int count = 0;
if (Math.abs(bi.getWidth() - width) > 2)
continue;
int widthmin = width < bi.getWidth() ? width : bi.getWidth();
int heightmin = height < bi.getHeight() ? height : bi.getHeight();
Label1: for (int x = 0; x < widthmin; ++x) {
for (int y = 0; y < heightmin; ++y) {
if (isBlack(img.getRGB(x, y)) != isBlack(bi.getRGB(x, y))) {
count++;
if (count >= min)
break Label1;
}
}
}
if (count < min) {
min = count;
result = map.get(bi);
}
}
return result;
}
MySetting.java
package getCode;
public class MySetting {
public static String IMG_1K = "RES\\img1K\\";
// 从正方教务验证码直接获取的验证码GIF格式图片
public static String IMG_PNG_1K_ROOT = "RES\\imgPNG1K";
public static String IMG_PNG_1K = "RES\\imgPNG1K\\";
// PNG格式的验证码图片
public static String IMG_TEST = "RES\\imgTEST\\";
// 验证码识别测试集
public static String IMG_PNG_TEST = "RES\\imgPNGTEST\\";
// PNG格式验证码识别测试集
public static String IMG_PNG_TEST_RE = "RES\\imgPNGTESTRe\\";
// 验证码识别测试集识别结果
public static String TRAIN_ROOT = "RES\\train";
public static String TRAIN = "RES\\train\\";
// 验证码识别测试集识别结果
public static String SECRETCODE_URL = "http://jwxt.domain.edu.cn/CheckCode.aspx";
// 正方教务验证码URL链接
}
ImagePreProcess.java
package getCode;
import java.awt.Color;
import java.awt.image.BufferedImage;
import java.io.File;
import java.nio.file.Files;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import javax.imageio.ImageIO;
public class ImagePreProcess {
private static Map trainMap = null;
private static int index = 0;
/*
* 生成字模库>注释掉startOCR()>运行trainData() 启动识别>注释掉trainData()>运行startOCR()
*/
public static void main(String[] args) throws Exception {
//trainData();
startOCR();
System.out.println("Finish All");
}
/*
* 验证码识别入口
*/
private static void startOCR() throws Exception {
int sameCount = 0; // 重复验证码计数
for (int i = 0; i < 1000; i++) {
String resultStr = getAllOcr(MySetting.IMG_PNG_TEST + "Code" + i + ".png");
System.out.println(i + ".png = " + resultStr);
File source = new File(MySetting.IMG_PNG_TEST + "Code" + i + ".png");
File dest = new File(MySetting.IMG_PNG_TEST_RE + resultStr + ".png");
if (dest.exists()) {
sameCount++;
} else {
Files.copy(source.toPath(), dest.toPath());
}
}
System.out.println("Same Result=" + sameCount);
}
/*
* 生成字模入口
*/
public static void trainData() throws Exception {
File dir = new File(MySetting.IMG_PNG_1K_ROOT);
File[] files = dir.listFiles();
for (File file : files) {
BufferedImage img = removeBackgroud(MySetting.IMG_PNG_1K + file.getName());
List listImg = splitImage(img);
if (listImg.size() == 4) {
for (int j = 0; j < listImg.size(); j++) {
ImageIO.write(listImg.get(j), "PNG",
new File(MySetting.TRAIN + file.getName().charAt(j) + "-" + (index++) + ".png"));
System.out.println(file.getName() + "\t" + file.getName().charAt(j) + "-" + (index++) + ".png");
}
}
}
}
public static Map loadTrainData() throws Exception {
if (trainMap == null) {
Map map = new HashMap();
File dir = new File(MySetting.TRAIN_ROOT);
File[] files = dir.listFiles();
for (File file : files) {
map.put(ImageIO.read(file), file.getName().charAt(0) + "");
}
trainMap = map;
}
return trainMap;
}
/*
* 获得所有验证码图片路径
*/
public static String getAllOcr(String file) throws Exception {
BufferedImage img = removeBackgroud(file);
List listImg = splitImage(img);
Map map = loadTrainData();
String result = "";
for (BufferedImage bIMG : listImg) {
result += getSingleCharOcr(bIMG, map);
}
return result;
}
/*
* 去除验证码背景并二值化
*/
public static BufferedImage removeBackgroud(String picFile) throws Exception {
BufferedImage img = ImageIO.read(new File(picFile));
int width = img.getWidth();
int height = img.getHeight();
for (int x = 0; x < width; x++) {
for (int y = 0; y < height; y++) {
if (isBlue(img.getRGB(x, y)) == 1) {
img.setRGB(x, y, Color.BLACK.getRGB());
} else {
img.setRGB(x, y, Color.WHITE.getRGB());
}
}
}
return img;
}
public static int isBlue(int colorInt) {
Color color = new Color(colorInt);
int rgb = color.getRed() + color.getGreen() + color.getBlue();
if (rgb == 153) {
return 1;
}
return 0;
}
public static int isBlack(int colorInt) {
Color color = new Color(colorInt);
if (color.getRed() + color.getGreen() + color.getBlue() <= 100) {
return 1;
}
return 0;
}
public static int isWhite(int colorInt) {
Color color = new Color(colorInt);
if (color.getRed() + color.getGreen() + color.getBlue() > 600) {
return 1;
}
return 0;
}
/*
* 切割验证码图片
*/
public static List splitImage(BufferedImage img) throws Exception {
List subImgs = new ArrayList();
subImgs.add(img.getSubimage(5, 0, 12, 23));
subImgs.add(img.getSubimage(17, 0, 12, 23));
subImgs.add(img.getSubimage(29, 0, 12, 23));
subImgs.add(img.getSubimage(41, 0, 12, 23));
return subImgs;
}
/*
* 识别切割的单个字符
*/
public static String getSingleCharOcr(BufferedImage img, Map map) {
String result = "#";
int width = img.getWidth();
int height = img.getHeight();
int min = width * height;
for (BufferedImage bi : map.keySet()) {
int count = 0;
if (Math.abs(bi.getWidth() - width) > 2)
continue;
int widthmin = width < bi.getWidth() ? width : bi.getWidth();
int heightmin = height < bi.getHeight() ? height : bi.getHeight();
Label1: for (int x = 0; x < widthmin; ++x) {
for (int y = 0; y < heightmin; ++y) {
if (isBlack(img.getRGB(x, y)) != isBlack(bi.getRGB(x, y))) {
count++;
if (count >= min)
break Label1;
}
}
}
if (count < min) {
min = count;
result = map.get(bi);
}
}
return result;
}
}
通过不断的反复操作,可以提高字模库容量以达到高精确度要求。
该工具基于JavaFX,可在一定程度上辅助人工部分的工作。在校验验证码识别结果时,若识别结果正确,直接回车键进入下一个验证码的检验;若识别结果错误,则在输入框内输入正确的值,即可更正,并将错误的识别结果移入单独的文件夹以供后期分析。
package getCode;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import javafx.application.Application;
import javafx.event.EventHandler;
import javafx.scene.Scene;
import javafx.scene.control.Label;
import javafx.scene.control.TextField;
import javafx.scene.image.Image;
import javafx.scene.image.ImageView;
import javafx.scene.input.KeyCode;
import javafx.scene.input.KeyEvent;
import javafx.scene.layout.VBox;
import javafx.stage.Stage;
public class checkTool extends Application {
private int doneCount = 0;
private int rightCount = 0;
private String nowFStr = "";
private ImageView imgV;
private Image img;
public void start(Stage primaryStage) {
Stage splash = new Stage();
File f = new File(MySetting.IMG_PNG_TEST_RE);
File[] files = f.listFiles();
VBox root = new VBox(5);
nowFStr = files[doneCount].getName();
String URL = files[doneCount].getAbsoluteFile().toURI().toString();
img = new Image(URL);
imgV = new ImageView(img);
doneCount++;
Label ocrL = new Label(nowFStr);
ocrL.setPrefSize(200, 20);
TextField inputTF = new TextField();
Label rightL = new Label("Correct: 0.00");
root.getChildren().addAll(imgV, ocrL, inputTF, rightL);
root.setOnKeyPressed(new EventHandler() {
@Override
public void handle(KeyEvent event) {
if (event.getCode() == KeyCode.ENTER) {
String name = inputTF.getText();
if (name == null || name.equals("")) {
rightCount++;
} else {
inputTF.setText("");
File file = new File(MySetting.IMG_PNG_TEST_RE + nowFStr);
File aimA = new File("RES\\wrong\\" + name + ".png");
// 修正的文件
File aimB = new File("RES\\wrongOrg\\" + nowFStr);
// 错误的原始文件
try {
Files.copy(file.toPath(), aimA.toPath());
Files.copy(file.toPath(), aimB.toPath());
file.delete();
} catch (IOException e) {
e.printStackTrace();
}
}
System.out.println(
"Total: " + doneCount + "\tRight: " + rightCount + "\tWrong: " + (doneCount - rightCount));
rightL.setText("Correct: " + (1.0 * rightCount / doneCount));
nowFStr = files[doneCount].getName();
String URL = files[doneCount].getAbsoluteFile().toURI().toString();
img = new Image(URL);
imgV.setImage(img);
ocrL.setText(nowFStr);
doneCount++;
}
}
});
Scene scene = new Scene(root);
splash.setScene(scene);
splash.setTitle("Src");
splash.show();
}
public static void main(String[] args) {
Application.launch(args);
}
}
上述工具稍做修改可以计算验证码识别精确率(添加随机数,使验证码图片的显示顺序随机出现,而不是顺序出现)
package getCode;
import java.io.File;
import java.util.ArrayList;
import java.util.Random;
import javafx.application.Application;
import javafx.event.EventHandler;
import javafx.scene.Scene;
import javafx.scene.control.Label;
import javafx.scene.control.TextField;
import javafx.scene.image.Image;
import javafx.scene.image.ImageView;
import javafx.scene.input.KeyCode;
import javafx.scene.input.KeyEvent;
import javafx.scene.layout.VBox;
import javafx.stage.Stage;
public class resultTool extends Application {
private int doneCount = 0;
private int rightCount = 0;
private String nowFStr = "";
private ImageView imgV;
private Image img;
public void start(Stage primaryStage) {
Stage splash = new Stage();
File f = new File(MySetting.IMG_PNG_TEST_RE);
File[] files = f.listFiles();
ArrayList tempA = new ArrayList();
VBox root = new VBox(5);
nowFStr = files[doneCount].getName();
String URL = files[doneCount].getAbsoluteFile().toURI().toString();
img = new Image(URL);
imgV = new ImageView(img);
tempA.add(doneCount);
doneCount++;
Label ocrL = new Label(nowFStr);
ocrL.setPrefSize(200, 20);
TextField inputTF = new TextField();
Label rightL = new Label("Correct: 0.0");
root.getChildren().addAll(imgV, ocrL, inputTF, rightL);
Random ran = new Random();
root.setOnKeyPressed(new EventHandler() {
@Override
public void handle(KeyEvent event) {
if (event.getCode() == KeyCode.ENTER) {
String name = inputTF.getText();
if (name == null || name.equals("")) {
rightCount++;
} else {
inputTF.setText("");
}
System.out.println(
"Total: " + doneCount + "\tRight: " + rightCount + "\tWrong: " + (doneCount - rightCount));
rightL.setText("Correct: " + (1.0 * rightCount / doneCount));
int index = ran.nextInt(files.length);
while (tempA.contains(index)) {
index = ran.nextInt(files.length);
}
nowFStr = files[index].getName();
String URL = files[index].getAbsoluteFile().toURI().toString();
img = new Image(URL);
imgV.setImage(img);
doneCount++;
ocrL.setText(nowFStr);
}
}
});
Scene scene = new Scene(root);
splash.setScene(scene);
splash.setTitle("Src");
splash.show();
}
public static void main(String[] args) {
Application.launch(args);
}
}
由于正方教务系统的验证码还是比较简单的,因此1000个识别样本中取样300个的识别率为100%
https://download.csdn.net/download/swiftmx/10485009