基于HtmlUnit实现简单登录、页面跳转以及获取有用数据部分代码示例(示例网站:大润发)

首先,我们将要获取的目标内容为商户订单查询结果:
基于HtmlUnit实现简单登录、页面跳转以及获取有用数据部分代码示例(示例网站:大润发)_第1张图片

如下代码为登录模块代码(由于验证码解析这部分目前没做,只能手动识别):
/**
*
* @param username 用户
* @param password 密码
* @param otherParam 其它参数 区域 参数必须为数字 如: 0:华东;1:华北;2:东北;3:华中;4:华南
* @return
*/
@Override
public CrawlerMessage login(String username, String password, String otherParam) {
userName = username;
CrawlerMessage message = new CrawlerMessage(E10001,”登录成功!”);
try {
webClient = new WebClientTool().getHttpsWebClient();
HtmlPage home = webClient.getPage(“https://supplier.rt-mart.com.cn“);

        DomNodeList list = home.getElementsByTagName("img");
        HtmlImage codeImage = (HtmlImage) list.item(9);
        codeImage.saveAs(new File("E:/rtmartCode.jpg"));

// logger.info(home.asText());
HtmlSelect select = home.getElementByName(“area”);
HtmlOption option = select.getOption(Integer.parseInt(otherParam));
option.setSelected(true);
HtmlInput idInput = home.getElementByName(“userid”);
idInput.setValueAttribute(username);
HtmlInput pwdInput = home.getElementByName(“passwd”);
pwdInput.setValueAttribute(password);
HtmlInput codeInput = home.getElementByName(“checkstr”);
String code = ScannerTool.getCode();
codeInput.setValueAttribute(code);
//根据js获取页面
// ScriptResult res = home.executeJavaScript(“act”);
// HtmlPage page2 = (HtmlPage) res.getNewPage();
//点击登录获取页面
HtmlInput loginBut = home.getElementByName(“image”);
HtmlPage userMain = loginBut.click();
//解决HttpClient获取中文乱码 ,用String对象进行转码
String userMainPage = new String(userMain.asText().getBytes(“ISO-8859-1”),”gb2312”);
System.out.println(“User Main:” + userMainPage);
if(!userMainPage.contains(“登出”)){
message.setCode(E10002);
message.setMessage(“登录失败!”);
}
} catch (IOException e) {
message.setCode(E10003);
message.setMessage(“登录异常!”);
}
return message;
}

WebClientTools类:
public class WebClientTool {
private static WebClient webClient;

public WebClient getHttpClient(){
    webClient = new WebClient(BrowserVersion.FIREFOX_24);
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.setAjaxController(new NicelyResynchronizingAjaxController());
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    return webClient;
}

public WebClient getHttpsWebClient() {
    webClient = new WebClient(BrowserVersion.FIREFOX_24);
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.setAjaxController(new NicelyResynchronizingAjaxController());
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setUseInsecureSSL(true);
    return webClient;
}

}

登录成功以后,即可开始获取目标数据:
//———————————–抽取方法———————————————-
public void showOrder(String path, String date) throws IOException {
String orderUrl = “https://supplier.rt-mart.com.cn/php/scm_orders_form_1.php?status=1“;
String orderDetailUrl = “https://supplier.rt-mart.com.cn/php/“;
HtmlPage orderMain = webClient.getPage(orderUrl);
String orderMainStr = new String(orderMain.asText().getBytes(“ISO-8859-1”),”gb2312”);
//取订单编号
String temp = orderMainStr.substring(orderMainStr.indexOf(“免费样机退货)”));
String orderIdList = temp.substring(7,temp.indexOf(“腾讯大润发”)).trim();
orderIdList = orderIdList.replaceAll(“\t”,”“).replaceAll(“\r\n”,”“);
//取订单号数组
String [] orderList = orderIdList.substring(1).split(“\*”);
HtmlForm orderForm = orderMain.getFormByName(“order”);
DomNodeList orderHrefList = orderForm.getElementsByTagName(“a”);
int index = orderHrefList.getLength();
String saveFile = FileUtil.getSaveFile(path, date, “rtmart”, userName, Order.name());
String href = “”;
for(int i=0; i

你可能感兴趣的:(网络爬虫学习笔记)