HtmlUnit is an open source java library for creating HTTP calls which imitate the browser functionality. HtmlUnit is mostly used for integration testing on top of unit test frameworks such as JUnit or TestNG. This is done by requesting web pages and asserting the results.
@Test public void testGoogle(){ WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/"); assertEquals("Google", currentPage.getTitleText()); }
As you can see in the example, the WebClient is the starting point. It is the browser simulator.
WebClient.getPage() is just like typing an address in the browser. It returns an HtmlPage object.
HtmlPage represents a single web page along with all of it's client's data (HTML, JavaScript, CSS ...).
The HtmlPage lets you access to many of a web page content:
You can receive the page source as text or as XML.
HtmlPage currentPage = webClient.getPage("http://www.google.com/"); String textSource = currentPage.asText(); String xmlSource = currentPage.asXml();
HtmlPage lets you ability to access any of the page HTML elements and all of their attributes and sub elements. This includes tables, images, input fields, divs or any other Html element you may imagine.
Use the function getHtmlElementById() to get any of the page elements.
WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/"); HtmlImage imgElement = (HtmlImage)currentPage.getHtmlElementById("logo"); System.out.println(imgElement.getAttribute("src"));
Anchor is the representation of the Html tag <a href="..." >link</a>.
Use the functions getAnchorByName(), getAnchorByHref() and getAnchorByText() to easily access any of the anchors in the page.
WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/"); HtmlAnchor advancedSearchAn = currentPage.getAnchorByText("Advanced Search"); currentPage = advancedSearchAn.click(); assertEquals("Google Advanced Search",currentPage.getTitleText());
You can access any of the page elements by using <A style="BORDER-BOTTOM: 0px; BORDER-LEFT: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; OUTLINE-STYLE: none; OUTLINE-COLOR: invert; PADDING-LEFT: 0px; OUTLINE-WIDTH: 0px; PADDING-RIGHT: 0px; FONT-FAMILY: ProximaNova-Reg; COLOR: rgb(32,50,142); FONT-SIZE: 15px; VERTICAL-ALIGN: baseline; BORDER-TOP: 0px; BORDER-RIGHT: 0px; TEXT-DECORATION: none; PADDING-TOP: 0px" class="external text" title=http://www.w3schools.com/xpath/xpath_syntax.asp href="http://www.w3schools.com/xpath/xpath_syntax.asp" rel=nofollow>XPath.
WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/search?q=avi"); //Using XPath to get the first result in Google query HtmlElement element = (HtmlElement)currentPage.getByXPath("//h3").get(0); DomNode result = element.getChildNodes().get(0);
A large part of controlling your HTML page is to control the form elements:
HtmlForm
HtmlTextInput
HtmlSubmitInput
HtmlCheckBoxInput
HtmlHiddenInput
HtmlPasswordInput
HtmlRadioButtonInput
HtmlFileInput
WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/"); //Get the query input text HtmlInput queryInput = currentPage.getElementByName("q"); queryInput.setValueAttribute("aviyehuda"); //Submit the form by pressing the submit button HtmlSubmitInput submitBtn = currentPage.getElementByName("btnG"); currentPage = submitBtn.click();
currentPage = webClient.getPage("http://www.google.com/search?q=htmlunit"); final HtmlTable table = currentPage.getHtmlElementById("nav"); for (final HtmlTableRow row : table.getRows()) { System.out.println("Found row"); for (final HtmlTableCell cell : row.getCells()) { System.out.println(" Found cell: " + cell.asText()); } }
HtmlUnit uses the <A style="BORDER-BOTTOM: 0px; BORDER-LEFT: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; OUTLINE-STYLE: none; OUTLINE-COLOR: invert; PADDING-LEFT: 0px; OUTLINE-WIDTH: 0px; PADDING-RIGHT: 0px; FONT-FAMILY: ProximaNova-Reg; COLOR: rgb(32,50,142); FONT-SIZE: 15px; VERTICAL-ALIGN: baseline; BORDER-TOP: 0px; BORDER-RIGHT: 0px; TEXT-DECORATION: none; PADDING-TOP: 0px" class="external text" title=http://www.mozilla.org/rhino/ href="http://www.mozilla.org/rhino/" rel=nofollow>Mozilla Rhino JavaScript engine.
This lets you the ability to run pages with JavaScript or even run JavaScript code by command.
ScriptResult result = currentPage.executeJavaScript(JavaScriptCode);
By default JavaScript exceptions will crash your tests. If you wish to ignore JavaScript exceptions use this:
webClient().setThrowExceptionOnScriptError(false);
If you would like to turn off the JavaScript all together, use this:
currentPage.getWebClient().setJavaScriptEnabled(false);
WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.co.uk/search?q=htmlunit"); URL url = currentPage.getWebResponse().getRequestSettings().getUrl()
WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/"); assertEquals(200,currentPage.getWebResponse().getStatusCode()); assertEquals("OK",currentPage.getWebResponse().getStatusMessage());
Set cookies = webClient.getCookieManager().getCookies(); for (Cookie cookie : cookies) { System.out.println(cookie.getName() + " = " + cookie.getValue()); }
WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/search?q=htmlunit"); List headers = currentPage.getWebResponse().getResponseHeaders(); for (NameValuePair header : headers) { System.out.println(header.getName() + " = " + header.getValue()); }
List parameters = currentPage.getWebResponse().getRequestSettings().getRequestParameters(); for (NameValuePair parameter : parameters) { System.out.println(parameter.getName() + " = " + parameter.getValue()); }
HtmlUnit comes with a set of assetions:
assertTitleEquals(HtmlPage, String) assertTitleContains(HtmlPage, String) assertTitleMatches(HtmlPage, String) assertElementPresent(HtmlPage, String) assertElementPresentByXPath(HtmlPage, String) assertElementNotPresent(HtmlPage, String) assertElementNotPresentByXPath(HtmlPage, String) assertTextPresent(HtmlPage, String) assertTextPresentInElement(HtmlPage, String, String) assertTextNotPresent(HtmlPage, String) assertTextNotPresentInElement(HtmlPage, String, String) assertLinkPresent(HtmlPage, String) assertLinkNotPresent(HtmlPage, String) assertLinkPresentWithText(HtmlPage, String) assertLinkNotPresentWithText(HtmlPage, String) assertFormPresent(HtmlPage, String) assertFormNotPresent(HtmlPage, String) assertInputPresent(HtmlPage, String) assertInputNotPresent(HtmlPage, String) assertInputContainsValue(HtmlPage, String, String) assertInputDoesNotContainValue(HtmlPage, String, String)
You can still of course use the framework's assertions. For example, if you are using JUnit, you can still use assertTrue() and so on.
Here are a few examples:
WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/search?q=htmlunit"); assertEquals(200,currentPage.getWebResponse().getStatusCode()); assertEquals("OK",currentPage.getWebResponse().getStatusMessage()); WebAssert.assertTextPresent(currentPage, "htmlunit"); WebAssert.assertTitleContains(currentPage, "htmlunit"); WebAssert.assertLinkPresentWithText(currentPage, "Advanced search"); assertTrue(currentPage.getByXPath("//h3").size()>0); //result number assertNotNull(webClient.getCookieManager().getCookie("NID"));
String url="http://outofmemory.cn/";//想采集的网址 String refer="http://outofmemory.cn/"; URL link=new URL(url); WebClient wc=new WebClient(); WebRequest request=new WebRequest(link); request.setCharset("UTF-8"); request.setProxyHost("120.120.120.x"); request.setProxyPort(8080); request.setAdditionalHeader("Referer", refer);//设置请求报文头里的refer字段 ////设置请求报文头里的User-Agent字段 request.setAdditionalHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"); //wc.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"); //wc.addRequestHeader和request.setAdditionalHeader功能应该是一样的。选择一个即可。 //其他报文头字段可以根据需要添加 wc.getCookieManager().setCookiesEnabled(true);//开启cookie管理 wc.getOptions().setJavaScriptEnabled(true);//开启js解析。对于变态网页,这个是必须的 wc.getOptions().setCssEnabled(true);//开启css解析。对于变态网页,这个是必须的。 wc.getOptions().setThrowExceptionOnFailingStatusCode(false); wc.getOptions().setThrowExceptionOnScriptError(false); wc.getOptions().setTimeout(10000); //设置cookie。如果你有cookie,可以在这里设置 Set<Cookie> cookies=null; Iterator<Cookie> i = cookies.iterator(); while (i.hasNext()) { wc.getCookieManager().addCookie(i.next()); } //准备工作已经做好了 HtmlPage page=null; page = wc.getPage(request); if(page==null) { System.out.println("采集 "+url+" 失败!!!"); return ; } String content=page.asText();//网页内容保存在content里 if(content==null) { System.out.println("采集 "+url+" 失败!!!"); return ; } //搞定了 CookieManager CM = wc.getCookieManager(); //WC = Your WebClient's name Set<Cookie> cookies_ret = CM.getCookies();//返回的Cookie在这里,下次请求的时候可能可以用上啦。