网络爬虫

一、htmlunit形式

WebClient wc = new WebClient(BrowserVersion.FIREFOX_31);//模拟浏览器内核
wc.getOptions().setJavaScriptEnabled(true);
wc.getOptions().setCssEnabled(true);
wc.getOptions().setThrowExceptionOnFailingStatusCode(false);
wc.getOptions().setThrowExceptionOnScriptError(false);
wc.getOptions().setTimeout(10000);
wc.getOptions().setRedirectEnabled(true);
HtmlPage hp = wc.getPage(url);
System.out.println("为了获取js执行的数据 线程开始沉睡等待");
Thread.sleep(4000);//主要是这个线程的等待 因为js加载也是需要时间的
System.out.println("线程结束沉睡");
String xml =hp.asText();

对于一次性抓取,根据线程等待时间让js加载还是比较可取的,如果多次抓取或者js加载时间很难判断这个方法不可取!

二、httpClient

  HttpClient httpClient = new DefaultHttpClient();
HttpGet get = new HttpGet(urls);
get.addHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
get.addHeader("Accept-Encoding", "gzip, deflate, sdch");
get.addHeader("Accept-Language", "zh-CN,zh;q=0.8");
get.addHeader("Cache-Control", "max-age=0");
get.addHeader("Connection", "keep-alive");
get.addHeader("Cookie", "Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1456796413,1456796523,1456796663,1456800833; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1456800879");
get.addHeader("Host", "www.tianyancha.com");
get.addHeader("Upgrade-Insecure-Requests", "1");
get.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36");//去掉后出问题
HttpResponse response = httpClient.execute(get);
response.setStatusCode(200);
HttpEntity entity = response.getEntity();
String line =null;
if (entity != null) {
BufferedReader reader = new BufferedReader(new InputStreamReader(entity.getContent()));
while ((line = reader.readLine()) != null) {
}
}

抓取ajax中的内容,这个需要根据网页用浏览器的F12查看具体生成的内容来设Header

你可能感兴趣的:(网络爬虫)