Java简单网页爬虫

简单原理就是使用apache访问网页,获取网页内容,然后根据匹配的开始和结束位置,得到想要的结果

 

首先需要引入apache的几个包

import org.apache.commons.lang.StringUtils;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.StatusLine;
import org.apache.http.client.HttpResponseException;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;

然后设置url,需要获取的开始和结束位置的HTML,具体位置可以通过查看网页源代码得到

注意:此处的url地址必须是http://开头,不然会有Target host is not specified报错

private final String url = "http://xg.xgsggzy.com/website//ztbinfo/showlist.aspx?xiaqucode=420901";
private final String txtStart = "";
private final String txtEnd = "

你可能感兴趣的:(java学习)