java网络爬虫入门实例——以爬取百度首页源代码为例

最近在自学网络爬虫,参考教材是《自己动手写网络爬虫》,使用ide为eclipse。感觉书上入门的例子有些问题,于是我参考了httpclient 3.1的文档,爬取了百度首页的html,算是一个初学者入门的东西吧,把代码和一些心得和大家分享一下。

httpclient 3.1 下载地址:http://archive.apache.org/dist/httpcomponents/commons-httpclient/3.0/

import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.params.HttpMethodParams;

import java.io.*;

public class HttpClientTutorial {

	private static String url = "https://www.baidu.com/";
  
	public static void main(String[] args) {
		
		// Create an instance of HttpClient.
		HttpClient client = new HttpClient();
		
		// Create a method instance.
		GetMethod method = new GetMethod(url);
		
		// Provide custom retry handler is necessary
		method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler(3, false));
		
		try {
			// Execute the method.
			int statusCode = client.executeMethod(method);
			
			if (statusCode != HttpStatus.SC_OK) {
			System.err.println("Method failed: " + method.getStatusLine());
			}
			
			// Read the response body.
			byte[] responseBody = method.getResponseBody();
			
			// Deal with the response.
			// Use caution: ensure correct character encoding and is not binary data
			System.out.println(new String(responseBody));
		} catch (HttpException e) {
			System.err.println("Fatal protocol violation: " + e.getMessage());
			e.printStackTrace();
		} catch (IOException e) {
			System.err.println("Fatal transport error: " + e.getMessage());
			e.printStackTrace();
		} finally {
			// Release the connection.
			method.releaseConnection();
		}
	}
}

运行结果:


如果汉字部分为乱码,则可以在window-preference-general-workspace下设置编码:

java网络爬虫入门实例——以爬取百度首页源代码为例_第1张图片

参考资料:

http://hc.apache.org/httpclient-legacy/tutorial.html 

https://zhidao.baidu.com/question/625740030659487804.html 


你可能感兴趣的:(java网络爬虫入门实例——以爬取百度首页源代码为例)