Java实现一个简单的爬虫

前言:

这篇文章是我看了团长的一篇关于Java爬虫的文章之后,写的一个练习。代码中,实现了对京东网站的数据爬取、分析。

程序结构图如下:
Java实现一个简单的爬虫_第1张图片

 说明,关于代码的说明在代码中已经表述的很明白,这里不过多叙述。

JdongMain是程序的入口、JdongBook对应京东上出售的书籍、URLHandle是对URL和client的处理,通过它返回经过加工的数据、HTTPUtils发送真正的HTTP请求,并返回响应报文、jdParse是对响应报文的实体内容进行解析。

代码:

1、JdongMain.java

package main;

import java.io.IOException;
import java.util.List;

import org.apache.http.ParseException;
import org.apache.http.client.HttpClient;
import org.apache.http.impl.client.DefaultHttpClient;

import model.JdongBook;
import util.URLHandle;

/**
 * 程序入口,在此声明客户端,并向服务器发送请求
 * @author 康茜
 *
 */
public class JdongMain {
	public static void main(String[] args) {
		//生成一个客户端,通过客户端可url向服务器发送请求,并接收响应
		HttpClient client = new DefaultHttpClient();
		String url = "http://search.jd.com/Search?keyword=Python&enc=utf-8&book=y&wq=Python&pvid=33xo9lni.p4a1qb";
		List bookList = null;
		try {
			bookList = URLHandle.urlParser(client, url);
		} catch (ParseException | IOException e) {
			e.printStackTrace();
		}
		for(JdongBook book : bookList) {
			System.out.println(book);
		}
	}
}

2、URLHandle.java

package util;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.http.HttpResponse;
import org.apache.http.ParseException;
import org.apache.http.client.HttpClient;
import org.apache.http.util.EntityUtils;

import model.JdongBook;
import parse.JdParse;

/**
 * 通过URL和客户端(client)处理请求返回的数据
 * @author 康茜
 *
 */
public class URLHandle {
	/**
	 * 
	 * @param client 客户端
	 * @param url 请求地址
	 * @return 请求数据 :List
	 * @throws ParseException
	 * @throws IOException
	 */
	public static List urlParser(HttpClient client, String url) throws ParseException, IOException {
		List data = new ArrayList<>();
		
		//获取响应资源
		HttpResponse response = HTTPUtils.getHtml(client, url);
		//获取响应的状态码
		int sattusCode = response.getStatusLine().getStatusCode();
		if(sattusCode == 200) {//200表示成功
			//获取响应实体内容,并且将其转换为utf-8形式的字符串编码
			String entity = EntityUtils.toString(response.getEntity(), "utf-8");
			data = JdParse.getData(entity);
		} else {
			EntityUtils.consume(response.getEntity());//释放资源实体
		}
		return data;
	}
}	

3、HTTPUtils.java

package util;

import java.io.IOException;

import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.HttpVersion;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.message.BasicHttpResponse;

public class HTTPUtils {
	public static HttpResponse getHtml(HttpClient client, String url) {
		//获取响应文件,即HTML,采用get方法获取响应数据
		HttpGet getMethod = new HttpGet(url);
		HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK");
		
		try {
			//通过client执行get方法
			response = client.execute(getMethod);
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			//getMethod.abort();
		}
		
		return response;
	}
}

4、JdParse.java

package parse;

import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import model.JdongBook;

public class JdParse {
	/**
	 * 根据实体获取程序所需数据
	 * @param entity HTTP响应实体内容
	 * @return
	 */
	public static List getData(String entity) {
		List data = new ArrayList<>();
		//采用jsoup解析,关于jsoup的使用,见下文总结
		Document doc = Jsoup.parse(entity);
		
		//根据页面内容分析出需要的元素
		Elements elements = doc.select("ul[class=gl-warp clearfix]").select("li[class=gl-item]");
		for(Element element : elements) {
				JdongBook book = new JdongBook();
				book.setBookId(element.attr("data-sku"));
				book.setBookName(element.select("div[class=p-name p-name-type-2]").select("em").text());
				book.setBookPrice(element.select("div[class=p-price]").select("strong").select("i").text());
				
				data.add(book);
		}
		return data;
	}
}

5、JdongBook.java

package model;

public class JdongBook {
	private String bookId;
	private String bookName;
	private String bookPrice;

	public JdongBook() {
	}

	public String getBookId() {
		return bookId;
	}

	public void setBookId(String bookId) {
		this.bookId = bookId;
	}

	public String getBookName() {
		return bookName;
	}

	public void setBookName(String bookName) {
		this.bookName = bookName;
	}

	public String getBookPrice() {
		return bookPrice;
	}

	public void setBookPrice(String bookPrice) {
		this.bookPrice = bookPrice;
	}

	@Override
	public String toString() {
		return "Book [bookId=" + bookId + ", bookName=" + bookName + ", bookPrice=" + bookPrice + "]";
	}
}

总结:

1、通过这次联系我学会了 HttpClient、HttpResponse、HttpGet 之间的关系及联合使用。

2、jsoup解析html数据的基本用法:http://www.open-open.com/jsoup/

你可能感兴趣的:(java)