大数据学习笔记——JAVA爬虫及关键代码

大致步骤:

  • private static List films=new ArrayList();
  • List films=GetData.getData(client, url);(在Test类中)
    public static List getData(HttpClient client,String url)throws Exception{
  1. 获得Http客户端,生成Http对象(可以理解为:你得先有一个浏览器;注意:实际上HttpClient与浏览器是不一样的)
    (在Test类中)
    HttpClientBuilder builder=HttpClientBuilder.create();
    HttpClient client=builder.build();
  2. 创建Get请求,获取HttpGet对象
    HttpGet get=new HttpGet(网址);
  3. 设置Http请求头userAgent和浏览器标识。模拟浏览器信息
    get.setHeader(“userAgent”,浏览器标识);
  4. HTTP 响应是在接收和解释请求消息之后由服务器发送回客户端的消息。
    该消息的第一行包括协议版本,后跟数字状态代码及其关联的文本短语。
    HttpResponse response=new BasicHttpResponse(HttpVersion.HTTP_1_1,HttpStatus.SC_OK,“ok”);
  5. 调用HttpClient对象的execute(HttpUriRequest request)发送请求,该方法返回一个HttpResponse。(执行HttpGet)
    response=client.execute(get);
  6. 得到http响应结果的状态代码 ,等于200说明成功,等于500是失败
    int status=response.getStatusLine().getStatusCode();
  7. 如果status==200
    7.1 获取返回的实体
    HttpEntity httpEntity=response.getEntity();
    7.2 解析实体类
    String html=EntityUtils.toString(httpEntity,“utf-8”);
    7.3解析Html文档
    Document doc = Jsoup.parse(html);
    7.4 建立Elements对象来寻找Html文档中的元素
    Elements elements=doc.select(“Html文档中的关键字”);
    7.5 Films film=null;
    7.6 for(Elements element : elements) (遍历Html文档)
    7.6.1 film=new Films
    7.6.2 调用film对象的set······方法来获取Html文档中想获取的信息。
    7.6.3 films.add(film)
    7.7 获取下一个页面的地址,并做如上操作
    7.8 关闭所有流
    EntityUtils.consume(response.getEntity());
  8. 如果status!=200
    关闭所有流
    EntityUtils.consume(response.getEntity());
  9. 返回films对象
    }

实例:(爬取豆瓣250电影数据)(一定要建立Maven的web项目,在项目里面建立一个123.txt)

代码:
类Films:

package come.sun.bean;
public class Films {
	private Integer location;
	private String fname;
	private double score;
	public Integer getLocation() {
		return location;
	}
	public void setLocation(Integer location) {
		this.location = location;
	}
	public String getFname() {
		return fname;
	}
	public void setFname(String fname) {
		this.fname = fname;
	}
	public double getScore() {
		return score;
	}
	public void setScore(double score) {
		this.score = score;
	}
	@Override
	public String toString(){
		return "Films [location="+location+",fname="+fname+",score="+score+"]";
	}
}

类GetData:

 package come.sun.get;

import java.util.ArrayList;
import java.util.List;

import org.apache.el.lang.ELSupport;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.HttpVersion;
import org.apache.http.client.*;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.message.BasicHttpResponse;
import org.apache.http.protocol.HTTP;
import org.apache.http.util.EntityUtils;
import org.apache.jasper.tagplugins.jstl.core.ForEach;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


import java.io.IOException;

import come.sun.bean.Films;
public class GetData {
	private static List films=new ArrayList();
	public static List getData(HttpClient client,String url)throws Exception{
		String userAgent="Mozilla/5.0 (Windows NT 6.2) AppleWebkit/537.36 (KHTML,like Gecko) Chrome/53.0.2840.87 Safari/537.36";
		HttpGet get=new HttpGet(url);
		get.setHeader("userAgent",userAgent);
		HttpResponse response=new BasicHttpResponse(HttpVersion.HTTP_1_1,HttpStatus.SC_OK,"ok");
		response=client.execute(get);
		int status=response.getStatusLine().getStatusCode();
		if(status==200){
			String html=EntityUtils.toString(response.getEntity());
			Document doc=Jsoup.parse(html);
			Elements elements=doc.select("div[class=item]");
			Films film=null;
			for(Element element : elements){
				film=new Films();
				film.setLocation(Integer.valueOf(element.select("div[class=pic]").select("em").text()));
				film.setFname(element.select("div[class=hd]").select("span").eq(0).text());
				film.setScore(Double.valueOf(element.select("div[class=star]").select("span[class=rating_num]").text()));
				films.add(film);
			}
			String nexturl=doc.select("div[class=paginator]").select("span[class=next]").select("a").attr("href");
			System.out.println(nexturl);
			nexturl="https://movie.douban.com/top250"+nexturl;
			if(nexturl.indexOf("?")!=-1){
				getData(client, nexturl);
			
			}
			EntityUtils.consume(response.getEntity());
		}else{
			EntityUtils.consume(response.getEntity());
		}
		return films;
	}
}

类Test:

 package com.sun.test;

import java.io.File;
import java.io.FileWriter;
import java.io.Writer;
import java.util.List;

import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.client.HttpClient;

import come.sun.bean.Films;
import come.sun.get.GetData;

public class Test {
	public static void main(String[] args)throws Exception{
		String url="https://movie.douban.com/top250";
		HttpClientBuilder builder=HttpClientBuilder.create();
		HttpClient client=builder.build();
		List films=GetData.getData(client, url);
		File file=new File("123.txt");
		Writer writer=new FileWriter(file,true);
		for(Films films2 : films){
			writer.write(films2.toString()+"\n");
		}
		writer.close();
	}
}

你可能感兴趣的:(大数据学习笔记)