Java爬虫例子

写给自己,新知识的总结。
最近有个需求要爬一些百度贴吧上帖子的发言和回复,所以就去学习了一下如何使用java爬虫来爬数据。
直接上代码吧!
如果只是爬源码的话只用httpclient.jar一个包用了,如果要解析的话还得加上jsoup.jar包,解析后面有空再写吧。

一、


<dependencies>
    <dependency>
            <groupId>org.apache.httpcomponentsgroupId>
            <artifactId>httpclientartifactId>
            <version>4.3.1version>
        dependency>
  dependencies>

二、
创建类
代码如下:

package com.myself.crawl;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

import org.apache.http.HttpEntity;
import org.apache.http.HttpStatus;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;

/*
 * @author sjia
 * @Date 2017年4月14日--下午12:40:50
 */
public class HttpGetUtils {
    public static void main(String[] args) {
        String str=get("http://www.baidu.com");
        System.out.println(str);
    }
     /**
     * get 方法
     * @param url
     * @return
     */
    public static String get(String url){
        String result = "";
        try {
            //获取httpclient实例
            CloseableHttpClient httpclient = HttpClients.createDefault();
            //获取方法实例。GET
            HttpGet httpGet = new HttpGet(url);
            //执行方法得到响应
            CloseableHttpResponse response = httpclient.execute(httpGet);
            try {
                //如果正确执行而且返回值正确,即可解析
                if (response != null
                        && response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                    System.out.println(response.getStatusLine());
                    HttpEntity entity = response.getEntity();
                    //从输入流中解析结果
                    result = readResponse(entity, "utf-8");
                }
            } finally {
                httpclient.close();
                response.close();
            }
        }catch (Exception e){
            e.printStackTrace();
        }
        return result;
    }
    /**
     * stream读取内容,可以传入字符格式
     * @param resEntity
     * @param charset
     * @return
     */
    private static String readResponse(HttpEntity resEntity, String charset) {
        StringBuffer res = new StringBuffer();
        BufferedReader reader = null;
        try {
            if (resEntity == null) {
                return null;
            }

            reader = new BufferedReader(new InputStreamReader(
                    resEntity.getContent(), charset));
            String line = null;

            while ((line = reader.readLine()) != null) {
                res.append(line);
            }

        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (reader != null) {
                    reader.close();
                }
            } catch (IOException e) {
            }
        }
        return res.toString();
    }

}

三、
测试!运行main方法,我这里以百度为例输出情况如下:
输出结果

然后就结束了!

你可能感兴趣的:(学习,java,爬虫)