JAVA爬虫三种方法

文章目录

  • 前言
  • 一、JDK
  • 二、HttpClient
  • 三、Jsoup
  • 总结


前言

记录JAVA爬虫三种方式


一、JDK

使用JDK自带的URLConnection实现网络爬虫。

public void testGet() throws Exception {
        //1.确定要访问/爬取的URL
        URL url = new URL("https://blog.csdn.net/weixin_40298650/article/details/118490147?spm=1001.2014.3001.5501");

        //2.获取连接对象
        HttpURLConnection urlConnection = (HttpURLConnection)url.openConnection();

        //3.设置连接信息:请求方式/请求参数/请求头...
        urlConnection.setRequestMethod("GET");//请求方式默认就是GET,注意要大写
        urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36");
        urlConnection.setConnectTimeout(30000); //设置超时时间,单位毫秒

        //4.获取数据
        InputStream in = urlConnection.getInputStream();
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        String line;
        String html ="";
        while ((line = reader.readLine()) !=null){
            html += line +"\n";
        }
        System.out.println(html);

        //5.关闭资源
        in.close();
        reader.close();
    }

二、HttpClient

使用HttpClient实现网络爬虫

public void testGet() throws Exception {
        //1.创建HttpClient对象
        //DefaultHttpClient httpClient = new DefaultHttpClient();
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.创建HttpGet请求并进行相关设置
        HttpGet httpGet = new HttpGet("https://blog.csdn.net/weixin_40298650/article/details/118490147?spm=1001.2014.3001.5501");
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36");

        //3.发起请求
        CloseableHttpResponse response = httpClient.execute(httpGet);

        //4.判断响应状态码并获取响应数据
        if(response.getStatusLine().getStatusCode() == 200){//200表示响应成功
            String html = EntityUtils.toString(response.getEntity(), "UTF-8");
            System.out.println(html);
        }

        //5.关闭资源
        httpClient.close();
        response.close();
    }

三、Jsoup

使用Jsoup实现页面解析

public void testGetDoument() throws Exception {
        //Document doc = Jsoup.connect("https://blog.csdn.net/weixin_40298650/article/details/118490147?spm=1001.2014.3001.5501").get();
        Document doc = Jsoup.parse(new URL("https://blog.csdn.net/weixin_40298650/article/details/118490147?spm=1001.2014.3001.5501"), 1000);
        //Document doc = Jsoup.parse(new File("jsoup.html"), "UTF-8");
        //String htmlStr = FileUtils.readFileToString(new File("jsoup.html"), "UTF-8");
        //Document doc = Jsoup.parse(htmlStr);
        System.out.println(doc);
        Element titleElement = doc.getElementsByTag("title").first();
        String title = titleElement.text();
        System.out.println(title);
    }

总结

本文仅仅简单介绍了JAVA网络爬虫的爬取方法,抛砖引玉。

你可能感兴趣的:(JAVA爬虫三种方法)