【技术应用】java通过url爬虫获取公众号文章内容

【技术应用】java通过url爬虫获取公众号文章内容

    • 一、前言
    • 二、解决思路
    • 三、爬虫工具
    • 四、代码实现
      • 1.爬取公众号文章
      • 2.爬取CSDN文章
    • 五、总结

一、前言

平时在微信或者钉钉发送消息时,会自动把url链接转出卡片信息,提高了消息的可读性,是一个很好的用户体验,这两天我们正在开发的一个应用软件也涉及到这一块的内容,由于前端的限制,只能依赖于后端程序接口辅助实现此功能;
【技术应用】java通过url爬虫获取公众号文章内容_第1张图片

二、解决思路

后端提供转换接口,入参为url链接,后端代码通过url链接获取文章信息(文章标题部分内容封面图)回传给前端,供前端实现卡片功能,返回值格式如下:

{
	"title": "从入门到放弃",
	"content": "java对象生命周期",
	"image_url": "http://..."
}

注:今天不讲url如何转卡片,重点想总结一下java如何通过url获取文章信息,或者说java如何通过url爬取文章信息;

三、爬虫工具

网络爬虫前两年是一个很火的功能,之前一提到爬虫大家可能首先想到的是通过python实现爬虫功能,但是通过java实现爬虫功能的应用相对较少,这也是由于pythonjava的特点决定的,python做爬虫语法更简单,代码更简洁,java的语法比python严格,而且代码也更复杂;

我们要实现的爬虫功能很简单,所以我们主要讲通过java实现爬虫功能

首先介绍三种爬虫框架/程序Phantomjs/Casperjs, HtmlUnit, Selenium,关于这三个框架的内容这里不做详细讲解,需要的可以自行百度,这里只做一个特点对比,也方便我们理解技术选型,特点对比如下:

框架 javaScript engine cookie request[received]url Browser 访问速度,稳定性,可扩展性等
Phantomjs/Casperjs Base on WebKit 支持 支持 Base on WebKit 访问速度较快,有时程序会crash,支持各种js 框架,缺点:支持的js有限
HtmlUnit Rhino 支持 支持 Firefox or Internet Explorer 访问速度最快,比较稳定,支持各种js框架,可以由页面类容模拟url请求. 缺点:支持的js有限
Selenium Most engine 支持 不支持 Most Browsers 访问速度太慢,速度也不稳定,而且带有UI,想跨平台必须使用RemoteWebDriver,优点: 支持大部分浏览器

由上述框架的特点可知,我们只是要实现通过url获取文章的标题内容图片属性,不涉及js和界面渲染等相关内容,所以我们选择使用htmlunit
简单介绍htmlunit 是一款开源的java 页面分析工具,读取页面后,可以有效的使用htmlunit分析页面上的内容。项目可以模拟浏览器运行,被誉为java浏览器的开源实现。这个没有界面的浏览器,运行速度也是非常迅速的。

为了方便解析爬虫获取的html,我们使用jsoup工具,jsoup是一款Java的html解析工具,主要是对html和xml文件进行解析,使用jsoup能轻松从html/xml获取想要的内容。

四、代码实现

我们今天主要列举两种类型的数据爬取:公众号文章CSDN文章

首先引入pom.xml依赖:

<dependency>
            <groupId>net.sourceforge.htmlunitgroupId>
            <artifactId>htmlunitartifactId>
            <version>2.55.0version>
        dependency>

        <dependency>
            <groupId>org.jsoupgroupId>
            <artifactId>jsoupartifactId>
            <version>1.8.3version>
        dependency>

1.爬取公众号文章

package com.example.tts.service.impl;

import com.example.tts.service.ICrawlingService;
import com.example.tts.utils.ToSpeach;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;

@Service
public class WXCrawlingService implements ICrawlingService {
    static String[] userAgent = {
            "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
            "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
            "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
            "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
            "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
            "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
            "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
            "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
            "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
            "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
            "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
            "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
            "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
            "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
            "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
            "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
            "UCWEB7.0.2.37/28/999",
            "NOKIA5700/ UCWEB7.0.2.37/28/999",
            "Openwave/ UCWEB7.0.2.37/28/999",
            "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
            "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
    };
    static BrowserVersion browser =
            new BrowserVersion.BrowserVersionBuilder(BrowserVersion.CHROME)
                    .setUserAgent(userAgent[new Random().nextInt(userAgent.length)])
                    .build();
    Map<String, Integer> proxy = new HashMap();
    private static final WebClient webClient = new WebClient(browser);
    ToSpeach toSpeach = new ToSpeach();

    public static String crawling02(String url) throws NullPointerException {
        ArrayList<String> texts = new ArrayList();
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setActiveXNative(false);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        HtmlPage page = null;
        try {
            page = webClient.getPage(url);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            webClient.close();
        }

        webClient.waitForBackgroundJavaScript(30000);

        String pageXml = page.asXml();
        System.out.println(pageXml);
        Document document = Jsoup.parse(pageXml);
        String title = document.title();
        System.out.println(title);

        // rich_media_title
        Elements rich_media_title = document.getElementsByClass("rich_media_title");
        System.out.println("=================title=====================");
        //System.out.println(rich_media_title);
        rich_media_title.forEach(element -> System.out.println(element.getElementsByTag("h1").text()));

        Elements infoListEle = document.getElementsByClass("rich_media_content");
        System.out.println("==================content====================");
        //System.out.println(infoListEle);
        infoListEle.forEach(element -> System.out.println(element.getElementsByTag("p").text()));

        System.out.println("================获取封面图=====================");
        String urlXml = pageXml.substring(pageXml.indexOf("cdn_url_1_1"),pageXml.length());
        urlXml = urlXml.substring(urlXml.indexOf("\"")+1);
        String url_01_01 = urlXml.substring(0,urlXml.indexOf("\""));
        System.out.println(url_01_01);
        return pageXml;
    }

    public static void main(String[] args) {
        //String url = "https://mp.weixin.qq.com/s/mcUJt29Dq9g4EGH1GgcsGQ";

        String url = "https://mp.weixin.qq.com/s/s2Txc0xDZ8KTnZk-5vp3JQ";

        String xml = crawling02(url);
       // System.out.println(readXMLName02(xml,"title"));

    }


    public static String readXMLName02(String s, String name) {
        if (s.contains(name)) {
            int start = s.indexOf("<" + name + ">") + name.length() + 2;
            int end = s.indexOf(" + name + ">");
            String substring = s.substring(start, end);
            return substring;
        }
        return null;
    }


    @Override
    public ArrayList<String> crawling(String URL) {
        return null;
    }

}

公众号文章:
【技术应用】java通过url爬虫获取公众号文章内容_第2张图片
爬取结果:

=================title=====================
Spring中获取bean的八种方式,你get了几种?
==================content====================
大家好,我是锋哥! Java1234 VIP大特惠(仅限100名额)... 省略
================获取封面图=====================
https://mmbiz.qpic.cn/mmbiz_jpg/JfTPiahTHJhogG8qc16pF4gePH9FfnnLTia8m58vwDztbQCKcpxaoa44htlfuhVBtA9JicRns25WrofQTqzP2icGxw/0?wx_fmt=jpeg

2.爬取CSDN文章

package com.example.tts.service.impl;

import com.example.tts.service.ICrawlingService;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;

import java.util.ArrayList;

@Service
public class CSDNCrawlingService implements ICrawlingService {

    private static WebClient webClient= new WebClient(BrowserVersion.CHROME);

    public static void crawling02(String URL){
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setActiveXNative(false);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());

        HtmlPage page = null;
        try {
            page = webClient.getPage(URL);
        } catch (Exception e) {
            e.printStackTrace();
        }finally {
            webClient.close();
        }

        webClient.waitForBackgroundJavaScript(30000);
        String pageXml = page.asXml();
        Document document = Jsoup.parse(pageXml);

        Elements titleListEle = document.getElementsByTag("title");
        String title = titleListEle.text();
        System.out.println("============文章标题=================");
        System.out.println(title);

        /*Elements infoListEle = document.getElementById("title").getElementsByTag("li");
        infoListEle.forEach(element -> {
            System.out.println(element.getElementsByClass("txt-box").first().getElementsByTag("h3").text());
            System.out.println(element.getElementsByClass("txt-box").first().getElementsByTag("a").attr("href"));
        });*/
    }

    public static void main(String[] args) {

        String url = "https://blog.csdn.net/Edward1027/article/details/124859901";
        crawling02(url);

    }

    @Override
    public ArrayList<String> crawling(String URL) {
        return null;
    }
}

文章信息:
【技术应用】java通过url爬虫获取公众号文章内容_第3张图片
爬取结果:

============文章标题=================
CSDN插入表格_码农爱德华的博客-CSDN博客_csdn表格
Disconnected from the target VM, address: '127.0.0.1:58937', transport: 'socket'

五、总结

爬虫内容平时应用比较少,只在这里总结这两种应用,有兴趣的可以多总结,学习如果只是学相对片面,还要结合实际应用。

你可能感兴趣的:(java,爬虫,python)