平时在微信或者钉钉发送消息时,会自动把url链接转出卡片信息,提高了消息的可读性,是一个很好的用户体验,这两天我们正在开发的一个应用软件也涉及到这一块的内容,由于前端的限制,只能依赖于后端程序接口辅助实现此功能;
后端提供转换接口,入参为url链接,后端代码通过url链接
获取文章信息(文章标题
、部分内容
、封面图
)回传给前端,供前端实现卡片功能,返回值格式如下:
{
"title": "从入门到放弃",
"content": "java对象生命周期",
"image_url": "http://..."
}
注:
今天不讲url如何转卡片,重点想总结一下java如何通过url获取文章信息,或者说java如何通过url爬取文章信息;
网络爬虫前两年是一个很火的功能,之前一提到爬虫大家可能首先想到的是通过python实现爬虫功能,但是通过java实现爬虫功能的应用相对较少,这也是由于python
和java
的特点决定的,python做爬虫语法更简单,代码更简洁,java的语法比python严格,而且代码也更复杂;
我们要实现的爬虫功能很简单,所以我们主要讲通过java实现爬虫功能
;
首先介绍三种爬虫框架/程序:Phantomjs/Casperjs
, HtmlUnit
, Selenium
,关于这三个框架的内容这里不做详细讲解,需要的可以自行百度,这里只做一个特点对比,也方便我们理解技术选型,特点对比如下:
框架 | javaScript engine | cookie | request[received]url | Browser | 访问速度,稳定性,可扩展性等 |
---|---|---|---|---|---|
Phantomjs/Casperjs | Base on WebKit | 支持 | 支持 | Base on WebKit | 访问速度较快,有时程序会crash,支持各种js 框架,缺点:支持的js有限 |
HtmlUnit | Rhino | 支持 | 支持 | Firefox or Internet Explorer | 访问速度最快,比较稳定,支持各种js框架,可以由页面类容模拟url请求. 缺点:支持的js有限 |
Selenium | Most engine | 支持 | 不支持 | Most Browsers | 访问速度太慢,速度也不稳定,而且带有UI,想跨平台必须使用RemoteWebDriver,优点: 支持大部分浏览器 |
由上述框架的特点可知,我们只是要实现通过url
获取文章的标题
、内容
和图片
属性,不涉及js和界面渲染等相关内容,所以我们选择使用htmlunit
;
简单介绍:htmlunit
是一款开源的java 页面分析工具,读取页面后,可以有效的使用htmlunit
分析页面上的内容。项目可以模拟浏览器运行,被誉为java
浏览器的开源实现。这个没有界面的浏览器,运行速度也是非常迅速的。
为了方便解析爬虫获取的html,我们使用jsoup工具,jsoup
是一款Java的html解析工具
,主要是对html和xml文件进行解析,使用jsoup
能轻松从html/xml
获取想要的内容。
我们今天主要列举两种类型
的数据爬取:公众号文章
和CSDN文章
首先引入pom.xml依赖:
<dependency>
<groupId>net.sourceforge.htmlunitgroupId>
<artifactId>htmlunitartifactId>
<version>2.55.0version>
dependency>
<dependency>
<groupId>org.jsoupgroupId>
<artifactId>jsoupartifactId>
<version>1.8.3version>
dependency>
package com.example.tts.service.impl;
import com.example.tts.service.ICrawlingService;
import com.example.tts.utils.ToSpeach;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;
@Service
public class WXCrawlingService implements ICrawlingService {
static String[] userAgent = {
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UCWEB7.0.2.37/28/999",
"NOKIA5700/ UCWEB7.0.2.37/28/999",
"Openwave/ UCWEB7.0.2.37/28/999",
"Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
"Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
};
static BrowserVersion browser =
new BrowserVersion.BrowserVersionBuilder(BrowserVersion.CHROME)
.setUserAgent(userAgent[new Random().nextInt(userAgent.length)])
.build();
Map<String, Integer> proxy = new HashMap();
private static final WebClient webClient = new WebClient(browser);
ToSpeach toSpeach = new ToSpeach();
public static String crawling02(String url) throws NullPointerException {
ArrayList<String> texts = new ArrayList();
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage page = null;
try {
page = webClient.getPage(url);
} catch (Exception e) {
e.printStackTrace();
} finally {
webClient.close();
}
webClient.waitForBackgroundJavaScript(30000);
String pageXml = page.asXml();
System.out.println(pageXml);
Document document = Jsoup.parse(pageXml);
String title = document.title();
System.out.println(title);
// rich_media_title
Elements rich_media_title = document.getElementsByClass("rich_media_title");
System.out.println("=================title=====================");
//System.out.println(rich_media_title);
rich_media_title.forEach(element -> System.out.println(element.getElementsByTag("h1").text()));
Elements infoListEle = document.getElementsByClass("rich_media_content");
System.out.println("==================content====================");
//System.out.println(infoListEle);
infoListEle.forEach(element -> System.out.println(element.getElementsByTag("p").text()));
System.out.println("================获取封面图=====================");
String urlXml = pageXml.substring(pageXml.indexOf("cdn_url_1_1"),pageXml.length());
urlXml = urlXml.substring(urlXml.indexOf("\"")+1);
String url_01_01 = urlXml.substring(0,urlXml.indexOf("\""));
System.out.println(url_01_01);
return pageXml;
}
public static void main(String[] args) {
//String url = "https://mp.weixin.qq.com/s/mcUJt29Dq9g4EGH1GgcsGQ";
String url = "https://mp.weixin.qq.com/s/s2Txc0xDZ8KTnZk-5vp3JQ";
String xml = crawling02(url);
// System.out.println(readXMLName02(xml,"title"));
}
public static String readXMLName02(String s, String name) {
if (s.contains(name)) {
int start = s.indexOf("<" + name + ">") + name.length() + 2;
int end = s.indexOf("" + name + ">");
String substring = s.substring(start, end);
return substring;
}
return null;
}
@Override
public ArrayList<String> crawling(String URL) {
return null;
}
}
=================title=====================
Spring中获取bean的八种方式,你get了几种?
==================content====================
大家好,我是锋哥! Java1234 VIP大特惠(仅限100名额)!... 省略
================获取封面图=====================
https://mmbiz.qpic.cn/mmbiz_jpg/JfTPiahTHJhogG8qc16pF4gePH9FfnnLTia8m58vwDztbQCKcpxaoa44htlfuhVBtA9JicRns25WrofQTqzP2icGxw/0?wx_fmt=jpeg
package com.example.tts.service.impl;
import com.example.tts.service.ICrawlingService;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;
import java.util.ArrayList;
@Service
public class CSDNCrawlingService implements ICrawlingService {
private static WebClient webClient= new WebClient(BrowserVersion.CHROME);
public static void crawling02(String URL){
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage page = null;
try {
page = webClient.getPage(URL);
} catch (Exception e) {
e.printStackTrace();
}finally {
webClient.close();
}
webClient.waitForBackgroundJavaScript(30000);
String pageXml = page.asXml();
Document document = Jsoup.parse(pageXml);
Elements titleListEle = document.getElementsByTag("title");
String title = titleListEle.text();
System.out.println("============文章标题=================");
System.out.println(title);
/*Elements infoListEle = document.getElementById("title").getElementsByTag("li");
infoListEle.forEach(element -> {
System.out.println(element.getElementsByClass("txt-box").first().getElementsByTag("h3").text());
System.out.println(element.getElementsByClass("txt-box").first().getElementsByTag("a").attr("href"));
});*/
}
public static void main(String[] args) {
String url = "https://blog.csdn.net/Edward1027/article/details/124859901";
crawling02(url);
}
@Override
public ArrayList<String> crawling(String URL) {
return null;
}
}
============文章标题=================
CSDN插入表格_码农爱德华的博客-CSDN博客_csdn表格
Disconnected from the target VM, address: '127.0.0.1:58937', transport: 'socket'
爬虫内容平时应用比较少,只在这里总结这两种应用,有兴趣的可以多总结,学习如果只是学相对片面,还要结合实际应用。