网络爬虫,就是在浏览器上,代替人类爬取数据,Java网络爬虫就是通过Java编写爬虫代码,代替人类从网络上爬取信息数据。程序员通过设定既定的规则,让程序代替我们从网络上获取海量我们需要的数据,比如图片,企业信息等。爬虫的关键是对于网页信息的解析。
什么是jsoup:
jsoup是一个用于处理现实世界HTML的Java库。它提供了一个非常方便的API,用于获取URL以及提取和操作数据,使用最好的HTML5 DOM方法和CSS选择器
我们以解析京东网页,红框数据为例
package com.xhf;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.net.URL;
public class JsoupTest {
static String url = "https://search.jd.com/Search?keyword=%E9%A4%90%E5%B7%BE%E7%BA%B8";
public static void main(String[] args) throws IOException {
// 解析网页, document就代表网页界面
Document document = Jsoup.parse(new URL(url), 5000);
// 打印获取前端代码
System.out.println(document);
}
}
但是,经常遇到弹出京东安全的界面
doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=no,maximum-scale=1.0,viewport-fit=cover">
<title>京东安全title>
<link href="https://cfe.m.jd.com/privatedomain/risk_handler/03101900/css/app.6f723501.css" rel="preload" as="style">
<link href="https://cfe.m.jd.com/privatedomain/risk_handler/03101900/js/app.js" rel="preload" as="script">
<link href="https://cfe.m.jd.com/privatedomain/risk_handler/03101900/js/chunk-vendors.js" rel="preload" as="script">
<link href="https://cfe.m.jd.com/privatedomain/risk_handler/03101900/css/app.6f723501.css" rel="stylesheet">
head>
<body>
<div class="ipaas-floor-app">div>
<script type="text/javascript" src="https://cfe.m.jd.com/privatedomain/risk_handler/03101900/js/chunk-vendors.js">script>
<script type="text/javascript" src="https://cfe.m.jd.com/privatedomain/risk_handler/03101900/js/app.js">script>
body>
html>
这算是对于爬取数据的一种反制措施。以前可以通过添加header解决,现在得添加cookie了。获取cookie的方式如下
以下,就是修正后的代码
package com.xhf;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.HashMap;
import java.util.Map;
public class JsoupTest {
static String url = "https://search.jd.com/Search?keyword=%E9%A4%90%E5%B7%BE%E7%BA%B8";
public static void main(String[] args) throws IOException {
// 设置cookie
Map<String, String> cookies = new HashMap<String, String>();
cookies.put("thor", "03F9B0325C5DCD2FCCDB435C227FD474D0B53C9143EB5DDA60599BDB9AE7A415B7CFEB4418F01DDEB8B8B9DD502D366A4E0BA2D84A0FE6CB6658061484CA95D230C7B76A36E31F4B329D2EFAC7DCD1E526F3C416CC50617276FED57FAF618892895784CB6446F6B8468A807290C12C3BA1C99DD0C0939C48C4E69681CA900EA9");
// 解析网页, document就代表网页界面
Document document = Jsoup.connect(url).cookies(cookies).get();
System.out.println(document);
}
}
doctype html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="renderer" content="webkit"> <meta http-equiv="Cache-Control" content="max-age=300"> <link rel="dns-prefetch" href="//search.jd.com"> <link rel="dns-prefetch" href="//item.jd.com"> <link rel="dns-prefetch" href="//list.jd.com"> <link rel="dns-prefetch" href="//p.3.cn"> <link rel="dns-prefetch" href="//misc.360buyimg.com"> <link rel="dns-prefetch" href="//nfa.jd.com"> <link rel="dns-prefetch" href="//d.jd.com"> <link rel="dns-prefetch" href="//img12.360buyimg.com"> <link rel="dns-prefetch" href="//img13.360buyimg.com"> <link rel="dns-prefetch" href="//static.360buyimg.com"> <link rel="dns-prefetch" href="//csc.jd.com"> <link rel="dns-prefetch" href="//mercury.jd.com"> <link rel="dns-prefetch" href="//x.jd.com"> <link rel="dns-prefetch" href="//wl.jd.com"> <title>餐巾纸 - 商品搜索 - 京东title> <meta name="Keywords" content="餐巾纸,京东餐巾纸"> <meta name="description" content="在京东找到了餐巾纸305051件餐巾纸的类似商品,其中包含了餐巾纸价格、餐巾纸评论、餐巾纸导购、餐巾纸图片等相关信息"> <style>
jsoup中的document可以当作js中的document使用,解析网站内容就是在js中操作document,获取信息
我们发现,所有的商品数据都是通过ul标签进行渲染
每单个数据,则是用li标签渲染
所以,如果我们要获取每个商品数据,我们可以先通过class,获取ul元素,然后选择出ul元素内包含的所有li元素
// 通过class获取ul标签
Elements ul = document.getElementsByClass("gl-warp clearfix");
// 获取ul标签下的所有li标签
Elements liList = ul.select("li");
for (Element element : liList) {
System.out.println("------------------");
System.out.println(element);
System.out.println();
}
------------------ <li data-sku="1297484" data-spu="1297484" ware-type="10" bybt="0" class="gl-item"> <div class="gl-i-wrap"> <div class="p-img"> <a target="_blank" title="【纸选维达,实力出发】爆品低至6.6折,抢新品低价试用 【神券疯狂领】满199减40神券 【会员福利送】下单满1元赢手机好礼,直达开抢!" href="//item.jd.com/1297484.html" onclick="searchlog(1, '1297484','28','2','','flagsClk=2097575');"> <img width="220" height="220" data-img="1" data-lazy-img="//img14.360buyimg.com/n7/jfs/t1/96373/19/43919/202641/64eb0ad6F1109a3ef/6d8d78fabae02163.jpg"> a> <div data-lease="" data-catid="15908" data-venid="1000001683" data-presale="0">div> div> <div class="p-price"> <strong class="J_1297484" data-presale="0" data-done="1"> <em>¥em><i data-price="1297484">78.90i> strong> div> <div class="p-name p-name-type-2"> <a target="_blank" title="【纸选维达,实力出发】爆品低至6.6折,抢新品低价试用 【神券疯狂领】满199减40神券 【会员福利送】下单满1元赢手机好礼,直达开抢!" href="//item.jd.com/1297484.html" onclick="searchlog(1, '1297484','28','1','','flagsClk=2097575');"> <em><img class="p-tag3" src="//m.360buyimg.com/cc/jfs/t1/113659/27/28361/2962/62ecb1f0E6c5fc50c/b914680e87a2c8e9.png"> 维达(Vinda)抽纸 超韧150抽*24包S码 湿水不易破 卫生纸 纸巾 <font class="skcolor_ljg">餐巾纸font> 整箱em> <i class="promo-words" id="J_AD_1297484">【纸选维达,实力出发】爆品低至6.6折,抢新品低价试用 【神券疯狂领】满199减40神券 【会员福利送】下单满1元赢手机好礼,直达开抢!i> a> div> <div class="p-commit"> <strong><a id="J_comment_1297484" target="_blank" href="//item.jd.com/1297484.html#comment" onclick="searchlog(1, '1297484','28','3','','flagsClk=2097575');">a>strong> div> <div class="p-shop" data-dongdong="" data-selfware="1" data-score="5" data-reputation="99"> <span class="J_im_icon"><a target="_blank" class="curr-shop hd-shopname" onclick="searchlog(1,'1000001683',0,58)" href="//mall.jd.com/index-1000001683.html?from=pc" title="维达京东自营官方旗舰店">维达京东自营官方旗舰店a>span> div> <div class="p-icons" id="J_pro_1297484" data-done="1"> <i class="goods-icons J-picon-tips J-picon-fix" data-idx="1" data-tips="京东自营,品质保障">自营i> <i class="goods-icons4 J-picon-tips" data-tips="本商品参与满件促销">2件9折i> div> <div class="p-operate"> <a class="p-o-btn contrast J_contrast contrast" data-sku="1297484" href="javascript:;" onclick="searchlog(1, '1297484','28','6','','flagsClk=2097575')"><i>i>对比a> <a class="p-o-btn focus J_focus" data-sku="1297484" href="javascript:;" onclick="searchlog(1, '1297484','28','5','','flagsClk=2097575')"><i>i>关注a> <a class="p-o-btn addcart" data-stocknew="1297484" href="//cart.jd.com/gate.action?pid=1297484&pcount=1&ptype=1" target="_blank" onclick="searchlog(1, '1297484','28','4','','flagsClk=2097575')" data-limit="0"><i>i>加入购物车a> div> <div class="p-stock hide" data-stocknew="1297484" data-province="山西">div> div> li> ------------------ <li data-sku="3092062" data-spu="3092062" ware-type="10" bybt="0" class="gl-item"> <div class="gl-i-wrap"> <div class="p-img"> <a target="_blank" title="【洁柔新品来袭】洁柔爱马仕设计师联名款重磅上线!爆款好物空前钜惠,爆品低至6.6折!【洁柔大会员】抢神券,会员臻享八大特权go" href="//item.jd.com/3092062.html" onclick="searchlog(1, '3092062','29','2','','flagsClk=2097574');"> <img width="220" height="220" data-img="1" data-lazy-img="//img12.360buyimg.com/n7/jfs/t1/97596/10/33191/189837/64ecc704F8cbfe25a/9015a6baf21bd1b9.jpg"> a> <div data-lease="" data-catid="15908" data-venid="1000001901" data-presale="0">div> div> <div class="p-price"> <strong class="J_3092062" data-presale="0" data-done="1"> <em>¥em><i data-price="3092062">54.90i> strong> div> <div class="p-name p-name-type-2"> <a target="_blank" title="【洁柔新品来袭】洁柔爱马仕设计师联名款重磅上线!爆款好物空前钜惠,爆品低至6.6折!【洁柔大会员】抢神券,会员臻享八大特权go" href="//item.jd.com/3092062.html" onclick="searchlog(1, '3092062','29','1','','flagsClk=2097574');"> <em><img class="p-tag3" src="//m.360buyimg.com/cc/jfs/t1/113659/27/28361/2962/62ecb1f0E6c5fc50c/b914680e87a2c8e9.png"> 洁柔抽纸 活力阳光橙3层120抽面巾纸*24包 母婴可用 全家适用em> <i class="promo-words" id="J_AD_3092062">【洁柔新品来袭】洁柔爱马仕设计师联名款重磅上线!爆款好物空前钜惠,爆品低至6.6折!【洁柔大会员】抢神券,会员臻享八大特权goi> a> div> <div class="p-commit"> <strong><a id="J_comment_3092062" target="_blank" href="//item.jd.com/3092062.html#comment" onclick="searchlog(1, '3092062','29','3','','flagsClk=2097574');">a>strong> div> <div class="p-shop" data-dongdong="" data-selfware="1" data-score="5" data-reputation="99"> <span class="J_im_icon"><a target="_blank" class="curr-shop hd-shopname" onclick="searchlog(1,'1000001901',0,58)" href="//mall.jd.com/index-1000001901.html?from=pc" title="洁柔京东自营官方旗舰店">洁柔京东自营官方旗舰店a>span> div> <div class="p-icons" id="J_pro_3092062" data-done="1"> <i class="goods-icons J-picon-tips J-picon-fix" data-idx="1" data-tips="京东自营,品质保障">自营i> div> <div class="p-operate"> <a class="p-o-btn contrast J_contrast contrast" data-sku="3092062" href="javascript:;" onclick="searchlog(1, '3092062','29','6','','flagsClk=2097574')"><i>i>对比a> <a class="p-o-btn focus J_focus" data-sku="3092062" href="javascript:;" onclick="searchlog(1, '3092062','29','5','','flagsClk=2097574')"><i>i>关注a> <a class="p-o-btn addcart" data-stocknew="3092062" href="//cart.jd.com/gate.action?pid=3092062&pcount=1&ptype=1" target="_blank" onclick="searchlog(1, '3092062','29','4','','flagsClk=2097574')" data-limit="0"><i>i>加入购物车a> div> <div class="p-stock hide" data-stocknew="3092062" data-province="山西">div> div> li> ...其余数据不做展示
通过遍历上述代码中出现的liList
,可以获取到每一个li
元素。每个元素都代表了商品的一组信息。具体如下所示。
如果我们要获取更为具体的信息,比如价格,图片,介绍等信息。我们就需要对li标签所封装的对象进行数据的截取。
我们可以用getElementsByTag("img")
来获取带有img标签的对象,然后获取其data-lazy-img
属性的数据
String pict = element.getElementsByTag("img").first().attr("data-lazy-img");
我们可以通过getElementsByClass("p-price")
的方式获取对象,然后获取其中内容
String price = element.getElementsByClass("p-price").first().text();
shop名称,类似价格获取方式
package com.xhf;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.HashMap;
import java.util.Map;
/**
* 解析京东界面, 爬取商品数据
*/
public class JsoupTest {
static String url = "https://search.jd.com/Search?keyword=%E9%A4%90%E5%B7%BE%E7%BA%B8";
public static void main(String[] args) throws IOException {
// 设置cookie
Map<String, String> cookies = new HashMap<String, String>();
cookies.put("thor", "03F9B0325C5DCD2FCCDB435C227FD474D0B53C9143EB5DDA60599BDB9AE7A415B7CFEB4418F01DDEB8B8B9DD502D366A4E0BA2D84A0FE6CB6658061484CA95D230C7B76A36E31F4B329D2EFAC7DCD1E526F3C416CC50617276FED57FAF618892895784CB6446F6B8468A807290C12C3BA1C99DD0C0939C48C4E69681CA900EA9");
// 解析网页, document就代表网页界面
Document document = Jsoup.connect(url).cookies(cookies).get();
// 通过class获取ul标签
Elements ul = document.getElementsByClass("gl-warp clearfix");
// 获取ul标签下的所有li标签
Elements liList = ul.select("li");
for (Element element : liList) {
System.out.println("------------------");
String pict = element.getElementsByTag("img").first().attr("data-lazy-img");
String price = element.getElementsByClass("p-price").first().text();
String shopName = element.getElementsByClass("p-shop").first().text();
System.out.println(pict);
System.out.println(price);
System.out.println(shopName);
}
}
}