概念
- jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
快速入门
- 步骤:
- 导入Jsoup的jar包
- 获取Document对象
- 获取对应标签Element对象
- 获取数据
- 代码:
- xml文档
<students>
<student number="duing-001">
<name>tomname>
<age>18age>
<sex>malesex>
student>
<student number="duing-002">
<name>jackname>
<age>18age>
<sex>femalesex>
student>
students>
public class JsoupDemo1 {
public static void main(String[] args) throws IOException {
String path = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
path = java.net.URLDecoder.decode(path, "utf-8");
Document document = Jsoup.parse(new File(path), "utf-8");
Elements elements = document.getElementsByTag("name");
System.out.println(elements.size());
Element element = elements.get(0);
String name = element.text();
System.out.println(name);
}
}
- 解析结果:
Jsoup对象
- Jsoup:工具类,可以解析html或xml文档,返回Document
- parse:静态方法,解析html或xml文档,返回Document
parse(File in, String charsetName)
:解析html或xml文件的(上面案例已演示)
parse(String html)
:解析html或xml字符串的
parse(URL url, int timeoutMillis)
:通过网络路径获取指定的html或xml的文档对象
public class JsoupDemo1 {
public static void main(String[] args) throws IOException {
URL url = new URL("https://baidu.com");
Document doc = Jsoup.parse(url, 1000);
System.out.println(doc);
}
}
Document对象
- Document:文档对象,代表内存中的dom树
- 获取Element对象
getElementById(String id)
:根据id属性值获取唯一的元素对象
getElementsByTag(String tagName)
:根据标签名称获取元素对象集合
getElementsByAttribute(String key)
:根据属性名称获取元素对象集合
getElementsByAttributeValue(String key, String value)
:根据对相应的属性名和属性值获取元素对象集合
- student.xml文档
<students>
<student number="duing_001">
<name id="id01">tomname>
<age>18age>
<sex>malesex>
student>
<student number="duing_002">
<name>jackname>
<age>18age>
<sex>femalesex>
student>
students>
public class JsoupDemo1 {
public static void main(String[] args) throws IOException {
String path = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
path = java.net.URLDecoder.decode(path, "utf-8");
Document doc = Jsoup.parse(new File(path), "utf-8");
Elements elements = doc.getElementsByTag("student");
System.out.println(elements);
System.out.println("------------------------");
Elements elements1 = doc.getElementsByAttribute("id");
System.out.println(elements1);
System.out.println("------------------------");
Elements elements2 = doc.getElementsByAttributeValue("number","duing_002");
System.out.println(elements2);
System.out.println("------------------------");
Element element = doc.getElementById("id01");
System.out.println(element);
}
- 解析结果:
Element对象
- Elements:元素Element对象的集合,可以当做
ArrayList
来使用
- Element:元素对象
- 获取子元素对象
getElementById(String id)
:根据id属性值获取唯一的元素对象
getElementsByTag(String tagName)
:根据标签名称获取元素对象集合
getElementsByAttribute(String key)
:根据属性名称获取元素对象集合
getElementsByAttributeValue(String key, String value)
:根据对相应的属性名和属性值获取元素对象集合
- 获取属性值
String attr(String key)
:根据属性名称获取属性值
- 获取文本内容(注意两者的区别)
String text()
:获取所有字标签的纯文本文本内容
String html()
:获取标签体的所有内容(包括子标签的标签和文本内容)
- student.xml文档:
<students>
<student number="duing_001">
<name id="id01">
<xing>zhangxing>
<ming>sanming>
name>
<age>18age>
<sex>malesex>
student>
<student number="duing_002">
<name>jackname>
<age>18age>
<sex>femalesex>
student>
students>
public class JsoupDemo3 {
public static void main(String[] args) throws IOException {
String path = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
path = java.net.URLDecoder.decode(path, "utf-8");
Document doc = Jsoup.parse(new File(path), "utf-8");
Elements elements = doc.getElementsByTag("name");
System.out.println(elements.size());
System.out.println("-----------------");
Element element_student = doc.getElementsByTag("student").get(0);
Elements ele_name =element_student.getElementsByTag("name");
System.out.println(ele_name.size());
System.out.println("-----------------");
String key = element_student.attr("number");
System.out.println(key);
System.out.println("-----------------");
String text = ele_name.text();
String html = ele_name.html();
System.out.println(text);
System.out.println("-----------------");
System.out.println(html);
}
Node:节点对象
- 是Document和Element的父类
- 不做重点学习
快捷查询方式
- selector:选择器
- 使用的方法:
Elements select(String cssQuery)
<students>
<student number="duing_001">
<name id="id01">
<xing>zhangxing>
<ming>sanming>
name>
<age>18age>
<sex>malesex>
student>
<student number="duing_002">
<name id="id02">jackname>
<age>18age>
<sex>femalesex>
student>
students>
public class JsoupDemo4 {
public static void main(String[] args) throws IOException {
String path = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
path = java.net.URLDecoder.decode(path, "utf-8");
Document doc = Jsoup.parse(new File(path), "utf-8");
Elements elements = doc.select("name");
System.out.println(elements);
System.out.println("----------------");
Elements elements1 = doc.select("#id02");
System.out.println(elements1);
System.out.println("----------------");
Elements elements2 = doc.select("student[number='duing_001']");
System.out.println(elements2);
System.out.println("----------------");
Elements elements3 = doc.select("student[number='duing_001'] > age");
System.out.println(elements3);
}
}
- 解析结果:
- XPath:XPath即为XML路径语言(XML Path Language),它是一种用来确定XML文档中某部分位置的语言。
- 使用Jsoup的XPath需要额外导入jar包
- 使用的方法:
sel、selN、selNOne、selOne
- 查询w3cshool参考手册,使用XPath语法完成查询
public class JsoupDemo5 {
public static void main(String[] args) throws IOException, XpathSyntaxErrorException {
String path = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
path = java.net.URLDecoder.decode(path, "utf-8");
Document doc = Jsoup.parse(new File(path), "utf-8");
JXDocument jxDocument = new JXDocument(doc);
List<JXNode> jxNodes = jxDocument.selN("//student");
for(JXNode jxNode:jxNodes){
System.out.println(jxNode);
}
System.out.println("----------------");
List<JXNode> jxNodes1 = jxDocument.selN("//student/name");
for(JXNode jxNode:jxNodes1){
System.out.println(jxNode);
}
System.out.println("----------------");
List<JXNode> jxNodes2 = jxDocument.selN("//student/name[@id]");
for(JXNode jxNode:jxNodes2){
System.out.println(jxNode);
}
System.out.println("----------------");
List<JXNode> jxNodes3 = jxDocument.selN("//student/name[@id='id02']");
for(JXNode jxNode:jxNodes3){
System.out.println(jxNode);
}
}
}
- 解析结果: