采用CSS或类似jquery 选择器(selector)语法来处理HTML文档中的数据。
利用方法:Element.select(String selector)和Elements.select(String selector)。
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]"); // img with src ending .png
Element masthead = doc.select("div.masthead").first();
// div with class=masthead
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
Jsoup的元素支持类似CSS或(jquery)的选择器语法的查找匹配的元素,可实现功能强大且鲁棒性好的查询。
jsoup elements support a CSS(or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.
Select方法可作用于Document、Element或Elements,且是上下文相关的,因此可实现指定元素的过滤,或者链式选择访问。
The selectmethod is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.
选择(操作)返回元素列表(Elements),并提供一组方法来提取或处理结果。
Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.
详见SelectorAPI 参考资料所列全部信息和细节。
【原文】http://jsoup.org/cookbook/extracting-data/selector-syntax
-------------------------------------------------------------------------------------------------------------------------------
CSS-like element selector, that finds elements matching a query.
The universal selector (*) is implicit when no element selector is supplied (i.e.
*.header
and
.header
is equivalent).
Pattern | Matches | Example | |
---|---|---|---|
* |
any element | * |
|
tag |
elements with the given tag name | div |
|
ns|E |
elements of type E in the namespace ns | fb|name finds elements |
|
#id |
elements with attribute ID of "id" | div#wrap , #logo |
|
.class |
elements with a class name of "class" | div.left , .result |
|
[attr] |
elements with an attribute named "attr" (with any value) | a[href] , [title] |
|
[^attrPrefix] |
elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets | [^data-] , div[^data-] |
|
[attr=val] |
elements with an attribute named "attr", and value equal to "val" | img[width=500] , a[rel=nofollow] |
|
[attr^=valPrefix] |
elements with an attribute named "attr", and value starting with "valPrefix" | a[href^=http:] |
|
[attr$=valSuffix] |
elements with an attribute named "attr", and value ending with "valSuffix" | img[src$=.png] |
|
[attr*=valContaining] |
elements with an attribute named "attr", and value containing "valContaining" | a[href*=/search/] |
|
[attr~=regex] |
elements with an attribute named "attr", and value matching the regular expression | img[src~=(?i)\\.(png|jpe?g)] |
|
The above may be combined in any order | div.header[title] |
||
Combinators |
|||
E F |
an F element descended from an E element | div a , .logo h1 |
|
E > F |
an F direct child of E | ol > li |
|
E + F |
an F element immediately preceded by sibling E | li + li , div.head + div |
|
E ~ F |
an F element preceded by sibling E | h1 ~ p |
|
E, F, G |
all matching elements E, F, or G | a[href], div, h3 |
|
Pseudo selectors |
|||
:lt(n) |
elements whose sibling index is less than n | td:lt(3) finds the first 2 cells of each row |
|
:gt(n) |
elements whose sibling index is greater than n | td:gt(1) finds cells after skipping the first two |
|
:eq(n) |
elements whose sibling index is equal to n | td:eq(0) finds the first cell of each row |
|
:has(selector) |
elements that contains at least one element matching the selector | div:has(p) finds divs that contain p elements |
|
:not(selector) |
elements that do not match the selector. See also Elements.not(String) |
div:not(.logo) finds all divs that do not have the "logo" class.div:not(:has(div)) finds divs that do not contain divs. |
|
:contains(text) |
elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants. | p:contains(jsoup) finds p elements containing the text "jsoup". |
|
:matches(regex) |
elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants. | td:matches(\\d+) finds table cells containing digits. div:matches((?i)login) finds divs containing the text, case insensitively. |
|
:containsOwn(text) |
elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants. | p:containsOwn(jsoup) finds p elements with own text "jsoup". |
|
:matchesOwn(regex) |
elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants. | td:matchesOwn(\\d+) finds table cells directly containing digits.div:matchesOwn((?i)login) finds divs containing the text, case insensitively. |
|
The above may be combined in any order and with other selectors | .light:contains(name):eq(0) |
||
Structural pseudo selectors |
|||
:root |
The element that is the root of the document. In HTML, this is the html element |
:root |
|
:nth-child(an+b) |
elements that have :nth-child() can take odd and even as arguments instead. odd has the same signification as 2n+1 , and even has the same signification as 2n . |
tr:nth-child(2n+1) finds every odd row of a table. :nth-child(10n-1) the 9th, 19th, 29th, etc, element. li:nth-child(5) the 5h li |
|
:nth-last-child(an+b) |
elements that have an+b-1 siblings after it in the document tree. Otherwise like :nth-child() |
tr:nth-last-child(-n+2) the last two rows of a table |
|
:nth-of-type(an+b) |
pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element |
img:nth-of-type(2n+1) |
|
:nth-last-of-type(an+b) |
pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element |
img:nth-last-of-type(2n+1) |
|
:first-child |
elements that are the first child of some other element. | div > p:first-child |
|
:last-child |
elements that are the last child of some other element. | ol > li:last-child |
|
:first-of-type |
elements that are the first sibling of its type in the list of children of its parent element | dl dt:first-of-type |
|
:last-of-type |
elements that are the last sibling of its type in the list of children of its parent element | tr > td:last-of-type |
|
:only-child |
elements that have a parent element and whose parent element hasve no other element children | ||
:only-of-type |
an element that has a parent element and whose parent element has no other element children with the same expanded element name | ||
:empty |
elements that have no children at all |
采用CSS或类似jquery 选择器(selector)语法来处理HTML文档中的数据。
利用方法:Element.select(String selector)和Elements.select(String selector)。
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]"); // img with src ending .png
Element masthead = doc.select("div.masthead").first();
// div with class=masthead
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
Jsoup的元素支持类似CSS或(jquery)的选择器语法的查找匹配的元素,可实现功能强大且鲁棒性好的查询。
jsoup elements support a CSS(or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.
Select方法可作用于Document、Element或Elements,且是上下文相关的,因此可实现指定元素的过滤,或者链式选择访问。
The selectmethod is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.
选择(操作)返回元素列表(Elements),并提供一组方法来提取或处理结果。
Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.
详见SelectorAPI 参考资料所列全部信息和细节。
【原文】http://jsoup.org/cookbook/extracting-data/selector-syntax
-------------------------------------------------------------------------------------------------------------------------------
CSS-like element selector, that finds elements matching a query.
The universal selector (*) is implicit when no element selector is supplied (i.e.
*.header
and
.header
is equivalent).
Pattern | Matches | Example | |
---|---|---|---|
* |
any element | * |
|
tag |
elements with the given tag name | div |
|
ns|E |
elements of type E in the namespace ns | fb|name finds elements |
|
#id |
elements with attribute ID of "id" | div#wrap , #logo |
|
.class |
elements with a class name of "class" | div.left , .result |
|
[attr] |
elements with an attribute named "attr" (with any value) | a[href] , [title] |
|
[^attrPrefix] |
elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets | [^data-] , div[^data-] |
|
[attr=val] |
elements with an attribute named "attr", and value equal to "val" | img[width=500] , a[rel=nofollow] |
|
[attr^=valPrefix] |
elements with an attribute named "attr", and value starting with "valPrefix" | a[href^=http:] |
|
[attr$=valSuffix] |
elements with an attribute named "attr", and value ending with "valSuffix" | img[src$=.png] |
|
[attr*=valContaining] |
elements with an attribute named "attr", and value containing "valContaining" | a[href*=/search/] |
|
[attr~=regex] |
elements with an attribute named "attr", and value matching the regular expression | img[src~=(?i)\\.(png|jpe?g)] |
|
The above may be combined in any order | div.header[title] |
||
Combinators |
|||
E F |
an F element descended from an E element | div a , .logo h1 |
|
E > F |
an F direct child of E | ol > li |
|
E + F |
an F element immediately preceded by sibling E | li + li , div.head + div |
|
E ~ F |
an F element preceded by sibling E | h1 ~ p |
|
E, F, G |
all matching elements E, F, or G | a[href], div, h3 |
|
Pseudo selectors |
|||
:lt(n) |
elements whose sibling index is less than n | td:lt(3) finds the first 2 cells of each row |
|
:gt(n) |
elements whose sibling index is greater than n | td:gt(1) finds cells after skipping the first two |
|
:eq(n) |
elements whose sibling index is equal to n | td:eq(0) finds the first cell of each row |
|
:has(selector) |
elements that contains at least one element matching the selector | div:has(p) finds divs that contain p elements |
|
:not(selector) |
elements that do not match the selector. See also Elements.not(String) |
div:not(.logo) finds all divs that do not have the "logo" class.div:not(:has(div)) finds divs that do not contain divs. |
|
:contains(text) |
elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants. | p:contains(jsoup) finds p elements containing the text "jsoup". |
|
:matches(regex) |
elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants. | td:matches(\\d+) finds table cells containing digits. div:matches((?i)login) finds divs containing the text, case insensitively. |
|
:containsOwn(text) |
elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants. | p:containsOwn(jsoup) finds p elements with own text "jsoup". |
|
:matchesOwn(regex) |
elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants. | td:matchesOwn(\\d+) finds table cells directly containing digits.div:matchesOwn((?i)login) finds divs containing the text, case insensitively. |
|
The above may be combined in any order and with other selectors | .light:contains(name):eq(0) |
||
Structural pseudo selectors |
|||
:root |
The element that is the root of the document. In HTML, this is the html element |
:root |
|
:nth-child(an+b) |
elements that have :nth-child() can take odd and even as arguments instead. odd has the same signification as 2n+1 , and even has the same signification as 2n . |
tr:nth-child(2n+1) finds every odd row of a table. :nth-child(10n-1) the 9th, 19th, 29th, etc, element. li:nth-child(5) the 5h li |
|
:nth-last-child(an+b) |
elements that have an+b-1 siblings after it in the document tree. Otherwise like :nth-child() |
tr:nth-last-child(-n+2) the last two rows of a table |
|
:nth-of-type(an+b) |
pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element |
img:nth-of-type(2n+1) |
|
:nth-last-of-type(an+b) |
pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element |
img:nth-last-of-type(2n+1) |
|
:first-child |
elements that are the first child of some other element. | div > p:first-child |
|
:last-child |
elements that are the last child of some other element. | ol > li:last-child |
|
:first-of-type |
elements that are the first sibling of its type in the list of children of its parent element | dl dt:first-of-type |
|
:last-of-type |
elements that are the last sibling of its type in the list of children of its parent element | tr > td:last-of-type |
|
:only-child |
elements that have a parent element and whose parent element hasve no other element children | ||
:only-of-type |
an element that has a parent element and whose parent element has no other element children with the same expanded element name | ||
:empty |
elements that have no children at all |