今天学习了一个java的html解析器(jsoup),以前一直使用htmlParser,感觉htmlParser处理html还可以,然而jsoup更强大,简单来说jsoup就是一个java版的jquery,个人认为很相似,方法名和选择器都很相像,比如jquery比较常用的方法:html()、text()、attr()等,如果熟悉jquery的话那学习jsoup就非常简单了,要是不熟悉jquery就多做些例子看看官方提供的教材和api也很快能学会的。
官方网站:
http://jsoup.org/
我的网站“导航189”在“热帖”板块抓取天涯的数据时候就是使用jsoup来解析html
http://www.dh189.com/
截图:
在附件中有:jsoup-1.2.3.jar jsoup-1.2.3-javadoc.jar jsoup-1.2.3-sources.jar 三个文件
开始使用jsoup:
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.dh189.com/");
Document doc = Jsoup.parse(url, 3 * 1000);
//获取所有的链接
Elements test = doc.select("a");
for (Element element : test) {
//element.outerHtml() 和 element.toString()效果一样
System.out.println("链接源代码:" + element.outerHtml());
System.out.println("链接地址:" + element.attr("href") + " 链接文本:" + element.text());
}
}
}
jsoup选择器支持如下:
Pattern |
Example |
* |
* |
E |
h1 |
ns E |
fb name finds elements |
E#id |
div#wrap, #logo |
E.class |
div.left, .result |
E[attr] |
a[href], [title] |
E[^attrPrefix] |
[^data-], div[^data-] |
E[attr=val] |
img[width=500], a[rel=nofollow] |
E[attr^=valPrefix] |
a[href^=http:] |
E[attr$=valSuffix] |
img[src$=.png] |
E[attr*=valContaining] |
a[href*=/search/] |
E[attr~=regex] |
img[src~=(?i)\\.(png\jpe?g)] div.header[title] |
E F |
div a, .logo h1 |
E > F |
ol > li |
E + F |
li + li, div.head + div |
E ~ F |
h1 ~ p |
E, F, G |
a[href], div, h3 |
E:lt(n) |
td:lt(3) finds the first 2 cells of each row |
E:gt(n) |
td:gt(1) finds cells after skipping the first two |
E:eq(n) |
td:eq(0) finds the first cell of each row |
E:has(selector) |
div:has(p) finds divs that contain p elements |
E:contains(text) |
p:contains(jsoup) finds p elements containing the text "jsoup". |
E:matches(regex) |
td:matches(\\d+) finds table cells containing digits. div:matches((?i)login) finds divs containing the text, case insensitively. |
官方提供的例子links:
package org.jsoup.examples;
import org.apache.commons.lang.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import java.net.URL;
import java.io.IOException;
/**
* Example program to list links from a URL.
*/
public class ListLinks {
public static void main(String[] args) throws IOException {
Validate.isTrue(args.length == 1, "usage: supply url to fetch");
URL url = new URL(args[0]);
print("Fetching %s...", url.toExternalForm());
Document doc = Jsoup.parse(url, 3*1000);
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
print("\nMedia: (%d)", media.size());
for (Element src : media) {
if (src.tagName().equals("img"))
print(" * %s: <%s> %sx%s (%s)",
src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
trim(src.attr("alt"), 20));
else
print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
}
print("\nImports: (%d)", imports.size());
for (Element link : imports) {
print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
}
print("\nLinks: (%d)", links.size());
for (Element link : links) {
print(" * a: <%s> (%s)", link.attr("abs:href"), trim(link.text(), 35));
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}
其他的一些例子:
Element div = doc.select("div").first(); // <div></div>
div.text("five > four"); // <div>five > four</div>
div.prepend("First ");
div.append(" Last");
// now: <div>First five > four Last</div>
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]"); // img with src ending .png
Element masthead = doc.select("div.masthead").first(); // div with class=masthead
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3