Jsoup是一个Java的HTML解析器,提供了非常方便的抽取和操作HTML文档方法,可以结合DOM,CSS和Jquery类似的方法来定位和得到节点的信息。
有着和Jquery一样强大的select和pipeline的API。
我们以从58同城网抽取租房信息为例,来说明如何使用它:
package test
import org.jsoup.nodes.Document
import java.util.HashMap
import org.jsoup.Jsoup
/**
* Author: fuliang
* http://fuliang.iteye.com
*/
class HouseEntry(var title: String,var link: String,var price: Integer, var houseType: String, var date: String){
override def toString(): String = {
return String.format("title: %s\tlink:%s\tprice:%d\thouseType:%s\tdate:%s\n", title,link,price,houseType,date);
}
}
class HouseRentCrawler{
def crawl(url: String,keyword: String,lowRange: Int,highRange: Int): List[HouseEntry] = {
var doc = fetch(url,keyword,lowRange,highRange);
return extract(doc);
}
private def fetch(url:String,keyword: String,lowRange: Int,highRange: Int): Document = {
var params = new HashMap[String,String]();
params.put("final","1");
params.put("jump","2");
params.put("searchtype","3");
params.put("key",keyword);
params.put("MinPrice",lowRange + "_" + highRange);
return Jsoup.connect(url).data(params)
.userAgent("Mozilla")
.timeout(10000)
.get();
}
private def extract(doc: Document): List[HouseEntry] = {
val elements = doc.select("#infolist > tr:not(.dev)");
var houseEntries = List[HouseEntry]();
for(val i <- 0 until elements.size()){
val entry = elements.get(i);
val fields = entry.select("td");
val title = fields.get(0).text();
val link = fields.get(0).select("a[class=t]").attr("href");
val price = fields.get(1).text().toInt;
val houseType = fields.get(2).text();
val date = fields.get(3).text();
val houseEntry = new HouseEntry(title,link,price,houseType,date);
houseEntries ::= houseEntry;
}
return houseEntries;
}
}
object HouseRentCrawler{
def main(args: Array[String]) {
val url = "http://bj.58.com/zufang";
val crawler = new HouseRentCrawler();
val houseEntries = crawler.crawl(url,"智学苑",2000,3500);
for(val entry <- houseEntries){
println(entry);
}
}
}
Selector overview
* tagname: find elements by tag, e.g. a
* ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
* #id: find elements by ID, e.g. #logo
* .class: find elements by class name, e.g. .masthead
* [attribute]: elements with attribute, e.g. [href]
* [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
* [attr=value]: elements with attribute value, e.g. [width=500]
* [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
* [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
* *: all elements, e.g. *
Selector combinations
* el#id: elements with ID, e.g. div#logo
* el.class: elements with class, e.g. div.masthead
* el[attr]: elements with attribute, e.g. a[href]
* Any combination, e.g. a[href].highlight
* ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
* parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
* siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
* siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
* el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
Pseudo selectors
* :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
* :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
* :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
* :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
* :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
* :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
* :containsOwn(text): find elements that directly contain the given text
* :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
* :matchesOwn(regex): find elements whose own text matches the specified regular expression
* Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
更多的信息可以参考[http://jsoup.org/|http://jsoup.org/]