继续上一篇所讲
上一篇完成了基本的抓取网页内容,现在这篇是在上一篇的基础上的优化。
下面是效果图
和上一篇一样,利用对返回的HTML数据做分析,得到自己相应想要的数据,放入Adapter,显示在listView中。
Runnable runnable = new Runnable(){ @Override public void run() { Message message = new Message(); try { if(url.isEmpty()){ return; } String u =url; Connection conn = Jsoup.connect(u); conn.header("User-Agent", "Mozilla/5.0 (X11; Linux x86_64; rv:32.0) Gecko/ 20100101 Firefox/32.0"); Document doc = conn.get(); Elements elements = doc.select("span[id=text110]"); Log.v(TAG,"size "+elements.size()); all=elements.toString(); message.what = WebActivity.FG; }catch(Exception x){ x.printStackTrace(); } // new MyTask().execute(); handler.sendMessage(message); } };
不妨来看看,Jsoup的一些源码:
/** * Creates a new {@link Connection} to a URL. Use to fetch and parse a HTML page. * <p> * Use examples: * <ul> * <li><code>Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();</code></li> * <li><code>Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post(); * </ul> * @param url URL to connect to. The protocol must be {@code http} or {@code https}. * @return the connection. You can add data, cookies, and headers; set the user-agent, referrer, method; and then execute. */ public static Connection connect(String url) { return HttpConnection.connect(url); }这是通过提供的url获取HTML信息
/** * Set a request header. * @param name header name * @param value header value * @return this Connection, for chaining * @see org.jsoup.Connection.Request#headers() */ public Connection header(String name, String value);设置请求header,我们App的代码如下,其中我们可以看出一些相关信息,火狐浏览器( 20100101 Firefox/32.0
conn.header("User-Agent", "Mozilla/5.0 (X11; Linux x86_64; rv:32.0) Gecko/ 20100101 Firefox/32.0");
紧接着是获取Document
/** * Execute the request as a GET, and parse the result. * @return parsed Document * @throws IOException on error */ public Document get() throws IOException;得到Document后开始做分析,查询整个HTML信息,得到符合条件的
/** * Find elements that match the {@link Selector} query, with this element as the starting context. Matched elements * may include this element, or any of its children. * <p/> * This method is generally more powerful to use than the DOM-type {@code getElementBy*} methods, because * multiple filters can be combined, e.g.: * <ul> * <li>{@code el.select("a[href]")} - finds links ({@code a} tags with {@code href} attributes) * <li>{@code el.select("a[href*=example.com]")} - finds links pointing to example.com (loosely) * </ul> * <p/> * See the query syntax documentation in {@link org.jsoup.select.Selector}. * * @param query a {@link Selector} query * @return elements that match the query (empty if none match) * @see org.jsoup.select.Selector */ public Elements select(String query) { return Selector.select(query, this); }我们在App中用到的查询是
Elements elements = doc.select("span[id=text110]");我们对应原网页看一遍,可以注意到就是以 span 标签和 id=text110 两个条件确定的内容