http://lobobrowser.org/cobra.jsp
有js逻辑的页面,对网络爬虫的信息抓取工作造成了很大障碍。DOM树,只有执行了js的逻辑才可以完整的呈现。而有的时候,有要对js修改后的 dom树进行解析。在搜寻了大量资料后,发现了一个开源的项目cobra。cobra支持JavaScript引擎,其内置的JavaScript引擎是 mozilla下的 rhino,利用rhino的API,实现了对嵌入在html的JavaScript的解释执行。测试用例:
js.html
<html>
<title>test javascript</title>
<script language="javascript">
var go = function(){
document.getElementById("gg").innerHTML="google";
}
</script>
<body onLoad="javascript:go();">
<a id = "gg" onClick="javascript:go();" href="#">baidu</a>
</body>
</html>
Test.java
package net.cooleagle.test.cobra;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import org.lobobrowser.html.UserAgentContext;
import org.lobobrowser.html.domimpl.HTMLDocumentImpl;
import org.lobobrowser.html.parser.DocumentBuilderImpl;
import org.lobobrowser.html.parser.InputSourceImpl;
import org.lobobrowser.html.test.SimpleUserAgentContext;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class Test{
private static final String TEST_URI = "http://localhost/js.html ";
public static void main(String[] args) throws Exception {
UserAgentContext uacontext = new SimpleUserAgentContext();
DocumentBuilderImpl builder = new DocumentBuilderImpl(uacontext);
URL url = new URL(TEST_URI);
InputStream in = url.openConnection().getInputStream();
try {
Reader reader = new InputStreamReader(in, "ISO-8859-1");
InputSourceImpl inputSource = new InputSourceImpl(reader, TEST_URI);
Document d = builder.parse(inputSource);
HTMLDocumentImpl document = (HTMLDocumentImpl) d;
Element ele = document.getElementById("gg");
System.out.println(ele.getTextContent());
} finally {
in.close();
}
}
}
执行结果:
google
测试成功。
============================================
I originally used JRex , a Java wrapper for the Mozilla Gecko layout engine, to render HTML pages. I was looking for a better engine for extracting the HTML of rendered pages and found the Cobra Toolkit that is part of the Lobo Project . This project includes the Cobra Toolkit that renders HTML and the LoboBrowser built on this toolkit. The code is pure Java.
My initial comparison of JRex and Cobra found the following salient facts: