题记:
今天闲着蛋疼,想弄个自己的博客,于是前台需要一个美观的页面,后台就需要爬爬XXX,因为看到XXX有RSS,原以为抓抓网页就省事了,可没想到.....更没想到...
Page:
先搞了个page,向CSS牛人学习下。
Rot:
原以为URLConnection抓到xml页面就可以了,可悲剧发生了,直接遭到XXX的拒绝。
<body> <div style="padding:50px 0 0 300px"> <h1>您的访问被拒绝</h1> <p>您可能使用了网络爬虫!</p> XXXXXXXXX </div> </body>
- -! 于是就自然而然的自己构造http包,对XXX的80端口直接发送http包,折腾了几个小时,弄完后虽然没有被XXX直接拒收,但由于对HTTP协议不够深入,请求页面没被执行成功,如下:
www.XXXXX.com/XXX.XXX.XXX.XXX 80 HTTP/1.1 400 Bad Request Connection: close Content-Type: text/html Content-Length: 349 Date: Sat, 24 Jul 2010 16:52:47 GMT Server: lighttpd/1.4.20 <?xml version="1.0" encoding="iso-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>400 - Bad Request</title> </head> <body> <h1>400 - Bad Request</h1> </body> </html>
无奈,不想弄HTTP包了,用URLConnection伪装个User-Agent,结果竟然被抓出来了,汗一个!!
<?xml version="1.0" encoding="UTF-8" ?> <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx </rss> </xml>
XML(待续)
拿到博客的InputStream后,开始解析XML流并入后台数据库。
package org.blog.xml; import java.io.IOException; import java.io.InputStream; import java.util.Map; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.NodeList; import org.xml.sax.SAXException; /** * * @author cjcj * */ public class XMLParser { public Document parser(InputStream is) throws ParserConfigurationException, SAXException, IOException{ DocumentBuilderFactory f=DocumentBuilderFactory.newInstance(); DocumentBuilder builder=f.newDocumentBuilder(); Document doc=builder.parse(is); getItems(doc.getDocumentElement()); return doc; } private Map<String,String> getItems(Element n){ if(n==null)throw new NullPointerException(); // get the item.. NodeList nl=n.getElementsByTagName("item"); for(int i=0;nl!=null&&i<nl.getLength();++i){ Element et=(Element) nl.item(i); System.out.println(getTextValue(et,"title"));// get the title.... } return null; } private String getTextValue(Element e,String tagNm){ NodeList nl=e.getElementsByTagName(tagNm); return nl!=null&&nl.getLength()>0?nl.item(0).getFirstChild().getNodeValue():null; } } package org.blog.xml; import java.io.IOException; import java.io.InputStream; import java.util.Map; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import org.w3c.dom.Document; import org.w3c.dom.Node; import org.xml.sax.SAXException; /** * * @author cjcj * */ public class XMLParser { public Document parser(InputStream is) throws ParserConfigurationException, SAXException, IOException{ DocumentBuilderFactory f=DocumentBuilderFactory.newInstance(); DocumentBuilder builder=f.newDocumentBuilder(); Document doc=builder.parse(is); getItems(doc); return doc; } public Map<String,String> getItems(Node n){ if(n==null)throw new NullPointerException(); //Map<String,String> items=new HashMap<String,String>(); //NodeList lists=doc.getChildNodes(); System.out.println(n.getNodeName()); System.out.println(n.getNodeValue()); //NamedNodeMap map=n.getAttributes(); //Node lists=map.getNamedItem("item"); return null; } }
Filter
压缩
DB
智能检测更新与定时器
方案一:通过比对<pubDate></pubDate>标签来判定更新。