当你会使用一些简单的代码去从网站上获取数据的时候,你也许不知道,你已经迈出了爬虫的第一步,不要把爬虫想得太高大上,简单的说“网络爬虫”就是按照一定的规则和策略对网页或数据的分析与过滤,从中获取想要的数据。最突出的例子就是各大搜索引擎,每当你输入关键字,点击搜索的时候,他们就会按照一定的策略去各大网站爬数据,然后呈现出来。关于目前的搜索引擎,能用google就用goole吧,百度的算法没有google的算法强大,还有,只要你花钱,百度会把你的内容排在第一位。。。。。大家可以使用百度和google随便搜索什么图片,对比一下。
说的有点远!
自从我做了毕业设计以后,我就变成了一个“小虫子”,想到好玩的网站就忍不住的去爬一爬,比如去爬糗事百科的段子 ,但每次都是自己写代码,有时候就在想有没有什么好的框架,自己把爬取得规则写一下就行了,前段时间就在找,找来找去,我看到了crawler4j 这个框架,挺好使的,我在这里简单介绍一下使用方法:
自己要写两个文件
第一个是XXXXCrawler.java,这文件你需要继承WebCrawler,然后再重写里面的两个方法
shouldVisit(Page referringPage, WebURL url)//判断url是否匹配你的爬取策略 visit(Page page)//爬取得页面第二个文件是XXXXController.java,这里面要对你的小虫子设置一些参数
String crawlStorageFolder = "/data/crawl/root"; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /* * Instantiate the controller for this crawl. */ PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); /* * For each crawl, you need to add some seed urls. These are the first * URLs that are fetched and then the crawler starts following links * which are found in these pages */ controller.addSeed("http://www.ics.uci.edu/~lopes/"); controller.addSeed("http://www.ics.uci.edu/~welling/"); controller.addSeed("http://www.ics.uci.edu/"); /* * Start the crawl. This is a blocking operation, meaning that your code * will reach the line after this only when crawling is finished. */ controller.start(MyCrawler.class, numberOfCrawlers);搞定这两个文件,一个小虫子就造出来了,下面是我爬取知乎用户相关代码(别问我为什么爬知乎,大家都在爬,我也就爬了)。。。
我的入口是随便一个关注人多的问题链接,比如这个“http://www.zhihu.com/question/20894671”(入口不一,只要程序能找到个人的链接)
MyController.java
public static void main(String[] args) throws Exception { String crawlStorageFolder = "F:/spider"; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); config.setFollowRedirects(true); /* * Instantiate the controller for this crawl. */ PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.addSeed("http://www.zhihu.com/question/20381470"); controller.addSeed("http://www.zhihu.com/question/23049278#answer-4608421"); controller.addSeed("http://www.zhihu.com/question/37667007"); controller.addSeed("http://www.zhihu.com/question/21578177#answer-2649268"); controller.addSeed("http://www.zhihu.com/question/20162455#answer-830100"); controller.start(MyCrawler.class, numberOfCrawlers); }然后就是MyCrawler.java(规则写的比较简单)
private final static Pattern FILTERS = Pattern.compile("-\\d+"); @Override public boolean shouldVisit(Page referringPage, WebURL url) { String href = url.getURL().toLowerCase(); return FILTERS.matcher(href).find() && href.startsWith("http://www.zhihu.com") && !href.endsWith("posts") && !href.endsWith("answers") && !href.endsWith("asks") && !href.endsWith("collections"); }
@Override public void visit(Page page) { String url = page.getWebURL().getURL(); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String html = htmlParseData.getHtml(); Document doc = Jsoup.parse(html); zhUser zu = new zhUser(); Elements eles1 = doc.select("div.title-section"); zu.setName( eles1.select(".name").text()); zu.setSlogan(eles1.select(".bio").text()); Elements bodyEles = doc.select(".body"); for(Element item : bodyEles){ zu.setHeadImgUrl(item.select(".zm-profile-header-avatar-container img").attr("src")); zu.setAddress(item.select(".info-wrap .location").attr("title")); zu.setWork(item.select(".info-wrap .business").attr("title")); zu.setSchool(item.select(".info-wrap .education").attr("title")); zu.setMajor(item.select(".info-wrap .education-extra").attr("title")); } zu.setUserUrl(url); insertUserToDB(zu); } }
我只用数据做了一个知乎用户地区分布图,其他的还没来得及做
等这段时间忙过去再想想对这些数据做些分析~~~~