爬虫帝国

爬虫所需要的知识体系。

  • http://www.zhihu.com/question/20899988

  • URL获取:

  • URL去重:Bloom Filter

  • 分布式爬取

scrapy

  1. 清洗HTML数据

  2. 验证解析到的数据(检查项目是否包含必要的字段)

  3. 检查是否是重复数据(如果重复就删除)

  4. 将解析到的数据存储到数据库中

whatsoever

  • spider.py
  • here is a tutorial:http://blog.csdn.net/HanTangSongMing/article/details/24454453
    for those of thou who wanna have a try.

and here is documentation of scrapy online http://doc.scrapy.org/en/0.24/ .btw,how to download the html.zip?

assignment:scrawl down all the blog pages of zhangxx.

http://github.windwild.net/2013/03/scrapy002/

a little bit deeper.

http://download.csdn.net/detail/jasonding1354/8393855

here is a chinese documentation

http://download.csdn.net/detail/jasonding1354/8393855

要开始成为爬虫大牛啦!2015-03-29

你可能感兴趣的:(爬虫帝国)