Scrapy

0. 基础知识:

1) 搜索引擎爬虫介绍 --> 增量式爬虫和分布式爬虫

http://www.zouxiaoyang.com/archives/386.html

http://docs.pythontab.com/scrapy/scrapy0.24/intro/overview.html

62792

scrapy crawl -s LOG_FILE=./logs/liter.log -s MONGODB_COLLECTION=literature literatureSpider

#http://doc.scrapy.org/en/latest/topics/jobs.html

scrapy crawl douban8590Spider -s JOBDIR=crawls/douban8590Spider -s MONGODB_DB=douban -s MONGODB_COLLECTION=book8590

1. Run your spider with -a option like:

scrapy crawl myspider -a filename=text.txt

Then read the file in the __init__ method of the spider and define start_urls:

class MySpider(BaseSpider):

    name = 'myspider'

    def __init__(self, filename=None):

        if filename:

            with open(filename, 'r') as f:

            self.start_urls = f.readlines()

2. scrapy可以通过Settings来让爬取结束之后不自动关闭.  how ?

3. 快代理svip 经常出现的问题:

TCP connection timed out: 60: Operation timed out.

Connection was refused by other side: 61: Connection refused.

An error occurred while connecting: 65: No route to host.

504 Gateway Time-out

404 Not Found

501 Not Implemented

4. AttributeError: 'Response' object has no attribute 'body_as_unicode'

出现这个问题,主要是网站的header里面没有content-type字段,scrapy就抽风了,不知道抓取网页的类型,其实解决办法很简单。

把pase方法进行简单的改写即可

def parse(self, response):

    hxs=Selector(text=response.body)

    detail_url_list = hxs.xpath('//li[@class="good-list"]/@href').extract()

    for url in detail_url_list:

        if 'goods' in url:

            yield Request(url, callback=self.parse_detail)

#该代码片段来自于: http://www.sharejs.com/codes/python/9049

5.Speed up web scraper

Here's a collection of things to try:

- use latest scrapy version (if not using already)

- check if non-standard middlewares are used

- try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)

- turn off logging LOG_ENABLED = False (docs)

- try yielding an item in a loop instead of collecting items into the items list and returning them

- use local cache DNS (see this thread)

- check if this site is using download threshold and limits your download speed (see this thread)

- log cpu and memory usage during the spider run - see if there are any problems there

- try run the same spider under scrapyd service

- see if grequests + lxml will perform better (ask if you need any help with implementing this solution)

- try running Scrapy on pypy, see Running Scrapy on PyPy

你可能感兴趣的:(Scrapy)