3 Scrapy 爬取 (1)

以网页 http://quotes.toscrape.com/ 为例
命令:
scrapy shell 'http://quotes.toscrape.com/'

In [4]: response.xpath('//*[@class="quote"]')
Out[4]: 
[\n        \u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d\n        by \n        (about)\n        \n        
\n Tags:\n \n \n change\n \n deep-thoughts\n \n thinking\n \n world\n \n
\n
'

对单个quote的处理:

In [9]: quote.xpath('.//*[@class="text"]')
Out[9]: [\u201cThe '>]

In [10]: quote.xpath('.//*[@class="text"]/text()').extract()
Out[10]: [u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d']

In [11]: quote.xpath('.//*[@class="text"]/text()').extract_first()
Out[11]: u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'

上面是用class,也可以用itemprop

text = quote.xpath('.//*[@itemprop="text"]/text()').extract_first()

In [13]: text
Out[13]: u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'

对于 custom quote,如果不加最前面那个点 . 的话:

In [16]: quote.xpath('//*[@itemprop="text"]/text()').extract()
Out[16]: 
[u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d',
 u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d',
 u'\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d',
 u'\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d',
 u"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d",
 u'\u201cTry not to become a man of success. Rather become a man of value.\u201d',
 u'\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d',
 u"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d",
 u"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d",
 u'\u201cA day without sunshine is like, you know, night.\u201d']

有点神奇,具体为什么?我不知道。以后知道了再回来补吧。
对于 meta 标签的 content部分的获取,语法稍微不同

quote.xpath('.//*[@itemprop="keywords"]/@content').extract()
Out[20]: [u'change,deep-thoughts,thinking,world']

你可能感兴趣的:(3 Scrapy 爬取 (1))