选择器(selectors)常用语法

#通过以下代码获取名为 response 的shell变量
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

#得到根URL(base URL)和一些图片链接:(通过xpath和css两种方式)
>>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

#嵌套选择器
#先获取所有包含图片的url的链接的集合
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'Name: My image 1 
', u'Name: My image 2
', u'Name: My image 3
', u'Name: My image 4
', u'Name: My image 5
'] >>> for index, link in enumerate(links): args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) print 'Link number %d points to url %s and image %s' % args Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg'] Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg'] Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg'] Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg'] Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

上面列举的只是一部分具体的
参考:http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/selectors.html#topics-selectors

你可能感兴趣的:(选择器(selectors)常用语法)