python:爬虫学习与教学(6)Scrapy中选择器用法

参考scrapy官方文档:https://docs.scrapy.org/en/latest/

爬取示例地址:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

完整的HTML代码:

 
  
  Example website
 
 
  
 
首先:
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

在命令行:

xpath方法:

In [4]: response.xpath("//title/text()")
Out[4]: []

获取文本:

In [5]: response.xpath("//title/text()").extract_first()
Out[5]: 'Example website'

In [7]: response.xpath("//title/text()").extract()
Out[7]: ['Example website']

也可以:

In [8]: response.xpath("//title/text()").get()
Out[8]: 'Example website'

In [10]: response.xpath("//title/text()").getall()
Out[10]: ['Example website']

css方法:

In [12]: response.css("title::text")
Out[12]: []

 



.xpath() and .css() 方法返回一个 SelectorList 实例, which is a list of new selectors. This API can be used for quickly selecting nested data:

In [17]: response.css("img").xpath("@src")
Out[17]:
[,
 ,
 ,
 ,
 ]

In [18]: response.css("img").xpath("@src").get()
Out[18]: 'image1_thumb.jpg'

xpath按属性查找:

In [20]: response.xpath("//div[@id='images']/a/text()")
Out[20]:
[,
 ,
 ,
 ,
 ]

对应的css:

In [28]: response.css("div[id='images'] a").xpath("@href")
Out[28]:
[,
 ,
 ,
 ,
 ]

In [32]: response.css("div[id='images'] a>img")
Out[32]:
[,
 ,
 ,
 ,
 ]

In [33]: response.css("div[id='images']>a>img")
Out[33]:
[,
 ,
 ,
 ,
 ]


It returns None if no element was found:

In [38]: response.xpath('//div[@id="not-exists"]/text()').get() is None
Out[38]: True

设置找不到是的返回默认值

In [40]: response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
Out[40]: 'not-found'

Instead of using e.g. '@src' XPath it is possible to query for attributes using .attrib property of a Selector:

除了用xpath的[@属性名],还可以用标签.attrib[属性名]

In [41]: [img.attrib['src'] for img in response.css('img')]
Out[41]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

注意区别:以下只返回一个

In [42]: response.css("img").attrib['src']
Out[42]: 'image1_thumb.jpg'

This is most useful when only a single result is expected, e.g. when selecting by id, or selecting unique elements on a web page:

>>> response.css('base').attrib['href']
'http://example.com/'

通过属性获取:

In [43]: response.xpath("//base/@href").get()
Out[43]: 'http://example.com/'

In [44]: response.css("base::attr(href)").get()
Out[44]: 'http://example.com/'

In [48]: response.css("base").attrib["href"]
Out[48]: 'http://example.com/'

选择符合条件的属性的相关内容:

xpath:

In [50]: response.xpath("//a[contains(@href,'image')]/@href").getall()
Out[50]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [52]: response.xpath("//a[contains(@href,'image')]/img/@src").getall()
Out[52]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

css:

In [54]: response.css("a[href*=image]::attr(href)").getall()
Out[54]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [55]: response.css("a[href*=image] img::attr(src)").getall()
Out[55]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']


CSS Selectors扩展:

  • to select text nodes, use ::text,如:title::text

  • to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of

*::text selects all descendant text nodes of the current selector context   #选择当前selector所有后代节点文本

In [62]: response.css("#images *::text").getall()
Out[62]:
['\n   ',
 'Name: My image 1 ',
 '\n   ',
 'Name: My image 2 ',
 '\n   ',
 'Name: My image 3 ',
 '\n   ',
 'Name: My image 4 ',
 '\n   ',
 'Name: My image 5 ',
 '\n  ']


迭代举例:

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['Name: My image 1 
', 'Name: My image 2
', 'Name: My image 3
', 'Name: My image 4
', 'Name: My image 5
']
>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').get(), link.xpath('img/@src').get())
...     print('Link number %d points to url %r and image %r' % args)
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

In [65]: [a.xpath("@href") for a in response.css("a")]
Out[65]:
[[],
 [],
 [],
 [],
 []]

In [66]: [a.xpath("@href").get() for a in response.css("a")]
Out[66]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [68]: [a.attrib["href"] for a in response.css("a")]
Out[68]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']


Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

>>> from scrapy import Selector
>>> sel = Selector(text="""
....:     
    ....:
  • 1
  • ....:
  • 2
  • ....:
  • 3
  • ....:
....:
    ....:
  • 4
  • ....:
  • 5
  • ....:
  • 6
  • ....:
""") >>> xp = lambda x: sel.xpath(x).getall()
>>> xp("//li[1]")
['
  • 1
  • ', '
  • 4
  • ']
    >>> xp("(//li)[1]")
    ['
  • 1
  • ']

     

    你可能感兴趣的:(python,python)