参考scrapy官方文档:https://docs.scrapy.org/en/latest/
爬取示例地址:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
完整的HTML代码:
Example website
首先:
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
在命令行:
xpath方法:
In [4]: response.xpath("//title/text()")
Out[4]: [
获取文本:
In [5]: response.xpath("//title/text()").extract_first()
Out[5]: 'Example website'
In [7]: response.xpath("//title/text()").extract()
Out[7]: ['Example website']
也可以:
In [8]: response.xpath("//title/text()").get()
Out[8]: 'Example website'
In [10]: response.xpath("//title/text()").getall()
Out[10]: ['Example website']
css方法:
In [12]: response.css("title::text")
Out[12]: [
.xpath()
and .css()
方法返回一个 SelectorList
实例, which is a list of new selectors. This API can be used for quickly selecting nested data:
In [17]: response.css("img").xpath("@src")
Out[17]:
[
In [18]: response.css("img").xpath("@src").get()
Out[18]: 'image1_thumb.jpg'
xpath按属性查找:
In [20]: response.xpath("//div[@id='images']/a/text()")
Out[20]:
[
对应的css:
In [28]: response.css("div[id='images'] a").xpath("@href")
Out[28]:
[
In [32]: response.css("div[id='images'] a>img")
Out[32]:
[
In [33]: response.css("div[id='images']>a>img")
Out[33]:
[
It returns None
if no element was found:
In [38]: response.xpath('//div[@id="not-exists"]/text()').get() is None
Out[38]: True
设置找不到是的返回默认值
In [40]: response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
Out[40]: 'not-found'
Instead of using e.g. '@src'
XPath it is possible to query for attributes using .attrib
property of a Selector
:
除了用xpath的[@属性名],还可以用标签.attrib[属性名]
In [41]: [img.attrib['src'] for img in response.css('img')]
Out[41]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
注意区别:以下只返回一个
In [42]: response.css("img").attrib['src']
Out[42]: 'image1_thumb.jpg'
This is most useful when only a single result is expected, e.g. when selecting by id, or selecting unique elements on a web page:
>>> response.css('base').attrib['href']
'http://example.com/'
通过属性获取:
In [43]: response.xpath("//base/@href").get()
Out[43]: 'http://example.com/'
In [44]: response.css("base::attr(href)").get()
Out[44]: 'http://example.com/'
In [48]: response.css("base").attrib["href"]
Out[48]: 'http://example.com/'
选择符合条件的属性的相关内容:
xpath:
In [50]: response.xpath("//a[contains(@href,'image')]/@href").getall()
Out[50]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
In [52]: response.xpath("//a[contains(@href,'image')]/img/@src").getall()
Out[52]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
css:
In [54]: response.css("a[href*=image]::attr(href)").getall()
Out[54]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
In [55]: response.css("a[href*=image] img::attr(src)").getall()
Out[55]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
to select text nodes, use ::text,如:title::text
to select attribute values, use ::attr(name)
where name is the name of the attribute that you want the value of
*::text
selects all descendant text nodes of the current selector context #选择当前selector所有后代节点文本
In [62]: response.css("#images *::text").getall()
Out[62]:
['\n ',
'Name: My image 1 ',
'\n ',
'Name: My image 2 ',
'\n ',
'Name: My image 3 ',
'\n ',
'Name: My image 4 ',
'\n ',
'Name: My image 5 ',
'\n ']
迭代举例:
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['Name: My image 1
',
'Name: My image 2
',
'Name: My image 3
',
'Name: My image 4
',
'Name: My image 5
']
>>> for index, link in enumerate(links):
... args = (index, link.xpath('@href').get(), link.xpath('img/@src').get())
... print('Link number %d points to url %r and image %r' % args)
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'
In [65]: [a.xpath("@href") for a in response.css("a")]
Out[65]:
[[
[
[
[
[
In [66]: [a.xpath("@href").get() for a in response.css("a")]
Out[66]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
In [68]: [a.attrib["href"] for a in response.css("a")]
Out[68]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
//node[1]
selects all the nodes occurring first under their respective parents.
(//node)[1]
selects all the nodes in the document, and then gets only the first of them.
>>> from scrapy import Selector >>> sel = Selector(text=""" ....:
>>> xp("//li[1]") ['
>>> xp("(//li)[1]") ['