选择器(Selectors)

当抓取网页时，你做的最常见的任务是从HTML源码中提取数据。现有的一些库可以达到这个目的：

BeautifulSoup 是在程序员间非常流行的网页分析库，它基于HTML代码的结构来构造一个Python对象，对不良标记的处理也非常合理，但它有一个缺点：慢。

lxml 是一个基于 ElementTree (不是Python标准库的一部分)的python化的XML解析库(也可以解析HTML)。

Scrapy提取数据有自己的一套机制。它们被称作选择器(seletors)，因为他们通过特定的 XPath 或者 CSS 表达式来“选择” HTML文件中的某个部分。

XPath 是一门用来在XML文件中选择节点的语言，也可以用在HTML上。 CSS 是一门将HTML文档样式化的语言。选择器由它定义，并与特定的HTML元素的样式相关连。

Scrapy选择器构建于 lxml 库之上，这意味着它们在速度和解析准确性上非常相似。

使用选择器(selectors)

构造选择器(selectors)

Scrapy selector是以文字(text) 或 TextResponse 构造的 Selector 实例。其根据输入的类型自动选择最优的分析方法(XML vs HTML):

>>> from scrapy.selector import Selector

>>> from scrapy.http import HtmlResponse

以文字构造:

>>> body = 'good'

>>> Selector(text=body).xpath('//span/text()').extract()

[u'good']

以response构造:

>>> response = HtmlResponse(url='http://example.com', body=body)

>>> Selector(response=response).xpath('//span/text()').extract()

[u'good']

为了方便起见，response对象以 .selector 属性提供了一个selector，可以随时使用该快捷方法:

>>> response.selector.xpath('//span/text()').extract()

[u'good']

使用选择器(selectors)

Scrapy shell (提供交互测试)和位于Scrapy文档服务器的一个样例页面，来解释如何使用选择器：

http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

这里是它的HTML源码:

Example website

首先, 打开shell:

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

接着，当shell载入后，将获得名为 response 的shell变量，其为响应的response，并且在其 response.selector 属性上绑定了一个selector。

因为处理的是HTML，选择器将自动使用HTML语法分析。

那么，通过查看 HTML code 该页面的源码，构建一个XPath来选择title标签内的文字:

>>> response.selector.xpath('//title/text()')

[]

由于在response中使用XPath、CSS查询十分普遍，因此，Scrapy提供了两个实用的快捷方式: response.xpath() 及 response.css():

>>> response.xpath('//title/text()')

[]

>>> response.css('title::text')

[]

如你所见， .xpath() 及 .css() 方法返回一个类 SelectorList 的实例, 它是一个新选择器的列表。这个API可以用来快速的提取嵌套数据。

为了提取真实的原文数据，你需要调用 .extract() 方法如下:

>>> response.xpath('//title/text()').extract()

[u'Example website']

注意CSS选择器可以使用CSS3伪元素(pseudo-elements)来选择文字或者属性节点:

>>> response.css('title::text').extract()

[u'Example website']

现在将得到根URL(base URL)和一些图片链接:

>>> response.xpath('//base/@href').extract()

[u'http://example.com/']

>>> response.css('base::attr(href)').extract()

[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()

[u'image1.html',

u'image2.html',

u'image3.html',

u'image4.html',

u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()

[u'image1.html',

u'image2.html',

u'image3.html',

u'image4.html',

u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()

[u'image1_thumb.jpg',

u'image2_thumb.jpg',

u'image3_thumb.jpg',

u'image4_thumb.jpg',

u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()

[u'image1_thumb.jpg',

u'image2_thumb.jpg',

u'image3_thumb.jpg',

u'image4_thumb.jpg',

u'image5_thumb.jpg']

嵌套选择器(selectors)

选择器方法( .xpath() or .css() )返回相同类型的选择器列表，因此你也可以对这些选择器调用选择器方法。下面是一个例子:

>>> links = response.xpath('//a[contains(@href, "image")]')

>>> links.extract()

[u'Name: My image 1
',

u'Name: My image 2
',

u'Name: My image 3
',

u'Name: My image 4
',

u'Name: My image 5
']

>>> for index, link in enumerate(links):

args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())

print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']

Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']

Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']

Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']

Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

结合正则表达式使用选择器(selectors)

Selector 也有一个 .re() 方法，用来通过正则表达式来提取数据。然而，不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。所以你无法构造嵌套式的 .re() 调用。

下面是一个例子，从上面的 HTML code 中提取图像名字:

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

[u'My image 1',

u'My image 2',

u'My image 3',

u'My image 4',

u'My image 5']

使用相对XPaths

记住如果你使用嵌套的选择器，并使用起始为 / 的XPath，那么该XPath将对文档使用绝对路径，而且对于你调用的 Selector 不是相对路径。

比如，假设你想提取在

元素中的所有

元素。首先，你将先得到所有的

元素:

>>> divs = response.xpath('//div')

开始时，你可能会尝试使用下面的错误的方法，因为它其实是从整篇文档中，而不仅仅是从那些

元素内部提取所有的

元素:

>>> for p in divs.xpath('//p'): # this is wrong - gets all

from the whole document

... print p.extract()

下面是比较合适的处理方法(注意 .//p XPath的点前缀):

>>> for p in divs.xpath('.//p'): # extracts all

inside

... print p.extract()

另一种常见的情况将是提取所有直系

的结果:

>>> for p in divs.xpath('p'):

... print p.extract()

更多关于相对XPaths的细节详见XPath说明中的 Location Paths 部分。

使用EXSLT扩展

因建于 lxml 之上, Scrapy选择器也支持一些 EXSLT 扩展，可以在XPath表达式中使用这些预先制定的命名空间：

前缀命名空间用途

re http://exslt.org/regular-expressions 正则表达式

set http://exslt.org/sets 集合操作

正则表达式

例如在XPath的 starts-with() 或 contains() 无法满足需求时， test() 函数可以非常有用。

例如在列表中选择有”class”元素且结尾为一个数字的链接:

>>> from scrapy import Selector

>>> doc = """

...

first item

...

second item

...

third item

...

fourth item

...

fifth item

...

... """

>>> sel = Selector(text=doc, type="html")

>>> sel.xpath('//li//@href').extract()

[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']

>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()

[u'link1.html', u'link2.html', u'link4.html', u'link5.html']

>>>

警告

C语言库 libxslt 不原生支持EXSLT正则表达式，因此 lxml 在实现时使用了Python re 模块的钩子。因此，在XPath表达式中使用regexp函数可能会牺牲少量的性能。

集合操作

集合操作可以方便地用于在提取文字元素前从文档树中去除一些部分。

例如使用itemscopes组和对应的itemprops来提取微数据(来自http://schema.org/Product的样本内容):

>>> doc = """

...

... Kenmore White 17" Microwave

... Kenmore 17" Microwave

...

... itemscope itemtype="http://schema.org/AggregateRating">

... Rated 3.5/5

... based on 11 customer reviews

...

... $55.00

... In stock

...

... Product description:

... 0.7 cubic feet countertop microwave.

... Has six preset cooking categories and convenience features like

... Add-A-Minute and Child Lock.

...

... Customer reviews:

...

... Not a happy camper -

... by Ellie,

... April 1, 2011

...

... 1/

... 5stars

...

... The lamp burned out and now I have to replace

... it.

...

... Value purchase -

... by Lucas,

... March 25, 2011

...

... 4/

... 5stars

...

... Great microwave for the price. It is small and

... fits in my apartment.

...

... ...

...

... """

>>>

>>> for scope in sel.xpath('//div[@itemscope]'):

... print "current scope:", scope.xpath('@itemtype').extract()

... props = scope.xpath('''

... set:difference(./descendant::*/@itemprop,

... .//*[@itemscope]/*/@itemprop)''')

... print " properties:", props.extract()

... print

...

current scope: [u'http://schema.org/Product']

properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review']

current scope: [u'http://schema.org/AggregateRating']

properties: [u'ratingValue', u'reviewCount']

current scope: [u'http://schema.org/Offer']

properties: [u'price', u'availability']

current scope: [u'http://schema.org/Review']

properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']

properties: [u'worstRating', u'ratingValue', u'bestRating']

current scope: [u'http://schema.org/Review']

properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']

properties: [u'worstRating', u'ratingValue', u'bestRating']

>>>

在这里，首先在 itemscope 元素上迭代，对于其中的每一个元素，寻找所有的 itemprops 元素，并排除那些本身在另一个 itemscope 内的元素。

Some XPath tips

Here are some tips that you may find useful when using XPath with Scrapy selectors, based on this post from ScrapingHub’s blog. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.

Using text nodes in a condition

When you need to use the text content as argument to a XPath string function, avoid using .//text() and use just . instead.

This is because the expression .//text() yields a collection of text elements – a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), it results in the text for the first element only.

Example:

>>> from scrapy import Selector

>>> sel = Selector(text='Click here to go to the Next Page')

Converting a node-set to string:

>>> sel.xpath('//a//text()').extract() # take a peek at the node-set

[u'Click here to go to the ', u'Next Page']

>>> sel.xpath("string(//a[1]//text())").extract() # convert it to string

[u'Click here to go to the ']

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> sel.xpath("//a[1]").extract() # select the first node

[u'Click here to go to the Next Page']

>>> sel.xpath("string(//a[1])").extract() # convert it to string

[u'Click here to go to the Next Page']

So, using the .//text() node-set won’t select anything in this case::

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()

[]

But using the . to mean the node, works:

>>> sel.xpath("//a[contains(., 'Next Page')]").extract()

[u'Click here to go to the Next Page']

Beware the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

Example:

>>> from scrapy import Selector

>>> sel = Selector(text="""

....:

""")

>>> xp = lambda x: sel.xpath(x).extract()

This gets all first

elements under whatever it is its parent:

>>> xp("//li[1]")

[u'

', u'

And this gets the first

element in the whole document:

>>> xp("(//li)[1]")

[u'

This gets all first

elements under an

>>> xp("//ul/li[1]")

[u'

And this gets the first

element under an
- 1

scrapy-选择器(Selectors)

选择器(Selectors)

使用选择器(selectors)

构造选择器(selectors)

使用选择器(selectors)

嵌套选择器(selectors)

结合正则表达式使用选择器(selectors)

使用相对XPaths

使用EXSLT扩展

正则表达式

集合操作

Some XPath tips

Using text nodes in a condition

Beware the difference between //node[1] and (//node)[1]

When querying by class, consider using CSS

内建选择器的参考

SelectorList对象

在HTML响应上的选择器样例

元素，返回:class:Selector 对象(即 SelectorList 的一个对象)的列表:

sel.xpath("//h1")

从HTML响应主体上提取所有

元素的文字，返回一个unicode字符串的列表:

sel.xpath("//h1").extract() # this includes the h1 tag

sel.xpath("//h1/text()").extract() # this excludes the h1 tag

在所有
标签上迭代，打印它们的类属性:

for node in sel.xpath("//p"):

print node.xpath("@class").extract()

在XML响应上的选择器样例

移除命名空间

你可能感兴趣的:(scrapy-选择器(Selectors))

scrapy-选择器(Selectors)

选择器(Selectors)

使用选择器(selectors)

构造选择器(selectors)

使用选择器(selectors)

嵌套选择器(selectors)

结合正则表达式使用选择器(selectors)

使用相对XPaths

使用EXSLT扩展

正则表达式

集合操作

Some XPath tips

Using text nodes in a condition

Beware the difference between //node[1] and (//node)[1]

When querying by class, consider using CSS

内建选择器的参考

SelectorList对象

在HTML响应上的选择器样例

元素，返回:class:Selector 对象(即 SelectorList 的一个对象)的列表: sel.xpath("//h1") 从HTML响应主体上提取所有

元素的文字，返回一个unicode字符串的列表: sel.xpath("//h1").extract() # this includes the h1 tag sel.xpath("//h1/text()").extract() # this excludes the h1 tag 在所有 标签上迭代，打印它们的类属性: for node in sel.xpath("//p"): print node.xpath("@class").extract()

在XML响应上的选择器样例

移除命名空间

你可能感兴趣的:(scrapy-选择器(Selectors))

元素，返回:class:Selector 对象(即 SelectorList 的一个对象)的列表:

sel.xpath("//h1")

从HTML响应主体上提取所有

元素的文字，返回一个unicode字符串的列表:

sel.xpath("//h1").extract() # this includes the h1 tag

sel.xpath("//h1/text()").extract() # this excludes the h1 tag

在所有
标签上迭代，打印它们的类属性:

for node in sel.xpath("//p"):

print node.xpath("@class").extract()