从页面中提取数据的核心技术是HTTP文本解析,在Python 中常用以下模块处理此类问题:
BeautifulSoup | lxml |
---|---|
非常流行的HTTP解析库,API 简洁易用,但解析速度较慢。 | 由C语言编写的xml解析库( libxml2),解析速度更快,API相对复杂。 |
Scrapy综合上述两者优点实现了Selector 类,它是基于lxml库构建的,并简化了API接口。在Scrapy中使用Selector 对象提取页面中的数据,使用时先通过XPath或CSS选择器选中页面中需要提取的数据,然后进行提取,下面来介绍一下Selector对象的使用。
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
html = '''
<html lang="en">
<head>
<title>Scrapy Studytitle>
head>
<body>
<h1>Hello Worldh1>
<h2>ayouleyangh2>
<b>yangyoub>
<ul>
<li>Pythonli>
<li>Scrapyli>
<li>htmlli>
ul>
'''
使用Response对象构造Selector对象,将其传递给Selector构造器方法的response参数:
>>> result = HtmlResponse(html,body=html,encoding='utf-8')
>>> selector = Selector(response = result)
>>> print(selector)
<Selector xpath=None data='\n\n Scrap' >
>>>
调用Selector对象的xpath或css方法可以选中文中某个或某部分:
>>> selector_h1 = selector.xpath('//h1')
>>> print (selector_h1)
[<Selector xpath='//h1' data='Hello World
'>]
>>> selector_li = selector.xpath('//li')
>>> print (selector_li)
[<Selector xpath='//li' data='Python '>,
<Selector xpath='//li' data='Scrapy '>,
<Selector xpath='//li' data='html '>]
>>>
xpath和css方法返回一个SelectorList对象,SelectorList支持列表接口,可使用for语句迭代访问其中的对象:
>>> for li in selector_li:
print (li.xpath('./text()'))
[<Selector xpath='./text()' data='Python'>]
[<Selector xpath='./text()' data='Scrapy'>]
[<Selector xpath='./text()' data='html'>]
>>>
SelectorList对象也有xpath和css方法:
>>> lis = selector.xpath('.//ul').css('li').xpath('./text()')
>>> print (lis)
[<Selector xpath='./text()' data='Python'>,
<Selector xpath='./text()' data='Scrapy'>,
<Selector xpath='./text()' data='html'>]
>>>
调用Selector或SelectorList对象的一下方法可将选中的内容提取
extract方法
>>> selector_li = selector.xpath('//li')
>>> print (selector_li)
[<Selector xpath='//li' data='Python '>,
Selector xpath='//li' data='Scrapy '>,
Selector xpath='//li' data='html '>]
>>>
>>> print (selector_li[0].extract())
<li>Python</li>
>>>
>>> li = selector.xpath('.//li/text()')
>>> print (li)
[<Selector xpath='.//li/text()' data='Python'>,
<Selector xpath='.//li/text()' data='Scrapy'>,
<Selector xpath='.//li/text()' data='html'>]
>>>
>>>>print (li.extract())
['Python', 'Scrapy', 'html']
>>>
>>> print (li[0].extract())
Python
>>>
>>> print (li[1].extract())
Scrapy
提取标题内容:
>>> title = selector.xpath('.//title/text()')
>>> print (title)
[<Selector xpath='.//title/text()' data='Scrapy Study'>]
>>> print (title.extract())
['Scrapy Study']
>>> print (title[0].extract())
Scrapy Study
>>>
定点提取ul>li的内容:
>>> html = '''
- Python编程价格:32.00元
- 精通Scrapy价格:12.00元
- html知识价格:52.00元
'''
>>> selector = Selector(text=html)
>>> li = selector.xpath('.//ul/li/text()')
>>> print (li)
[<Selector xpath='.//ul/li/text()' data='Python编程'>,
<Selector xpath='.//ul/li/text()' data='精通Scrapy'>,
<Selector xpath='.//ul/li/text()' data='html知识'>]
>>> li = selector.xpath('.//ul/li/text()').extract()
>>> print (li)
['Python编程', '精通Scrapy', 'html知识']
>>> li = selector.xpath('.//ul/li/b/text()').extract()
>>> print (li)
['价格:32.00元', '价格:12.00元', '价格:52.00元']
>>> li = selector.xpath('.//ul/li/b/text()').re('\d+\.\d+') #只提取数字
>>> print (li)
['32.00', '12.00', '52.00']
>>>
>>> li = selector.xpath('.//ul/li/b/text()').re_first('\d+\.\d+')
>>> print(li)
32.00
>>>
>>> li = selector.xpath('.//ul/li[2]/b/text()').re('\d+\.\d+') #li[2]定位第二个li标签
>>> print (li[0]) #[0]提取数组第一位
12.00
>>>
菜鸟教程Xpath语法 。
先创一个html文档,接下来,我们通过一些例子xpath的作用。
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> html = '''
<html lang="en">
<head>
<title>Xpath studytitle>
head>
<body>
<div id="images">
<a href="image1.html">Name:图片1<br><img src="image1.jpg">a>
<a href="image2.html">Name:图片2<br><img src="image2.jpg">a>
<a href="image3.html">Name:图片3<br><img src="image3.jpg">a>
<a href="image4.html">Name:图片4<br><img src="image4.jpg">a>
<a href="image5.html">Name:图片5<br><img src="image5.jpg">a>
div>
body>
html>
'''
>>> response = HtmlResponse(html,body = html,encoding='utf-8')
>>> response.xpath('/html')
[<Selector xpath='/html' data='\n\n Xpath' >]
>>>
>>> response.xpath('/html/body/div/a')
[<Selector xpath='/html/body/div/a' data='"image1.html">Name:图片1
'>,
<Selector xpath='/html/body/div/a' data='"image2.html">Name:图片2
'>,
<Selector xpath='/html/body/div/a' data='"image3.html">Name:图片3
'>,
<Selector xpath='/html/body/div/a' data='"image4.html">Name:图片4
'>,
<Selector xpath='/html/body/div/a' data='"image5.html">Name:图片5
'>]
>>>
>>> name = response.xpath('.//a/text()')
>>> name
[<Selector xpath='.//a/text()' data='Name:图片1'>,
<Selector xpath='.//a/text()' data='Name:图片2'>,
<Selector xpath='.//a/text()' data='Name:图片3'>,
<Selector xpath='.//a/text()' data='Name:图片4'>,
<Selector xpath='.//a/text()' data='Name:图片5'>]
>>>
>>> name.extract()
['Name:图片1', 'Name:图片2', 'Name:图片3', 'Name:图片4', 'Name:图片5']
>>>
>>> name = response.xpath('.//a/text()')
>>> name
[<Selector xpath='.//a/text()' data='Name:图片1'>,
<Selector xpath='.//a/text()' data='Name:图片2'>,
<Selector xpath='.//a/text()' data='Name:图片3'>,
<Selector xpath='.//a/text()' data='Name:图片4'>,
<Selector xpath='.//a/text()' data='Name:图片5'>]
>>>
>>> name.extract()
['Name:图片1', 'Name:图片2', 'Name:图片3', 'Name:图片4', 'Name:图片5']
>>>
>>> response.xpath('/html/body/div/a/*')
[<Selector xpath='/html/body/div/a/*' data='
'>, <Selector xpath='/html/body/div/a/*' data='"image1.jpg">'>, <Selector xpath='/html/body/div/a/*' data='
'>, <Selector xpath='/html/body/div/a/*' data='"image2.jpg">'>, <Selector xpath='/html/body/div/a/*' data='
'>, <Selector xpath='/html/body/div/a/*' data='"image3.jpg">'>, <Selector xpath='/html/body/div/a/*' data='
'>, <Selector xpath='/html/body/div/a/*' data='"image4.jpg">'>, <Selector xpath='/html/body/div/a/*' data='
'>, <Selector xpath='/html/body/div/a/*' data='"image5.jpg">'>]
#选中div孙节点中的所有img
>>> response.xpath('//div/*/img')
[<Selector xpath='//div/*/img' data='"image1.jpg">'>,
<Selector xpath='//div/*/img' data='"image2.jpg">'>,
<Selector xpath='//div/*/img' data='"image3.jpg">'>,
<Selector xpath='//div/*/img' data='"image4.jpg">'>,
<Selector xpath='//div/*/img' data='"image5.jpg">'>]
>>>
#选中所有img的src属性
>>> response.xpath('//img/@src')
[<Selector xpath='//img/@src' data='image1.jpg'>,
<Selector xpath='//img/@src' data='image2.jpg'>,
<Selector xpath='//img/@src' data='image3.jpg'>,
<Selector xpath='//img/@src' data='image4.jpg'>,
<Selector xpath='//img/@src' data='image5.jpg'>]
>>>
>>> response.xpath('//img/@src').extract()
['image1.jpg', 'image2.jpg', 'image3.jpg', 'image4.jpg', 'image5.jpg']
>>>
#选中所有的href属性
>>> response.xpath('//@href')
[<Selector xpath='//@href' data='image1.html'>,
<Selector xpath='//@href' data='image2.html'>,
<Selector xpath='//@href' data='image3.html'>,
<Selector xpath='//@href' data='image4.html'>,
<Selector xpath='//@href' data='image5.html'>]
>>>
#获取第一个a下img的所有属性(这里只有一个src属性)
>>> response.xpath('//a[1]/img/@*')
[<Selector xpath='//a[1]/img/@*' data='image1.jpg'>]
>>>
#获取第一个a的选择器对象
>>> img = response.xpath('//a')[0]
>>> img
<Selector xpath='//a' data='"image1.html">Name:图片1
'>
>>>
>>>
#假设找a[0]中的所有img,但却得到所有的img,因为//是绝对路径,会从文档的根部开始搜索
>>> img.xpath('//img')
[<Selector xpath='//img' data='"image1.jpg">'>,
<Selector xpath='//img' data='"image2.jpg">'>,
<Selector xpath='//img' data='"image3.jpg">'>,
<Selector xpath='//img' data='"image4.jpg">'>,
<Selector xpath='//img' data='"image5.jpg">'>]
>>>
>>>
#需要使用.//来描述当前节点后代中的所有img
>>> img.xpath('.//img')
[<Selector xpath='.//img' data='"image1.jpg">'>]
>>>
>>> response.xpath('//img/..')
[<Selector xpath='//img/..' data='"image1.html">Name:图片1
'>,
<Selector xpath='//img/..' data='"image2.html">Name:图片2
'>,
<Selector xpath='//img/..' data='"image3.html">Name:图片3
'>,
<Selector xpath='//img/..' data='"image4.html">Name:图片4
'>,
<Selector xpath='//img/..' data='"image5.html">Name:图片5
'>]
#选区所有a中的第3个
>>> response.xpath('//a[3]')
[<Selector xpath='//a[3]' data='"image3.html">Name:图片3
'>]
>>>
>>>
#使用last函数,选中最后一个
>>> response.xpath('//a[last()]')
[<Selector xpath='//a[last()]' data='"image5.html">Name:图片5
'>]
>>>
>>>
#使用position函数,选中前3个
>>> response.xpath('//a[position()<=3]')
[<Selector xpath='//a[position()<=3]' data='"image1.html">Name:图片1
'>,
<Selector xpath='//a[position()<=3]' data='"image2.html">Name:图片2
'>,
<Selector xpath='//a[position()<=3]' data='"image3.html">Name:图片3
'>]
>>>
>>>
#选中所有含id属性的div
>>> response.xpath('//div[@id]')
[<Selector xpath='//div[@id]' data='"images">\n "image1.ht'>]
>>>
>>>
#选中所有含有id属性且值为images的div
>>> response.xpath('//div[@id="images"]')
[<Selector xpath='//div[@id="images"]' data='"images">\n "image1.ht'>]
2.2、常用函数
Xpath还提供了很多函数,如数字、字符串、时间、日期、统计等。
- string(arg):返回参数的字符串值。
>>> from scrapy.selector import Selector
>>> html = '阿优乐扬的博客'
>>> sel = Selector(text=html)
>>> sel
<Selector xpath=None data='>
>>> sel.xpath('/html/body/a/text()')
[<Selector xpath='/html/body/a/text()' data='的博客'>]
>>> sel.xpath('/html/body/a/b/text()')
[<Selector xpath='/html/body/a/b/text()' data='阿优乐扬'>]
#如果想同时得到a中的字符串(阿优乐扬的博客),只是用text()就不行了
>>> sel.xpath('/html/body/a//text()').extract()
['阿优乐扬', '的博客']
>>>
#这种情况可以使用string()函数
>>> sel.xpath('string(/html/body/a)').extract()
['阿优乐扬的博客']
>>>
- contains(str1,str2): 判断str1中是否包含str2,返回布尔值
>>> from scrapy.selector import Selector
>>> html = '''
阿优乐扬
Youle
'''
>>> sel = Selector(text=html)
#选择class属性中包含Nic的p元素
>>> sel.xpath('//p[contains(@class,"Nic")]')
[<Selector xpath='//p[contains(@class,"Nic")]' data='阿优乐扬
'>]
#选择class属性中包含name的p元素
>>> sel.xpath('//p[contains(@class,"name")]')
[<Selector xpath='//p[contains(@class,"name")]' data='阿优乐扬
'>,
<Selector xpath='//p[contains(@class,"name")]' data='Youle
'>]
>>>
三、CSS选择器
CSS即层叠样式表,其选择器是种用来确定 Html文档中某部分位置的语言。CSS选择器的语法比XPath更简单一些, 但功能不如XPath强大。实际上,当我们调用Selector对象的CSS方法时,在其内部会使用Python库csselecet将CSS选择器表达式翻译成XPath表达式,然后调用Selector 对象的XPATH方法。
先创建一个HTML文档并构造一个HtmlResponse对象。
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
html = '''
<html lang="en">
<head>
<title>CSS选择器 studytitle>
head>
<body>
<a href="image1.html">Name:图片1<br><img src="image1.jpg">a>
<a href="image2.html">Name:图片2<br><img src="image2.jpg">a>
<a href="image3.html">Name:图片3<br><img src="image3.jpg">a>
div>
<a href="image4.html">Name:图片4<br><img src="image4.jpg">a>
<a href="image5.html">Name:图片5<br><img src="image5.jpg">a>
div>
body>
html>
'''
response = HtmlResponse(html,body = html,encoding='utf-8')
- E :选中E元素
#选中所有的img
>>> response.css('img')
[<Selector xpath='descendant-or-self::img' data=''>,
<Selector xpath='descendant-or-self::img' data=''>,
<Selector xpath='descendant-or-self::img' data=''>,
<Selector xpath='descendant-or-self::img' data=''>,
<Selector xpath='descendant-or-self::img' data=''>]
>>>
- E1,E2 :选中E1和E2元素
#选中所有的title和div
>>> response.css('title,div')
[<Selector xpath='descendant-or-self::title | descendant-or-self::div' data='CSS选择器 study '>,
<Selector xpath='descendant-or-self::title | descendant-or-self::div' data='\n'>,
<Selector xpath='descendant-or-self::title | descendant-or-self::div' data='\n <'>]
>>>
- E1 E2 :选中E1后代中的E2元素
#div后代中的img
>>> response.css('div img')
[<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>,
<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>,
<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>,
<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>,
<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>]
>>>
- E1>E2 :选中E1子元素中的E2元素
>>> response.css('body>div')
[<Selector xpath='descendant-or-self::body/div' data='\n'>,
<Selector xpath='descendant-or-self::body/div' data='\n <'>]
- [ATTR] :选中包含ATTR属性的元素
>>> response.css('[style]')
[<Selector xpath='descendant-or-self::*[@style]' data='\n'>]
>>>
- [ATTR=VALUE] :选中包含ATTR属性且值为VALUE的元素
>>> response.css('[id="images1"]')
[<Selector xpath="descendant-or-self::*[@id = 'images1']" data='\n'>]
>>>
- E:nth-child(n):选中E元素,且该元素必须是其父元素的第n个子元素
#选中每个div的第一个a
>>> response.css('div>a:nth-child(1)')
[<Selector xpath='descendant-or-self::div/a[count(preceding-sibling::*) = 0]' data='Name:图片1
>,
<Selector xpath='descendant-or-self::div/a[count(preceding-sibling::*) = 0]' data='Name:图片4
>]
>>>
#选中第二个div的第一个a
>>> response.css('div:nth-child(2)>a:nth-child(1)')
[<Selector xpath='descendant-or-self::div[count(preceding-sibling::*) = 1]/a[count(preceding-sibling::*) = 0]' data='Name:图片4
>]
>>>
- E:first-child:选中E元素,该元素必须是其父元素的第一个子元素
- E:last-child:选中E元素,该元素必须是其父元素的倒数第一个子元素
#选取第一个div的最后一个a
>>> response.css('div:first-child>a:last-child')
[<Selector xpath='descendant-or-self::div[count(preceding-sibling::*) = 0]/a[count(following-sibling::*) = 0]' data='Name:图片3
>]
>>>
- E::text: 选中E元素的文本节点
#选中所有a的文本
>>> response.css('a::text')
[<Selector xpath='descendant-or-self::a/text()' data='Name:图片1'>,
<Selector xpath='descendant-or-self::a/text()' data='Name:图片2'>,
<Selector xpath='descendant-or-self::a/text()' data='Name:图片3'>,
<Selector xpath='descendant-or-self::a/text()' data='Name:图片4'>,
<Selector xpath='descendant-or-self::a/text()' data='Name:图片5'>]
>>> response.css('a::text').extract()
['Name:图片1', 'Name:图片2', 'Name:图片3', 'Name:图片4', 'Name:图片5']
>>>
以上学习内容参考《精通Scrapy网络爬虫 ——刘硕 编著》
你可能感兴趣的:(Python学习)