Beautifulsoup,pyquery、xpath解析库比较

俗话说:好记性不如烂笔头,零零碎碎的知识不加以总结归纳,建立知识体系,就会感觉杂乱无章,获得感极低,因此,再次比较三种解析库的常见使用方法。

主要参考:
BeautifulSoup官方文档 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
pyquery官方文档 https://pythonhosted.org/pyquery/index.html
xpathW3school文档 https://www.w3schools.com/xml/xpath_intro.asp

一、BeautifulSoup库

常见分析一个网页,主要思路是点位节点、获得数据
以下列html文本为例,阐述三种解析方法的异同点:

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""

beautifulsoup

实例化

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')  
First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

"lxml"是解析方式,常用的有
Beautifulsoup,pyquery、xpath解析库比较_第1张图片
定位节点:soup.节点
如:soup.title The Dormouse's story
不传入解析器,默认使用最合适的,最好传入!

UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“lxml”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

soup.p

The Dormouse's story

返回第一个p节点

Using a tag name as an attribute will give you only the first tag by that name:

soup.节点其实是soup.find(‘节点’)的简写,只返回第一个对象
soup.findAll(‘节点’)返回所有节点的一个列表对象
findAll() 更名为find_all(),同样的findParent(s) -> find_parent(s)
如何定位到某一个具体节点?

四种对象:Tag,NavigableString,Comment,BeautifulSoup

soup = BeautifulSoup('Extremely bold')
tag = soup.b
type(tag)     # 

.name访问名字,tag.attrs tag['属性名']访问属性
2.

tag.string
# u'Extremely bold'
type(tag.string)
# 
markup = ""
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# 

4.beautidulsoup

soup = BeautifulSoup(html,'lxml')
print(type(soup))
#

beautifulsoup也是一个对象类型,他的Tag.name是[document]


根据列表中的位置,根据属性定位!

遍历:(.contents .children .descendants)

A tag’s children are available in a list called .contents:
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator
The .contents and .children attributes only consider a tag’s direct children.Please use .descendants get the children recursively.

查找:find_all(name, attrs, recursive, string, limit, **kwargs)

由于find_all是最常用的方法,因此提供了一个shortcut,下面两种方式相同

soup.find_all('a')
soup('a')

beautifulsoup中分为tag,attribution,multi-value attribution,navigable string ,comment,

The BeautifulSoup object itself has children. In this case, the tag is the child of the BeautifulSoup object.:

.contents .children .descentants区分
.children只返回子节点
.descendants返回所有子孙节点

find() 和 find_all()

You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.
字符串:soup.find_all(‘p’)
正则表达式:soup.find_all(re.compile(’^b’)) #b开头的节点
列表:soup.find_all([‘a’,‘b’]) #匹配标签中含ab中的一个
True :所有的
函数:自定义

重点:

name 是标签名

attr

常用soup.find_all(id="sister")soup.find_all({'id':'sister'})以字典的形式传入
注意:class为保留词,需用class_="sister"代替

获取属性

可以用CSS格式 soup.select(p.story)

a标签,id=“link1”属性
soup.find('a',id="link1")
soup.find('a',attrs={"id":"link1"})

string

查找string中的值,返回字符串本身

soup.find(string="The Dormouse's story")
soup.find(string=re.compile("story"))
查找string包含”ie"的a标签
s = soup.find_all('a',string=re.compile("ie"))

The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text:

limit

限制查找次数,soup.find_all('a',limit=2)

recursive

默认recursive=True,查找子孙节点,recursive=False只能查找子节点

find()实际上就是find_all(,limit=1)

find_parents()查找父节点

soup.select()方法

soup.select(‘p a’)
soup.select(p a #link2)
soup.select(a[href]) = soup.find_all(‘a’,href=True)

get_text方法 get_text(,strip=True)

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:

与string strings区分:

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
Navigable string只有一个字符串,则通过.string获取,多个字符串则返回None
使用**.strings作为generator**,使用stripped_strings去除whitespace
使用get_text()获取文字

css_soup.select("p.strikeout.body")      #可以传入多个属性

二、pyquery库

初始化的三种方式

from pyquery import PyQuery as pq
doc = pq('代码‘)                       #直接解析代码
doc = pq(url="https://www.baidu.com")  #通过url
doc = pq(filename="D:\demo\html")      #通过本地文件

By default it use python’s urllib.
If requests is installed then it will use it. This allow you to use most of requests parameters:

可以为请求加上请求头!pq('')
可以指定解析方式!pq('

toto

', parser='xml')

如何定位到节点?

html = '''
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

'''
doc('节点’)         #返回所有该节点
doc('a#link1')      #id="link1"的a节点
doc('a').attr('id','newlink') # 修改a节点id的属性
doc('a').attr['class']='brother'
doc('a.sister').eq(0)   #使用eq()选择单独的 

怎么同时定位多个属性?

    doc('ul .t.clearfix')
items()  #返回PyQuery object
for a in doc('a').items():
    print(a.text())

text()返回所有的字符串

pyquery的编码问题?

pyquery封装了requests,可以指定编码

url = 'http://www.weather.com.cn/weather/101040100.shtml'
doc = pq(url,encoding='utf-8')

三、xpath使用方法

from lxml import etree

html = '''
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

''' doc = etree.HTML(html)

Beautifulsoup,pyquery、xpath解析库比较_第2张图片
注:/是绝对路径,//是相对路径

doc.xpath('//a[@id="link1"]/text()')
doc.xpath('//a/@href')
doc.xpath('//a[contains(@id,"link1")]')  #包含y属性,并且y属性中含有x
doc.xpath('//a[contains(text(),"cie")]/text()')  #包含有文本,文本中有字符串“cie"

你可能感兴趣的:(数据分析)