俗话说:好记性不如烂笔头,零零碎碎的知识不加以总结归纳,建立知识体系,就会感觉杂乱无章,获得感极低,因此,再次比较三种解析库的常见使用方法。
主要参考:
BeautifulSoup官方文档 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
pyquery官方文档 https://pythonhosted.org/pyquery/index.html
xpathW3school文档 https://www.w3schools.com/xml/xpath_intro.asp
常见分析一个网页,主要思路是点位节点、获得数据
以下列html文本为例,阐述三种解析方法的异同点:
html_doc = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
实例化
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:
"lxml"是解析方式,常用的有
定位节点:soup.节点
如:soup.title
不传入解析器,默认使用最合适的,最好传入!
UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“lxml”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
soup.p The Dormouse's story
返回第一个p节点
Using a tag name as an attribute will give you only the first tag by that name:
soup.节点其实是soup.find(‘节点’)的简写,只返回第一个
对象
soup.findAll(‘节点’)返回所有节点的一个列表
对象
findAll() 更名为find_all(),同样的findParent(s) -> find_parent(s)
如何定位到某一个具体节点?
soup = BeautifulSoup('Extremely bold')
tag = soup.b
type(tag) #
.name访问名字,tag.attrs tag['属性名']
访问属性
2.
tag.string
# u'Extremely bold'
type(tag.string)
#
markup = ""
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
#
4.beautidulsoup
soup = BeautifulSoup(html,'lxml')
print(type(soup))
#
beautifulsoup也是一个对象类型,他的Tag.name是[document]
根据列表中的位置,根据属性定位!
A tag’s children are available in a list called .contents:
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator
The .contents and .children attributes only consider a tag’s direct children.Please use .descendants get the children recursively.
由于find_all是最常用的方法,因此提供了一个shortcut,下面两种方式相同
soup.find_all('a')
soup('a')
beautifulsoup中分为tag,attribution,multi-value attribution,navigable string ,comment,
The BeautifulSoup object itself has children. In this case, the tag is the child of the BeautifulSoup object.:
.contents .children .descentants区分
.children只返回子节点
.descendants返回所有子孙节点
find() 和 find_all()
You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.
字符串:soup.find_all(‘p’)
正则表达式:soup.find_all(re.compile(’^b’)) #b开头的节点
列表:soup.find_all([‘a’,‘b’]) #匹配标签中含ab中的一个
True :所有的
函数:自定义
常用soup.find_all(id="sister")
或soup.find_all({'id':'sister'})
以字典的形式传入
注意:class为保留词,需用class_="sister"
代替
获取属性—
可以用CSS格式 soup.select(p.story)
a标签,id=“link1”属性
soup.find('a',id="link1")
soup.find('a',attrs={"id":"link1"})
查找string中的值,返回字符串本身
soup.find(string="The Dormouse's story")
soup.find(string=re.compile("story"))
查找string包含”ie"的a标签
s = soup.find_all('a',string=re.compile("ie"))
The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text:
限制查找次数,soup.find_all('a',limit=2)
默认recursive=True,查找子孙节点,recursive=False只能查找子节点
find()实际上就是find_all(,limit=1)
find_parents()查找父节点
soup.select(‘p a’)
soup.select(p a #link2)
soup.select(a[href]) = soup.find_all(‘a’,href=True)
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:
与string strings区分:
If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
Navigable string只有一个字符串,则通过.string获取,多个字符串则返回None
使用**.strings作为generator**,使用stripped_strings去除whitespace
使用get_text()获取文字
css_soup.select("p.strikeout.body") #可以传入多个属性
初始化的三种方式
from pyquery import PyQuery as pq
doc = pq('代码‘) #直接解析代码
doc = pq(url="https://www.baidu.com") #通过url
doc = pq(filename="D:\demo\html") #通过本地文件
By default it use python’s urllib.
If requests is installed then it will use it. This allow you to use most of requests parameters:
可以为请求加上请求头! totopq('')
可以指定解析方式!pq('
html = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
doc('节点’) #返回所有该节点
doc('a#link1') #id="link1"的a节点
doc('a').attr('id','newlink') # 修改a节点id的属性
doc('a').attr['class']='brother'
doc('a.sister').eq(0) #使用eq()选择单独的
怎么同时定位多个属性?
doc('ul .t.clearfix')
items() #返回PyQuery object
for a in doc('a').items():
print(a.text())
text()返回所有的字符串
pyquery的编码问题?
pyquery封装了requests,可以指定编码
url = 'http://www.weather.com.cn/weather/101040100.shtml'
doc = pq(url,encoding='utf-8')
from lxml import etree
html = '''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
doc = etree.HTML(html)
doc.xpath('//a[@id="link1"]/text()')
doc.xpath('//a/@href')
doc.xpath('//a[contains(@id,"link1")]') #包含y属性,并且y属性中含有x
doc.xpath('//a[contains(text(),"cie")]/text()') #包含有文本,文本中有字符串“cie"