python与爬虫-02复杂的HTML解析

序:基于位置、上下文、属性、内容选择标签的标准方式和创新方式;

1.进一步使用BeautifulSoup抓取网页

(1)代码如下

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(),'html.parser')
nameList = bs.findAll('span',{'class':'green'})
for name in nameList:
    print(name.get_text())

结果如下:

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna

原页面部分内容显示如下:
python与爬虫-02复杂的HTML解析_第1张图片

(2)理论解释
层叠样式表CSS,可以让页面的样式表现地更容易;网络爬虫可以通过class属性的值区分两种不同的标签,代码中可抓取的标签样式为:
通过BeautifulSoup对象,可以用find_all函数提取只包含在标签为span,类名为green中的文字,得到人物名称加粗样式列表!
之前调用bs.tagName只能获取页面中指定的第一个标签,而调用bs.find_all(tagName,tagAttributes)可以获取页面中所有指定的标签。
get_text()函数:清除正在处理的HTML文档中的所有标签,返回一个只包含文字的Unicode字符串。

PS:抓取结果中换行的部分,好像是因为原来的源码就换行了!我的浏览器在查看源代码的时候,那个名字就是换行显示的!

2.BeautifulSoup的重要函数

(1)案例分析与展示
依旧使用前面的案例网页,其他的代码如下:

nameList = bs.findAll(['h1','h2','h3','h4','h5','h6'])
for name in nameList:
    print(name.get_text())

结果为:

War and Peace
Chapter 1

当执行代码如下时:

nameList = bs.findAll('span',{'class':{'green','red'}})
for name in nameList:
    print(name.get_text())

结果为:

Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.
Heavens! what a virulent attack!
the prince
Anna Pavlovna
First of all, dear friend, tell me how you are. Set your friend's
mind at rest,
Can one be well while suffering morally? Can one be calm in times
like these if one has any feeling?
Anna Pavlovna
You are
staying the whole evening, I hope?
And the fete at the English ambassador's? Today is Wednesday. I
must put in an appearance there,
the prince
My daughter is
coming for me to take me there.
I thought today's fete had been canceled. I confess all these
festivities and fireworks are becoming wearisome.
If they had known that you wished it, the entertainment would
have been put off,
the prince
Don't tease! Well, and what has been decided about Novosiltsev's
dispatch? You know everything.
What can one say about it?
the prince
What has been decided? They have decided that
Buonaparte has burnt his boats, and I believe that we are ready to
burn ours.
Prince Vasili
Anna Pavlovna
Anna Pavlovna
Oh, don't speak to me of Austria. Perhaps I don't understand
things, but Austria never has wished, and does not wish, for war.
She is betraying us! Russia alone must save Europe. Our gracious
sovereign recognizes his high vocation and will be true to it. That is
the one thing I have faith in! Our good and wonderful sovereign has to
perform the noblest role on earth, and he is so virtuous and noble
that God will not forsake him. He will fulfill his vocation and
crush the hydra of revolution, which has become more terrible than
ever in the person of this murderer and villain! We alone must
avenge the blood of the just one.... Whom, I ask you, can we rely
on?... England with her commercial spirit will not and cannot
understand the Emperor Alexander's loftiness of soul. She has
refused to evacuate Malta. She wanted to find, and still seeks, some
secret motive in our actions. What answer did Novosiltsev get? None.
The English have not understood and cannot understand the
self-abnegation of our Emperor who wants nothing for himself, but only
desires the good of mankind. And what have they promised? Nothing! And
what little they have promised they will not perform! Prussia has
always declared that Buonaparte is invincible, and that all Europe
is powerless before him.... And I don't believe a word that Hardenburg
says, or Haugwitz either. This famous Prussian neutrality is just a
trap. I have faith only in God and the lofty destiny of our adored
monarch. He will save Europe!
I think,
the prince
that if you had been
sent instead of our dear Wintzingerode you would have captured the
King of Prussia's consent by assault. You are so eloquent. Will you
give me a cup of tea?
Wintzingerode
King of Prussia
In a moment. A propos,
I am
expecting two very interesting men tonight, le Vicomte de Mortemart,
who is connected with the Montmorencys through the Rohans, one of
the best French families. He is one of the genuine emigres, the good
ones. And also the Abbe Morio. Do you know that profound thinker? He
has been received by the Emperor. Had you heard?
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
I shall be delighted to meet them,
the prince
But tell me,
is it true that the Dowager Empress wants Baron Funke
to be appointed first secretary at Vienna? The baron by all accounts
is a poor creature.
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
Baron Funke has been recommended to the Dowager Empress by her
sister,
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
Now about your family. Do you know that since your daughter came
out everyone has been enraptured by her? They say she is amazingly
beautiful.
The prince
I often think,
I often think how unfairly sometimes the
joys of life are distributed. Why has fate given you two such splendid
children? I don't speak of Anatole, your youngest. I don't like
him,
Anatole
Two such charming children. And really you appreciate
them less than anyone, and so you don't deserve to have them.
I can't help it,
the prince
Lavater would have said I
lack the bump of paternity.
Don't joke; I mean to have a serious talk with you. Do you know I
am dissatisfied with your younger son? Between ourselves
he was mentioned at Her
Majesty's and you were pitied....
The prince
What would you have me do?
You know I did all
a father could for their education, and they have both turned out
fools. Hippolyte is at least a quiet fool, but Anatole is an active
one. That is the only difference between them.
And why are children born to such men as you? If you were not a
father there would be nothing I could reproach you with,
Anna
Pavlovna
I am your faithful slave and to you alone I can confess that my
children are the bane of my life. It is the cross I have to bear. That
is how I explain it to myself. It can't be helped!
Anna Pavlovna

接着变换执行代码:

nameList = bs.findAll(text='the prince')
print(len(nameList))

运行结果为:7
执行如下代码:

title = bs.find_all(id='title',class_='text')
print(title)

运行结果为:[]
当把class_改成class的时候,结果为:File "", line 1 title = bs.find_all(id='title',class='text') ^ SyntaxError: invalid syntax,报错!所以,这个代码并不是书里面的错误!
执行如下代码:

title = bs.find(id='title')
print(title)

结果为:None
(2)理论解释
BeautifulSoup里面的函数find()和find_all(),通过标签的不同属性过滤Html页面;函数定义为:find_all(tag,attributes,recursive,text,limit,keywords)find(tag,attributes,recursive,text,keywords)

参数 值举例 类型
tag [‘h1’,‘h2’,‘h3’] 列表
attributes ‘span’,{‘class’:{‘green’,‘red’}} 字典,标签和对应的属性
recursive 布尔变量,默认True,查询标签参数的所有子标签,False只查找一级标签
text ‘the prince’ 标签的文本内容去匹配
limit find的此值为1,按照网页上的顺序排序的前几项
keywords 选择具有指定属性的标签,冗余功能

补充:class是python受保护的关键字,python中不能作为变量和参数名的,所以后面加横线class_
PS:学了python好久了,都忘记了!哎~~~~

3.其他BeautifulSoup对象

BeautifulSoup对象bs,标签Tag对象-bs.body,NavigableString对象,Comment对象。

4.导航树

find_all()函数通过标签的名称和属性查找标签,通过标签的位置查找,需要用到导航树navigating tree。
(1)案例展示https://www.pythonscraping.com/pages/page3.html
python与爬虫-02复杂的HTML解析_第2张图片
处理子标签和其他标签
运行如下代码

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(),'html.parser')
for child in bs.find('table',{'id':'giftList'}).children:
    print(child)

运行结果如下:


Item Title

Description

Cost

Image



Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

$15.00





Russian Nesting Dolls

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!

$10,000.52





Fish Painting

If something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!

$10,005.00





Dead Parrot

This is an ex-parrot! Or maybe he's only resting?

$0.50





Mystery Box

If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!

$1.50



解释:子标签就是父标签的下一级,后代标签就是父标签下面的所有级别的标签。前面的代码会打印giftList表格中所有产品的数据行,包括最开始的列名行!
注意:函数有children和descendants。当代码为for child in bs.find('table',{'id':'giftList'}).descendants: print(child) 也可以运行!!!
处理兄弟标签
案例代码如下:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(),'html.parser')
for sibling in bs.find('table',{'id':'giftList'}).tr.next_siblings:
    print(sibling)

运行结果为:


Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

$15.00





Russian Nesting Dolls

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!

$10,000.52





Fish Painting

If something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!

$10,005.00





Dead Parrot

This is an ex-parrot! Or maybe he's only resting?

$0.50





Mystery Box

If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!

$1.50



函数为next_siblings(),案例中只选择了表格中除了标题行以外的所有行。获取一个标签的兄弟标签,都不会包含这个标签本身。
for sibling in bs.find('table',{'id':'giftList'}).tr.previous_siblings: print(sibling)for sibling in bs.find('table',{'id':'giftList'}).tr.next_sibling: print(sibling)for sibling in bs.find('table',{'id':'giftList'}).tr.previous_sibling: ,不过后面的应该只会返回单个值,我实验了一下,发现什么也返回不了!
处理父标签
案例代码:

print(bs.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())

结果:$15.00
之前的代码运行不了好像不需要使用列表或者数组的形式返回值,因为只有一个值!
查找父标签用到函数:parent和parents
代码:

for sibling in bs.find('img',{'src':'../img/gifts/img1.jpg'}).parent:
    print(sibling)

结果:
代码:

for sibling in bs.find('img',{'src':'../img/gifts/img1.jpg'}).parents:
    print(sibling)

结果:这里就只截个图了!因为太多了!
python与爬虫-02复杂的HTML解析_第3张图片
PS:明天接着学习余下的部分!止步于第二章余下的部分!后面的章节有空再学吧!!学习这个,python基础一定要好!

你可能感兴趣的:(python爬虫,python)