【Python爬虫】Beautiful Soup库入门

BeautifulSoup库的安装

【Python爬虫】Beautiful Soup库入门_第1张图片

 

安装

pip install beautifulsoup4

 

 【Python爬虫】Beautiful Soup库入门_第2张图片

 

 测试是否安装成功

Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:37:02) [MSC v.1924 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> import requests
>>> r=requests.get("https://www.baidu.com/")
>>> r.text
'\r\n ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93  
\r\n
' >>> r.encoding='UTF-8' >>> r.text '\r\n 百度一下,你就知道

关于百度 About Baidu

©2017 Baidu 使用百度前必读  意见反馈 京ICP证030173号 

\r\n
' >>> demo=r.text >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> print(soup.prettify()) >>>

【Python爬虫】Beautiful Soup库入门_第3张图片

Beautiful Soup库的基本元素

 【Python爬虫】Beautiful Soup库入门_第4张图片

 

 【Python爬虫】Beautiful Soup库入门_第5张图片

 

 【Python爬虫】Beautiful Soup库入门_第6张图片

 

 【Python爬虫】Beautiful Soup库入门_第7张图片

 

 【Python爬虫】Beautiful Soup库入门_第8张图片

 

 【Python爬虫】Beautiful Soup库入门_第9张图片

练习记录

>>> import requests
>>> r=requests.get("https://python123.io/ws/demo.html")
>>> r.text
'This is a python demo page\r\n\r\n

The demo python introduces several python courses.

\r\n

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.

\r\n
' >>> demo=r.text >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> print(soup.prettify()) <span> This </span><span>is</span><span> a python demo page </span>

class="title"> The demo python introduces several python courses.

class="course"> Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python and class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> Advanced Python .

>>> from bs4 import BeautifulSoup >>> soup=Beautiful("data","html.parser") Traceback (most recent call last): File "", line 1, in soup=Beautiful("data","html.parser") NameError: name 'Beautiful' is not defined >>> soup=BeautifulSoup("data","html.parser") >>> soup2=BeautifulSoup(open("D://MyProject//Python学习//爬虫学习//demo.html") 4444 SyntaxError: invalid syntax >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> soup.title This <span>is</span> a python demo page >>> tag=soup.a >>> tag class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> soup.a.name 'a' >>> soup.a.paernt.name Traceback (most recent call last): File "", line 1, in soup.a.paernt.name AttributeError: 'NoneType' object has no attribute 'name' >>> soup.a.parent.name 'p' >>> soup.a.parent.parent.name 'body' >>> tag=soup.a >>> tag.attrs {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'} >>> tag.attrs['class'] ['py1'] >>> tag.attrs['href'] 'http://www.icourse163.org/course/BIT-268001' >>> type(tag.attrs) <class 'dict'> >>> type(tag) <class 'bs4.element.Tag'> >>> soup.a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python >>> soup.a.string 'Basic Python' >>> soup.p

class="title">The demo python introduces several python courses.

>>> soup.p.string 'The demo python introduces several python courses.' >>> type(soup.p.string) <class 'bs4.element.NavigableString'> >>> newsoup=BeautifulSoup(" ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93
\r\n
' >>> r.encoding='UTF-8' >>> r.text '\r\n 百度一下,你就知道

关于百度 About Baidu

©2017 Baidu 使用百度前必读  意见反馈 京ICP证030173号 

\r\n
' >>> demo=r.text >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> soup.head "text/html;charset=utf-8" http-equiv="content-type"/>"IE=Edge" http-equiv="X-UA-Compatible"/>"always" name="referrer"/>"https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>百度一下,你就知道 >>> soup.head.contents ["text/html;charset=utf-8" http-equiv="content-type"/>, "IE=Edge" http-equiv="X-UA-Compatible"/>, "always" name="referrer"/>, "https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>, 百度一下,你就知道] >>> soup.body.contents [' ',
"wrapper">
"head">
class="head_wrapper">
class="s_form">
class="s_form_wrapper">
"lg"> "129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
"//www.baidu.com/s" class="fm" id="form" name="f"> "bdorz_come" type="hidden" value="1"/> "ie" type="hidden" value="utf-8"/> "f" type="hidden" value="8"/> "rsv_bp" type="hidden" value="1"/> "rsv_idx" type="hidden" value="1"/> "tn" type="hidden" value="baidu"/>class="bg s_ipt_wr">"off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>class="bg s_btn_wr">"" class="bg s_btn" id="su" type="submit" value="百度一下"/>
"u1"> class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品
, ' '] >>> len(soup.body.contents) 3 >>> soup.body.contents[1]
"wrapper">
"head">
class="head_wrapper">
class="s_form">
class="s_form_wrapper">
"lg"> "129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
"//www.baidu.com/s" class="fm" id="form" name="f"> "bdorz_come" type="hidden" value="1"/> "ie" type="hidden" value="utf-8"/> "f" type="hidden" value="8"/> "rsv_bp" type="hidden" value="1"/> "rsv_idx" type="hidden" value="1"/> "tn" type="hidden" value="baidu"/>class="bg s_ipt_wr">"off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>class="bg s_btn_wr">"" class="bg s_btn" id="su" type="submit" value="百度一下"/>
"u1"> class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品
>>>

 

 【Python爬虫】Beautiful Soup库入门_第14张图片

 

 【Python爬虫】Beautiful Soup库入门_第15张图片

 

 练习

Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:37:02) [MSC v.1924 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> import requests
>>> r=requests.get("https://www.baidu.com/")
>>> r.encoding='UTF-8'
>>> demo=r.text
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.title.parent
"text/html;charset=utf-8" http-equiv="content-type"/>"IE=Edge" http-equiv="X-UA-Compatible"/>"always" name="referrer"/>"https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>百度一下,你就知道
>>> soup.html.parent


 "text/html;charset=utf-8" http-equiv="content-type"/>"IE=Edge" http-equiv="X-UA-Compatible"/>"always" name="referrer"/>"https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>百度一下,你就知道 "#0000cc"> 
"wrapper">
"head">
class="head_wrapper">
class="s_form">
class="s_form_wrapper">
"lg"> "129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
"//www.baidu.com/s" class="fm" id="form" name="f"> "bdorz_come" type="hidden" value="1"/> "ie" type="hidden" value="utf-8"/> "f" type="hidden" value="8"/> "rsv_bp" type="hidden" value="1"/> "rsv_idx" type="hidden" value="1"/> "tn" type="hidden" value="baidu"/>class="bg s_ipt_wr">"off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>class="bg s_btn_wr">"" class="bg s_btn" id="su" type="submit" value="百度一下"/>
"u1"> class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品
>>> soup.parent >>>

【Python爬虫】Beautiful Soup库入门_第16张图片

 

 实践

【Python爬虫】Beautiful Soup库入门_第17张图片

 【Python爬虫】Beautiful Soup库入门_第18张图片

【Python爬虫】Beautiful Soup库入门_第19张图片

 

 练习:

>>> soup.a.next_sibling
' '
>>> soup.a.next_sibling.next_sibling
class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123
>>> soup.a.previous_sibling
' '
>>> soup.a.previous_sibling.previous_sibling
>>> soup.a.parent
"u1"> class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品
>>>

【Python爬虫】Beautiful Soup库入门_第20张图片

 

 基于bs4库的HTML格式化和编码

【Python爬虫】Beautiful Soup库入门_第21张图片

 

练习:

Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:37:02) [MSC v.1924 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> import requests
>>> r=requests.get("https://www.baidu.com/")
>>> r.encoding='UTF-8'
>>> demo=r.text
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.prettify()
'\n\n\n \n  \n  \n  \n  \n  \n   百度一下,你就知道\n  \n \n \n  
\n \n
\n
\n

\n \n 关于百度\n \n \n About Baidu\n \n

\n

\n ©2017\xa0Baidu\n \n 使用百度前必读\n \n \n 意见反馈\n \n 京ICP证030173号\n \n

\n
\n
\n
\n \n\n
' >>> print(soup.prettify())









<br> 百度一下,你就知道<br>







>>> 

 

 【Python爬虫】Beautiful Soup库入门_第22张图片

 

 编码要用“UTF-8”

【Python爬虫】Beautiful Soup库入门_第23张图片

 

你可能感兴趣的:(【Python爬虫】Beautiful Soup库入门)