BeautifulSoup库的安装
安装
pip install beautifulsoup4
测试是否安装成功
Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:37:02) [MSC v.1924 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license()" for more information. >>> import requests >>> r=requests.get("https://www.baidu.com/") >>> r.text '\r\nç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93 \r\n' >>> r.encoding='UTF-8' >>> r.text '\r\nå\x85³äº\x8eç\x99¾åº¦ About Baidu
©2017 Baidu 使ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85读 æ\x84\x8fè§\x81å\x8f\x8dé¦\x88 京ICPè¯\x81030173å\x8f·
![]()
百度一下,你就知道 \r\n' >>> demo=r.text >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> print(soup.prettify()) >>>
Beautiful Soup库的基本元素
练习记录
>>> import requests >>> r=requests.get("https://python123.io/ws/demo.html") >>> r.text 'This is a python demo page \r\n\r\nThe demo python introduces several python courses.
\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.
\r\n' >>> demo=r.text >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> print(soup.prettify())This is a python demo page class="title"> The demo python introduces several python courses.
class="course"> Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python and class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> Advanced Python .
>>> from bs4 import BeautifulSoup >>> soup=Beautiful("data","html.parser") Traceback (most recent call last): File "", line 1, in soup=Beautiful("data","html.parser") NameError: name 'Beautiful' is not defined >>> soup=BeautifulSoup("data","html.parser") >>> soup2=BeautifulSoup(open("D://MyProject//Python学习//爬虫学习//demo.html") 4444 SyntaxError: invalid syntax >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> soup.title This is a python demo page >>> tag=soup.a >>> tag class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> soup.a.name 'a' >>> soup.a.paernt.name Traceback (most recent call last): File "", line 1, in soup.a.paernt.name AttributeError: 'NoneType' object has no attribute 'name' >>> soup.a.parent.name 'p' >>> soup.a.parent.parent.name 'body' >>> tag=soup.a >>> tag.attrs {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'} >>> tag.attrs['class'] ['py1'] >>> tag.attrs['href'] 'http://www.icourse163.org/course/BIT-268001' >>> type(tag.attrs) <class 'dict'> >>> type(tag) <class 'bs4.element.Tag'> >>> soup.a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python >>> soup.a.string 'Basic Python' >>> soup.p class="title">The demo python introduces several python courses.
>>> soup.p.string 'The demo python introduces several python courses.' >>> type(soup.p.string) <class 'bs4.element.NavigableString'> >>> newsoup=BeautifulSoup("ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93 \r\n' >>> r.encoding='UTF-8' >>> r.text '\r\nå\x85³äº\x8eç\x99¾åº¦ About Baidu
©2017 Baidu 使ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85读 æ\x84\x8fè§\x81å\x8f\x8dé¦\x88 京ICPè¯\x81030173å\x8f·
![]()
百度一下,你就知道 \r\n' >>> demo=r.text >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> soup.head "text/html;charset=utf-8" http-equiv="content-type"/>"IE=Edge" http-equiv="X-UA-Compatible"/>"always" name="referrer"/>"https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>百度一下,你就知道 >>> soup.head.contents ["text/html;charset=utf-8" http-equiv="content-type"/>, "IE=Edge" http-equiv="X-UA-Compatible"/>, "always" name="referrer"/>, "https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>,百度一下,你就知道 ] >>> soup.body.contents [' ',"wrapper">, ' '] >>> len(soup.body.contents) 3 >>> soup.body.contents[1]"head">class="head_wrapper">class="s_form">class="s_form_wrapper">"lg">"129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
"u1"> class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品"ftCon">"ftConw">"lh"> "http://home.baidu.com">关于百度 "http://ir.baidu.com">About Baidu
"cp">©2017 Baidu "http://www.baidu.com/duty/">使用百度前必读 class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈 京ICP证030173号
"//www.baidu.com/img/gs.gif"/>
"wrapper">>>>"head">class="head_wrapper">class="s_form">class="s_form_wrapper">"lg">"129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
"u1"> class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品"ftCon">"ftConw">"lh"> "http://home.baidu.com">关于百度 "http://ir.baidu.com">About Baidu
"cp">©2017 Baidu "http://www.baidu.com/duty/">使用百度前必读 class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈 京ICP证030173号
"//www.baidu.com/img/gs.gif"/>
练习
Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:37:02) [MSC v.1924 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license()" for more information. >>> import requests >>> r=requests.get("https://www.baidu.com/") >>> r.encoding='UTF-8' >>> demo=r.text >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> soup.title.parent "text/html;charset=utf-8" http-equiv="content-type"/>"IE=Edge" http-equiv="X-UA-Compatible"/>"always" name="referrer"/>"https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>百度一下,你就知道 >>> soup.html.parent "text/html;charset=utf-8" http-equiv="content-type"/>"IE=Edge" http-equiv="X-UA-Compatible"/>"always" name="referrer"/>"https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>百度一下,你就知道 "#0000cc">"wrapper">>>> soup.parent >>>"head">class="head_wrapper">class="s_form">class="s_form_wrapper">"lg">"129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
"u1"> class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品"ftCon">"ftConw">"lh"> "http://home.baidu.com">关于百度 "http://ir.baidu.com">About Baidu
"cp">©2017 Baidu "http://www.baidu.com/duty/">使用百度前必读 class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈 京ICP证030173号
"//www.baidu.com/img/gs.gif"/>
实践
练习:
>>> soup.a.next_sibling ' ' >>> soup.a.next_sibling.next_sibling class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 >>> soup.a.previous_sibling ' ' >>> soup.a.previous_sibling.previous_sibling >>> soup.a.parent"u1"> class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品>>>
基于bs4库的HTML格式化和编码
练习:
Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:37:02) [MSC v.1924 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license()" for more information. >>> import requests >>> r=requests.get("https://www.baidu.com/") >>> r.encoding='UTF-8' >>> demo=r.text >>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup(demo,"html.parser") >>> soup.prettify() '\n\n\n \n \n \n \n \n\n 百度一下,你就知道\n \n \n \n\n\n \n\n' >>> print(soup.prettify())\n \n\n\n\n\n\n\n \n 关于百度\n \n \n About Baidu\n \n
\n\n ©2017\xa0Baidu\n \n 使用百度前必读\n \n \n 意见反馈\n \n 京ICP证030173号\n
\n\n
百度一下,你就知道
>>>
编码要用“UTF-8”