内容来源于网络，本人只是在此稍作整理，如有涉及版权问题，归小甲鱼官方所有。

0.请写下这一节课你学习到的内容：格式不限，回忆并复述是加强记忆的好方式！

官方地址：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/index.html
第一个案例：爬取百度百科

import urllib.request
from bs4 import BeautifulSoup
import re
import ssl


def main():
    ssl._create_default_https_context = ssl._create_unverified_context
    url = "https://baike.baidu.com/view/284853.htm"
    response = urllib.request.urlopen(url)
    html = response.read()
    soup = BeautifulSoup(html, "html.parser")

    for each in soup.find_all(href=re.compile("view")):
        print(each.text, "->", ''.join(["https://baike.baidu.com", each["href"]]))

if __name__ == "__main__":
    main()

因为Beautiful是用于从HTML或XML文件中读取数据，所以需要先使用urllib.request模块从指定网址上读取HTML文件。
soup = BeautifulSoup(html, "html.parser")需要两个参数：第一个参数是需要读取数据的HTML或XML文件，第二个参数是指定解析器。然后使用find_all(href=re.compile("view"))方法可以读取所有包含“view”关键字的链接，使用for循环迭代读取。
我们可以把html的源码下载下来验证一把。

import urllib.request
import ssl
 
def getHtml(url):
    html = urllib.request.urlopen(url).read()
    return html
 
def saveHtml(file_name, file_content):
    #    注意windows文件命名的禁用符，比如 /
    with open(file_name.replace('/', '_') + ".html", "wb") as f:
        f.write(file_content)
 
ssl._create_default_https_context = ssl._create_unverified_context
url = "https://baike.baidu.com/view/284853.htm"
html = getHtml(url)
saveHtml("code.html", html)

第二个案例：允许用户输入搜索的关键字，能进入每一个词条，然后检测该词条是否具有副标题，如果有，就打印出来。

import urllib.request
import urllib.parse
import re
import ssl
from bs4 import BeautifulSoup


def main():
    keyword = input("请输入关键字：")
    keyword = urllib.parse.urlencode({"word": keyword})
    ssl._create_default_https_context = ssl._create_unverified_context
    response = urllib.request.urlopen("https://baike.baidu.com/search/word? % s" % keyword)
    html = response.read().decode('utf-8')
    soup = BeautifulSoup(html, "html.parser")

    for each in soup.find_all(href=re.compile("view")):
        content = ''.join([each.text])
        ssl._create_default_https_context = ssl._create_unverified_context
        url2 = ''.join(["https://baike.baidu.com", each["href"]])
        response2 = urllib.request.urlopen(url2)
        html2 = response2.read()
        soup2 = BeautifulSoup(html2, "html.parser")
        if soup2.h2:
            content = ''.join([content, soup2.h2.text])
        content = ''.join([content, "->", url2])
        print(content)


if __name__ == "__main__":
    main()

第五十六课：论一只爬虫的自我修养：Beautiful Soup

0.请写下这一节课你学习到的内容：格式不限，回忆并复述是加强记忆的好方式！

你可能感兴趣的:(第五十六课：论一只爬虫的自我修养：Beautiful Soup)