作用:用于解析HTML信息
示例:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,"html.parser")
将数据转换成指定格式,方便解析HTML
示例:
import requests
from bs4 import BeautifulSoup
url = "https://movie.douban.com/"
headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36"
}
# 3、解析信息
soup = BeautifulSoup(requests.get(url,headers=headers).content,"html.parser")
查找第一个目标string。同等与Soup.tageName。
还可以进行属性定位,用法为soup.find(tageName,属性=属性名)。
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(‘123.html’),"lxml")
list1 = soup.find('div',class_='index-left') # ‘div’为目标标签,class_='index-left为目标属性名
print(list1)
查找所有目标string。
进行属性定位,查找所有目标属性数据,用法为soup.find_all(tageName,属性=属性名)。
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(‘123.html’),"lxml")
list1 = soup.find_all('div',class_='index-left') # ‘div’为目标标签,class_='index-left为目标属性名
print(list1)
list2 = soup.find_all(‘img’)
print(list2)
选择标签,可通过’>’方式查找指定标签目录下的数据;空格表示多个层级。
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(‘123.html’),"lxml")
list1 = soup.select('.high-quality-list > ul > li > a > img')[0] # 选则路径:.high-quality-list类-->ul标签-->li-->a-->img
list1 = soup.select('.high-quality-list > ul a') # 空格表示多个层级
获取标签内的文本
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(‘123.html’),"lxml")
list1 = soup.select('.high-quality-list > ul > li > a > img')[0].get_text() # 选则路径:.high-quality-list类-->ul标签-->li-->a-->img
查找第一个标签内数据,tageName为目标标签名。
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(‘123.html’),"lxml")
list1 = soup.li
print(f"找到了{len(list1)}个数据")
print(list1)