十二、bs4 – 实战– 豆瓣Top250爬虫实战(续)
爬取内容
爬取豆瓣Top250
注意事项
1、headers
2、编码
3、使用BeautifulSoup
网站:
https://movie.douban.com/top250
示例代码:
import requests
from bs4 import BeautifulSoup
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
# 获取详情页面url
def get_detail_urls(url):
resp = requests.get(url, headers=headers)
# print(resp.text)
html = resp.text
soup = BeautifulSoup(html, 'lxml')
lis = soup.find('ol',class_='grid_view').find_all('li')
detail_urls = []
for li in lis:
detail_url = li.find('a')['href']
# print(detail_url)
detail_urls.append(detail_url)
return detail_urls
# 解析详情页面内容
def parse_detail_url(url):
#解析详情页面内容
resp = requests.get(url, headers=headers)
# print(resp.text)
html = resp.text
soup = BeautifulSoup(html, 'lxml')
#电影名
name = list(soup.find('div',id='content').find('h1').stripped_strings)
# 去除列表,输出字符串
name = ''.join(name)
# print(name)
#导演
director = list(soup.find('div',id='info').find('span').find('span', class_='attrs').stripped_strings)
director = ''.join(director)
# print(director)
#编剧
screenwriter = list(soup.find('div',id='info').find_all('span')[3].find("span",class_='attrs').stripped_strings)
screenwriter = ''.join(screenwriter)
# print(screenwriter)
# 主演
actor = list(soup.find('span',class_='actor').find('span', class_='attrs').stripped_strings)
# print(actor)
#评分
score = soup.find('strong', class_='llrating_num').string
print(score)
f.write('{},{},{},{},{}\n'.format(name,director, screenwriter, ''.join(actor), score))
def main():
base_url ='https://movie.douban.com/top250?start={}&filter='
with open('Top250.csv', 'a',encoding='utf-8') as f:
#调用get_detail_urls函数
for x in range(0.251,25):
url = base_url.format(x)
detail_urls = get_detail_urls(url)
for detail_url in detail_urls:
parse_detail_url(detail_url, f)
if__name__ == '__main__':
main()
上一篇文章 第三章 数据解析(十二) 2019-12-22 地址:
https://www.jianshu.com/p/a70e17e2e7c9
下一篇文章 第三章 数据解析(十三) 2019-12-24 地址:
https://www.jianshu.com/p/3303a724cd67
以上资料内容来源网络,仅供学习交流,侵删请私信我,谢谢。