第三章 数据解析(十二)续 2019-12-23

十二、bs4 – 实战– 豆瓣Top250爬虫实战(续

 

 

爬取内容

 

爬取豆瓣Top250


注意事项


1、headers

2、编码

3、使用BeautifulSoup


网站:

https://movie.douban.com/top250



示例代码:

import requests

from bs4 import BeautifulSoup

 

headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}

 

# 获取详情页面url

def get_detail_urls(url):

    resp = requests.get(url, headers=headers)

    # print(resp.text)

    html = resp.text

    soup = BeautifulSoup(html, 'lxml')

    lis = soup.find('ol',class_='grid_view').find_all('li')

    detail_urls = []

    for li in lis:

        detail_url = li.find('a')['href']

        # print(detail_url)

        detail_urls.append(detail_url)

       return detail_urls

 

# 解析详情页面内容

def parse_detail_url(url):

    #解析详情页面内容

    resp = requests.get(url, headers=headers)

    # print(resp.text)

    html = resp.text

    soup = BeautifulSoup(html, 'lxml')

    #电影名

    name = list(soup.find('div',id='content').find('h1').stripped_strings)

    # 去除列表,输出字符串

    name = ''.join(name)

    # print(name)

    #导演

    director = list(soup.find('div',id='info').find('span').find('span', class_='attrs').stripped_strings)

    director = ''.join(director)

    # print(director)

    #编剧

    screenwriter = list(soup.find('div',id='info').find_all('span')[3].find("span",class_='attrs').stripped_strings)

    screenwriter = ''.join(screenwriter)

    # print(screenwriter)

    # 主演

    actor = list(soup.find('span',class_='actor').find('span', class_='attrs').stripped_strings)

    # print(actor)

    #评分

    score = soup.find('strong', class_='llrating_num').string

    print(score)

    f.write('{},{},{},{},{}\n'.format(name,director, screenwriter, ''.join(actor), score))

 

def main():

    base_url ='https://movie.douban.com/top250?start={}&filter='

    with open('Top250.csv', 'a',encoding='utf-8') as f:

        #调用get_detail_urls函数

        for x in range(0.251,25):

            url = base_url.format(x)

            detail_urls = get_detail_urls(url)

            for detail_url in detail_urls:

                parse_detail_url(detail_url, f)

 

 

if__name__ == '__main__':

    main()



上一篇文章 第三章 数据解析(十二) 2019-12-22 地址: 

https://www.jianshu.com/p/a70e17e2e7c9

下一篇文章 第三章 数据解析(十三) 2019-12-24 地址:

https://www.jianshu.com/p/3303a724cd67



以上资料内容来源网络,仅供学习交流,侵删请私信我,谢谢。

你可能感兴趣的:(第三章 数据解析(十二)续 2019-12-23)