python_crawer——使用Requests与BeautifulSoup爬取酷狗TOP500榜单信息

酷狗TOP500榜单信息

# _*_ coding:utf-8 _*_

"""
Date: 2018-7-12
Auther: Liam
python_version: py3.6
package: requests, Beautifulsoup, time
platform: W10

function: 爬取酷狗Top500的排行榜信息,包括排行,歌手,歌名,播放时间
"""

import requests
from bs4 import BeautifulSoup
import time

# 获取User-Agent
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}


# 创建Spider类,并定义了一个get_info()函数用以获取所需信息
class Spider:
    def __init__(self):
        print('The spider is crawling...')

    def get_info(self, url):
        # 模拟浏览器提出get()请求,返回网页源码
        response = requests.get(url, headers=headers)
        # 规范化
        bf = BeautifulSoup(response.text, 'lxml')
        # 获取排行榜标签集
        ranks = bf.select('span.pc_temp_num')
        # 获取歌手-歌名标签集
        titles = bf.select('div.pc_temp_songlist > ul > li > a')
        # 获取播放时间标签集
        play_times = bf.select('span.pc_temp_tips_r > span')
        for rank, title, play_time in zip(ranks, titles, play_times):
            # 讲详细信息写进字典
            data = {
                'rank': rank.get_text().strip(),
                'singer': title.get_text().split('-')[0],
                'song': title.get_text().split('-')[1],
                'time': play_time.get_text().strip()
            }
            print(data)


if __name__ == '__main__':
    # 类的实例化
    spider1 = Spider()
    urls = ['http://www.kugou.com/yy/rank/home/{}-8888.html?from=homepage'
                .format(str(number)) for number in range(1, 24)]
    for url in urls:
        spider1.get_info(url)
        time.sleep(2)

关于Requests包的一些了解

你可能感兴趣的:(python_crawer——使用Requests与BeautifulSoup爬取酷狗TOP500榜单信息)