简单的python爬虫程序:爬取斗鱼直播人气前五十的主播

1.URL 地址分析

我选取的是斗鱼直播王者荣耀系列的网址: https://www.douyu.com/directory/game/wzry

个人有玩王者荣耀,偶尔看看直播。

2.页面抓取

首先要引入两模块:(安装请自行百度,pycharm安装方便很多)

from bs4 import BeautifulSoup
import requests
然后要给requests个url
url = 'https://www.douyu.com/directory/game/wzry

因为网站反爬,还要伪装个header(浏览器可以查看自己的agent)

header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
        }

整体页面爬取的代码是:

from urllib import request
import urllib.request
from bs4 import  BeautifulSoup
import bs4



def __fetch_content(self):
        # url = 'https://www.douyu.com/directory/game/wzry'在主函数那里有网址了,所以注释掉
        print(url)
        header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
        }
        # 网站反爬,要构造合理的HTTP请求头
        request = urllib.request.Request(url, headers=header)

        #爬取网站内容
        r = urllib.request.urlopen(request).read()
        soup = BeautifulSoup(r)

3.提取需要的信息

找到主播的名字,人气和房间号:
divList = soup.findAll("span",attrs={"class":"dy-name ellipsis fl"})
        name=soup.findAll("span",attrs={"class":'dy-num fr'})
        link=soup.findAll("a",attrs={"class":'play-list-link'})

用for in循环出人气前50的主播:

        for i in range(0,50):

            print(divList[i].string)
            print(name[i].string)
            print("https://www.douyu.com"+link[i].get("href"))
            print("-------------")


4.把数据以文本的形式保存下来:
            with open('D:\\douyu.txt',mode='a',encoding='utf-8')as jb:
                jb.write(divList[i].string)
                jb.write("\n")
                jb.write(name[i].string)
                jb.write("\n")
                jb.write("https://www.douyu.com"+link[i].get("href"))
                jb.write("\n")
                jb.write("\n")

5.全部代码

from urllib import request
import urllib.request
from bs4 import  BeautifulSoup
import bs4



def __fetch_content(self):
        # url = 'https://www.douyu.com/directory/game/wzry'在主函数那里有网址了,所以注释掉
        print(url)
        header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
        }
        # 网站反爬,要构造合理的HTTP请求头
        request = urllib.request.Request(url, headers=header)

        #爬取网站内容
        r = urllib.request.urlopen(request).read()
        soup = BeautifulSoup(r)

       #找到主播的名字,人气和房间号
        divList = soup.findAll("span",attrs={"class":"dy-name ellipsis fl"})
        name=soup.findAll("span",attrs={"class":'dy-num fr'})
        link=soup.findAll("a",attrs={"class":'play-list-link'})

        #找出人前钱五十的主播及其房间连接
        for i in range(0,50):

            print(divList[i].string)
            print(name[i].string)
            print("https://www.douyu.com"+link[i].get("href"))
            print("-------------")

        #把数据以文本的方式保存下来
            with open('D:\\douyu.txt',mode='a',encoding='utf-8')as jb:
                jb.write(divList[i].string)
                jb.write("\n")
                jb.write(name[i].string)
                jb.write("\n")
                jb.write("https://www.douyu.com"+link[i].get("href"))
                jb.write("\n")
                jb.write("\n")



if __name__=="__main__":
    url = 'https://www.douyu.com/directory/game/wzry'
    __fetch_content(url)

















你可能感兴趣的:(简单的python爬虫程序:爬取斗鱼直播人气前五十的主播)