Python 爬取YouTube某个频道下的所有视频信息

被生活安排了这么一个需求。

需要爬取YouTube给的频道下的给定日期范围内的视频的信息,如标题,点赞数,点踩数,播放量等信息

首先需要一个谷歌账号,工具来科学上网,打开YouTube,搜索指定的频道,进入频道界面

Python 爬取YouTube某个频道下的所有视频信息_第1张图片

然后查看网页源代码,搜索channel,得到频道的频道ID

然后还需要申请谷歌数据API的个人独有的API key,参照博客申请api key并指定YouTube api

下面是谷歌的api地址

self.app_key = '你自己的 api key'
self.channel_api = 'https://developers.google.com/apis-explorer/#p/youtube/v3/youtube.channels.list?part=snippet,contentDetails&publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z&id='+ channel_id + '&key=' + self.app_key
self.info_api = 'https://www.googleapis.com/youtube/v3/videos'

本来的思路是找出所有的视频地址,然后根据视频发布日期过滤结果,而恰巧谷歌限制了API的返回结果为500个(实际为500个左右),导致视频缺失,导致我思考了很久解决办法,最终还是Google到了结果(Google google的问题 =  =)

相关摘录:

“如果没有搜索结果的质量严重降低(重复等),我们无法通过API为任意YouTube查询提供超过500个搜索结果.

v1 / v2 GData API在11月更新,以限制返回到500的搜索结果数.如果指定500或更高的起始索引,则不会获得任何结果.

因此,为了获取全部指定时间段发布的视频,需要在参数里加上发布日期界限(分时间段搜索,每次的搜索结果上限仍然是500,请特别注意!!!!

publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z

下面贴完整代码:

# -*- coding: UTF-8 -*-
import urllib2
import time
import urllib
import json
import datetime
import requests
import sys
import xlsxwriter
reload(sys)
sys.setdefaultencoding("utf-8")

channel = "Samsung"#频道名
channel_id = 'UCWwgaK7x0_FR1goeSRazfsQ'#频道ID



class YoukuCrawler:
    def __init__(self):
        self.video_ids = []
        self.maxResults = 50#每次返回的结果数
        self.app_key = '你自己的 api key'
        self.channel_api = 'https://developers.google.com/apis-explorer/#p/youtube/v3/youtube.channels.list?part=snippet,contentDetails&publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z&id='+ channel_id + '&key=' + self.app_key
        # self.info_api = 'https://www.googleapis.com/youtube/v3/videos?maxResults=50&part=snippet,statistics' + '&key=' + self.app_key
        self.info_api = 'https://www.googleapis.com/youtube/v3/videos'
        now = time.mktime(datetime.date.today().timetuple())


    def get_all_video_in_channel(self, channel_id):


        base_video_url = 'https://www.youtube.com/watch?v='
        base_search_url = 'https://www.googleapis.com/youtube/v3/search?'

        first_url = base_search_url + 'key={}&channelId={}&part=snippet,id&publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z&order=date&maxResults=25'.format(self.app_key, channel_id)
        url = first_url

        while True:
            print url

            request = urllib2.Request(url=url)
            response = urllib2.urlopen(request)
            page = response.read()
            result = json.loads(page, encoding="utf-8")
            for i in result['items']:
                try:
                    self.video_ids.append(i['id']['videoId'])#获取作品ID
                except:
                    pass

            try:

                next_page_token = result['nextPageToken']#获取下一页作品
                url = first_url + '&pageToken={}'.format(next_page_token)

            except:
                print "no nextPageToken"
                break

    def main(self):
        self.get_all_video_in_channel(channel_id)
        return self.get_videos_info()


    def get_videos_info(self):#获取作品信息
        url = self.info_api
        query = ''
        count = 0
        f = open(channel_id + '.txt', 'w')
        print len(self.video_ids)
        for i in self.video_ids:
            try:
                count += 1
                query = i
                results = requests.get(url,
                                       params={'id': query, 'maxResults': self.maxResults, 'part': 'snippet,statistics',
                                               'key': self.app_key})
                page = results.content
                videos = json.loads(page, encoding="utf-8")['items']
                for video in videos:

                    try:
                        like_count = int(video['statistics']['likeCount'])
                    except KeyError:
                        like_count = 0
                    try:
                        dislike_count = int(video['statistics']['dislikeCount'])
                    except KeyError:
                        dislike_count = 0

                    temp = time.mktime(time.strptime(video['snippet']['publishedAt'], "%Y-%m-%dT%H:%M:%S.000Z"))

                    dateArray = datetime.datetime.utcfromtimestamp(int(temp))
                    otherStyleTime = dateArray.strftime("%Y-%m-%d")
                    print otherStyleTime,count
                    if (otherStyleTime>='2016-11-01' and otherStyleTime<="2017-12-31"):
                        print video['snippet']['title'], otherStyleTime, like_count, int(video['statistics']['viewCount'])
                        f.write("%s\t%s\t%s\t%s\n" % (video['snippet']['title'], otherStyleTime, str(like_count), video['statistics']['viewCount']))

                    if otherStyleTime <= '2016-10-01':
                        return 1

            except Exception,e:
                print e, count

        return 1

if __name__ == "__main__":
    c = YoukuCrawler()
    c.main()

附上google API网址:https://developers.google.com/youtube/v3/docs/channelSections

API Key 控制台:https://console.developers.google.com/

你可能感兴趣的:(总结,爬虫,菜鸟之路)