被生活安排了这么一个需求。
需要爬取YouTube给的频道下的给定日期范围内的视频的信息,如标题,点赞数,点踩数,播放量等信息
首先需要一个谷歌账号,工具来科学上网,打开YouTube,搜索指定的频道,进入频道界面
然后查看网页源代码,搜索channel,得到频道的频道ID
然后还需要申请谷歌数据API的个人独有的API key,参照博客申请api key并指定YouTube api
下面是谷歌的api地址
self.app_key = '你自己的 api key'
self.channel_api = 'https://developers.google.com/apis-explorer/#p/youtube/v3/youtube.channels.list?part=snippet,contentDetails&publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z&id='+ channel_id + '&key=' + self.app_key
self.info_api = 'https://www.googleapis.com/youtube/v3/videos'
本来的思路是找出所有的视频地址,然后根据视频发布日期过滤结果,而恰巧谷歌限制了API的返回结果为500个(实际为500个左右),导致视频缺失,导致我思考了很久解决办法,最终还是Google到了结果(Google google的问题 = =)
相关摘录:
“如果没有搜索结果的质量严重降低(重复等),我们无法通过API为任意YouTube查询提供超过500个搜索结果.
v1 / v2 GData API在11月更新,以限制返回到500的搜索结果数.如果指定500或更高的起始索引,则不会获得任何结果.
因此,为了获取全部指定时间段发布的视频,需要在参数里加上发布日期界限(分时间段搜索,每次的搜索结果上限仍然是500,请特别注意!!!!)
publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z
下面贴完整代码:
# -*- coding: UTF-8 -*-
import urllib2
import time
import urllib
import json
import datetime
import requests
import sys
import xlsxwriter
reload(sys)
sys.setdefaultencoding("utf-8")
channel = "Samsung"#频道名
channel_id = 'UCWwgaK7x0_FR1goeSRazfsQ'#频道ID
class YoukuCrawler:
def __init__(self):
self.video_ids = []
self.maxResults = 50#每次返回的结果数
self.app_key = '你自己的 api key'
self.channel_api = 'https://developers.google.com/apis-explorer/#p/youtube/v3/youtube.channels.list?part=snippet,contentDetails&publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z&id='+ channel_id + '&key=' + self.app_key
# self.info_api = 'https://www.googleapis.com/youtube/v3/videos?maxResults=50&part=snippet,statistics' + '&key=' + self.app_key
self.info_api = 'https://www.googleapis.com/youtube/v3/videos'
now = time.mktime(datetime.date.today().timetuple())
def get_all_video_in_channel(self, channel_id):
base_video_url = 'https://www.youtube.com/watch?v='
base_search_url = 'https://www.googleapis.com/youtube/v3/search?'
first_url = base_search_url + 'key={}&channelId={}&part=snippet,id&publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z&order=date&maxResults=25'.format(self.app_key, channel_id)
url = first_url
while True:
print url
request = urllib2.Request(url=url)
response = urllib2.urlopen(request)
page = response.read()
result = json.loads(page, encoding="utf-8")
for i in result['items']:
try:
self.video_ids.append(i['id']['videoId'])#获取作品ID
except:
pass
try:
next_page_token = result['nextPageToken']#获取下一页作品
url = first_url + '&pageToken={}'.format(next_page_token)
except:
print "no nextPageToken"
break
def main(self):
self.get_all_video_in_channel(channel_id)
return self.get_videos_info()
def get_videos_info(self):#获取作品信息
url = self.info_api
query = ''
count = 0
f = open(channel_id + '.txt', 'w')
print len(self.video_ids)
for i in self.video_ids:
try:
count += 1
query = i
results = requests.get(url,
params={'id': query, 'maxResults': self.maxResults, 'part': 'snippet,statistics',
'key': self.app_key})
page = results.content
videos = json.loads(page, encoding="utf-8")['items']
for video in videos:
try:
like_count = int(video['statistics']['likeCount'])
except KeyError:
like_count = 0
try:
dislike_count = int(video['statistics']['dislikeCount'])
except KeyError:
dislike_count = 0
temp = time.mktime(time.strptime(video['snippet']['publishedAt'], "%Y-%m-%dT%H:%M:%S.000Z"))
dateArray = datetime.datetime.utcfromtimestamp(int(temp))
otherStyleTime = dateArray.strftime("%Y-%m-%d")
print otherStyleTime,count
if (otherStyleTime>='2016-11-01' and otherStyleTime<="2017-12-31"):
print video['snippet']['title'], otherStyleTime, like_count, int(video['statistics']['viewCount'])
f.write("%s\t%s\t%s\t%s\n" % (video['snippet']['title'], otherStyleTime, str(like_count), video['statistics']['viewCount']))
if otherStyleTime <= '2016-10-01':
return 1
except Exception,e:
print e, count
return 1
if __name__ == "__main__":
c = YoukuCrawler()
c.main()
附上google API网址:https://developers.google.com/youtube/v3/docs/channelSections
API Key 控制台:https://console.developers.google.com/