python爬取TED演讲视频(代码)

环境: windows+python3.6+pycharm(非必须)

引用的python库/模块:requests, bs4, os, random,you-get

准备知识:requests的应用,BeautifulSoup的find_all(),os.system(“cmd命令”),you-get

爬取步骤:

1.对于爬虫,我习惯都用上ip代理池,虽然有的网站没有反爬虫策略,但是用上也无大碍。将ip代理池封装为一个模块可以随时调用

直接贴代码:get_ip.py

import requests
from bs4 import BeautifulSoup
import random

head = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre',
    'ue': 'utf-8',
}

def get_ip_list():                                   # 从IP代理网站1直接爬取大量的ip
    url = 'http://www.xicidaili.com/nn/'             #ip代理网站
    response = requests.get(url, headers=head).text
    bs = BeautifulSoup(response, 'html.parser')
    ips = bs.find_all('tr')
    ip_list = []
    for i in range(1, len(ips)):
        ip_info = ips[i]
        tds = ip_info.find_all('td')
        ip_list.append(tds[1].text + ':' + tds[2].text)
    return ip_list

def get_random_ip():                                   # 在ip池中获取一个随机ip地址调用
    ips_list = get_ip_list()
    proxy_list = []
    for ip in ips_list:
        proxy_list.append('http://' + ip)
    proxy_ip = random.choice(proxy_list)
    proxies = {'http': proxy_ip}
    return proxies

2.现在来实现爬取TED

(1)分析TED网页,我这里直接贴出规律

        TED主页:https://www.ted.com/

        TED视频的列表网页:https://www.ted.com/talks?page=1,最后的page=1表示列表第一页。如此类推

(2)直接贴代码get_TED.py

import requests
from get_ip import get_random_ip
from bs4 import BeautifulSoup
import os

head = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre',
    'proxies': get_random_ip(),
    'ue': 'utf-8',
}
path = r'F:\TED'

def get_TED(url, count):
    page_part = url.split('=')
    for i in range(1, count+1):
        url_ted = page_part[0] + '=' + str(i)
        response = requests.get(url_ted, params=head)
        html = response.text
        bs = BeautifulSoup(html, 'html.parser')
        talks_list = bs.find_all('div', attrs={'class': 'media__message'})
        for j in range(len(talks_list)):
            ted_a = talks_list[j].find_all('a', attrs={'class': 'ga-link', 'data-ga-context': 'talks'})
            ted_url = 'https://www.ted.com' + ted_a[0]['href']
            print("TED演讲主题:" + ted_a[0].text)
            os.system(r'you-get -o {} {}'.format(path, ted_url))

if __name__ == '__main__':
    url = 'https://www.ted.com/talks?page=1'
    count = int(input("请输入要下载的页数(一页36个TED):"))
    get_TED(url, count)


代码在链接在个人的github上:https://github.com/goodloving/python

你可能感兴趣的:(python爬虫)