豆瓣爬虫

1.获取你要爬虫的数据代理:user-Agent
豆瓣爬虫_第1张图片
2.然后对request头进行封装:

        python
def DouBanSpide(i):
    url = "https://movie.douban.com/top250?start="+str(i*9)
    user_agent = {
     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "}
    req = request.Request(url=url, headers=user_agent)
    html = request.urlopen(req)
    Douban_data_wash(html.read().decode())

3.中间的豆瓣数据利用split方式进行切片,切出你想要的排名,电影名称,评分,及其人数,以及推荐理由。

              python
 rank = text1.split('')[i+1].split("")[0]
        title = text1.split('')[i].split('>')[-1].strip()
        rate = text1.split('v:average\">')[i+1].split('')[0]
        number = text1.split('star')[i+1].split('')[1].split('')[0]

4.所有完整的代码:

          python
 import os
import random
import time
from urllib import request
def Douban_data_wash(text1):
    text1 = text1.split('

豆瓣电影 Top 250

'
)[1] for i in range(0, 9): rank = text1.split('')[i+1].split("")[0] title = text1.split('')[i].split('>')[-1].strip() rate = text1.split('v:average\">')[i+1].split('')[0] number = text1.split('star')[i+1].split('')[1].split('')[0] try: quote = text1.split('inq')[1].split('>')[1].split('<')[0] except: print(rank + "该处评价为空") quote = " " file = open("豆瓣数据", "a") file.write( "排名:" + rank + ",豆瓣评分" + rate + ",评价人数:" + number + "。推荐理由:" + quote + "\n") file.close() print("排名{},《{}》,豆瓣评分{},{}。推荐理由:{}".format(rank, title, rate, number, quote)) def DouBanSpide(i): url = "https://movie.douban.com/top250?start="+str(i*9) user_agent = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "} req = request.Request(url=url, headers=user_agent) html = request.urlopen(req) Douban_data_wash(html.read().decode()) if __name__ == '__main__': file = open("豆瓣数据", "w") file.write("") for i in range(0, 10): DouBanSpide(i) time.sleep(random.randint(2, 10))

5.最后的实验结果:
豆瓣爬虫_第2张图片

你可能感兴趣的:(python学习中的理解,mysql的学习经验,python,mysql)