Spider_maoyantop100

初涉爬虫,爬取猫眼电影的top100相关信息(下载电影海报到本地指定路径,并抓取电影名称、主演和上映时间以json格式保存到本地)。

爬取过程

动态生成循环页面地址(模拟页面跳转) -- 设置响应头相关内容(猫眼比较好爬,本身并没有什么反爬措施) -- 使用requests方法发出请求 -- 判断请求状态(如果发生错误,返回空即可)-- 写正则过滤抓取页面 -- 字节流保存图片 -- 转换格式保存数据

import json
import requests
import re


def get_page(url):
    headers = {
        "User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"
    }
    # 名称参数赋值,位置参数会报错
    # response = requests.get(url,headers)
    response = requests.get(url=url, headers=headers)
    if response.status_code == 200:
        return response.content.decode('utf-8')
    return None


def pares_page(html):
    # 排行
    pattern = re.compile('(.*?)')
    item_rank = re.findall(pattern, html)
    # 电影
    pattern = re.compile('movieId.*?>.*?.*?(.*?)

',re.S) item_star = re.findall(pattern, html) # 上映时间 pattern = re.compile('

(.*?)

',re.S) item_time = re.findall(pattern, html) items = [] for i in range(len(item_film)): film = {} film['rank'] = item_rank[i] film['film'] = item_film[i] film['star'] = item_star[i].strip() film['time'] = item_time[i] items.append(film) return items def write_image(items): """ 根据图片地址下载图片到指定目录 """ url_parts = items.split("@") url_result = url_parts[0] filename = "./upload/%s" % url_result.split("/")[-1] r = requests.get(items) with open(filename, "wb") as f: f.write(r.content) def write_json(items): movie_json = json.dumps(items, ensure_ascii=False, check_circular=True) filename = './猫眼top100' with open(filename,"a",encoding='utf-8') as f: f.write(movie_json) def main(): for i in range(0, 10): page = str(i * 10) url = 'http://maoyan.com/board/4?offset=' + page print(url) html = get_page(url) items = pares_page(html) # print(html) # 列表生成式[] # print([item.strip() for item in items]) # 循环遍历爬取结果 # for item in items: # write_image(item) write_json(items) if __name__ == '__main__': main()

总结
保存图片的时候是十张保存一次,写数据的时候也是十条数据一追加(注意写数据是指定的是a而不是w,否则后面保存的数据会覆盖前面保存的数据)。本身对于页面抓取的内容不是很清楚,列表字典的嵌套和循环也不是很熟悉,所以在json信息的完整性和格式上弄了半天。整体思路不太清晰,希望下次会好点!
GitHub地址

你可能感兴趣的:(Spider_maoyantop100)