初涉爬虫,爬取猫眼电影的top100相关信息(下载电影海报到本地指定路径,并抓取电影名称、主演和上映时间以json格式保存到本地)。
爬取过程
动态生成循环页面地址(模拟页面跳转) -- 设置响应头相关内容(猫眼比较好爬,本身并没有什么反爬措施) -- 使用requests方法发出请求 -- 判断请求状态(如果发生错误,返回空即可)-- 写正则过滤抓取页面 -- 字节流保存图片 -- 转换格式保存数据
import json
import requests
import re
def get_page(url):
headers = {
"User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"
}
# 名称参数赋值,位置参数会报错
# response = requests.get(url,headers)
response = requests.get(url=url, headers=headers)
if response.status_code == 200:
return response.content.decode('utf-8')
return None
def pares_page(html):
# 排行
pattern = re.compile('(.*?)')
item_rank = re.findall(pattern, html)
# 电影
pattern = re.compile('movieId.*?>.*?.*?(.*?)',re.S)
item_star = re.findall(pattern, html)
# 上映时间
pattern = re.compile('(.*?)
',re.S)
item_time = re.findall(pattern, html)
items = []
for i in range(len(item_film)):
film = {}
film['rank'] = item_rank[i]
film['film'] = item_film[i]
film['star'] = item_star[i].strip()
film['time'] = item_time[i]
items.append(film)
return items
def write_image(items):
"""
根据图片地址下载图片到指定目录
"""
url_parts = items.split("@")
url_result = url_parts[0]
filename = "./upload/%s" % url_result.split("/")[-1]
r = requests.get(items)
with open(filename, "wb") as f:
f.write(r.content)
def write_json(items):
movie_json = json.dumps(items, ensure_ascii=False, check_circular=True)
filename = './猫眼top100'
with open(filename,"a",encoding='utf-8') as f:
f.write(movie_json)
def main():
for i in range(0, 10):
page = str(i * 10)
url = 'http://maoyan.com/board/4?offset=' + page
print(url)
html = get_page(url)
items = pares_page(html)
# print(html)
# 列表生成式[]
# print([item.strip() for item in items])
# 循环遍历爬取结果
# for item in items:
# write_image(item)
write_json(items)
if __name__ == '__main__':
main()
总结:
保存图片的时候是十张保存一次,写数据的时候也是十条数据一追加(注意写数据是指定的是a
而不是w
,否则后面保存的数据会覆盖前面保存的数据)。本身对于页面抓取的内容不是很清楚,列表字典的嵌套和循环也不是很熟悉,所以在json信息的完整性和格式上弄了半天。整体思路不太清晰,希望下次会好点!
GitHub地址