python初级爬虫之猫眼电影

**

逐步爬取猫眼电影排行信息

**

一、抓取一页,简单地思路

import requests
import re

headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
        Chrome/71.0.3578.98 Safari/537.36'
}
r = requests.get('https://maoyan.com/board/4',headers=headers) 
items = re.findall('
.*?board-index.*?>(\d+).*?data-src="(.*?)".*?name">(.*?).*?star">(.*?)

.*?releasetime">(.*?)

' + '.*?integer">(.*?).*?fraction">(.*?).*?
',r.text,re.S) print(items)

先使用requests库抓取页面,再使用正则表达式提取页面相关的信息,最后输出结果:
python初级爬虫之猫眼电影_第1张图片

二、整理代码,使其更加简洁和规范,便于分页抓取。

import requests
import re

def get_page(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
        Chrome/71.0.3578.98 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    return response.text

def parse_page(html):
    pattern = re.compile('
.*?board-index.*?>(\d+).*?data-src="(.*?)".*?name">(.*?).*?star">(.*?)

.*?releasetime">(.*?)

' + '.*?integer">(.*?).*?fraction">(.*?).*?
',re.S) items = re.findall(pattern,html) return items def main(): url = 'https://maoyan.com/board/4' html = get_page(url) results = parse_page(html) print(results) main()

上面代自定义函数来规范代码,内容基本不变,输出结果仍为列表型。下面来整理一下输出的结果:

import requests
import re

def get_page(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
        Chrome/71.0.3578.98 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    return response.text

def parse_page(html):
    pattern = re.compile('
.*?board-index.*?>(\d+).*?data-src="(.*?)".*?name">(.*?).*?star">(.*?)

.*?releasetime">(.*?)

' + '.*?integer">(.*?).*?fraction">(.*?).*?
',re.S) items = re.findall(pattern,html) return items def main(): url = 'https://maoyan.com/board/4' html = get_page(url) for item in parse_page(html): print(item[0],item[1],item[2],item[3].strip()[3:],item[4].strip()[5:],item[5] + item[6]) main()

python初级爬虫之猫眼电影_第2张图片
三、终极整理,同时完成分页抓取,共抓取前100名,完整代码如下:

import re
import requests
from requests.exceptions import RequestException

 
def get_one_page(url):
    try:
        headers = {
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
        }
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None
 
def parse_one_page(html):
    pattern = re.compile('
.*?board-index.*?>(\d+).*?data-src="(.*?)".*?name">(.*?).*?star">(.*?)

.*?releasetime">(.*?)

' + '.*?integer">(.*?).*?fraction">(.*?).*?
', re.S) items = re.findall(pattern, html) for item in items: yield { 'index': item[0], 'image': item[1], 'title': item[2], 'actor': item[3].strip()[3:], 'time': item[4].strip()[5:], 'score': item[5] + item[6] } def main(offset): url = 'https://maoyan.com/board/4?offset='+str(offset) html = get_one_page(url) for item in parse_one_page(html): print(item) if __name__ == '__main__': for i in range(10): main(offset=i*10)

python初级爬虫之猫眼电影_第3张图片
小结:一步一步来做,先理解每一个步骤的原理,懂得相关知识点的使用,才可以学到东西。最后的代码中,使用了requests库来抓取页面、re库来完成正则表达式。同时,在输出结果端使用一个生成器来完成迭代,使得输出结果为字典型。

你可能感兴趣的:(python初级爬虫之猫眼电影)