爬虫demo——爬取豆瓣正在上映的电影

学习Python爬虫的第一个小demo，给出一些笔记，以便日后复习。
在使用Python做爬虫的时候，可以分为两大块：1.将目标网页内容请求下来；2.对请求下来的内容做整理
这里也是先给出每一步的笔记，然后给出最终的源代码。

一、导入相关库

import requests
from lxml import etree

二、将目标网页内容请求下来

1.设置请求头

原因是一些网站可能会有反爬虫机制，设置请求头，可以绕过一些网站的反爬虫机制，成功获取数据。
设置请求头的时候，一般情况下要设置User-Agent 和 Referer，如果只设置这两项不足以绕过网站的反爬虫机制的话，就使用Chrome的开发者工具，设置更多的请求头。

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Referer": "https://www.douban.com/"
}

2.请求网页内容

douban_url = "https://movie.douban.com/cinema/nowplaying/shanghai/"
response = requests.get(douban_url, headers=headers)
douban_text = response.text

三、对请求下来的内容做整理

这里主要是使用lxml配合xpath语法进行整理，将每一部电影的信息整理到字典中，最终将所有的电影存放在列表中

html_element = etree.HTML(douban_text)
ul = html_element.xpath('//ul[@class="lists"]')[0]
lis = ul.xpath('./li')
movies = []
for li in lis:
    title = li.xpath('./@data-title')[0]
    score = li.xpath('./@data-score')[0]
    star = li.xpath('./@data-star')[0]
    duration = li.xpath('./@data-duration')[0]
    region = li.xpath('./@data-region')[0]
    director = li.xpath('./@data-director')[0]
    actors = li.xpath('./@data-actors')[0]
    post = li.xpath('.//img/@src')[0]
    movie = {
        "title": title,
        "score": score,
        "star": star,
        "duration": duration,
        "redion": region,
        "director": director,
        "actors": actors,
        "post": post
    }
    movies.append(movie)

for movie in movies:
    print(movie)

四、完整代码

# 导入相关库
import requests
from lxml import etree

# 1.将目标网页的内容请求下来
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Referer": "https://www.douban.com/"
}
douban_url = "https://movie.douban.com/cinema/nowplaying/shanghai/"
response = requests.get(douban_url, headers=headers)
douban_text = response.text

# 2.将抓取的数据进行处理
html_element = etree.HTML(douban_text)
ul = html_element.xpath('//ul[@class="lists"]')[0]
lis = ul.xpath('./li')
movies = []
for li in lis:
    title = li.xpath('./@data-title')[0]
    score = li.xpath('./@data-score')[0]
    star = li.xpath('./@data-star')[0]
    duration = li.xpath('./@data-duration')[0]
    region = li.xpath('./@data-region')[0]
    director = li.xpath('./@data-director')[0]
    actors = li.xpath('./@data-actors')[0]
    post = li.xpath('.//img/@src')[0]
    movie = {
        "title": title,
        "score": score,
        "star": star,
        "duration": duration,
        "redion": region,
        "director": director,
        "actors": actors,
        "post": post
    }
    movies.append(movie)

for movie in movies:
    print(movie)

爬虫demo——爬取豆瓣正在上映的电影

一、导入相关库

二、将目标网页内容请求下来

1.设置请求头

2.请求网页内容

三、对请求下来的内容做整理

四、完整代码

你可能感兴趣的:(爬虫demo——爬取豆瓣正在上映的电影)