【Python】数据提取xpath和lxml模块(豆瓣电影排行榜的爬虫)

xpath

xpath:一门从html中提取数据的语言

xpath语法

1、选择节点(标签)    /html/head/meta :能够选中html下的head下的所有的meta标签

2、// :能够从任意节点开始选择    //li:当前页面上所有的li标签   //html/head/link :head下所有的link标签

3、@符号的用途:

1)、选择具体某个元素//div[@class='feed']/ul/li(选择class=‘feed’)的div下的ul下的li。

2)、a/@href:选择a的href值

4、获取文本:

/a/text():获取a下的文本

/a//text():获取a下所有的文本

lxml

安装:pip install lxml

使用:

from lxml import etree

element=etree.HTML("html字符串")

element.xpath("")

from lxml import  etree

import requests

import json

url = "https://movie.douban.com/chart"

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"}

response = requests.get(url,headers=headers)

html_str = response.content.decode()

#print(html_str)

html = etree.HTML(html_str)

ret1 = html.xpath("//div[@class='indent']//table")

for table in ret1:
    item = {}
    item["title"]=table.xpath(".//div[@class='pl2']/a/text()")[0].replace("/","").strip()
    item["href"] = table.xpath(".//div[@class='pl2']/a/@href")[0]
    item["img"] = table.xpath(".//a[@class='nbg']/img/@src")[0]
    item["main actors"]=table.xpath(".//div[@class='pl2']/p[@class='pl']/text()")[0]
    item["rating_nums"]=table.xpath(".//div[@class='pl2']//span[@class='rating_nums']/text()")[0]
    item["people_nums"]=table.xpath(".//div[@class='pl2']//span[@class='pl']/text()")[0]
    with open ("豆瓣电影榜.txt","a",encoding="utf-8") as f:
        f.write(json.dumps(item,ensure_ascii=False,indent=2))

你可能感兴趣的:(Python,爬虫)