xpath:一门从html中提取数据的语言
xpath语法
1、选择节点(标签) /html/head/meta :能够选中html下的head下的所有的meta标签
2、// :能够从任意节点开始选择 //li:当前页面上所有的li标签 //html/head/link :head下所有的link标签
3、@符号的用途:
1)、选择具体某个元素//div[@class='feed']/ul/li(选择class=‘feed’)的div下的ul下的li。
2)、a/@href:选择a的href值
4、获取文本:
/a/text():获取a下的文本
/a//text():获取a下所有的文本
安装:pip install lxml
使用:
from lxml import etree
element=etree.HTML("html字符串")
element.xpath("")
from lxml import etree
import requests
import json
url = "https://movie.douban.com/chart"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"}
response = requests.get(url,headers=headers)
html_str = response.content.decode()
#print(html_str)
html = etree.HTML(html_str)
ret1 = html.xpath("//div[@class='indent']//table")
for table in ret1:
item = {}
item["title"]=table.xpath(".//div[@class='pl2']/a/text()")[0].replace("/","").strip()
item["href"] = table.xpath(".//div[@class='pl2']/a/@href")[0]
item["img"] = table.xpath(".//a[@class='nbg']/img/@src")[0]
item["main actors"]=table.xpath(".//div[@class='pl2']/p[@class='pl']/text()")[0]
item["rating_nums"]=table.xpath(".//div[@class='pl2']//span[@class='rating_nums']/text()")[0]
item["people_nums"]=table.xpath(".//div[@class='pl2']//span[@class='pl']/text()")[0]
with open ("豆瓣电影榜.txt","a",encoding="utf-8") as f:
f.write(json.dumps(item,ensure_ascii=False,indent=2))