Python解析网页的几种其他好方法

除了使用Requests + BeautifulSoup包。
以及BeautifulSoup.find()这种比较万能的方法来匹配html的元素以外，在解析匹配网页中还存在几种比较常见并且有效的方法。
这些方法包括使用lxml，pyquery，或使用Beautiful或者Scrapy中自带的css选择器。本文中会详细介绍三种做法。

lxml解析法
PyQuery解析法
Soup.Select方法

LXML法

from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests
from lxml import etree

如果使用使用lxml解析网站的话，最常用的方法是etree.HTML()，以及etree.parse()两个方法。
etree.HTML()解析一串字符，而etree.parse()解析一个html文件。
解析完后也能达到BeautifulSoup一样的效果，得到一串HTML的text文本。

user_agent= "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
headers={"User-Agent":user_agent}
url_name = "https://www.baidu.com"

html_r = requests.get(url_name,headers = headers)
html = etree.HTML(html_r.text)
result = etree.tostring(html,pretty_print = True)
print(result[0:100]) #得到的一串html的text文本

b'\n\n    \n    \n

 
 解析一段html的text的话，如下一例所示： 
 from lxml import etree
text = '''

    
         first item
         second item
         third item
         fourth item
         fifth item
     
 
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result)
 
 b'\n    \n         first item
\n         second item
\n         third item
\n         fourth item
\n         fifth item
\n     
\n \n'
 
 解析一个文件的话 
 etree.parse("test.html")
 
 
 
 要是使用lxml来提取元素，则常用xpath来提取。
 所以lxml最好的搭档是xpath。 
 from lxml import etree
from lxml import html
import requests

text = '''







    
         first item
         second item
         third item
         fourth item
         fifth item
     
 


'''
html_text = etree.HTML(text)
li_list = html_text.xpath('/html/body/div/ul/li/a')
for item in li_list:
    print(item.text)
 
 first item
second item
third item
fourth item
fifth item
 
 若以豆瓣举例 
 from lxml import html
import requests

page = requests.get("https://movie.douban.com/subject/4920389/")
tree = html.fromstring(page.text)
html_s = tree.xpath('//*[@id="info"]')
for item in (html_s[0]):
    if item.text != None:
        print(item.text)
 
 类型:
动作
科幻
冒险
官方网站:
readyplayeronemovie.com
制片国家/地区:
语言:
上映日期:
2018-03-30(中国大陆)
2018-03-11(西南偏南电影节)
2018-03-29(美国)
片长:
140分钟
又名:
IMDb链接:
tt1677720
 
 在Scrapy中，则使用response.xpath()方法加快匹配html节点的效率
 如下例子所示 
 '''
for info in response.xpath('//div[@class="item"]'):
      item = Movie250Item()
      item['rank'] = info.xpath('div[@class="pic"]/em/text()').extract()
      item['title'] = info.xpath('div[@class="pic"]/a/img/@alt').extract()
      item['link'] = info.xpath('div[@class="pic"]/a/@href').extract()
      item['star'] = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span/em/text()').extract()
      item['rate'] = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span/text()').extract()
      item['quote'] = info.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()').extract()
      yield item
'''
 
 '\nfor info in response.xpath(\'//div[@class="item"]\'):\n      item = Movie250Item()\n      item[\'rank\'] = info.xpath(\'div[@class="pic"]/em/text()\').extract()\n      item[\'title\'] = info.xpath(\'div[@class="pic"]/a/img/@alt\').extract()\n      item[\'link\'] = info.xpath(\'div[@class="pic"]/a/@href\').extract()\n      item[\'star\'] = info.xpath(\'div[@class="info"]/div[@class="bd"]/div[@class="star"]/span/em/text()\').extract()\n      item[\'rate\'] = info.xpath(\'div[@class="info"]/div[@class="bd"]/div[@class="star"]/span/text()\').extract()\n      item[\'quote\'] = info.xpath(\'div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()\').extract()\n      yield item\n'
 
 PyQuery法 
 PyQuery最厉害的地方，是能够将html的节点操作代码逻辑，在Python上如同Javascript的方法实现出来。
 如PyQuery解析的方法十分简单，直接使用PyQuery([url,html.text,filename])方法即可实现。
 PyQuery跟CSS选择器是比较配的，解析完网站后直接使用CSS选择器提取。 
 from pyquery import PyQuery as pq
doc = pq("")
doc_baidu = pq("http://www.baidu.com")
doc_file = pq(filename = "test.html")
 
 若下例匹配豆瓣以后，使用css选择器，则可以用很短的代码，将上文Xpath的内容匹配出来。 
 from pyquery import PyQuery as pd
from lxml import etree

doc = pq("")
doc_douban = pq("https://movie.douban.com/subject/4920389/")

info = doc_douban("div #info")
print(info.text())
 
 导演: 史蒂文·斯皮尔伯格
编剧: 扎克·佩恩 / 恩斯特·克莱恩
主演: 泰伊·谢里丹 / 奥利维亚·库克 / 本·门德尔森 / 马克·里朗斯 / 丽娜·维特 / 森崎温 / 赵家正 / 西蒙·佩吉 / T·J·米勒 / 汉娜·乔恩-卡门 / 拉尔夫·尹爱森 / 苏珊·林奇 / 克莱尔·希金斯 / 劳伦斯·斯佩尔曼 / 佩蒂塔·维克斯 / 艾萨克·安德鲁斯
类型: 动作 / 科幻 / 冒险
官方网站: readyplayeronemovie.com
制片国家/地区: 美国
语言: 英语
上映日期: 2018-03-30(中国大陆) / 2018-03-11(西南偏南电影节) / 2018-03-29(美国)
片长: 140分钟
又名: 玩家一号 / 挑战者1号(港) / 一级玩家(台) / 一号玩家
IMDb链接: tt1677720
 
 More_Comments = doc_douban("div #hot-comments > a")
Long_Comments = doc_douban("#topic-items > div > div:nth-child(3) > div.posts > span > a")

#topic-items > div
#hot-comments > a
print(More_Comments)
print(Long_Comments)
 
 更多短评122443条
 
 在Scrapy中，通常也用response.css()方法来提取，如下例 
 '''
def parse(self, response):
    for quote in response.css('div.item'):
        yield {
            "电影名": quote.css('div.info div.hd a span.title::text').extract_first(),
            "评分":quote.css('div.info div.bd div.star span.rating_num::text').extract(),
            "引言": quote.css('div.info div.bd p.quote span.inq::text').extract()
        }
'''
 
 '\ndef parse(self, response):\n    for quote in response.css(\'div.item\'):\n        yield {\n            "电影名": quote.css(\'div.info div.hd a span.title::text\').extract_first(),\n            "评分":quote.css(\'div.info div.bd div.star span.rating_num::text\').extract(),\n            "引言": quote.css(\'div.info div.bd p.quote span.inq::text\').extract()\n        }\n'
 
 BeautifulSoup.select()方法 
 这种方法类似上种，是使用BeautifulSoup中的CSS选择器来提取,如下简单例子所示。 
 html_soup = BeautifulSoup(html_r.text,'lxml')
html_soup_li = html_soup.select("li")
print(html_soup_li)
 
 [手写
, 拼音
, 
, 关闭]

Python解析网页的几种其他好方法

Python解析网页的几种其他好方法

LXML法

PyQuery法

BeautifulSoup.select()方法

你可能感兴趣的:(Python解析网页的几种其他好方法)