由于之前学习过程中没有使用过lxml+XPath的组合,本篇主要是学习lxml+XPath+python的过程。主要参考了该教程。
在w3school上有详细的教程,不懂之处可以查看
pip install lxml
一般来说,我们在解析网页时仅用到这两种用法就行了。
>>> from lxml import etree
>>> text = '''
- first item
- second item
- third item
- fourth item
- fifth item #这里少了一个li标签
'''
>>>
>>> html = etree.HTML(text) #直接读取字符串格式
>>> html
>>> type(html)
>>> result = etree.tostring(html) #补全代码
>>> result
b'\n \n - first item
\n - second item
\n - third item
\n - fourth item
\n - fifth item\n
\n \n'
>>> type(result)
>>>
>>>
除上面的读取字符串外,etree还可以读取文件形式,得到同样的结果。
>>> html = etree.parse('hello.html')
>>> result = etree.tostring(html)
>>> result
>>>
>>> li = html.xpath('//li') #查找html中所有li标签,并以列表形式返回
>>> li
[, , , , ]
>>> type(li)
>>>
>>>
>>> li_class = html.xpath('//li/@class')
>>> li_class
['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
>>>
>>> li_class_0 = html.xpath("//li[@class='item-0']")
>>> li_class_0
[, ]
>>>
>>> text = '''
'''
>>> html = etree.HTML(text)
>>> span = html.xpath('//li//span')
>>> span
[]
>>> span1 = html.xpath('//li/span')
>>> span1
[]
>>>
注意:span前面双斜线表示寻找包含在所有li标签下的span标签,单斜线表示寻找所有的li标签的名为span的子标签>>> res = html.xpath('//li[last()]/a') #最后一个
>>>
>>>
>>> res = html.xpath('//li[last()-1]/a') #倒数第二个
>>> res[0].text #显示a标签下的内容
'fourth item'
>>>
前面要求类名包含'external'即可,而后者需要类名为'external',完全匹配。
用XPath和BeautifulSoup、re(正则表达式)同时抓取相同内容
代码:
from lxml import etree
import urllib.request
from bs4 import BeautifulSoup as BS
import re
headers = {'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"}
def get_html(url):
req = urllib.request.Request(url,headers=headers)
response = urllib.request.urlopen(req)
html = response.read()
return html
def for_xpath(html):
print('this method is for xpath...')
result = etree.HTML(html)
#获取所有热门目的地
#所有class属性为’hot-list clearfix‘的div标签下的所有a标签
hot_city = result.xpath("//div[@class='hot-list clearfix']//a")
mdd = []
for i in range(len(hot_city)):
mdd.append(hot_city[i].text)
print(mdd)
return len(mdd)
def for_BS(html):
print('this method is for BeautifulSoup...')
soup = BS(html,'html.parser')
div_col_in = soup.select('.col')[:2] #查找class为col的标签,取前两个,包含所有国内景点
#print(div_col_in)
mdd = []
for col_in in div_col_in:
a = col_in.select('a')
for i in range(len(a)):
mdd.append(a[i].text)
print(mdd)
return len(mdd)
def for_re(html):
print('this method is for re ...')
mdd_all = re.findall('target="_blank">(.*?)',html)
mdd = mdd_all[1:137]
print(mdd)
return len(mdd)
def main():
url = 'http://www.mafengwo.cn/mdd/'
html = get_html(url)
html = html.decode('utf8')
#print(html)
#html1 = etree.HTML(html)
#print(html1)
i = for_xpath(html)
j = for_BS(html)
k = for_re(html)
print('for xpath: total is %d mdd !' %i)
print('for BS: total is %d mdd !' %j)
print('for re: total is %d mdd !'%k)
if __name__ == '__main__':
main()
最后发现结果相同,而且三种解析方式差不多,难易度类似。类名有空格时(one-list clearfix)BeautifulSoup无法获取???
具体用哪一种方法根据实际情况分析或者直接习惯偏好吧!!!!!!