网络爬虫:Requests+lxml

比较常用

# -*-coding:utf8-*-
import requests
from lxml import etree

url="http://econpy.pythonanywhere.com/ex/001.html"
page=requests.get(url)
html=page.text
selector = etree.HTML(html)

buyer=selector.xpath('//div[@title="buyer-name"]/text()')
prices=selector.xpath('//span[@class="item-price"]/text()')

print (buyer)
print (prices)

这个用的少一些

# -*-coding:utf8-*-

import requests
from lxml import html

url="http://econpy.pythonanywhere.com/ex/001.html"
page=requests.get(url)
tree=html.fromstring(page.text)

buyer=tree.xpath('//div[@title="buyer-name"]/text()')
prices=tree.xpath('//span[@class="item-price"]/text()')

print (buyer)
print (prices)
  1. Xpath的语法参考
    http://www.w3school.com.cn/xpath/xpath_syntax.asp
  2. Chrome中使用时可以下载插件:Xpath helper
  3. 参考使用requests和lxml编写python爬虫小记
    http://www.tuicool.com/articles/vABNRbR

XPath在python中的高级应用
参见:http://blog.csdn.net/winterto1990/article/details/47903653

但是遇到中文网页时,中文出现乱码。

req = requests.get("http://news.sina.com.cn/")
print (req.text)

为了解决这个问题,请参考这篇文章:
http://blog.csdn.net/chaowanghn/article/details/54889835

你可能感兴趣的:(python)