Python网页抓取:获取页面中某段内容的xpath

     在批量抓取网页内容时,我经常采用的做法是:1、得到目标内容在网页中的位置,即xpath路径;2、批量下载网页,然后利用xpath,取出每个网页中所需要的内容。

     在这里,我们利用python模块lxml。

     以谷歌翻译为例,我要批量抓取翻译内容,那么首先我要知道译文的xpath,代码如下:

     import urllib,urllib2 import lxml import lxml.html as HTML import lxml.etree as etree #设置url参数 lin = 'en' lout = 'zh-CN' text = 'my apple 123' values = {'hl':'zh-CN', 'ie':'UTF-8', 'text':text, 'sl':lin, 'tl':lout} url = 'http://translate.google.cn/translate_t' data = urllib.urlencode(values) req = urllib2.Request(url, data) req.add_header('User-Agent', "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)") response = urllib2.urlopen(req, timeout = 10) shtml = response.read() response.close() hdoc = HTML.fromstring(shtml) htree = etree.ElementTree(hdoc) #依次打印出hdoc每个元素的文本内容和xpath路径 for t in hdoc.iter(): print htree.getpath(t) print t.text_content() raw_input()

 

     运行这段代码,发现译文“我的苹果123”的xpath为“/html/body/div[2]/div[2]/div[2]/div/div/div[2]/div”。

     现在可以利用xpath取出译文内容。以下方法接受英文原文,然后调用google translate,返回中文译文。代码如下:

    # -*- coding:utf-8 -*- import urllib,urllib2 import lxml import lxml.html as HTML import lxml.etree as etree def g_trans(str_text): lin = 'en' lout = 'zh-CN' values = {'hl':'zh-CN', 'ie':'UTF-8', 'text':str_text, 'sl':lin, 'tl':lout} url = 'http://translate.google.cn/translate_t' data = urllib.urlencode(values) req = urllib2.Request(url, data) req.add_header('User-Agent', "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)") response = urllib2.urlopen(req, timeout = 10) htree = HTML.parse(response) response.close() #注意,此处返回的是一个list emts = htree.xpath('/html/body/div[2]/div[2]/div[2]/div/div/div[2]/div') return emts[0].text_content()

你可能感兴趣的:(Python)