XPath定位中and、or、not、contains、starts-with和string(.)用法

下文总结了XPath常用的text()andornotcontains,当然也还有类似的positionlastends_withstarts_with等等。

data1 = selector.xpath("//input[@type='submit' and @name='fuck']");
data2 = selector.xpath("//input[@type='submit' or @name='fuck']");
data2 = selector.xpath("//input[@type='submit' and not(contains(@name,'fuck'))]");
data3 = selector.xpath("//input[starts-with(@id,'fuck')]"));
data4 = selector.xpath("//input[ends-with(@id,'fuck')]"));
data5 = selector.xpath("//input[contains(@id,'fuck')]"));
另外,举个例子解释下string(.)的用法:


    我左青龙,
   
        右白虎,
       
    上朱雀,
               
  • 下玄武。

  •        

        老牛在腰间,
   

    龙头在胸前。

注意selector.xpath返回的是一个list,因为页面id要求是唯一的,所以以下[]中总是<=1个元素。

data = selector.xpath('//div[@id="test3"]')[0]; info = data.xpath('string(.)');

此时,info里面的内容即为“我左青龙,右白虎,上朱雀,下玄武。老牛在腰间,龙头在胸前。”

一个综合小例子:

# -*- coding: utf-8 -*-
import requests
from lxml import etree

class Bugzilla(object):
    def __init__(self):
        self.base_url = 'https://bugs.winehq.org/show_bug.cgi?id=';
        self.user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'};

    def getPage(self, url):
        html = requests.get(url, headers=self.user_agent);
        html.encoding = 'utf-8';
        return html.text;
        
    def getSelector(self, page):
        selector = etree.HTML(page);
        return selector;

    def getUrl(self, number):
        url = self.base_url + str(number);
        return url;

    def getDescription(self, selector):
        description = selector.xpath('//*[@id="c0"]/pre/text()');
        if len(description) <= 0:
            return "";
        return description[0];

    def getComments(self, selector):
        #comments = selector.xpath('//*[@id="comments"]/table/tbody/tr/td/div[position() > 1]/pre/text()');
        #comments = selector.xpath('//*[@class="bz_comment" or not(contains(@id,"c0"))]/pre/text()');
        data = selector.xpath('//*[@class="bz_comment"]/pre');
        comments = [];
        for comms in data:
            comments.append(comms.xpath('string(.)'));
        return comments; #'. '.join(comments);

if __name__ == '__main__':
    bug = Bugzilla();
References:
[1] http://www.cnblogs.com/unknows/p/7684331.html
[2] https://blog.csdn.net/huang1600301017/article/details/84585065
[3] https://blog.csdn.net/zhouxuan623/article/details/43935039
[4] 《Python爬虫开发 从入门到实战》(绿皮),作者谢乾坤
 

你可能感兴趣的:(java)