多线程爬虫2

XPath的特殊用法

1、以相同的字符开头

•starts-with(@属性名称, 属性字符相同部分)

eg:

<div id="test-1">需要的内容1</div>

<div id="test-2">需要的内容2</div>

<div id="testfault">需要的内容3</div>

  1: #-*-coding:utf8-*-
  2: from lxml import etree
  3: 
  4: html1 = '''
  5: <!DOCTYPE html>
  6: <html>
  7: <head lang="en">
  8:     <meta charset="UTF-8">
  9:     <title></title>
 10: </head>
 11: <body>
 12:     <div id="test-1">需要的内容1</div>
 13:     <div id="test-2">需要的内容2</div>
 14:     <div id="testfault">需要的内容3</div>
 15: </body>
 16: </html>
 17: selector = etree.HTML(html1)
 18: content = selector.xpath('//div[starts-with(@id,"test")]/text()')
 19: for each in content:
 20:     print each

2、标签套标签

eg:

<div id=“class3”>美女,

<font color=red>你的微信是多少?</font>

</div>

  1: #-*-coding:utf8-*-
  2: from lxml import etree
  3: html2 = '''
  4: <!DOCTYPE html>
  5: <html>
  6: <head lang="en">
  7:     <meta charset="UTF-8">
  8:     <title></title>
  9: </head>
 10: <body>
 11:     <div id="test3">
 12:         我左青龙,
 13:         <span id="tiger">
 14:             右白虎,
 15:             <ul>上朱雀,
 16:                 <li>下玄武。</li>
 17:             </ul>
 18:             老牛在当中,
 19:         </span>
 20:         龙头在胸口。
 21:     </div>
 22: </body>
 23: </html>
 24:  data = selector.xpath('//div[@id="test3"]')[0]
 25:  info = data.xpath('string(.)')
 26:  content_2 = info.replace('\n','').replace(' ','')
 27:  print content_2

你可能感兴趣的:(多线程爬虫2)