webharvest 实例

1 , <config charset="utf-8">
  <var-def name="start">
  <html-to-xml>
  <http url="http://www.tianya.cn/bbs/index.shtml" charset="utf-8" />
  </html-to-xml>
  </var-def>
  <var-def name="ulList">
  <xpath expression="//div[@class='bankuai_list']">
  <var name="start" />
  </xpath>
  </var-def>
  <file action="write" path="tianya/siteboards.xml" charset="utf-8">
  <![CDATA[ <site> ]]>
  <loop item="item" index="i">
  <list><var name="ulList"/></list>
  <body>
  <xquery>
  <xq-param name="item">
  <var name="item"/>
  </xq-param>
  <xq-expression><![CDATA[
  declare variable $item as node() external;
  <board boardname="{normalize-space(data($item//h3/text()))}" boardurl="">
  {
  for $row in $item//li return
  <board boardname="{normalize-space(data($row//a/text()))}" boardurl="{normalize-space(data($row/a/@href))}" />
  }
  </board>
  ]]></xq-expression>
  </xquery>
  </body>
  </loop>
  <![CDATA[ </site> ]]>
  </file>
  </config>
  这个设置装备摆设文件分为三个部门:
  1. 界说爬虫进口:
  <var-def name="start">
  <html-to-xml>
  <http url="http://www.tianya.cn/bbs/index.shtml" charset="utf-8" />
  </html-to-xml>
  </var-def>


2 ,<var-def name = "requestURL">
        http://www.informatik.uni-trier.de/~ley/db/conf/IEEEscc/scc2009.html
    </var-def>
    <var-def name = "confXML">
        http://dblp.uni-trier.de/rec/bibtex/conf/IEEEscc/2009.xml
    </var-def>
    <var-def name = "article_link">
        <xquery>
            <xq-param name="doc">
                <html-to-xml>
                    <http url = "${requestURL}"/>
                </html-to-xml>   
            </xq-param>
            <xq-param name="confXML" type = "string">
                <var name = "confXML"/>
            </xq-param>
            <xq-expression><![CDATA[
                    declare variable $doc as node() external;
                    declare variable $confXML as xs:string external;
                     <asdfasd>
                          {  for $x in $doc//a
                                where $x/@href = $confXML and matches($x/@href,"http:.*\.xml")
                             return
                                   $x/@href
                              }
                      </asdfasd>  
                   
                    ]]></xq-expression>
   
        </xquery>
    </var-def>
1. 前面定义的变量在Xquery中不能使用,必须在xq-param中再次定义变量去context中定义的值。
2. 在xq-expression中使用变量需要采用declare variable $name as xs:string external。
3. 声明(declare variable $name as xs:string external)需要在加xs:***否则报错。
4. 在返回值是 <asdfasd>
                          {  for $x in $doc//a
                                where $x/@href = $confXML and matches($x/@href,"http:.*\.xml")
                             return
                                   $x/@href
                              }
                 </asdfasd>返回结果是计算了for语句后的内容<asdfasd href="http://dblp.uni-trier.de/rec/bibtex/conf/IEEEscc/2009.xml"/>
去了大括号返回<asdfasd>
for $x in $doc//a
where $x/@href = $confXML and matches($x/@href,"http:.*\.xml")
return
$x/@href
</asdfasd>一个字就是怪

你可能感兴趣的:(实例,网络爬虫,webharvest)