Treating HTML like XML using HtmlAgilityPack, and doing it inside of an XSLT too [转载]

I was not able to post this on Simon Mourier's blog due to the HTML and XSLT tags, so here it is on mine:
Maybe someone has done this already, but I don't see it in the comments.
I created an XSLT extension object based on HtmlAgilityPack. The class is tiny:
using System;
using System.Collections.Generic;
using System.Text;
using HtmlAgilityPack;
using System.Xml;
using System.Xml.XPath;
using System.IO;
namespace HtmlAgilityPack
{
    public class XslExtension
    {
        public XmlDocument loadhtmlasxml(string url)
        {
            // Create an instance of the HtmlWeb object
            HtmlWeb web = new HtmlWeb();
            // Declare necessary stream and writer objects
            MemoryStream m = new MemoryStream();           
            XmlTextWriter xtw = new XmlTextWriter(m,null);           
            // Load the content into the writer
            web.LoadHtmlAsXml(url, xtw);
            // Rewind the memory stream
            m.Position = 0;
            // Create, fill, and return the xml document
            XmlDocument xdoc = new XmlDocument();
            xdoc.LoadXml((new StreamReader(m)).ReadToEnd());
            return xdoc;
        }
    }
}

Then, I used NXSLT from [url]http://www.xmllab.net[/url] to load the custom extension function in from the command line so that the following XSL style sheet can be used directly:
<xsl:stylesheet
 xmlns:xsl=" [url]http://www.w3.org/1999/XSL/Transform[/url]"
 xmlns:hap=" [url]http://smourier.blogspot.com[/url]"
 xmlns:msxsl="urn:schemas-microsoft-com:xslt"
      version="1.0">
 <xsl:output method="html" omit-xml-declaration="yes" indent="no"/>
 <xsl:template match="/">
  <h1>BEGIN TEST OF HtmlAgilityPack.XslExtension</h1>
  <h2>First, connect to [url]http://www.cnn.com[/url] and load its node set into a local variable</h2>   
  <xsl:variable name="cnn"><xsl:copy-of select="hap:loadhtmlasxml('http://www.cnn.com')" /></xsl:variable>
  <h3>CNN.com has this many nodes:</h3>
  <xsl:value-of select="count(msxsl:node-set($cnn)//*)" />
  <h2>Now, process all the A tags within the "Special Converage" stories inside the "div class="cnnLSSpecialCovBoxContent" that have an HREF that starts with /2005.</h2>
   <h3>Special Coverage</h3>
    <xsl:for-each select="msxsl:node-set($cnn)//div[@class='cnnLSSpecialCovBoxContent']//a[starts-with(@href, '/2005/')]">
   <div>
    <h3><xsl:copy-of select="." /></h3>
    <!-- Now get the images from each story if they exist -->
    <h5>Connecting to: <xsl:value-of select="concat('http://www.cnn.com', @href)" /> to retrieve image if it exists</h5>
    <xsl:copy-of select="hap:loadhtmlasxml(concat('http://www.cnn.com', @href))//img[@height = '168']" />
   <br /><br />
   </div>
   </xsl:for-each>
  <h1>END TEST OF HtmlAgilityPack.XslExtension</h1>
 </xsl:template>
</xsl:stylesheet>

The command for NXSLT to perform this is:

nxslt2.exe source.xml source.xsl -ext hap:HtmlAgilityPack.XslExtension xmlns:hap=" [url]http://smourier.blogspot.com[/url]" -af .\HtmlAgilityPackXs
lExtension.dll
The style sheet connects to CNN.com using the syntax:
select="hap:loadhtmlasxml('http://www.cnn.com')"
Then, further down, after it processes each of the selected A HREF's, it connects to each of the linked stories and retrieves any images with height 168, outputting the HTML result tree.
This could allow for any number of descendent link followings. I haven't worked out the automatic form processor yet, but I think that could be an XSLT extension too perhaps...
Let me know what you think...
[url]http://blogs.wdevs.com/ultravioletconsulting/archive/2005/09/10/10506.aspx[/url]
自由、创新、研究、探索……

你可能感兴趣的:(html,XSLT,休闲,HtmlAgilityPack,Treating)