This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
PM> Install-Package HtmlAgilityPack -Version 1.8.0
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlAgilityPack.HtmlNode rootnode = doc.DocumentNode;
HtmlAgilityPack.HtmlNode row = rootnode.SelectSingleNode("//*[@id='content']/div[3]/div[1]");
Scraping Framework containing :
- a web client able to simulate a web browser.
- an HtmlAgilityPack extension to select elements using css selector (like JQuery)
PM> Install-Package ScrapySharp -Version 2.6.2
html.CssSelect("div"); //all div elements
html.CssSelect("div.content"); //all div elements with css class 'content'
html.CssSelect("div.widget.monthlist"); //all div elements with the both css class
html.CssSelect("#postPaging"); //all HTML elements with the id postPaging
html.CssSelect("div#postPaging.testClass"); // all HTML elements with the id postPaging and css class testClass
html.CssSelect("div.content > p.para"); //p elements who are direct children of div elements with css class 'content'
html.CssSelect("input[type = text].login"); // textbox with css class login
更多的CSS选择器使用方法可以参看W3的网页:CSS 选择器参考手册