用.Net core写爬虫之HtmlAgilityPack用法详解

HtmlAgilityPack用法详解

在上一篇 用.Net core写爬虫之HttpClient用法详解 中我们已经知道了怎么发送HTTP请求,获取到数据了,那么接下来就是如何解析这些数据,提取我们想要的信息了,在Python中常用的解析库有 PyQuery,BeautifulSoup,lxml等,在.Net中与之对应的库就是HtmlAgilityPack了,它的原理也是利用Xpath语法对Dom树节点进行结构解析,十分简单,还和其他语言通用。

1. HtmlAgilityPack简介

HtmlAgilityPack 简称HAP,是一个用C#语言开发的用来解析html Dom和XML的第三方解析类库,用官网的描述是 HAP is an HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT.基本意思都差不多,HAP的官网地址是 https://html-agility-pack.net/

2. HtmlAgilityPack使用

根据官方文档,它有几个基础的类,包括Parser,Selectors,Manipulation,Traversing,Writer,Utilities,Attributes。通过名字,我也大概知道他们的用法,比如Parser是一个解析类,Selectors是选择器类,Manipulation是节点操作类…等等,具体的官方文档已经交代的很清晰了,而且英文也很简单,基本不用查字典就能明白个八九不离十。

官方文档: https://html-agility-pack.net/documentation

2.1 获取html字符串

解析之前我们先要得到html 文档,我这里用HttpClient简单获取一下。

static string urlRoot = "https://www.haolizi.net/examples/csharp_{0}.html";

/// 
/// 获取html页面
/// 
/// url地址
/// 
public static async Task<string> HtmlRequest(string requestUrl)
{
	HttpClient httpClient = new HttpClient();
	httpClient.DefaultRequestHeaders.Add("Method", "Get");
	httpClient.DefaultRequestHeaders.Add("KeepAlive", "false"); 
	httpClient.DefaultRequestHeaders.Add("UserAgent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
	var response = await httpClient.GetStringAsync(requestUrl);
	return response;
}

string requestUrl = string.Format(urlRoot, 1);
Console.WriteLine(requestUrl);
string html = HtmlRequest(requestUrl).Result;

当然我们也可以直接用 HAP自带的方法来加载html 文档:

HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(requestUrl);
2.2 解析html字符串

接下来我们就要解析html Dom文档,从中提取我们需要的元素数据了,要先用浏览器F12定位一下元素标签,如图所示,很快我们就能搞清楚它的Dom树结构,最后我们就可以用Xpath语法将有用的信息提取出来了。截图中画红框的就是我们要提取的元素,可以看到凡是带有超链接的字符串都是我们要提取的内容。

  • F12 定位Dom文档示意:
    用.Net core写爬虫之HtmlAgilityPack用法详解_第1张图片

  • 提取解析源代码:

/// 
/// 解析提取字段
/// 
/// 
/// 
public  static void GetExampleData(string htmlStr)
{

	#region 字段
	string rootUrl = @"https://www.haolizi.net";
	string name = string.Empty;
	string detailUrl = string.Empty;
	string category = string.Empty;
	string categoryUrl = string.Empty;
	int hotNum = -1;
	int downloadCount = -1;
	int needScore = 0;
	string devLanguage = string.Empty;
	string downloadSize = string.Empty;
	string pubdate = string.Empty;
	string pubPersion = string.Empty;
	string downloadUrl = string.Empty;
	#endregion
	
	HtmlDocument htmlDoc = new HtmlDocument();
	htmlDoc.LoadHtml(htmlStr);
	
	var liNodes = htmlDoc.DocumentNode.SelectNodes("//div[@class='content-box']/ul/li");
	foreach(HtmlNode node in liNodes)
	 {
		List<string> tags = new List<string>();
		
		#region 提取元素
		// 实例标题
		HtmlNode aNode = node.SelectSingleNode("./div[@class='baseinfo']/h3/a");
		 name = aNode.InnerText;
		 detailUrl = rootUrl + aNode.Attributes["href"].Value;
		// 实例种类
		 HtmlNode categoryNode = node.SelectSingleNode("./div[@class='baseinfo']/a");
		 category = categoryNode.InnerText;
		 categoryUrl = rootUrl + categoryNode.Attributes["href"].Value;
		// 下载人气
		 HtmlNode hotNumNode = node.SelectSingleNode("./div[@class='baseinfo']/div[@class='xj']/span[@class='rq']/em");
		 hotNum = Convert.ToInt32(hotNumNode.InnerText);
		// 下载次数
		 HtmlNode downloadCountNode = node.SelectSingleNode("./div[@class='baseinfo']/div[@class='xj']/span[2]");
		 downloadCount = Convert.ToInt32(downloadCountNode.InnerText);
		// 下载所需积分
		 HtmlNode needScoreNode = node.SelectSingleNode("./div[@class='baseinfo']/div[@class='xj']/span[3]");
		 needScore = Convert.ToInt32(needScoreNode.InnerText);
		// 开发语言
		 HtmlNode devLanguageNode = node.SelectSingleNode("./div[@class='sinfo']/div/p[@class='fun']/span[1]");
		 devLanguage = devLanguageNode.NextSibling.InnerText.Replace(" ", "").Replace("|", "");
		// 下载大小
		 HtmlNode downloadSizeNode = node.SelectSingleNode("./div[@class='sinfo']/div/p[@class='fun']/span[2]");
		 downloadSize = downloadSizeNode.InnerText;
		// 发布时间
		 HtmlNode pubdateNode = node.SelectSingleNode("./div[@class='sinfo']/div/p[@class='fun']/span[3]");
		 pubdate = pubdateNode.InnerText;
		// 发布人
		 HtmlNode pubPersionNode = node.SelectSingleNode("./div[@class='sinfo']/div/p[@class='fun']/span[4]/a");
		 pubPersion = pubPersionNode.InnerText;
		// 相关标签
		 var tagNodes = node.SelectNodes("./div[@class='sinfo']/div/p[@class='fun']/span[contains(@class , 'zwch')]");
		 if (tagNodes != null)
		 {
			 foreach (var tnode in tagNodes)
			 {
				 tags.Add(tnode.SelectSingleNode("./a").InnerText);
				 // Console.WriteLine(name + " tag:" + tnode.SelectSingleNode("./a").InnerText);
			 }
		 }
		#endregion
		
		string jsonStr = JsonConvert.SerializeObject(new {
			Name = name,
			Category = category,
			DevLanguage = devLanguage,
			DownloadCount = downloadCount,
			DownloadSize = downloadSize.Replace("大小:", "").Trim(),
			HotNum = hotNum,
			NeedScore = needScore,
			Pubdate = Convert.ToDateTime(pubdate.Replace("发布时间:", "").Trim()),
			PubPersion = pubPersion
		});
		Console.WriteLine(jsonStr);
	 };
}
  • 执行解析效果:
    用.Net core写爬虫之HtmlAgilityPack用法详解_第2张图片

3. 项目代码链接

这是我之前写的抓取好例子网的C#脚本代码,当时用同步函数方式写的,没有用异步,我之后会尝试用异步再重构一版,也是一个学习进步的过程。
https://github.com/Dahlin/MarsCrawler

你可能感兴趣的:(.NET)