由于各种原因,我们经常需要去别的网站采集一些信息,.net下所有相关的技术都已经非常成熟,用Webrequest抓取页面,既支持自定义Reference头,又支持cookie,解析页面一般都是用正则,而且对方网站结构一变,还得重新改代码,重新编译,发布。
如果有了IronPython,可以把抓取和分析的逻辑做成Python脚本,如果对方页面结构变了,只需修改脚本就行了,不需重新编译软件,这样可以用c#做交互和界面部分,用Python封装预期经常变化的部分。
安装好IronPython和vs.net 2010后,还需要下载一个SGMLReader(见参考链接),这个组件可以把格式不是很严格的HTML转换成格式良好的XML文件,甚至还能增加DTD的验证
我们以抓取百度贴吧页面为例,新建一个Console项目,引用IronPython,Microsoft.Dynamic,Microsoft.Scripting,SgmlReaderDll这些组件,把SGMLReader里的Html.dtd复制到项目目录下,如果没有这个,它会根据doctype去网络上找dtd,然后新建baidu.py的文件,最后在项目属性的生成事件里写上如下代码,把这两个文件拷贝到目标目录里
copy $(ProjectDir)\
*
.py $(TargetDir)
copy $(ProjectDir)\
*
.dtd $(TargetDir)
在baidu.py里首先引用必要的.net程序集
import
clr, sys
clr.AddReference(
"
SgmlReaderDll
"
)
clr.AddReference(
"
System.Xml
"
)
完了导入我们需要的类
from
Sgml
import
*
from
System.Net
import
*
from
System.IO
import
TextReader,StreamReader
from
System.Xml
import
*
from
System.Text.UnicodeEncoding
import
UTF8
利用SgmlReader写一个把html转换成xml的函数,注意SystemLiteral属性必须设置,否则就会去网上找dtd了,浪费时间
def
fromHtml(textReader):
sgmlReader
=
SgmlReader()
sgmlReader.SystemLiteral
=
"
html.dtd
"
sgmlReader.WhitespaceHandling
=
WhitespaceHandling.All
sgmlReader.CaseFolding
=
CaseFolding.ToLower
sgmlReader.InputStream
=
textReader
doc
=
XmlDocument()
doc.PreserveWhitespace
=
True
doc.XmlResolver
=
None
doc.Load(sgmlReader)
return
doc
利用webrequest写一个支持cookie和网页编码的抓网页方法
def
getWebData(url, method, data
=
None, cookie
=
None, encoding
=
"
UTF-8
"
):
req
=
WebRequest.Create(url)
req.Method
=
method
if
cookie
!=
None:
req.CookieContainer
=
cookie
if
data
!=
None:
stream
=
req.GetRequestStream()
stream.Write(data, 0, data.Length)
rsp
=
req.GetResponse()
reader
=
StreamReader(rsp.GetResponseStream(), UTF8.GetEncoding(encoding))
return
reader
写一个类来定义抓取结果,这个类不需要在c#项目里定义,到时候直接用c# 4.0的dynamic关键字就可以使用
class
Post:
def
__init__
(self, hit, comments, title, link, author):
self.hit
=
hit
self.comments
=
comments
self.title
=
title
self.link
=
link
self.author
=
author
定义主要工作的类,__init__大概相当于构造函数,我们传入编码参数,并初始化cookie容器和解析结果,[]是python里的列表,大约相当于c#的List<T>
class
BaiDu:
def
__init__
(self,encoding):
self.cc
=
self.cc
=
CookieContainer()
self.encoding
=
encoding
self.posts
=
[]
接下来定义抓取方法,调用getWebData抓网页,然后用fromHtml转换成xml,剩下的就是xml操作,和.net里一样,一看便知
def
getPosts(self, url):
reader
=
getWebData(url,
"
GET
"
, None, self.cc, self.encoding)
doc
=
fromHtml(reader)
trs
=
doc.SelectNodes(
"
html//table[@id='thread_list_table']/tbody/tr
"
)
self.parsePosts(trs)
def
parsePosts(self, trs):
for
tr
in
trs:
tds
=
tr.SelectNodes(
"
td
"
)
hit
=
tds[0].InnerText
comments
=
tds[
1
].InnerText
title
=
tds[
2
].ChildNodes[
1
].InnerText
link
=
tds[
2
].ChildNodes[
1
].Attributes[
"
href
"
]
author
=
tds[
3
].InnerText
post
=
Post(hit, comments, title, link, author)
self.posts.append(post)
c#代码要创建一个脚本运行环境,设置允许调试,然后执行baidu.py,最后创建一个Baidu的类的实例,并用dynamic关键字引用这个实例
Dictionary
<
string
,
object
>
options
=
new
Dictionary
<
string
,
object
>
();
options[
"
Debug
"
]
=
true
;
ScriptEngine engine
=
Python.CreateEngine(options);
ScriptScope scope
=
engine.ExecuteFile(
"
baidu.py
"
);
dynamic baidu
=
engine.Operations.Invoke(scope.GetVariable(
"
BaiDu
"
),
"
GBK
"
);
接下来调用BaiDu这个python类的方法获取网页抓取结果,然后输出就可以了
baidu.getPosts(
"
http://tieba.baidu.com/f?kw=seo
"
);
dynamic posts
=
baidu.posts;
foreach
(dynamic post
in
posts)
{
Console.WriteLine(
"
{0} (回复数:{1})(点击数:{2})[作者:{3}]
"
,
post.title,
post.comments,
post.hit,
post.author);
}
参考链接: