用 Ruby scrAPI 做数据采集

 

前天在 Railscasts 上面看到一篇介绍 Ruby scrAPI  这个类库的视频教程《 Screen Scraping with ScrAPI 》,里面介绍了如何通过 scrAPI 以 HTML dom 的方式抓取其它网站的内容的例子,整个方式非常简单有效! scrAPI 的 HTML 解析机制和 jQuery 的  Selectors  非常像,它可以以  html>body>div#container>div#articles>div.item>div.title 的方式来解析像下面这样的HTML结构

[代码] [Ruby]代码

01 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
02   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
03 <html xmlns="http://www.w3.org/1999/xhtml">
04 <head>
05   <meta http-equiv="Content-type" content="text/html; charset=utf-8">
06   <title>This is the scrAPI collect demo page</title>
07 </head>
08 <body>
09   <div id="header">
10     <h1 id="demo">Demo</h1>
11     <ul id="nav">
12       <li><a href="/">Home</a></li>
13       <li class="current"><a href="/articles">Articles</a></li>
14       <li><a href="/about">About</a></li>
15     </ul>
16   </div>
17   <div id="container">
18     <div id="articles">
19       <div class="item">
20         <div class="title">
21           <a href="/articles/show/1">Sample article title 1</a>
22         </div>
23         <div class="summary">
24           There is the summary text 1.
25         </div>
26       </div>
27       <div class="item">
28         <div class="title">
29           <a href="/articles/show/1">Sample article title 2</a>
30         </div>
31         <div class="summary">
32           There is the summary text 2.
33         </div>
34       </div>
35       <div class="item">
36         <div class="title">
37           <a href="/articles/show/1">Sample article title 3</a>
38         </div>
39         <div class="summary">
40           There is the summary text 3.
41         </div>
42       </div>
43     </div>
44   </div>
45 </body>
46 </html>

你可能感兴趣的:(数据采集)