Abstract
The Rich Site Summary (RSS) format, previously known as the RDF Site Summary, has quietly become the dominant format for distributing news headlines on the Web.
In this Mother of Perl tutorial, we will write a short Perl script (less than 100 lines) that retrieves an XML RSS file from the Web or local file system and converts it to HTML. Using a Server Side Include (SSI) or similar method, you can easily add news headlines from any number of sources to your Web site.
History
Where did RSS come from you ask? Netscape invented the RSS format for "channels" on Netscape Netcenter (http://my.netscape.com). It was released to the public in March of 1999. The first non-Netscape Web site to incorporate the new format was Scripting News, a popular technology news site run by Dave Winer, president of Userland Software (think Frontier). Interestingly enough, Scripting News had been using its own XML format, scriptingNews, since December of 1997.
In May of 1999, Dave Winer released a new version of the scriptingNews XML format, which added new content-rich elements. Netscape followed suit by adopting most of the new scriptingNews elements into RSS 0.91, which was released in July of 1999.
Userland Software also rolled out their own flavor of my.netscape.com. If you haven't already guessed, it's available at http://my.userland.com.
As far as I know, RSS is the most widely used XML format on the Web today. RSS headlines are available for many popular news sites like Slashdot, Forbes, and CNET News.com, and the list is growing daily.
In a time when "stickiness" is a good, displaying news headlines on your Web site can really help give it the extra "umph" that will encourage users to return. After all, users can only read your president's bio but so many times.
Required Modules
For rss2html.pl to work on your system, you should have a recent version of Perl installed, 5.003 or better. 5.005 is recommended. You will also need the XML::Parser and XML::RSS modules installed.
To install the modules on a *nix system, type:
perl -MCPAN -e "install XML::Parser"
perl -MCPAN -e "install XML::RSS"
If you're using a win32 machine (Win95/98/NT), you have a recent installation of Activestate Perl. If you don't have a recent version, visit http://www.activestate.com.
To install XML::Parser on a win32 machine type:
ppm install XML-Parser
To install XML::RSS on a win32 machine (you must have a C compiler and nmake):
- Download the module from: http://search.cpan.org/dist/XML-RSS/
- Uncompress the zip file and cd to the XML-RSS-0.5 directory
- type: perl Makefile.PL
- type: nmake
- type: nmake install
Next, we'll examine the RSS format in more detail.
rss2html.pl |
Get the source |
This script converts an RSS file on the Web or local file system to HTML. |
|
RSS 0.9
The first public version of RSS, 0.9, includes basic headline information. Below is an example RSS file for Freshmeat.net, a popular news site for Linux software:
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://my.netscape.com/rdf/simple/0.9/"> <channel> <title>freshmeat.net</title> <link>http://freshmeat.net</link> <description>the one-stop-shop for all your Linux softwar needs</description> </channel> <image> <title>freshmeat.net</title> <url>http://freshmeat.net/images/fm.mini.jpg</url> <link>http://freshmeat.net</link> </image> <item> <title>Geheimnis 0.59</title> <link>http://freshmeat.net/news/1999/06/21/930004162.html</link> </item> <item> <title>Firewall Manager 1.3 PRO</title> <link>http://freshmeat.net/news/1999/06/21/930004148.html</link> </item> <textinput> <title>quick finder</title> <description>Use the text input below to search the fresh meat application database</description> <name>query</name> <link>http://core.freshmeat.net/search.php3</link> </textinput>
</rdf:RDF>
|
The first major element is channel
which contains the following elements:
title
- the title of the channel
link
- the link to the channel Web site
description
- short description of the channel
An RSS channel may also contain an image
element as in the example above which contains the following elements:
title
- the text describing the image
url
- the URL of the image
link
- the URL that the image is linked to
The item
element contains the real channel content which is comprised of a title
and a link
element. An RSS file may contain up to 15 items.
An RSS 0.9 file may alternatively contain a textinput
element which allows users to type a string into a HTML text input field and submit it via the HTTP GET method to the URL specified in the link
element.
Next, we will examine RSS 0.91 which was released by Netscape in July of 1999.
RSS 0.91
The latest version of RSS added a few new elements. Below is a sample RSS file from XML.com, an excellent XML resource site:
<?xml version="1.0"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
<channel> <title>XML News and Features from XML.com</title> <description>XML.com features a rich mix of information and services for the XML community.</description> <language>en-us</language> <link>http://xml.com/pub</link> <copyright>Copyright 1999, O'Reilly and Associates and Seybold Publications</copyright> <managingEditor>[email protected] (Dale Dougherty)</managingEditor> <webMaster>[email protected] (Peter Wiggin)</webMaster>
<image> <title>XML News and Features from XML.com</title> <url>http://xml.com/universal/images/xml_tiny.gif</url> <link>http://xml.com/pub</link> <width>88</width> <height>31</height> </image>
<item> <title>Issue: XML Data Servers</title> <link>http://xml.com/pub?wwwrrr_rss</link> <description>Although not everyone agrees that XML should become a full-fledged data-management discipline, object-database vendors are busy repositioning their object-database products as XML data servers. Jon Udell looks at one of these, Object Design's eXcelon and finds it a solid product.</description> </item>
<item> <title>O'Reilly Labs Review: Object Design's eXcelon 1.1</title> <link>http://xml.com/pub/1999/08/excelon/index.html?wwwrrr_rss</link> <description>Jon Udell takes a look at eXcelon, Object Design's XML data servers, and explains its user interface and general approach to XML. </description> </item>
<item> <title>Report from Montreal</title> <link>http://xml.com/pub/1999/08/excelon/montreal.html?wwwrrr_rss</link> <description>Lisa Rein reports from MetaStructures 99 and XML Developers' Day.</description> </item>
<item> <title>Reviews: Bluestone Software's XML Suite: Promising App, Rough Around the Edges</title> <link>http://xml.com/pub/1999/08/bluestone/index.html?wwwrrr_rss</link> <description>Our reviewer tested Bluestone's XML Suite (XML Server and Visual XML) on the Windows NT platform, simulating a two-way exchange of business information between a book publisher and book stores. The results were encouraging (with a few caveats).</description> </item>
<item> <title>Interviews: CBL: Ecommerce Componentry</title> <link>http://xml.com/pub/1999/08/glushko/glushko.html?wwwrrr_rss</link> <description>In this audio interview, Bob Glushko of Commerce One talks about the Common Business Library (CBL) as a set of building blocks for XML document types and schemas used in ecommerce.</description> </item>
<item> <title>Backends Sharing Data</title> <link>http://xml.com/pub/1999/08/rpc/index.html?wwwrrr_rss</link> <description>What if you could script remote procedure calls between web sites as easily as you can between programs? Edd Dumbill shows how it can be done in PHP.</description> </item>
<item> <title>Back Issue: XML Suite</title> <link>http://xml.com/pub/1999/08/18/index.html?wwwrrr_rss</link> <description> Barry Nance runs Bluestone's XML Suite through the paces. The tools show promise for passing data between databases and XML. But there are still a few kinks to be worked out.</description> </item>
<item> <title>Back Issue: XML-RPC</title> <link>http://xml.com/pub/1999/08/11/index.html?wwwrrr_rss</link> <description>A major promise of XML is its ability to pass data simply from one place to another, regardless of platform. In this issue, Edd Dumbill shows how to use XML-RPC in PHP to pass data from a web site to a PDA.</description> </item>
<item> <title>News: InDelv XML/XSL Client Version 0.4.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-a?wwwrrr_rss</link> <description> A posting from Rob Brown reports on the public availability of the new InDelv XML Client version 0.4. This version represent an upgrade to InDelv's previously released XML Browser, but "it has been renamed as a 'Client' to reflect the fact that it now contains both an XML/XSL browser and an XML/XSL editor. The browser is available free for all uses. The editor comes packaged with the browser as a demo, which can later be upgraded to a full commercial version. This is a 100% Java appl... </description> </item>
<item> <title>News: OpenJade Development Team Releases OpenJade 1.3pre1 (Beta).</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-g?wwwrrr_rss</link> <description> A recent posting from Avi Kivity and the OpenJade Development Team announced the release of OpenJade 1.3pre1 (Beta). "OpenJade is the DSSSL user community's open source implementation of DSSSL, Document Style Semantics and Specification Language, an ISO standard for rendering SGML and XML documents. OpenJade is based on James Clark's widely used Jade. OpenJade 1.3pre1 is a more complete implementation of the DSSSL standard, and introduces many new features, including (1) Implementat... </description> </item>
<item> <title>News: IBM XML Parser Update: XML4C2 Version 2.3.1 Released.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-b?wwwrrr_rss</link> <description> Dean Roddey posted an announcement for the update of XML4C. IBM's XML for C++ parser (XML4C) "is a validating XML parser written in a portable subset of C++. XML4C makes it easy to give an application the ability to read and write XML data. Its two shared libraries provide classes for parsing, generating, manipulating, and validating XML documents. XML4C is faithful to the XML 1.0 Recommendation and associated standards (DOM 1.0, SAX 1.0). Source code, samples and API documentation ... </description> </item>
<item> <title>News: Platform for Privacy Preferences (P3P) Specification Working Draft.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-h?wwwrrr_rss</link> <description> As part of the W3C P3P Activity, a fifth public working draft of the Platform for Privacy Preferences (P3P) Specification has been published for review by W3C members. The working draft "describes the Platform for Privacy Preferences (P3P). P3P enables Web sites to express their privacy practices and enables users to exercise preferences over those practices. P3P compliant products will allow users to be informed of site practices (in both machine and human readable formats), to deleg... </description> </item>
<item> <title>News: Extended XLink with XSLT.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-c?wwwrrr_rss</link> <description> Nikita Ogievetsky (President, Cogitech, Inc.) posted an announcement for the availability of slides from the Metastructures '99 presentation "HTML Form Templates with XML. All in One and One for All. XSLT template library for WEB applications." The paper describes building XSLT template library for web applications. The goal was to "demonstrate data processing on the web made easy with XSL transformations: Generate a data maintenance web with data-structure controlled by XML, scree... </description> </item>
<item> <title>News: HyBrick Web Site Reopens.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-d?wwwrrr_rss</link> <description> A posting from Toshimitsu Suzuki (Fujitsu Laboratories Ltd.) to the XLXP-DEV mailing list recently announced the reopening of the HyBrick Web site. 'HyBrick' is "an advanced SGML/XML browser developed by Fujitsu Laboratories, the research arm of Fujitsu. HyBrick is based on an architecture that supports advanced linking and formatting capabilities. HyBrick includes a DSSSL renderer and XLink/XPointer engine running on top of James Clark's SP and Jade. HyBrick supports: (1) Both v... </description> </item>
<item> <title>News: Extended DocBook Synopses Version 1.0.</title> <link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-e?wwwrrr_rss</link> <description> Norman Walsh has posted an announcement for a preliminary release of 'Extended DocBook Synopses'. Extended DocBook Synopses is a customization layer that extends DocBook, "adding a function synopsis element, ClassSynopsis for modern, mostly object-oriented, programming languages such as Java, C++, Perl, and IDL." DocBook is an SGML [and XML] DTD maintained by the DocBook Technical Committee of OASIS that particularly well suited to books and papers about computer hardware and softwar... </description> </item>
</channel> </rss>
|
Notice that there are more descriptive elements for the channel, image, amd items elements. These are referred to as "fat elements" because they contain a more detailed description of each channel item.
The XML::RSS Module
Now that you've had a change to glance at two RSS examples, it's time to introduct the XML::RSS module. XML::RSS is a subclass of XML::Parser, a Perl module maintained by Clark Cooper that utilizes James Clark's Expat C library. XML::RSS was developed to simplify the task of manipulating and parsing RSS files. A deep understanding of XML is not a prerequisite for using XML::RSS since the XML details are hidden inside the class interface.
While XML::RSS is capable of creating RSS files, we will be focusing on parsing existing RSS files in this column. You can read more about the capabilities of XML::Parser in the module's documentation or by typing:
perldoc XML::RSS
The Code
Well, let's look at the code shall we? Lines 16-17 load the XML::RSS and LWP::Simple modules. We've already talked about XML::RSS in brief, but what does LWP::Simple do? Good question! The answer is simple (puns intended). It's a procedural interface for interacting with a Web server. It's also the little cousin of LWP::UserAgent, a fuller object oriented interface. We'll be using one of the library's subroutines later in the code to fetch an RSS file from the Web.
In lines 20-21 we initialize two variables that we're going to use later.
Line 25 starts the main code body. The first thing we do is verify that the user typed exactly one command-line parameter. This parameter is then assigned to the $arg
variable in line 28.
Next we create a new instance of the XML::RSS class and assign the reference to the $rss
variable on line 31.
Now we must determine whether the command-line parameter the user entered is an HTTP URL or a file on the local file system (lines 34-46). On line 34, we us a regular expression to look for the characters http:
.
If the command-line argument starts with these characters, we can safely assume that the user intends to retrieve an RSS file from a Web server. On line 35 we pass the argument to the get()
function, which is a part of LWP::Simple, and assign the results to the $content
variable. On line 36 we call die()
if $content
is empty. If this happens, it means there was an error retrieving the RSS file. If the RSS file was downloaded successfully, $rss->parse($content)
is called which parses the RSS file and stores the results in the object's internal structure (line 38).
If the command-line argument does not contain the http:
characters, we assume the argument is a file instead of a URL on lines 41-46. The first thing we do is assign the value of $arg
to the $file
variable and test for the existence of the file (lines 42-43).
Then we call $rss->parsefile($file)
(line 45), which parses the RSS file and stores the results in the object's internal structure. The parsefile()
method parses a file, whereas the parse()
method parses the string that's passed to it.
Lastly, we call the print_html
subroutine on line 49, which converts the RSS object in nicely formatted HTML.
print_html
As you examine this subroutine, you will begin to understand the internal structure of the XML::RSS object. The critical portion of the subroutine is contained on lines 76-79. In this foreach
loop, we iterate over each of the RSS items.
Next, let's take a look at rss2html.pl in action.
rss2html.pl in Action
I've added the following cron jobs that run once per hour on the Webreference server (Scheduler is the NT counterpart):
rss2html.pl http://slashdot.org/slashdot.rdf > slashdot.html
rss2html.pl http://freshmeat.net/backend/fm.rdf > freshmeat.html
rss2html.pl http://www.linuxtoday.com/backend/my-netscape.rdf > linuxtoday.html
rss2html.pl http://www.xml.com/xml/news.rdf > xmlnews.html
rss2html.pl http://www.perlxml.com/rdf/moperl.rdf > mop.html
The commands above fetch the RSS files off the Web and convert them to HTML. Using Server-Side Includes (SSI), I've included the results below:
Slashdot: |
WiMax Technology Could Blanket the US? Hitchhiker's Guide to the Galaxy Trailer Microsoft Anti-Spyware to Be Free of Charge ACM to Honor TCP/IP Creators with Turing Award New Rules Proposed on Electronic Evidence Intel From Behind the Curtain Kyoto Protocol Comes Into Force Cory Doctorow's 'I, Robot' Posted Straczynski Offers To Re-Boot Star Trek Building The MareNostrum COTS Supercomputer Search Slashdot stories |
|
freshmeat.net announcements (Global) |
Zolera SOAP Infrastructure 1.7 (Default branch) XBible 3.0 (Default branch) PDFdirectory 0.2.04 (Default branch) XC-AST 0.7.0 (Default branch) Imagero Reader 1.73 (Default branch) GNU ccAudio2 0.4.0 (Testing branch) quisp 1.27 (Default branch) shsql 1.27 (Default branch) samhain 2.0.4 (Default branch) CANDIDv2 2.40 (Default branch) ADV: Dialing for Dollars libferris 1.1.46 (Default branch) FUDforum 2.6.10 (Stable branch) HORRORss 1.0 (Default branch) Roxen WebServer 4.0.325-release 4 (Default branch) Configuration File Library 1.0 (Default branch) Goggles 0.7.11 (Default branch) Pluto DCE library 2.0.0.9 (Default branch) Pluto Bi-Directional Comm library 2.0.0.9 (Default branch) zen Platform 2.0.4 (Default branch) ADV: Gimme Shelter MIME Email message class 2005.02.15 (Default branch) ELF statifier 1.6.3 (Default branch) SekHost 1.2 (Default branch) ulogd 1.21 (Default branch) Journaled Files LIBrary 0.1.0-0.0.0 (Default branch) FastTemplate.php3 1.2.0 (Default branch) iptables 1.3.0 (Default branch) Very Simple Control Protocol Daemon 0.1.4 (Default branch) C Parameters 0.9.0 (Default branch) ADV: Dialing for Dollars eXtreme Project Management Tool 0.7beta1 (Development branch) gccc 1.099 (Default branch) Magellan Metasearch 1.00-RC3 (Default branch) CAN Abstraction Layer 0.1.4 (Default branch) TreeLine 0.11.1 (Default branch) GNOME Sensors Applet 0.6.1 (Default branch) iODBC Driver Manager and SDK 3.52.2 (Default branch) DISLIN 8.3 (Default branch) Pluto Home 2.0.0.9 (Default branch) ADV: Dialing for Dollars Expense Report Software 1.07 (Default branch) Yzis M3 (Default branch) Q Light Controller 2.4.1 (Default branch) Menc 0.3 (Default branch) Another File Integrity Checker 2.7-0 (Default branch) BibShelf 1.4.0-1 (Default branch) Eleven 1.0 (Default branch) Linice 2.5 (Default branch) JDirt 1.3 (Default branch) ADV: Dialing for Dollars Nazghul 0.4.0 (Default branch) Rush 2005 0.4.10 (Default branch) Monesa 0.24.1 (Stable branch) Persist.NET 0.9.1 beta (Default branch) Roundup 0.8 (Default branch) Aquarium Web Application Framework 2.0 (Default branch) sn9c102 Video Grabber 1.7.0 (Default branch) GRAVEMAN 0.3.8 (Default branch) viewurpmi 0.2 (Default branch) ADV: Dialing for Dollars NuFW 1.0-rc1 (Stable branch) OpenSceneGraph Editor 0.6.0 (Default branch) HPGS - HPGl Script 0.6.0 (Default branch) lustre 1.4.1-rc1 (Default branch) IBM HeapAnalyzer 1.3.3 (Default branch) CANDIDv2 2.3.6 (Default branch) NetSPoC 2.5 (Default branch) Metal Mech 0.0.3 (Default branch) radmind 1.5.0 (Default branch) ADV: Dialing for Dollars iPodBackup 1.4 (Default branch) db4o 4.3 (Mono branch) web2ldap 0.15.9 (Default branch) Mantissa 5.6 (Default branch) Drone IRC Bot 1.2 (Default branch) NoFuss POS 0.06 (Default branch) xlog 1.1 (Stable branch) ActiveBPEL 1.0.7 (Default branch) Java Embedded Python 1.1 (Default branch) ADV: Dialing for Dollars Neveredit 0.8 (Default branch) The friendly interactive shell 1.1 (Default branch) Webmatic 2.0.3 (Default branch) JTMOS Operating System Build 7700 (Default branch) BIRD 1.0.10 (Default branch) Tune in 2 Me 050215 (Default branch) HMSCalc 3.0 (Default branch) Information Currency Web Services 0.0.4 (Default branch) Nitro + Og 0.10.0 (Default branch) ADV: Dialing for Dollars Just For Fun Network Management System 0.8.0 (Stable branch) rxvt-unicode 5.1 (Default branch) PHPEmaillist 0.3 (Default branch) ulogd-php 1.0 (Default branch) mod_access_rbl2 1.0 (Default branch) 5lack10.1 0.8 (Default branch) profusemail 0.9.1 (Default branch) |
|
Linux Today |
LWN.net: FSF Announces New Executive Director LinuxPlanet: Novell Takes Enterprise Security Focus CNET News: HP: Don't Like Software Patents? Learn to Deal internetnews.com: CA Chief: Innovate, Cooperate Boston Herald: Linux Show Plans BCEC Move Search Linux Today: |
|
XML.com |
Features: Very Dynamic Web Interfaces Features: Comparing CSS and XSL: A Reply from Norm Walsh Features: Top 10 XForms Engines Features: An Introduction to TMAPI XML Tourist: The Silent Soundtrack Transforming XML: The XPath 2.0 Data Model Features: SIMILE: Practical Metadata for the Semantic Web Features: Hacking Open Office Features: Formal Taxonomies for the U.S. Government Features: Reviewing the Architecture of the World Wide Web Features: Printing XML: Why CSS Is Better than XSL Python and XML: Introducing the Amara XML Toolkit Features: Introducing Comega Features: SAML 2: The Building Blocks of Federated Identity The Restful Web: Amazon's Simple Queue Service Copyright 2004, O'Reilly Media, Inc. |
|
Conclusion
Well, we've shown in this column that Perl can really pack a wallop in a short amount of code. With rss2html.pl, anyone can automatically add a news feed to their Web site.
For more information on RSS, you might try visiting the following sites:
- http://my.userland.com
- http://www.scripting.com
- http://www.perlxml.com
rss2html.pl |
Get the source |
This script converts an RSS file on the Web or local file system to HTML. |
|