Using RSS News Feeds with Perl

Using RSS News Feeds with Perl

Abstract

The Rich Site Summary (RSS) format, previously known as the RDF Site Summary, has quietly become the dominant format for distributing news headlines on the Web.

In this Mother of Perl tutorial, we will write a short Perl script (less than 100 lines) that retrieves an XML RSS file from the Web or local file system and converts it to HTML. Using a Server Side Include (SSI) or similar method, you can easily add news headlines from any number of sources to your Web site.

History

Where did RSS come from you ask? Netscape invented the RSS format for "channels" on Netscape Netcenter (http://my.netscape.com). It was released to the public in March of 1999. The first non-Netscape Web site to incorporate the new format was Scripting News, a popular technology news site run by Dave Winer, president of Userland Software (think Frontier). Interestingly enough, Scripting News had been using its own XML format, scriptingNews, since December of 1997.

In May of 1999, Dave Winer released a new version of the scriptingNews XML format, which added new content-rich elements. Netscape followed suit by adopting most of the new scriptingNews elements into RSS 0.91, which was released in July of 1999.

Userland Software also rolled out their own flavor of my.netscape.com. If you haven't already guessed, it's available at http://my.userland.com.

As far as I know, RSS is the most widely used XML format on the Web today. RSS headlines are available for many popular news sites like Slashdot, Forbes, and CNET News.com, and the list is growing daily.

In a time when "stickiness" is a good, displaying news headlines on your Web site can really help give it the extra "umph" that will encourage users to return. After all, users can only read your president's bio but so many times.

Required Modules

For rss2html.pl to work on your system, you should have a recent version of Perl installed, 5.003 or better. 5.005 is recommended. You will also need the XML::Parser and XML::RSS modules installed.

To install the modules on a *nix system, type:
perl -MCPAN -e "install XML::Parser"
perl -MCPAN -e "install XML::RSS"

If you're using a win32 machine (Win95/98/NT), you have a recent installation of Activestate Perl. If you don't have a recent version, visit http://www.activestate.com.

To install XML::Parser on a win32 machine type:
ppm install XML-Parser

To install XML::RSS on a win32 machine (you must have a C compiler and nmake):

  • Download the module from: http://search.cpan.org/dist/XML-RSS/
  • Uncompress the zip file and cd to the XML-RSS-0.5 directory
  • type: perl Makefile.PL
  • type: nmake
  • type: nmake install

Next, we'll examine the RSS format in more detail.

rss2html.pl Get the source
This script converts an RSS file on the Web or local file system to HTML.

RSS 0.9

The first public version of RSS, 0.9, includes basic headline information. Below is an example RSS file for Freshmeat.net, a popular news site for Linux software:

<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">

<channel>
<title>freshmeat.net</title>
<link>http://freshmeat.net</link>
<description>the one-stop-shop for all your Linux softwar needs</description>
</channel>

<image>
<title>freshmeat.net</title>
<url>http://freshmeat.net/images/fm.mini.jpg</url>
<link>http://freshmeat.net</link>
</image>

<item>
<title>Geheimnis 0.59</title>
<link>http://freshmeat.net/news/1999/06/21/930004162.html</link>
</item>

<item>
<title>Firewall Manager 1.3 PRO</title>
<link>http://freshmeat.net/news/1999/06/21/930004148.html</link>
</item>

<textinput>
<title>quick finder</title>
<description>Use the text input below to search the fresh
meat application database</description>
<name>query</name>
<link>http://core.freshmeat.net/search.php3</link>
</textinput>

</rdf:RDF>

The first major element is channel which contains the following elements:

  • title - the title of the channel
  • link - the link to the channel Web site
  • description - short description of the channel

An RSS channel may also contain an image element as in the example above which contains the following elements:

  • title - the text describing the image
  • url - the URL of the image
  • link - the URL that the image is linked to

The item element contains the real channel content which is comprised of a title and a link element. An RSS file may contain up to 15 items.

An RSS 0.9 file may alternatively contain a textinput element which allows users to type a string into a HTML text input field and submit it via the HTTP GET method to the URL specified in the link element.

Next, we will examine RSS 0.91 which was released by Netscape in July of 1999.

RSS 0.91

The latest version of RSS added a few new elements. Below is a sample RSS file from XML.com, an excellent XML resource site:

<?xml version="1.0"?>

<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">

<channel>
<title>XML News and Features from XML.com</title>
<description>XML.com features a rich mix of information and services for the XML community.</description>
<language>en-us</language>
<link>http://xml.com/pub</link>
<copyright>Copyright 1999, O'Reilly and Associates and Seybold Publications</copyright>
<managingEditor>[email protected] (Dale Dougherty)</managingEditor>
<webMaster>[email protected] (Peter Wiggin)</webMaster>

<image>
<title>XML News and Features from XML.com</title>
<url>http://xml.com/universal/images/xml_tiny.gif</url>
<link>http://xml.com/pub</link>
<width>88</width>
<height>31</height>
</image>

<item>
<title>Issue: XML Data Servers</title>
<link>http://xml.com/pub?wwwrrr_rss</link>
<description>Although not everyone agrees that XML should become a full-fledged data-management discipline, object-database vendors are busy repositioning their object-database products as XML data servers. Jon Udell looks at one of these, Object Design's eXcelon and finds it a solid product.</description>
</item>

<item>
<title>O'Reilly Labs Review: Object Design's eXcelon 1.1</title>
<link>http://xml.com/pub/1999/08/excelon/index.html?wwwrrr_rss</link>
<description>Jon Udell takes a look at eXcelon, Object Design's XML data servers, and explains its user interface and general approach to XML. </description>
</item>

<item>
<title>Report from Montreal</title>
<link>http://xml.com/pub/1999/08/excelon/montreal.html?wwwrrr_rss</link>
<description>Lisa Rein reports from MetaStructures 99 and XML Developers' Day.</description>
</item>

<item>
<title>Reviews: Bluestone Software's XML Suite: Promising App, Rough Around the Edges</title>
<link>http://xml.com/pub/1999/08/bluestone/index.html?wwwrrr_rss</link>
<description>Our reviewer tested Bluestone's XML Suite (XML Server and Visual XML) on the Windows NT platform, simulating a two-way exchange of business information between a book publisher and book stores. The results were encouraging (with a few caveats).</description>
</item>

<item>
<title>Interviews: CBL: Ecommerce Componentry</title>
<link>http://xml.com/pub/1999/08/glushko/glushko.html?wwwrrr_rss</link>
<description>In this audio interview, Bob Glushko of Commerce One talks about the Common Business Library (CBL) as a set of building blocks for XML document types and schemas used in ecommerce.</description>
</item>

<item>
<title>Backends Sharing Data</title>
<link>http://xml.com/pub/1999/08/rpc/index.html?wwwrrr_rss</link>
<description>What if you could script remote procedure calls between web sites as easily as you can between programs? Edd Dumbill shows how it can be done in PHP.</description>
</item>

<item>
<title>Back Issue: XML Suite</title>
<link>http://xml.com/pub/1999/08/18/index.html?wwwrrr_rss</link>
<description> Barry Nance runs Bluestone's XML Suite through the paces. The tools show promise for passing data between databases and XML. But there are still a few kinks to be worked out.</description>
</item>

<item>
<title>Back Issue: XML-RPC</title>
<link>http://xml.com/pub/1999/08/11/index.html?wwwrrr_rss</link>
<description>A major promise of XML is its ability to pass data simply from one place to another, regardless of platform. In this issue, Edd Dumbill shows how to use XML-RPC in PHP to pass data from a web site to a PDA.</description>
</item>

<item>
<title>News: InDelv XML/XSL Client Version 0.4.</title>
<link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-a?wwwrrr_rss</link>
<description> A posting from Rob Brown reports on the public availability of the new InDelv XML Client version 0.4. This version represent an upgrade to InDelv's previously released XML Browser, but "it has been renamed as a 'Client' to reflect the fact that it now contains both an XML/XSL browser and an XML/XSL editor. The browser is available free for all uses. The editor comes packaged with the browser as a demo, which can later be upgraded to a full commercial version. This is a 100% Java appl...
</description>
</item>

<item>
<title>News: OpenJade Development Team Releases OpenJade 1.3pre1 (Beta).</title>
<link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-g?wwwrrr_rss</link>
<description> A recent posting from Avi Kivity and the OpenJade Development Team announced the release of OpenJade 1.3pre1 (Beta). "OpenJade is the DSSSL user community's open source implementation of DSSSL, Document Style Semantics and Specification Language, an ISO standard for rendering SGML and XML documents. OpenJade is based on James Clark's widely used Jade. OpenJade 1.3pre1 is a more complete implementation of the DSSSL standard, and introduces many new features, including (1) Implementat...
</description>
</item>

<item>
<title>News: IBM XML Parser Update: XML4C2 Version 2.3.1 Released.</title>
<link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-b?wwwrrr_rss</link>
<description> Dean Roddey posted an announcement for the update of XML4C. IBM's XML for C++ parser (XML4C) "is a validating XML parser written in a portable subset of C++. XML4C makes it easy to give an application the ability to read and write XML data. Its two shared libraries provide classes for parsing, generating, manipulating, and validating XML documents. XML4C is faithful to the XML 1.0 Recommendation and associated standards (DOM 1.0, SAX 1.0). Source code, samples and API documentation ...
</description>
</item>

<item>
<title>News: Platform for Privacy Preferences (P3P) Specification Working Draft.</title>
<link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-h?wwwrrr_rss</link>
<description> As part of the W3C P3P Activity, a fifth public working draft of the Platform for Privacy Preferences (P3P) Specification has been published for review by W3C members. The working draft "describes the Platform for Privacy Preferences (P3P). P3P enables Web sites to express their privacy practices and enables users to exercise preferences over those practices. P3P compliant products will allow users to be informed of site practices (in both machine and human readable formats), to deleg...
</description>
</item>

<item>
<title>News: Extended XLink with XSLT.</title>
<link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-c?wwwrrr_rss</link>
<description> Nikita Ogievetsky (President, Cogitech, Inc.) posted an announcement for the availability of slides from the Metastructures '99 presentation "HTML Form Templates with XML. All in One and One for All. XSLT template library for WEB applications." The paper describes building XSLT template library for web applications. The goal was to "demonstrate data processing on the web made easy with XSL transformations: Generate a data maintenance web with data-structure controlled by XML, scree...
</description>
</item>

<item>
<title>News: HyBrick Web Site Reopens.</title>
<link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-d?wwwrrr_rss</link>
<description> A posting from Toshimitsu Suzuki (Fujitsu Laboratories Ltd.) to the XLXP-DEV mailing list recently announced the reopening of the HyBrick Web site. 'HyBrick' is "an advanced SGML/XML browser developed by Fujitsu Laboratories, the research arm of Fujitsu. HyBrick is based on an architecture that supports advanced linking and formatting capabilities. HyBrick includes a DSSSL renderer and XLink/XPointer engine running on top of James Clark's SP and Jade. HyBrick supports: (1) Both v...
</description>
</item>

<item>
<title>News: Extended DocBook Synopses Version 1.0.</title>
<link>http://xml.com/pub/coverpage/newspage.html#ni1999-08-27-e?wwwrrr_rss</link>
<description> Norman Walsh has posted an announcement for a preliminary release of 'Extended DocBook Synopses'. Extended DocBook Synopses is a customization layer that extends DocBook, "adding a function synopsis element, ClassSynopsis for modern, mostly object-oriented, programming languages such as Java, C++, Perl, and IDL." DocBook is an SGML [and XML] DTD maintained by the DocBook Technical Committee of OASIS that particularly well suited to books and papers about computer hardware and softwar...
</description>
</item>

</channel>
</rss>

Notice that there are more descriptive elements for the channel, image, amd items elements. These are referred to as "fat elements" because they contain a more detailed description of each channel item.

The XML::RSS Module

Now that you've had a change to glance at two RSS examples, it's time to introduct the XML::RSS module. XML::RSS is a subclass of XML::Parser, a Perl module maintained by Clark Cooper that utilizes James Clark's Expat C library. XML::RSS was developed to simplify the task of manipulating and parsing RSS files. A deep understanding of XML is not a prerequisite for using XML::RSS since the XML details are hidden inside the class interface.

While XML::RSS is capable of creating RSS files, we will be focusing on parsing existing RSS files in this column. You can read more about the capabilities of XML::Parser in the module's documentation or by typing:
perldoc XML::RSS

The Code

Well, let's look at the code shall we? Lines 16-17 load the XML::RSS and LWP::Simple modules. We've already talked about XML::RSS in brief, but what does LWP::Simple do? Good question! The answer is simple (puns intended). It's a procedural interface for interacting with a Web server. It's also the little cousin of LWP::UserAgent, a fuller object oriented interface. We'll be using one of the library's subroutines later in the code to fetch an RSS file from the Web.

In lines 20-21 we initialize two variables that we're going to use later.

Line 25 starts the main code body. The first thing we do is verify that the user typed exactly one command-line parameter. This parameter is then assigned to the $arg variable in line 28.

Next we create a new instance of the XML::RSS class and assign the reference to the $rss variable on line 31.

Now we must determine whether the command-line parameter the user entered is an HTTP URL or a file on the local file system (lines 34-46). On line 34, we us a regular expression to look for the characters http:.

If the command-line argument starts with these characters, we can safely assume that the user intends to retrieve an RSS file from a Web server. On line 35 we pass the argument to the get() function, which is a part of LWP::Simple, and assign the results to the $content variable. On line 36 we call die() if $content is empty. If this happens, it means there was an error retrieving the RSS file. If the RSS file was downloaded successfully, $rss->parse($content) is called which parses the RSS file and stores the results in the object's internal structure (line 38).

If the command-line argument does not contain the http: characters, we assume the argument is a file instead of a URL on lines 41-46. The first thing we do is assign the value of $arg to the $file variable and test for the existence of the file (lines 42-43).

Then we call $rss->parsefile($file) (line 45), which parses the RSS file and stores the results in the object's internal structure. The parsefile() method parses a file, whereas the parse() method parses the string that's passed to it.

Lastly, we call the print_html subroutine on line 49, which converts the RSS object in nicely formatted HTML.

print_html

As you examine this subroutine, you will begin to understand the internal structure of the XML::RSS object. The critical portion of the subroutine is contained on lines 76-79. In this foreach loop, we iterate over each of the RSS items.

Next, let's take a look at rss2html.pl in action.

rss2html.pl in Action

I've added the following cron jobs that run once per hour on the Webreference server (Scheduler is the NT counterpart):

rss2html.pl http://slashdot.org/slashdot.rdf > slashdot.html
rss2html.pl http://freshmeat.net/backend/fm.rdf > freshmeat.html
rss2html.pl http://www.linuxtoday.com/backend/my-netscape.rdf > linuxtoday.html
rss2html.pl http://www.xml.com/xml/news.rdf > xmlnews.html
rss2html.pl http://www.perlxml.com/rdf/moperl.rdf > mop.html

The commands above fetch the RSS files off the Web and convert them to HTML. Using Server-Side Includes (SSI), I've included the results below:

Slashdot:

Slashdot:

  • WiMax Technology Could Blanket the US?
  • Hitchhiker's Guide to the Galaxy Trailer
  • Microsoft Anti-Spyware to Be Free of Charge
  • ACM to Honor TCP/IP Creators with Turing Award
  • New Rules Proposed on Electronic Evidence
  • Intel From Behind the Curtain
  • Kyoto Protocol Comes Into Force
  • Cory Doctorow's 'I, Robot' Posted
  • Straczynski Offers To Re-Boot Star Trek
  • Building The MareNostrum COTS Supercomputer
    Search Slashdot stories

  • freshmeat.net announcements (Global)
  • Zolera SOAP Infrastructure 1.7 (Default branch)
  • XBible 3.0 (Default branch)
  • PDFdirectory 0.2.04 (Default branch)
  • XC-AST 0.7.0 (Default branch)
  • Imagero Reader 1.73 (Default branch)
  • GNU ccAudio2 0.4.0 (Testing branch)
  • quisp 1.27 (Default branch)
  • shsql 1.27 (Default branch)
  • samhain 2.0.4 (Default branch)
  • CANDIDv2 2.40 (Default branch)
  • ADV: Dialing for Dollars
  • libferris 1.1.46 (Default branch)
  • FUDforum 2.6.10 (Stable branch)
  • HORRORss 1.0 (Default branch)
  • Roxen WebServer 4.0.325-release 4 (Default branch)
  • Configuration File Library 1.0 (Default branch)
  • Goggles 0.7.11 (Default branch)
  • Pluto DCE library 2.0.0.9 (Default branch)
  • Pluto Bi-Directional Comm library 2.0.0.9 (Default branch)
  • zen Platform 2.0.4 (Default branch)
  • ADV: Gimme Shelter
  • MIME Email message class 2005.02.15 (Default branch)
  • ELF statifier 1.6.3 (Default branch)
  • SekHost 1.2 (Default branch)
  • ulogd 1.21 (Default branch)
  • Journaled Files LIBrary 0.1.0-0.0.0 (Default branch)
  • FastTemplate.php3 1.2.0 (Default branch)
  • iptables 1.3.0 (Default branch)
  • Very Simple Control Protocol Daemon 0.1.4 (Default branch)
  • C Parameters 0.9.0 (Default branch)
  • ADV: Dialing for Dollars
  • eXtreme Project Management Tool 0.7beta1 (Development branch)
  • gccc 1.099 (Default branch)
  • Magellan Metasearch 1.00-RC3 (Default branch)
  • CAN Abstraction Layer 0.1.4 (Default branch)
  • TreeLine 0.11.1 (Default branch)
  • GNOME Sensors Applet 0.6.1 (Default branch)
  • iODBC Driver Manager and SDK 3.52.2 (Default branch)
  • DISLIN 8.3 (Default branch)
  • Pluto Home 2.0.0.9 (Default branch)
  • ADV: Dialing for Dollars
  • Expense Report Software 1.07 (Default branch)
  • Yzis M3 (Default branch)
  • Q Light Controller 2.4.1 (Default branch)
  • Menc 0.3 (Default branch)
  • Another File Integrity Checker 2.7-0 (Default branch)
  • BibShelf 1.4.0-1 (Default branch)
  • Eleven 1.0 (Default branch)
  • Linice 2.5 (Default branch)
  • JDirt 1.3 (Default branch)
  • ADV: Dialing for Dollars
  • Nazghul 0.4.0 (Default branch)
  • Rush 2005 0.4.10 (Default branch)
  • Monesa 0.24.1 (Stable branch)
  • Persist.NET 0.9.1 beta (Default branch)
  • Roundup 0.8 (Default branch)
  • Aquarium Web Application Framework 2.0 (Default branch)
  • sn9c102 Video Grabber 1.7.0 (Default branch)
  • GRAVEMAN 0.3.8 (Default branch)
  • viewurpmi 0.2 (Default branch)
  • ADV: Dialing for Dollars
  • NuFW 1.0-rc1 (Stable branch)
  • OpenSceneGraph Editor 0.6.0 (Default branch)
  • HPGS - HPGl Script 0.6.0 (Default branch)
  • lustre 1.4.1-rc1 (Default branch)
  • IBM HeapAnalyzer 1.3.3 (Default branch)
  • CANDIDv2 2.3.6 (Default branch)
  • NetSPoC 2.5 (Default branch)
  • Metal Mech 0.0.3 (Default branch)
  • radmind 1.5.0 (Default branch)
  • ADV: Dialing for Dollars
  • iPodBackup 1.4 (Default branch)
  • db4o 4.3 (Mono branch)
  • web2ldap 0.15.9 (Default branch)
  • Mantissa 5.6 (Default branch)
  • Drone IRC Bot 1.2 (Default branch)
  • NoFuss POS 0.06 (Default branch)
  • xlog 1.1 (Stable branch)
  • ActiveBPEL 1.0.7 (Default branch)
  • Java Embedded Python 1.1 (Default branch)
  • ADV: Dialing for Dollars
  • Neveredit 0.8 (Default branch)
  • The friendly interactive shell 1.1 (Default branch)
  • Webmatic 2.0.3 (Default branch)
  • JTMOS Operating System Build 7700 (Default branch)
  • BIRD 1.0.10 (Default branch)
  • Tune in 2 Me 050215 (Default branch)
  • HMSCalc 3.0 (Default branch)
  • Information Currency Web Services 0.0.4 (Default branch)
  • Nitro + Og 0.10.0 (Default branch)
  • ADV: Dialing for Dollars
  • Just For Fun Network Management System 0.8.0 (Stable branch)
  • rxvt-unicode 5.1 (Default branch)
  • PHPEmaillist 0.3 (Default branch)
  • ulogd-php 1.0 (Default branch)
  • mod_access_rbl2 1.0 (Default branch)
  • 5lack10.1 0.8 (Default branch)
  • profusemail 0.9.1 (Default branch)
  • Linux Today

    Linux Today

  • LWN.net: FSF Announces New Executive Director
  • LinuxPlanet: Novell Takes Enterprise Security Focus
  • CNET News: HP: Don't Like Software Patents? Learn to Deal
  • internetnews.com: CA Chief: Innovate, Cooperate
  • Boston Herald: Linux Show Plans BCEC Move
    Search Linux Today:

  • XML.com

    XML.com

  • Features: Very Dynamic Web Interfaces
  • Features: Comparing CSS and XSL: A Reply from Norm Walsh
  • Features: Top 10 XForms Engines
  • Features: An Introduction to TMAPI
  • XML Tourist: The Silent Soundtrack
  • Transforming XML: The XPath 2.0 Data Model
  • Features: SIMILE: Practical Metadata for the Semantic Web
  • Features: Hacking Open Office
  • Features: Formal Taxonomies for the U.S. Government
  • Features: Reviewing the Architecture of the World Wide Web
  • Features: Printing XML: Why CSS Is Better than XSL
  • Python and XML: Introducing the Amara XML Toolkit
  • Features: Introducing Comega
  • Features: SAML 2: The Building Blocks of Federated Identity
  • The Restful Web: Amazon's Simple Queue Service

    Copyright 2004, O'Reilly Media, Inc.

  • Conclusion

    Well, we've shown in this column that Perl can really pack a wallop in a short amount of code. With rss2html.pl, anyone can automatically add a news feed to their Web site.

    For more information on RSS, you might try visiting the following sites:

    • http://my.userland.com
    • http://www.scripting.com
    • http://www.perlxml.com

    rss2html.pl Get the source
    This script converts an RSS file on the Web or local file system to HTML.

    你可能感兴趣的:(Using RSS News Feeds with Perl)