Recently a few people have asked me if I had a script that could store the content of a RSS feeds in a public folder. Initially I was puzzled as to why you would want to do this but when you start to look at the problems RSS can cause in large networks it started to make a lot more sense. After going though the issues of building the script and testing how some RSS aggregators work and the way different people support publishing feeds it became a lot clearer that RSS as a standard can cause a lot of problems. I guess because of the fast pace of RSS adoption its shown up holes in the initial design. If your interested do a search on Google for bandwidth usage of RSS the Blogspear has been bashing this out for the last couple of years.
An overview of what this script does is it takes a RSS feed and a pubic folder as command-line parameter and then synchronizes the content of the feed with a public folder by creating or modifying posts. The script uses the Msxml2.XMLHTTP.4.0 object to access the RSS feeds. The main reason this object was used vs others is that it supports decompressing gzip content automatically. To create the post in the public folder CDOEX is used because of this fact the script must be run locally on an Exchange server where there is an instance of the public folder you want to create the feed in.
Keeping the Bandwidth Lean
This was the most challenging part of this script initially I was just pulling down the whole feed to work out if anything had changed. The problem is that doing this several times a day over a lot of feeds meant you start consuming a lot of bandwidth. The solution to fixing this was two fold the first was to use conditional gets. A Conditional get allows you to make a normal get request with the addition of two headers If-Modified-Since and If-None-Match that means if the content has not changed since the last request it will return a status of 304 and no content. To use a conditional get the values from the previous get request must be stored to do this the script creates a custom property on the public folder itself named after the URL of the blog your aggregating. The value of the Last-Modified and Etag headers are stored within this property and used on future requests.
The other thing that is done to attempt to kept the bandwidth used to a minimum is to request http compression be used. For this the Accept-Encoding header is used. With the amount of bloat in XML feeds this can have quite a large saving during the initial synchronization of feeds.
Unfortunately some content providers don’t support either of the standards most do support conditional get (although I did find a number that didn’t). I only found around 40% of the blogs I tried supported compression.
Reading the feed’s XML
This was the second most challenging part of the script dealing with all the different formats that syndication feeds come in. There are 3 main feed formats that are used Atom, RSS version 2.0 and RSS 1.0 RDF feeds. The real pain comes from the fact that most elements in a feed are optional so when you’re trying to read a lot of feeds from different sources you can never be to sure what elements are going be used in a feed. For example pubdate is an option element most rss feeds have it but some don’t. Without pubdate working out if an item is new in a feed creates a bit of a problem. Atom feeds are a lot better but they still have a lot of optional elements and the way in which the content is published in an atom feed can also vary (especially the content fields). Basically to parse this there are three separate subs in the script that handle the different feeds and do a best effort to work out if a post has been changed based on whether a date can be retrieved. This is one part of the script that may need re-engineering for it to support different type of feeds you wish to argregate.
The last section of the code does the synchronization with a public folder. The Sync basically works by trying to use one of unique elements from the Feed entry to create a Href value in a public folder. If its comes down to not being able to work out if a item has been modified or not the createpost function will try to open a the item at calculated href if this fails it will instead create a new item. If the item does open it will do a comparison of the bodytext to detect if any changes have been made and update the post if necessary.
Running the script
To run the script you need to give it the name of the blog you want to aggregate as the first command line parameter and the name of the public folder as second command line parameter eg (to aggregate this blog to a public folder called rssfeed you would do)
cscript readfeed.vbs "http://gsexdev.blogspot.com/atom.xml" http://servername/public/rssFeeds
The script is designed so you can have multiple feeds being feed into one public folder and they shouldn’t affect each other. (I’ve got up to 15 going into one folder). As the script runs its writes a fairly verbose log to a logfile at c:\temp\rssfeedlog.txt this can be used to help diagnose problems with the script.
The script is a little on the large side to post verbatim (around 450 lines) I’ve put a downloadable copy of the script here.
If you wish to aggregate a number of blogs there are a few options when running the script. The first is use a batch file and include a line for each blog you want to aggregate. Jörg-Stefan Sell has also come with another great idea which is to create a script that reads a XML config file which contains the blogs and the public folders you want to aggregate and then it shells out to the readfeed script. You can download a copy of Jörg-Stefan script here.
Special thanks to Bill Pogue from Aztec Systems, Inc. for his help with the idea and the code.