Tuesday, February 07, 2006

Aggregating RSS feeds into a public folder via a script

Recently a few people have asked me if I had a script that could store the content of a RSS feeds in a public folder. Initially I was puzzled as to why you would want to do this but when you start to look at the problems RSS can cause in large networks it started to make a lot more sense. After going though the issues of building the script and testing how some RSS aggregators work and the way different people support publishing feeds it became a lot clearer that RSS as a standard can cause a lot of problems. I guess because of the fast pace of RSS adoption its shown up holes in the initial design. If your interested do a search on Google for bandwidth usage of RSS the Blogspear has been bashing this out for the last couple of years.

The Script

An overview of what this script does is it takes a RSS feed and a pubic folder as command-line parameter and then synchronizes the content of the feed with a public folder by creating or modifying posts. The script uses the Msxml2.XMLHTTP.4.0 object to access the RSS feeds. The main reason this object was used vs others is that it supports decompressing gzip content automatically. To create the post in the public folder CDOEX is used because of this fact the script must be run locally on an Exchange server where there is an instance of the public folder you want to create the feed in.

Keeping the Bandwidth Lean

This was the most challenging part of this script initially I was just pulling down the whole feed to work out if anything had changed. The problem is that doing this several times a day over a lot of feeds meant you start consuming a lot of bandwidth. The solution to fixing this was two fold the first was to use conditional gets. A Conditional get allows you to make a normal get request with the addition of two headers If-Modified-Since and If-None-Match that means if the content has not changed since the last request it will return a status of 304 and no content. To use a conditional get the values from the previous get request must be stored to do this the script creates a custom property on the public folder itself named after the URL of the blog your aggregating. The value of the Last-Modified and Etag headers are stored within this property and used on future requests.

The other thing that is done to attempt to kept the bandwidth used to a minimum is to request http compression be used. For this the Accept-Encoding header is used. With the amount of bloat in XML feeds this can have quite a large saving during the initial synchronization of feeds.

Unfortunately some content providers don’t support either of the standards most do support conditional get (although I did find a number that didn’t). I only found around 40% of the blogs I tried supported compression.

Reading the feed’s XML

This was the second most challenging part of the script dealing with all the different formats that syndication feeds come in. There are 3 main feed formats that are used Atom, RSS version 2.0 and RSS 1.0 RDF feeds. The real pain comes from the fact that most elements in a feed are optional so when you’re trying to read a lot of feeds from different sources you can never be to sure what elements are going be used in a feed. For example pubdate is an option element most rss feeds have it but some don’t. Without pubdate working out if an item is new in a feed creates a bit of a problem. Atom feeds are a lot better but they still have a lot of optional elements and the way in which the content is published in an atom feed can also vary (especially the content fields). Basically to parse this there are three separate subs in the script that handle the different feeds and do a best effort to work out if a post has been changed based on whether a date can be retrieved. This is one part of the script that may need re-engineering for it to support different type of feeds you wish to argregate.

The last section of the code does the synchronization with a public folder. The Sync basically works by trying to use one of unique elements from the Feed entry to create a Href value in a public folder. If its comes down to not being able to work out if a item has been modified or not the createpost function will try to open a the item at calculated href if this fails it will instead create a new item. If the item does open it will do a comparison of the bodytext to detect if any changes have been made and update the post if necessary.

Running the script

To run the script you need to give it the name of the blog you want to aggregate as the first command line parameter and the name of the public folder as second command line parameter eg (to aggregate this blog to a public folder called rssfeed you would do)

cscript readfeed.vbs "http://gsexdev.blogspot.com/atom.xml" http://servername/public/rssFeeds

The script is designed so you can have multiple feeds being feed into one public folder and they shouldn’t affect each other. (I’ve got up to 15 going into one folder). As the script runs its writes a fairly verbose log to a logfile at c:\temp\rssfeedlog.txt this can be used to help diagnose problems with the script.


The script is a little on the large side to post verbatim (around 450 lines) I’ve put a downloadable copy of the script here.

If you wish to aggregate a number of blogs there are a few options when running the script. The first is use a batch file and include a line for each blog you want to aggregate. Jörg-Stefan Sell has also come with another great idea which is to create a script that reads a XML config file which contains the blogs and the public folders you want to aggregate and then it shells out to the readfeed script. You can download a copy of Jörg-Stefan script here.

Special thanks to Bill Pogue from Aztec Systems, Inc. for his help with the idea and the code.

14 comments:

Colin Walker said...

Excellent Glen! This is something I've been wondered about for a while.

A server based RSS solution is ideal for cutting down bandwidth usage in an Enterprise environment - just have one machine checking feeds so that they are available for all.

It also means that the feeds can be constantly checked without having to worry about having a client application open which is a big pain.

Thanks for this, I'll certainly be testing it :)

Colin Walker said...

Hi Glen,

This is working nicely but just a couple of questions:

1 - are the items supposed to be marked as unread? If not how can this be achieved?

2 - when receiving updates you compare the body of the item but is it possible to also change the timestamp and then also mark as unread?

Glen said...

Hi Colin,

Thanks for feedback

1. Thats a great idea very easy to do only requires the addition of msgobj.fields("urn:schemas:httpmail:read") = false at three places in the script. I've updated the download to do this and it seems to work fine.

2. The time stamps of the actual posts where one problem i couldn't really solve with CDOEX. Initially i want the creation time to reflect the creation time of the RSS post if it was availiable but these fields are read only for a number of reasons and the actual time the script run will allways be used. Setting the post as unread is easy i've already include the change for this. If you just wanted to make the timestamp newer what you could do is once you have worked out the item is newer is delete the item and then recreate it.

Cheers
Glen

Colin Walker said...

Marvellous :)

I'll get the new version up and running straight away. If I think of anything else I'll let you know.

Ronny Ong said...

This is great, but it probably doesn't make Hexamail too happy. They sell a product implementing this functionality for $299. Admittedly, RSS2Exchange is more polished and incorporates additional features, but your aggregator script can be combined with your RSS feed event sink from 2004 and another free tool from rssextender.com to achieve most of the publishing functionality and HTML scraping functionality of RSS2Exchange.

It's too bad that Public Folders won't live past Exchange 2003. I'm a bit of a packrat when it comes to feeds. I subscribe to plenty, and I like to retain articles forever. If you've ever tried this with most feed reader client apps, you know that absolutely all of them are designed with the assumption that nobody would want to retain more than a few weeks or months worth of history. Therefore, they tend to use stores which bog down once you get around 15,000 total articles or 50MB of feed data. I haven't found any readers which won't keel over and become completely unusable after 30,000 articles or 100MB.

In contrast, 30,000 items or 100MB is trivial to the Exchange store. That's like a single day's worth of spam for many organizations!

Anonymous said...

Is there any way to configure this in a single-server site using Forms-based Auth?

Glen said...

This script will run fine in a single server site there is no really difference here. This is not a WebDAV script so FBA has no real baring authentication wise. The script itself uses Exoledb to access the exchange store and create posts. This means it needs to be run locally on the server. The exoledb code can be converted to webDAV but its a lot of time and effort to do this

Anonymous said...

Thanks for confirming, I'm just getting the following errors which is why I ask (should have posted this before!):

*****Version 1.4*****
http://feeds.feedburner.com/arstechnica/BAaf
Public Folder Property Value:
Full Sync
200
Compressed Stream: gzip
Sat, 01 Apr 2006 06:16:07 GMT
eRXhfimwQVA+HaRaS73llwoS4oE
*****Version 1.4*****
http://blogs.msdn.com/MainFeed.aspx?GroupID=2
Public Folder Property Value:
Full Sync
200
Compressed Stream: gzip
Sat, 01 Apr 2006 21:01:00 GMT
4/1/2006 4:01:00 PM

Nothing ever appears in the PF I've specified. The Public VD does require HTTPS if that makes a difference (I've specified with and without in the feedlist.xml using the master script).

Glen said...

I've tested both those feeds with the script and it worked okay. Some things to check because it used the Internet Explorer Browser control if the browser is configured to work offline then this can cause the script to fail. The other major thing to check is permissions on the public folder make sure the account you running the script has MAPI owner rights on the public folder. Just because you using an Admin account wont mean this is the case you need to check and set this using Outlook.

Cheers
Glen

James said...

first off, Awsome script!
I think the "If-Modified-Since" and "If-None-Match" are keeping this script from updating my public folders. The first time I run it it updates fine but wont update after that. The sites are getting new content but the script is still logging no updates? any suggestions?

Glen said...

Are they public sites you having a problem with If so can you email me the links glenscales@yahoo.com and i can give them a test. I could be to do with the type of servers that are serving the webpages or maybe the way they are clustered. You should see a verbose output of what happening are the timestamps getting updated. It could also be one of the functions failing you could try reming out on error resume next in the script and see if it generating any errors anywhere

RealityMasque said...

Hi Glen,

I'm developing a reader, & it requires that we know if a feed's been updated, an item is new, and an item is updated. As you said, the date fields in the RSS standards are often optional. Supposing that the date fields ARE missing, can you suggest a means of tracking those requirements?

- O8

Glen said...

This is challenging if there isn't any date field you can use. The way i would do it would be to check the length of the entry. It would be usually if a modified entry would maintain the same character count. The other thing you could do which would be more fancy would be to hash the entry and the compare the hash to work out if any update happened.

Cheers
Glen

Anonymous said...

Hi Glen

I was looking for a way if it is possible to re-direct a RSS feed to a DL in Exchange 2007.

It was a real help running your script on the PF.

Thank you

Niloy