Skip to main content

Aggregating RSS feeds into a public folder via a script

Recently a few people have asked me if I had a script that could store the content of a RSS feeds in a public folder. Initially I was puzzled as to why you would want to do this but when you start to look at the problems RSS can cause in large networks it started to make a lot more sense. After going though the issues of building the script and testing how some RSS aggregators work and the way different people support publishing feeds it became a lot clearer that RSS as a standard can cause a lot of problems. I guess because of the fast pace of RSS adoption its shown up holes in the initial design. If your interested do a search on Google for bandwidth usage of RSS the Blogspear has been bashing this out for the last couple of years.

The Script

An overview of what this script does is it takes a RSS feed and a pubic folder as command-line parameter and then synchronizes the content of the feed with a public folder by creating or modifying posts. The script uses the Msxml2.XMLHTTP.4.0 object to access the RSS feeds. The main reason this object was used vs others is that it supports decompressing gzip content automatically. To create the post in the public folder CDOEX is used because of this fact the script must be run locally on an Exchange server where there is an instance of the public folder you want to create the feed in.

Keeping the Bandwidth Lean

This was the most challenging part of this script initially I was just pulling down the whole feed to work out if anything had changed. The problem is that doing this several times a day over a lot of feeds meant you start consuming a lot of bandwidth. The solution to fixing this was two fold the first was to use conditional gets. A Conditional get allows you to make a normal get request with the addition of two headers If-Modified-Since and If-None-Match that means if the content has not changed since the last request it will return a status of 304 and no content. To use a conditional get the values from the previous get request must be stored to do this the script creates a custom property on the public folder itself named after the URL of the blog your aggregating. The value of the Last-Modified and Etag headers are stored within this property and used on future requests.

The other thing that is done to attempt to kept the bandwidth used to a minimum is to request http compression be used. For this the Accept-Encoding header is used. With the amount of bloat in XML feeds this can have quite a large saving during the initial synchronization of feeds.

Unfortunately some content providers don’t support either of the standards most do support conditional get (although I did find a number that didn’t). I only found around 40% of the blogs I tried supported compression.

Reading the feed’s XML

This was the second most challenging part of the script dealing with all the different formats that syndication feeds come in. There are 3 main feed formats that are used Atom, RSS version 2.0 and RSS 1.0 RDF feeds. The real pain comes from the fact that most elements in a feed are optional so when you’re trying to read a lot of feeds from different sources you can never be to sure what elements are going be used in a feed. For example pubdate is an option element most rss feeds have it but some don’t. Without pubdate working out if an item is new in a feed creates a bit of a problem. Atom feeds are a lot better but they still have a lot of optional elements and the way in which the content is published in an atom feed can also vary (especially the content fields). Basically to parse this there are three separate subs in the script that handle the different feeds and do a best effort to work out if a post has been changed based on whether a date can be retrieved. This is one part of the script that may need re-engineering for it to support different type of feeds you wish to argregate.

The last section of the code does the synchronization with a public folder. The Sync basically works by trying to use one of unique elements from the Feed entry to create a Href value in a public folder. If its comes down to not being able to work out if a item has been modified or not the createpost function will try to open a the item at calculated href if this fails it will instead create a new item. If the item does open it will do a comparison of the bodytext to detect if any changes have been made and update the post if necessary.

Running the script

To run the script you need to give it the name of the blog you want to aggregate as the first command line parameter and the name of the public folder as second command line parameter eg (to aggregate this blog to a public folder called rssfeed you would do)

cscript readfeed.vbs "http://gsexdev.blogspot.com/atom.xml" http://servername/public/rssFeeds

The script is designed so you can have multiple feeds being feed into one public folder and they shouldn’t affect each other. (I’ve got up to 15 going into one folder). As the script runs its writes a fairly verbose log to a logfile at c:\temp\rssfeedlog.txt this can be used to help diagnose problems with the script.


The script is a little on the large side to post verbatim (around 450 lines) I’ve put a downloadable copy of the script here.

If you wish to aggregate a number of blogs there are a few options when running the script. The first is use a batch file and include a line for each blog you want to aggregate. Jörg-Stefan Sell has also come with another great idea which is to create a script that reads a XML config file which contains the blogs and the public folders you want to aggregate and then it shells out to the readfeed script. You can download a copy of Jörg-Stefan script here.

Special thanks to Bill Pogue from Aztec Systems, Inc. for his help with the idea and the code.

Popular posts from this blog

Exporting and Uploading Mailbox Items using Exchange Web Services using the new ExportItems and UploadItems operations in Exchange 2010 SP1

Two new EWS Operations ExportItems and UploadItems where introduced in Exchange 2010 SP1 that allowed you to do a number of useful things that where previously not possible using Exchange Web Services. Any object that Exchange stores is basically a collection of properties for example a message object is a collection of Message properties, Recipient properties and Attachment properties with a few meta properties that describe the underlying storage thrown in. Normally when using EWS you can access these properties in a number of a ways eg one example is using the strongly type objects such as emailmessage that presents the underlying properties in an intuitive way that's easy to use. Another way is using Extended Properties to access the underlying properties directly. However previously in EWS there was no method to access every property of a message hence there is no way to export or import an item and maintain full fidelity of every property on that item (you could export the...

The MailboxConcurrency limit and using Batching in the Microsoft Graph API

If your getting an error such as Application is over its MailboxConcurrency limit while using the Microsoft Graph API this post may help you understand why. Background   The Mailbox  concurrency limit when your using the Graph API is 4 as per https://docs.microsoft.com/en-us/graph/throttling#outlook-service-limits . This is evaluated for each app ID and mailbox combination so this means you can have different apps running under the same credentials and the poor behavior of one won't cause the other to be throttled. If you compared that to EWS you could have up to 27 concurrent connections but they are shared across all apps on a first come first served basis. Batching Batching in the Graph API is a way of combining multiple requests into a single HTTP request. Batching in the Exchange Mail API's EWS and MAPI has been around for a long time and its common, for email Apps to process large numbers of smaller items for a variety of reasons.  Batching in the Gr...

Sending a Message in Exchange Online via REST from an Arduino MKR1000

This is part 2 of my MKR1000 article, in this previous post  I looked at sending a Message via EWS using Basic Authentication.  In this Post I'll look at using the new Outlook REST API  which requires using OAuth authentication to get an Access Token. The prerequisites for this sketch are the same as in the other post with the addition of the ArduinoJson library  https://github.com/bblanchon/ArduinoJson  which is used to parse the Authentication Results to extract the Access Token. Also the SSL certificates for the login.windows.net  and outlook.office365.com need to be uploaded to the devices using the wifi101 Firmware updater. To use Token Authentication you need to register an Application in Azure https://msdn.microsoft.com/en-us/office/office365/howto/add-common-consent-manually  with the Mail.Send permission. The application should be a Native Client app that use the Out of Band Callback urn:ietf:wg:oauth:2.0:oob. You ...
All sample scripts and source code is provided by for illustrative purposes only. All examples are untested in different environments and therefore, I cannot guarantee or imply reliability, serviceability, or function of these programs.

All code contained herein is provided to you "AS IS" without any warranties of any kind. The implied warranties of non-infringement, merchantability and fitness for a particular purpose are expressly disclaimed.