Every couple of months, there is another uprising about the bandwidth usage of RSS…the most recent one has been going on in the last couple of days, and this post from Robert Scoble is right in the middle of it, along with its associated comments. In another post, he even says “RSS is broken.”
As you could probably surmise, NewsGator’s own RSS feeds (such as the News/Updates feed) generate an enormous amount of traffic. This isn’t unexpected, and our network is designed for this…but I understand what people are seeing with their feeds. We use HTTP caching mechanisms to dramatically reduce the total bandwidth requirements, and other internal caching mechanisms to reduce overall server load.
90% of the discussion on the bandwidth issue centers around RSS aggregators, and how they are allegedly abusing servers relentlessly. Robert makes a rough estimation that hits increase 20x by having a RSS feed on a site like MSDN. He also surmises that this will get worse and worse over time:
This gets worse over time because on most sites HTML traffic will go down as people move away (at least until the site reposts interesting content that’ll bring back more traffic) while RSS just grows and grows even if new content doesn’t get posted because people subscribe and don’t move away.
Let’s look at two cases. Let’s make the assumption that the “average” aggregator will default to polling once an hour. Let’s also assume that the server implements HTTP caching headers in some way – really, I consider this a minimum entry criteria for RSS on a busy site.
Case 1 – the content on the site doesn’t update often (let’s say once a day). If the feed only updates once a day, 96% of the requests for the feed (23/24) will return a 304 Not Modified response. The other 4% of requests will respond with the entire contents of the feed. For the 304’s, the bandwidth is small (not negligible on an extremely busy feed, but low enough to not be a huge concern)…total number of connections are something to worry about, but typically not a big issue in most environments.
And the 4% will drop even smaller if the content is updated less often than once a day.
Case 2 – the content on the site is updated often, such that there are almost always changes from hour to hour. Assuming the feed updates in real time, every request to the RSS feed in our example will return the entire feed. This is the case that’s worth worrying about.
Given case 2, there are a number of things that can be done. Fewer items in the feed, excerpts versus full content; all of these have their issues. Some folks have suggested serving incremental content changes based on if-modified-since headers, which not only violates the HTTP specification, but breaks in common caching proxy scenarios. So what can you do?
One possible thing you could do is use caching headers to limit the potential “exposure” of a shorter-than-ideal aggregator polling interval. Nick Bradbury describes one such way to do that here.
Another similar option would be to batch feed updates to once or twice a day. All of the RSS feed requests would return a 304, except for those that occur just after the daily update(s). If there is one update a day, you cut 96% of the required bandwidth in our example. But wait – isn’t the point of RSS to get quick updates to site changes?
Now it gets interesting in a different way.
Back to Robert’s example, he assumes that users without RSS will break down as follows:
20% will visit at least once a day
40% will visit at least once a week
20% will visit at least once a month
20% will not visit in any one month (assuming these folks visited before but just aren’t revisiting)
But look at it this way – 80% of users will be at least a week behind on new content, and 40% will be at least a month behind.
So do you care about these users? Do you have content that you think they would be interested in, if only they knew about it? Would you benefit in some way if these users were reading your content more often? If yes to any of these, RSS helps.
You’re distributing incremental content to users who might be interested. From a business perspective, you can’t compare the bandwidth required by that process to the bandwidth required if these users only occasionally come to your site.
Further, the RSS hits will generally be smaller that the corresponding HTML pages, and also have less ancillary impact (such as images on the site, layout, etc). For example, my weblog front page is 58KB right now, and the RSS feed is 19KB. Adding images and such to the HTML version, and let’s call it 80K, approximately 4x the RSS size.
So I’m finally getting to the point. :-) Assume there is benefit to having users read your content every day. If you had some way to convince your interested users to do this (which of course you’re trying to do), going back to Robert’s example for the HTML site:
1000 users x 30 visits/month = 30,000 visits/mo (assuming once/day)
This is the ideal case for the site – assuming more exposure for your content is better. We’re not counting ancillary hits here, which will certainly add to the server load.
With RSS, let’s say we set it up to update/publish the feed 4x per day – which gives aggregator users an average 3 hour delay before they learn of new content (vs. 24 hours for the HTML):
1000 users x 120 hits/month = 120,000 hits/mo
Remember, all of the other hits (potentially 20 per day per user) are negligible in terms of bandwidth due to cache header implementation.
So we have 4x as many hits, but 1/4 the overall size…so it’s a wash in terms of bandwidth. And users are exposed to your content multiple times per day, which is good for you and them both.
If quicker updates are important for your users, then there is an incremental bandwidth cost to pay for that…but you as the publisher can control this, based on the information you’re trying to push.
Anyway, many of these numbers are pulled out of the air…but the point is, most mature aggregators (like NewsGator and NewsGator Online) use the HTTP caching mechanisms, so use them. And further, there are things you can do on the server side to manage the bandwidth load, depending on the goals you have for your feed.
Comments welcome as always!
Greg,
The problem remains that you do not “solve” case 2. Your suggestions do not keep with the idea of timely delivery of new content.
Ultimately the straight from source polling platform will have to put to pasture, but for now the major problem for the high update/ many readers situation is the granularity of the feed. The incledible wast of serving the same content over and over again to clients that already have it is the immediate killer. A basic staged indexing scheme, changing the one shot list into a tree of items, seems like a nice compromise for compatibility with current clients (the non-leave item nodes can contain summaries with links to the full articles) and giving some breading space to currently overloaded infrastructures.
The problem is really the busy sites that generate RSS feeds dynamically (like .Text). This is a rookie mistake for server-side code. They can’t take advantage of any of the well-known caching techniques because they have to respond to every request. Don’t blame the aggregators for this problem.
John,
I do not agree that this is a CPU problem. If it is the server is badly implemented, and the solutions obvious. It is an I/O problem.
Thinking of the top of my head here, but I’d try to explore something like this:
In RSS 2.0 there is the element of - , specifically:
is an optional sub-element of - .
“
It has one optional attribute, domain, a string that identifies a categorization taxonomy.
The value of the element is a forward-slash-separated string that identifies a hierarchic location in the indicated taxonomy. Processors may establish conventions for the interpretation of categories.”
I would propose to define the a magic value: ***RSS2.0+***. The domain attribute of the category element that has as a label that has the magic value has the url where the full content of the parent- can be retrieved.
e.g.***RSS2.0+***
Feeds can now be summary feeds, fully backward compatible with all current clients, whereas the adaptations to turn existing clients and servers into incremental full content serving/retrieving systems are managable.
This is just an example. I am sure more experienced minds on this subject will do far better.
The other obvious thing is to compress the content. IIS6 will supposedly do this out of the box. And then there’s the slashdot solution – ban clients that poll too often.
All that said, weblogs.asp.net is a pretty huge site and I could see them publishing extracts on the main feed, it’s just too unlikely that someone would want to read even a majority of that content.
Danny Ayers also posted on the idea of “rewarding” clients that accept compressed encodings.
http://dannyayers.com/archives/2004/09/10/rewarding-clients/
(this comment system seems to drop any tags, so let’s try again)
<test>
(sorry for the reposed, but the tag dropping was unexpected)
John,
I do not agree that this is a CPU problem. If it is the server is badly implemented, and the solutions obvious. It is an I/O problem.
Thinking of the top of my head here, but I’d try to explore something like this:
In RSS 2.0 there is the <category> element of <item>, specifically:
“<category> is an optional sub-element of <item>.
It has one optional attribute, domain, a string that identifies a categorization taxonomy.
The value of the element is a forward-slash-separated string that identifies a hierarchic location in the indicated taxonomy. Processors may establish conventions for the interpretation of categories.”
I would propose to define the a magic value: ***RSS2.0+***. The domain attribute of the category element that has as a label that has the magic value has the url where the full content of the parent <item> can be retrieved.
e.g. <category domain=”http://www.example.com/item111″>***RSS2.0+***</category>
Feeds can now be summary feeds, fully backward compatible with all current clients, whereas the adaptations to turn existing clients and servers into incremental full content serving/retrieving systems are managable.
This is just an example. I am sure more experienced minds on this subject will do far better.
I didn’t mean to imply that it was a CPU problem at all. And I really should have said that it’s a design mistake (rather than a coding mistake). I don’t think the solution to this problem is all that complicated.
First, some assumptions: The aggregate feed has value to people (lots of folks are subscribed to it). Supporting the feed has a significant cost for Microsoft. The current solution (limiting items to 500 characters in the aggregate feed) makes the feed less valuable.
Now, if I were assigned to deal with this problem, here’s what I would propose:
Let’s put together a 1-2 day meeting where MS brings in (i.e. MS pays the expenses) the main aggregator developers, the .TEXT developer, some IIS guys and maybe some of the IE team and lets just hammer out a solution. This is a pretty small community (heck, I can list off the first names of folks and most people reading this will know who I’m talking about: Scott, Greg, Dare, Nick, etc.). RSS/HTTP is a cooperative system. Let’s have a little cooperation.
One thing I think you didn’t address is the comment I’ve seen elsewhere that some (how many?) aggregators are hitting the server at predetermined times (on the hour, etc). I’ve seen this referred to in several columns.
That is something that I read into Scobble’s comments, although I don’t recall if he drew attention to that in the post you referenced.
Stephan, you’re right, that’s a potential issue. Most aggregators won’t do this, though – for example, NewsGator for Outlook starts its interval depending on when you start Outlook. And NewsGator Online spreads traffic out across the hour interval.
Not sure how many aggregators actually poll every hour at the top of the hour, but if there are any, I would encourage them to change that behavior.
I document on my blog a method, which conforms to existing HTTP specifications, which permits per-request minimization of items delivered without breaking caches, etc. The method relies on use of RFC3229. For more info read:
http://bobwyman.pubsub.com/main/2004/09/using_rfc3229_w.html
Your comments would be appreciated.
Ultimately, the most efficient syndication systems are going to be push-based. Nonetheless, we can still improve the efficiency of the current old technology “pull” based systems.
bob wyman
It seems to me that using If-Modified-Since to throttle RSS bandwidth can still be effective by introducing a minimum number of items. My feed currently contains 25 items. If I introduced If-Modified-Since throttling with a minimum of 7 items, then I’d still get a substantial savings in bandwidth without unduly penalizing readers behind a cache proxy. …[more]
Pingback