NewsGator’s sync platform details

For the vast majority of NewsGator users (including folks using NetNewsWire, FeedDemon, or any of our other applications), NewsGator’s sync system works totally transparently. But there are some nuances of our implementation that are sometimes visible to users. So, in the hopes of giving people a definitive place to go to understand this, I offer the following in-depth explanation of NewsGator’s sync platform.

The Mechanics

All content is stored on NewsGator’s servers. When an application like NetNewsWire does a sync, it sends up some bookkeeping information up to the NewsGator system (a “sync token”), and the system returns a list of feeds that NNW needs to update. NNW then requests updates for each of those feeds (which is generally a subset of the list of subscribed feeds), again using a sync token, and the system returns the new (not yet seen) or updated articles for that particular feed.

This system is extremely efficient. For feeds that haven’t updated, NNW will not even have to request them. For feeds that only have a single new item, that’s the only data that will be returned to the client. For scenarios like mobile applications (like NewsGator Go! or NetNewsWire for iPhone), this is pretty close to the minimum theoretical bandwidth required to deliver the content.

It’s actually possible to reduce the number of calls even further, but at the cost of a potentially large (and expensive to process) response. Our APIs are instead optimized around the case I describe above.

[note: this is somewhat simplified; for example, metadata also travels both ways during a sync, but I’m leaving discussion of that out for purposes of this article.]

Details

NewsGator’s online platform processes about 3.5 million feeds, and stores about 9 million new articles per day, as of this writing. There are a total of about 3 billion articles in the system.

Suppose you subscribe to a feed from CNN.com, and further assume that that feed publishes 100 new articles per day (I have no idea how many it actually publishes – just go with me here). Now imagine you go on vacation for a month, and you come back, fire up FeedDemon, and sync. There are now 3000 articles you haven’t seen. Should we deliver them all to you?

Probably not.

Do we?

No, for several reasons. First and foremost, the user experience therein would totally suck; no one wants to wade through 3000 articles from a single feed. And second, it’s pretty inefficient to retrieve all of this content from the API – we could deliver it, but it’s going to take a lot of bandwidth to retrieve it, and a lot of work to process it. Using a mobile phone? This might well lock it up.

And the other reason is, our system allows you to mark individual articles as read, and that data is synchronized throughout the system and all of the clients you use. But we don’t store your read states for all time – we only store it for fairly recent data. Do you really care if you marked an article read 2 years ago? Probably not.

So what do we do? We have a limit of how many articles will have their metadata state synced through the system. Here’s the rule we currently use:

The number of articles in the current feed,

OR

14 days, up to 200 articles.

Whichever of the two conditions above yields more articles is applied.

Here are three examples of applying this rule:

Imagine a blog that publishes 5 times per month, and its feed has the most recent 10 items in it. This feed would sync 10 articles.

Now imagine the hypothetical CNN feed above, which had 100 items per day, and imagine the feed held the 20 most recent items. In this case, we would sync 200 items.

And finally, if a feed published 10 times per day, and held the most recent 20 items in the feed, we would sync 140 items.

If you really want to go back and browse through all 3000 articles you missed, you still can – they’re all stored in NewsGator Online, and you can view them from the web site. In fact, you can go back all the way to when we first discovered the feed – over 4 years ago, in many cases.

The Gotcha

For most feeds, the algorithm described above makes things completely transparent, and articles and unread counts across all NewsGator-integrated applications will match up perfectly.

The gotcha is with feeds that have a lot of articles. For example, I have a smart feed for the term “NewsGator”, and I see probably 400 new articles there per day, 200 at a time. So in this case, only 200 articles have state synchronized.

What can happen is the following:

1. NetNewsWire downloads the feed, and shows 200 items, all unread.

2. 3 hours later, you sync from FeedDemon, and you see 200 items, all unread. You read them all, and mark them read.

3. An hour after that, you sync again from NetNewsWire…it syncs state from the old articles, and downloads say 25 new ones. You see 47 unread articles. You immediately sync again with FeedDemon, and it shows 25 unread. What gives?

The problem is the 200-article limit, and the fact that some of the articles fell off that ledge while still showing as unread in one application…and thus don’t have all of their state synchronized.

This problem used to be much more acute – it’s more rare now that we’ve raised the article-sync limit up to 200. But it’s still possible to run into this, specifically with very prolific feeds.

We’ve experimented with various different limits – the current 200 articles seems to be a good compromise, perfectly syncing the vast majority of feeds while maintaining the efficiency that our client applications demand. We’re also working on some things that will make this invisible to API clients, while still working within the constraints.

8 thoughts on “NewsGator’s sync platform details

  1. Rolphus

    Now *this* is a system I’d love to get my teeth into. Fascinating writeup, thanks. As a platform, the NG sync system has a huge amount of potential in the future.

    Reply
  2. Elliot Shank

    And this is precisely why I don’t use syncing. I want all entries, period, locally. Even if I can’t read the 3000 entry backlog, it’s still there for me to search later.

    Reply
  3. Jonathon McDougall

    @Elliot – I think you are missing the point – All posts are still stored locally *from when you subscribe* it is the Meta Data that isn’t synced. IE read state and flag state.

    Eg If you subscribe to a feed that has 20 posts/day, at the end of the year you’ll have over 6000 posts with or without sync – depending on the limits of your client. Only the last 200 posts will sync read state, but the others will still exists locally and be searchable.

    Reply
  4. Elliot Shank

    This has not been the case in the past. I haven’t tried it in a while, but it used to be the case that subscribing to a feed while syncing wouldn’t download all the items in the feed. Say a feed had the last 50 items in it; if I subscribed via NewsGator, I’d only get the last 20 items (or some other low number like that). I just don’t trust that I’m not missing anything.

    Reply
  5. gregr Post author

    @Elliot – it did indeed work differently in the past. With the current system, though, there is no scenario I can think of in which you’ll get less data using sync than you would downloading directly…and in fact, in most cases, you’ll get more data using sync.

    Reply
  6. Pingback: Ben Barren - Confessions of a Mad Man » newsgator + polyvore this clippn goalz.

Leave a Reply