Anatomy of an outage

The last 12 hours will not go down in the books as NewsGator’s happiest hours…as posted here, there was an outage and NewsGator Online was down for several hours.  On a normal day, I’d be thrilled that “NewsGator” was the #1 search on Technorati; but not today. :-(

But while this was painful for us, it was painful for our users too.  So I’m going to walk through what actually happened, in quite some detail, and why it took so long to come back online.  And there are still a lot of questions here – our platform/operations folks are on the phone with EMC and Microsoft both as I type this post, trying to get to the root cause of all this.  But I’m getting ahead of myself.

Warning – lots of detail ahead.  Feel free to skip below if you don’t want to read all the details.

Monday 5:00pm

A query starts running against the database that contains user information, kicked off by an automated agent job. This query is run periodically, and is used for some reporting data.  It normally takes a minute or so to run…but this time, for some reason (I don’t have the details), it was going to block on something and take 300+ minutes.

Monday 6:00pm

SQL transaction log files are growing much more quickly than normal, due to the errant query.  It’s not yet noticed, however, as it’s still below the relatively high log monitoring thresholds we have set up.

Monday 7:00pm 

I’m sitting in the office with Darryl, who runs our platform team, and owns all of the non-GUI portions of the online platform.  We’re noticing large I/O bottlenecks on one file group in one of our databases, and some quick calculations indicate we are at basically 100% of the hardware capacity.  This has been consistent for the past week or so.

We use an EMC fiber-channel SAN for database storage.  The database file group that was running into problems is striped across 20 disk spindles.  We had a plan in place to increase the number of spindles to increase I/O capacity, which was going to happen later this week…but given the data at hand, we decided to schedule it for Monday night.  The immediate plan was to add 10 more disks to the stripe for that file group, bringing the total up to 30, and increasing I/O capacity for the group by 50%.  (For techies out there, we’re measuring I/O operations per second; physical disks have a maximum number of these operations they can handle per second, and the SAN caching increases the effective rate in some cases…but generally, the only way to really increase I/O capacity is to add drives).

So…we allocate the extra drives, and begin the re-striping process.  This happens transparently to the applications (i.e. SQL Server), as the SAN has the ability to re-stripe on the fly.  We’ve expanded storage a number of times, exactly like this, with no problems.  Total I/O capacity is impacted a bit during the process, but everything stays up and running.

Or at least it’s supposed to.

Monday 10:00pm

System starts throwing intermittent errors when applications try to connect to the database in question.  Monitor alerts are sent.  It’s pretty critical, as this database contains all user-related data (such as authentication info) for all system users.

Monday 10:30pm

Darryl discovers the errant query (the one that started at 5pm), and terminates it.  At this point the SQL transaction logs are well over 300GB.  SQL starts to recover, but intermittent errors still occur.

Tuesday 2:00am

SQL Server shuts down – cause is as of yet unknown, although there are some suspicious log entries prior to the failure.  It automatically restarts, and begins the recovery process to bring the database online, including replaying this huge 300GB log.  Other cluster members don’t take over, because the database requires recovery – as opposed to having a server problem.

However, while it’s trying to recover, a huge amount of traffic is still coming in from the web sites (mostly API and non-US sites at this time), which causes a large number of SQL transactions.  Some of these transactions deadlock with the recovery process, which causes SQL to terminate again and restart recovery.

At this point, every request coming into the system that hits that particular database fails.

Tuesday 2:30am

It’s clear by now that the only way to bring the database back online is to eliminate all database traffic.  “Maintenance” pages go up, all sites come down, and the database restarts its recovery process.

Tuesday 10:15am

SQL Server completes the recovery process and brings this database online.  Back-end services are turned back on in priority order, and sites are brought back online by 10:30am.

So there you have it. I’m not a database expert myself, so I may have some of these things not quite right, but this is more or less what happened.  The root cause, and which details along the way are actually relevant, is as of yet unclear.

The big question is why did SQL Server fail completely?  One possible cause would be if the SAN expansion caused some disk errors, and left the database in an inconsistent state.  So as you can imagine, we have tickets open with both EMC and Microsoft, and we will get to the bottom of this.  There are also other issues – such as the intermittent failures – that are cause for concern.

Frankly, this was a pretty frustrating experience.  We have a lot of redundant systems – pretty much any piece of hardware in our data center could fail, and we can absorb it without a significant outage.  For example, if an entire SQL box would have lost power, fallen on the floor, and broken into pieces, no problem, we’d have an approximately 10 second outage.  But this case, where the database gets into an inconsistent state, wasn’t helped by the redundant systems.

Moving forward, we’re doing a number of things. Specifically, we still need to get to the root cause of this particular problem, but in general, we’re working a lot on our back-end systems to build additional redundancy, scalability, performance, and resiliency.  We’re going to partition data in different ways, so a failure like this will only affect a portion of the user base, a portion of the content, or a portion of the functionality.  Our back-end cache servers will expand to be able to serve data while underlying data store is offline.  And lots more, which I won’t go into details about just now, since this post is getting a little long.

But for now – yep, we screwed up.  To our customers who were affected – you have my personal apology, and my assurance that we’re doing everything we can to ensure this doesn’t happen again.

33 thoughts on “Anatomy of an outage

  1. Dan

    Thanks for letting us know what happened Greg – it sounds like you’ve managed to gather some useful data as a result of the down time, like it or not! It certainly is nice to have my subs back though…

    Reply
  2. bernard

    Hi Greg,

    Maybe I’ll be the first to thank you for this pretty indepth post.

    The outage was pretty frustrating for people, as probably like me, NWO is their feed reader away from their home machine (this is my case).

    I did have access to another feedreader, but well, it just wasn’t the same!

    On the SQL problems, well I can empathise with you. I have been in the a similar same situation watching “runaway SQL DBs” access and hog all memory and CPU cycles. I work in a webbased company who run SQL DBs in a clustered 64 bit env. Bleeding edge ;)

    You said you were in a clustered SQL environment..if the other nodes didn’t take over it could be DB issues.

    These could also be the reasons for the intermittent outages people experience.

    Anyway, the important part is that you have given users some information.

    And I think it explains the intermittent issues too.

    Well, I hope that EMC and MS help you guys as much as you need!

    Thanks for a great service. Just the new stuff coming!

    bernard

    runningwithbulls.com/blog

    Reply
  3. NewsGator outage

    Greg Reinacker: To our customers who were affected – you have my personal apology, and my assurance that we’re doing everything we can to ensure this doesn’t happen again.Thanks for the detailed account of what happened and for your…[more]

    Reply
  4. Ashleigh

    Thanks for the heads up, I know how you feel. I’ve had similar days! I remember one where I was awake for 3 day…….

    Well done Dan on the first reply ;) (we work together)

    Reply
  5. Tomas Restrepo

    Greg,

    Thanks for sharing the information with us, I sure hope you get to the root cause of the problem (it can be very frustraring not finding out what the issue is).

    Here’s a suggestion for future newsgator builds, that might alleviate this some: I honestly didn’t mind that newsgator was down for a few hours. What really was somewhat frustraring was newsgator Inbox/Outlook/whatever it is called now simply terminating sync without saying something was up. It would be very, very nice if you could somehow let the outlook add-in that the services platform is down so that it can let the user know. That by itself would make it far easier on the user!

    Reply
  6. David

    Greg – Stuff happens, but it’s never fun when stuff happens to you! I was frustrated with not getting my feeds, but I’m very pleased to see your blog entry about the problem.

    I expect that things will happen occasionally, but to hear quickly (and in some detail) about what exactly happened and how you are working to keep it from happening again is very reassuring. Thanks.

    Reply
  7. chatoyer

    I’ve been evaluating FeedDemon for about 17 days now, and your post, has confirmed it for me: I’ll definitely be buying it. The fact that you blog about what happened, in detail even, shows you guys actually give a stuff about what is happening. As we say here in NZ: ‘Good on ya, mate’.

    Reply
  8. Ian Betteridge

    Greg, you guys have given outstanding service for a long time – you’ve certainly bought up enough karma points with me to survive one outage :) What’s more, the fact that you’ve been completely transparent about what happened is a credit to you and your company. Well done!

    Reply
  9. Jason J. Thomas

    Nice to see you give a nice synopsis of what happened, Greg. As a big user of FeedDemon, I was a little worried when FD threw a weird error at me. I then investigated further, and I saw the maintenance page go up.

    As for your redundancy, sometimes it does not matter how redundant you make something. If something else fails, you are just plain and simple going to be waiting until that software process recovers (and with 300GB, that will take a while).

    Reply
  10. Gabe

    I feel much better about outages when explained like you did.

    I wish more in the industry would follow your example.

    Thank you

    Reply
  11. Serdar Kilic

    I concur with Gabe, thank you for being so public about it all. Stuff like this happens, hope you guys get to the bottom of this nasty bug, we live and learn :) Best of luck

    Reply
  12. Markus Merz

    First of all this bad news is good news and your public communication approach with this detailed blog entry is exemplary.

    Splitting and parallelizing databases is the right way to go. The size of transactions logs is critical for replay operations and has to be monitored.

    Re. storage … EMC had a big issue here in Germany before (i.e. search for: emc strato ausfall).

    Reply
  13. ?????

    I’ve been evaluating FeedDemon for about 17 days now, and your post, has confirmed it for me: I’ll definitely be buying it. The fact that you blog about what happened, in detail even, shows you guys actually give a stuff about what is happening. As we say here in NZ: ‘Good on ya, mate’.

    Reply

Leave a Reply