Server down 12/28

Santa Clause’s gift to us this Christmas was four hours of down time. The issue was caused by a deletion of all stale notifications in the queued_notifications table. The table had a little bit over 3 million records, all of which were stale and needed to be deleted (they were occupying more than 25% of the sql db footprint or about 0.5GB).

Following is the sequence of events for the record:

  • At around 4am I kicked off a delete on all records in that table and went to bed.
  • The deletion finished about 3 hours later (the mysql client process exited successfully after finishing the delete)
  • at around 8:48am John managed to get a hold of me to tell me that the server was down.At that stage here’s what was happening:
    • mysql was taking an unusual amount of CPU (40-50%)
    • simple queries were taking many seconds to finish sometimes tens of seconds
  • I put up the maintenance screen to bring the database back to idle state, but the database still used 10% of CPU on average and show 70-80% of CPU was in iowait state. This is highly unusual especially that the database was not being used at this point.
  • I also noticed something. While querying for count(*) on queued_notifications resulted about 39k records, show table status showed the original number of records before the delete was started (above 3 million records). This led me to believe that the database was still re-arranging data based on the large delete (not sure what that means yet, but will be investigating further later)
  • I dropped the queued_notifications table and recreated it with its index and the database started behaving.
  • Brought all the app servers back online and all seemed to work fine.

What should we learn from this:

  • We should avoid large data creation/deletion
  • We should insure tables that hold transient data get cleaned periodically (currently notifications and feeds)
  • A slave DB would have avoided us a lot of downtime in a case like this.

Sigh!

Leave a Reply