Server down 12/28
Friday, December 28th, 2007Santa Clause’s gift to us this Christmas was four hours of down time. The issue was caused by a deletion of all stale notifications in the queued_notifications table. The table had a little bit over 3 million records, all of which were stale and needed to be deleted (they were occupying more than 25% of the sql db footprint or about 0.5GB).
Following is the sequence of events for the record:
- At around 4am I kicked off a delete on all records in that table and went to bed.
- The deletion finished about 3 hours later (the mysql client process exited successfully after finishing the delete)
- at around 8:48am John managed to get a hold of me to tell me that the server was down.At that stage here’s what was happening:
- mysql was taking an unusual amount of CPU (40-50%)
- simple queries were taking many seconds to finish sometimes tens of seconds
- I put up the maintenance screen to bring the database back to idle state, but the database still used 10% of CPU on average and show 70-80% of CPU was in iowait state. This is highly unusual especially that the database was not being used at this point.
- I also noticed something. While querying for count(*) on queued_notifications resulted about 39k records, show table status showed the original number of records before the delete was started (above 3 million records). This led me to believe that the database was still re-arranging data based on the large delete (not sure what that means yet, but will be investigating further later)
- I dropped the queued_notifications table and recreated it with its index and the database started behaving.
- Brought all the app servers back online and all seemed to work fine.
What should we learn from this:
- We should avoid large data creation/deletion
- We should insure tables that hold transient data get cleaned periodically (currently notifications and feeds)
- A slave DB would have avoided us a lot of downtime in a case like this.
Sigh!