Archive for the ‘database’ Category

Memcache and database are out of sync

Wednesday, May 20th, 2009

We heavily use caching in our models using cache_fu.  However, we started to notice that a small number of objects in our system were out of sync between MySQL and memcached.
In one case, this led to model validation exceptions as a user appeared unregistered in memcache, but when we went to register the the user the db would complain that the user is already registered.

What was the bug?

Lets look at the log file…

Ex) user.save!
BEGIN
Update (0.7ms) UPDATE `users` SET `token` = '665456562', `updated_at` = '2009-05-13 17:22:26' WHERE `id` = 4
MemCache Delete (0.000285)  users:2
COMMIT

Uh oh!
We use InnoDB in MySQL to get transaction support and the way rails implements ActiveRecord calls is that it wraps a transaction around everything within an ActiveRecord method call.  This includes the after_save callback to expire_cache.

This leads to the following scenario with two threads where one is reading and another writing:

t1: BEGIN t1
t1: update user in db
t1: expire user in memcache
t2: get user from memcache
t2: cache miss -> get user from db
t2: set user in memcache
t1: COMMIT

Since t2 is reading from the database before t1 has committed it will read the old value and save the old value into memcache. The db is updating, but memcache is out of sync.

What’s the solution?
We need a way for the expire_cache calls after save to be outside the transaction.  This way, t2’s read will either be before the db update and cache expiry or after the db update on the expired cache. The three possible cases are listed below. In each case, memcache and the database remain in sync.

Case 1: t2 is before the COMMIT
t1: BEGIN t1
t1: update user in db
t2: get user from memcache
t1: COMMIT
t1: expire user in memcache

Case 2: t2 is after the COMMIT, but before cache expiry
t1: BEGIN t1
t1: update user in db
t1: COMMIT
t2: get user from memcache
t1: expire user in memcache

Case 3: t2 is after the COMMIT and cache expiry
t1: BEGIN t1
t1: update user in db
t1: COMMIT
t1: expire user in memcache
t2: get user from memcache
t2: cache miss -> get user from db
t2: set user in memcache (new user)

Luckily, there is a great plug-in called after_commit that does what we need. This plugin implements additional callbacks like “after_commit” which run after save, but are outside the commit.

ActiveRecord and includes maxing my ethernet

Tuesday, February 5th, 2008

We recently ran into an issue where using multiple includes were making a huge join on the backend and returning 1000s of rows which was taking all the bandwidth between servers.

Now, as we start to aggressively cache computed data we may run into a similar problem. For example, we cache some html pages in the db that change rarely. These pages can be huge. We would not want to return 10s of MBs of data per view of a list of pages (ie, table of contents or index)

The solution is that ActiveRecord has a :select option which does what SQL SELECT does. We should consider using this when the amount of data returned is very large.

Server down 12/28

Friday, December 28th, 2007

Santa Clause’s gift to us this Christmas was four hours of down time. The issue was caused by a deletion of all stale notifications in the queued_notifications table. The table had a little bit over 3 million records, all of which were stale and needed to be deleted (they were occupying more than 25% of the sql db footprint or about 0.5GB).

Following is the sequence of events for the record:

  • At around 4am I kicked off a delete on all records in that table and went to bed.
  • The deletion finished about 3 hours later (the mysql client process exited successfully after finishing the delete)
  • at around 8:48am John managed to get a hold of me to tell me that the server was down.At that stage here’s what was happening:
    • mysql was taking an unusual amount of CPU (40-50%)
    • simple queries were taking many seconds to finish sometimes tens of seconds
  • I put up the maintenance screen to bring the database back to idle state, but the database still used 10% of CPU on average and show 70-80% of CPU was in iowait state. This is highly unusual especially that the database was not being used at this point.
  • I also noticed something. While querying for count(*) on queued_notifications resulted about 39k records, show table status showed the original number of records before the delete was started (above 3 million records). This led me to believe that the database was still re-arranging data based on the large delete (not sure what that means yet, but will be investigating further later)
  • I dropped the queued_notifications table and recreated it with its index and the database started behaving.
  • Brought all the app servers back online and all seemed to work fine.

What should we learn from this:

  • We should avoid large data creation/deletion
  • We should insure tables that hold transient data get cleaned periodically (currently notifications and feeds)
  • A slave DB would have avoided us a lot of downtime in a case like this.

Sigh!