Archive for the ‘performance’ Category

Ruby on Rails: A look back

Friday, May 30th, 2008

It has been a little over a year since we started rewriting MedHelp’s software and had to answer a very simple question: which platform should we use?

After much exploration and deliberation, I decided that Ruby on Rails was the way to go. At that time, the debate on whether RoR was scalable or mature enough was raging (and still is), with few high profile stories adding to the drama (a twitter dev dissing RoR for what seemed to be architecture failures was a classic).

Just like anyone making an investment decision, I followed the various blogs talking about why RoR is such a terrible platform, why it couldn’t scale and how it is obviously a bad choice, starting with twitter of course and going through to the various people for and against.

To my surprise (or not) the issues people faced as they scaled RoR were not specific to RoR. In fact they were issues I saw people dabble with for years. Bottlenecked (and sometimes not truly stateless) app servers, expensive database queries, single points of failure, centralized databases.

For some reason many people in the debate assumed that there are platforms that scale and others that don’t. And that by picking the right platform you will be able to serve millions of users. Unfortunately, it is never that simple. Scaling is a continuous exercise of understanding the bottlenecks in your system and the limitations of your architecture and finding ways to gracefully get beyond them.

Another argument against Ruby on Rails was that Ruby is a slow language or that it consumed too much memory. But wasn’t this the argument against Java when the world was dominated by C++ fanatics? Wait, wasn’t this also the argument against C when Assembly developers were the coolest kids on the block? What about machine code.. you get the picture!

The answer to this argument is two folds. The first is an economic one. Developers are way more expensive than hardware. This statement held true for years, and is truer every minute than the minute before. The other part of the answer is that today’s architecture (thanks to the 90’s) puts completely stateless software at the heart of your system allowing you to scale horizontally. So it is not really that important how fast each machine is (as long as it is not noticeable to the end user), you can always add another piece of hardware and double your capacity.

So not finding any challenges with RoR that I didn’t expect to face with any other platform, and having been sold on its design philosophy (long live conventions), the elegance of its architecture and the elasticity of the Ruby language, I decided that MedHelp is going to be a Ruby on Rails shop.

Fast forward one year later. And you will notice that MedHelp is up and running. We were able to rewrite the entire application in RoR in about four weeks. We transformed the site from a simple forum application to a vibrant community. Added tons of feature, some of which are complex Ajax applications such as trackers. Swapped out the site’s interface in favor of better flow and aesthetics. And we did all that while growing our visitors from 2 million unique visitors to 5.5 million uniques.

Our average team size during this year was 3.5 people (we are 6 now). And while all of them are experienced engineers with a lot of experience in building and scaling server software (whom I knew or worked with prior to MedHelp, and am proud to continue doing so today) all of them learned Ruby on Rails on the job.

After all this, I am now taking a deep breath and asking myself again. Have I made the right choice? The answer for me is clearly yes.

The ride was not an easy one. And we had our share of emergencies, head scratching and nervous moments. But none of the mistakes made or the bugs found were caused by Ruby on Rails except in the sense that the platform’s flexibility made it easy to make some mistakes. But the mistakes were ours. When made, they often showed a misunderstanding of how a certain feature worked, a flaw in our database schema or how our components are distributed across our servers.

Now that we’ve gone through those pains to grow the site, I think I am ready to share many of the things that we learned or had to re-learn as we grew MedHelp. Each week or two I will share one of the big pitfalls that we managed to fall into, and what lessons we learned as we climbed out of it and started marching for the next pitfall.

ActiveRecord and includes maxing my ethernet

Tuesday, February 5th, 2008

We recently ran into an issue where using multiple includes were making a huge join on the backend and returning 1000s of rows which was taking all the bandwidth between servers.

Now, as we start to aggressively cache computed data we may run into a similar problem. For example, we cache some html pages in the db that change rarely. These pages can be huge. We would not want to return 10s of MBs of data per view of a list of pages (ie, table of contents or index)

The solution is that ActiveRecord has a :select option which does what SQL SELECT does. We should consider using this when the amount of data returned is very large.

Caching Lessons Learned

Tuesday, January 22nd, 2008

We have a set of bugs with caching:

Versioning:

  • We must version whenever we cache so that we when upgrade, the app uses the updated revision of the cached object
  • In acts_as_cached, there is a version. However, in page caching and fragment caching there is no version number.
  • In page caching, the cached page is saved on disk. Our deployment method overwrites the directory which will refresh this cache.
  • In this case, we should use the same workaround we use in css, icon, and js includes where we define the key as name?<version_number>

Includes:

  • Early on, we used :include in acts_as_cached so that we minimize the number of database calls. However, by over including you can accidentally max out your network. The include is implemented as a big join so if you have m includes where one column has a large amount of data, you will transfer n^m data.
  • We have seen this where we had a column that returns results in the order of 10K, but instead of transfering 60 rows of 10K (600K), we were transferring 10^6 * 10K rows (100MB). Now, that’s a huge difference!

Expiry:

  • We use fragment caching to save rendering in our views. However, the base implementation from rails does not support expiry.
  • Thus, we need to either explicitly expire or use another technique to expire like TTL expire or sweepers

Server down 12/28

Friday, December 28th, 2007

Santa Clause’s gift to us this Christmas was four hours of down time. The issue was caused by a deletion of all stale notifications in the queued_notifications table. The table had a little bit over 3 million records, all of which were stale and needed to be deleted (they were occupying more than 25% of the sql db footprint or about 0.5GB).

Following is the sequence of events for the record:

  • At around 4am I kicked off a delete on all records in that table and went to bed.
  • The deletion finished about 3 hours later (the mysql client process exited successfully after finishing the delete)
  • at around 8:48am John managed to get a hold of me to tell me that the server was down.At that stage here’s what was happening:
    • mysql was taking an unusual amount of CPU (40-50%)
    • simple queries were taking many seconds to finish sometimes tens of seconds
  • I put up the maintenance screen to bring the database back to idle state, but the database still used 10% of CPU on average and show 70-80% of CPU was in iowait state. This is highly unusual especially that the database was not being used at this point.
  • I also noticed something. While querying for count(*) on queued_notifications resulted about 39k records, show table status showed the original number of records before the delete was started (above 3 million records). This led me to believe that the database was still re-arranging data based on the large delete (not sure what that means yet, but will be investigating further later)
  • I dropped the queued_notifications table and recreated it with its index and the database started behaving.
  • Brought all the app servers back online and all seemed to work fine.

What should we learn from this:

  • We should avoid large data creation/deletion
  • We should insure tables that hold transient data get cleaned periodically (currently notifications and feeds)
  • A slave DB would have avoided us a lot of downtime in a case like this.

Sigh!

memcache marshaling pain

Monday, October 22nd, 2007

So I was trying to use the memcache increment method during my implementation of the view counter. Whenever I issue a CACHE.incr, CACHE.get starts to fail on that key. I found out what the problem is.

When get is called without specifying the raw argument, it defaults to false, which means that the memcache-client lib will marshal the value passed before sending it to memcache. Then when we issue an incr to the memcached server, it tries to increment the marshaled version which is a binary sequence representing a ruby object (Fixnum in this case) and messes it up.

The solution is to pass the raw parameter to MemCache.set as true, which will prevent the marshaling (later we should call set with a true raw as well) which will give memcached an integer (although treated as string on the ruby side) that it can increment safely.

I created a raw_get, raw_put and raw_incr that can be used together. They are nicely contained in a Cache module (yes, I’m using the same name as the memcache_util.rb module) under our lib folder.