Monday, May 12, 2008

I Poop on Designing for Scalability

Dear John -

Is it OK that I answer your question with a blog post? It seems very web 2.0, and my whole goal in life is to be more web 2.0 than any other web 2.0 weenie out there. Plus, maybe there is a reader out there, and maybe they have a better response.

Unfortunately, I've done a bunch of scaling in my short time on this planet (especially working with the notoriously performance agnst Plone) and I have to side with the camp that you should rarely spend time scaling architecture in the beginning. Here are some yummy reasons on why you should wait to pop your scaling cherry:
  • your web app may never take off and you have wasted time scaling when you could have spent time writing a killer feature
  • if it does take off, pay someone else to worry about it. If you read anything about scaling twitter, read about how they "fired" Blaine Cook and hired a bunch of scalability experts instead. Zing!
  • caching goes a loooong way. I like squid myself - it's a sexy beast if a little hard to configure - and I hear varnish is pretty top notch too. Hell, httpd has a nice accelerator built in too if you need a quick fix. Let's take a moment and remember WHY caching works so well: it serves up content that an app server like RoR or Django could give two craps about such as javascript, css and images. Have you ever looked at how much time your browser spends loading this stuff? It's a lot, and cache servers know when to expire, refresh, reload http headers, etc. Your cache server may be your users' browsers' best friend.
  • you can and should throw more hardware at the problem first. its cheap enough these days and it will buy you the time you need to develop a real scaling solution
  • Jared Spool gave a great talk at sxsw about actual performance and perceived performance. the performance of your pages (i.e. load time) is almost always next to perceived performance, the time it takes your user to complete a task. For example, amazon.com takes on average an unthinkable time to load pages but it is commonly perceived as the fastest site out there because of its 1 click functionality to get things done.
  • IP latency is a factor. If you are integrating with any external site (who isn't these days?), chances are that you will see more lag time from waiting for responses for that than any kludgey code you can write
  • it will rarely clog where you think it will clog so take that pipe dream to the dump and let the bums sleep in it
That being said, here are some things you can think about now if you are worried:
  • don't write stupid code. if you notice yourself writing 4 nested loops, think - hey, that could be nasty later on.
  • write good db queries and for the love of god don't use your code to filter out the results. in the same respect, don't access database variables more than needed - make a higher level variable and reference that forever. watch out for this in ORMs - they can be very inefficient for scaling even though they make code writing fast. but read bullet 1 above first.
  • know your language/tools. if you are using RoR or something that is known to be slow, anticipate it being a problem later and rewriting key parts in a language like C. oh, your language doesn't have C wrappers? that could be a problem ...
  • think about concurrency while you are writing for KNOWN expensive operations. Think: does this action rely on another action before going to the next step? For example, we have a process which needs to create an appountment in exchange and then attach 2 files to the resulting appointment. Instead of writing a functional piece of code like so: create appointment (2s) > attach file 1 (60s) > attach file 2(120 s) > report results(2s)) totaling 184sec, think of doing a threaded version: (create appt(2s) > report results(2s) > kickoff concurrent attach threads(120s) ) totaling 124 seconds. Web peeps don't think about concurrency enough. Be different.
  • put a round robin queue that diverts from dead parents so you can hot patch stuff live if your code requires restart for changes to take effect (i.e. compiled code). this has saved me a hundred times over and eliminates downtime while still allowing you to respond to heinous bugs.
Last but not least, don't forget Knuths famous words of wisdom: "Premature optimization is the root of all evil!!!"

This data streams crap seems like it was written by an architecture astronaut. Yeah right! Who really writes code like that? (hint: no one).

Hth,

Liz

>>>
...
So, although I truly want to know how you're doing, I also have a question for ya. It stems from thinking long and hard how to go about building the architecture for my latest startup. I've read interesting posts such as Two data streams for a happy website (http://gojko.net/2008/03/03/two-data-streams-for-a-happy-website/) and Scalability (http://romeda.org/blog/2008/05/scalability.html) from Twitter's Blaine Cook, but it's hard to figure out where to go from here.

Basically... how much time do I spend planning how the architecture can scale before I know the metrics. I've read that I shouldn't worry about it until I'm there, while others say to keep 2 separate data streams (one that requires users to be logged in and one that doesn't). I expect to have to tweak the solution if we do reach limits, but the more I know before I start writing code the better.

Have you (or friends you know) hit this problem? What advice could you provide?

Thanks in advance!

Best,
John
>>>

4 comments:

Sandy said...
This comment has been removed by the author.
Sandy said...

Nice post...for whatever reason I was reading this related stuff in the last couple of days:

From the front lines of a web 2.0 app launch:
http://www.expatsoftware.com/articles/2008/03/6-million-hits-day-time-to-think-scale.html

Classic slides on performance optimization, reminding us to measure things and then be smart about how we fix them instead of blindly "optimizing":
http://www.gnome.org/~federico/docs/2005-GNOME-Summit/html/img11.html
http://www.gnome.org/~federico/docs/2005-GNOME-Summit/html/img18.html

Of course those are more desktop-centric, but they still apply.

John Brennan said...

Let's take a moment and remember WHY caching works so well: it serves up content that an app server like RoR or Django could give two craps about such as javascript, css and images

@Liz: really informative post!

I completely agree that the first thing to do is move those huge js files to the bottom of the page. Chances are, your behavior code is waiting for the DOM to finish so why load that prematurely and especially before your css which breathes life into your app (without it it would look like a pine email client..yuck!)

However, I would disagree that caching doesn't care about RoR or w/e your d. language may be. Sometimes it's easier to cache certain data (twitter is a hard problem), but why recalculate something that doesn't change often? And even when it does change, users problem don't care if a category has 10,000 items or 10,005 so that would be a good candidate to cache for X minutes too.

Also, I'm surprised you didn't mention memcache at all. What do you use to cache db results?

eleddy said...

@john memcache is def the bomb but if you are already thinking about memcache then you are thinking prematurely. I've actually never had the need to cache db results because my main bottlenecks are elsewhere.