Sunday, March 17, 2013

Simple Scala habits saved us on launch

Months after it happened, the launch of Egraphs still feels like a blur. There was the crazy rush to finish the site in the weeks leading up, publishing the site from Tropicana Field as the Rays played the Red Sox, and watching baseball fans show up and actually start spending money.

I would have preferred to avoid any development death marches, but we were committed to July 12 because we were working with external partners including the MLB, the Rays, and an initial lineup of celebrity partners including David Ortiz and Pedro Martinez. How many startups out can say that they have so much support on day one?

One thing that naturally occurred is that we stopped writing tests in the six weeks leading up to launch. It’s not something I’m proud of, but amazingly the site didn’t croak when customers started using it. We saw zero null pointer exceptions and very few logic bugs, and I attribute that to easy Scala habits we adopted to great effect.

For example, to achieve null-safety, we avoided calling Option.get like the plague. Scala allows you to access objects that might or might have been instantiated with map, or match, or for-comprehension. That one habit is responsible for the rarity of NPEs for us.

Secondly, we relied on type-safety as much as possible. If branching on the state of something to determine which logic path to execute, we got specific with types. It’s a beautiful thing to be able to write a lot of code and feel confident that if it compiles, then it likely works.

(Code examples of these simple Scala habits can be found in this presentation.)

Building Faceted Search With PostgreSQL

Sig wrote a great post about how he built our fully-functional marketplace. It’s pretty cool how quickly things can come together with just one engineer making good decisions and using cool technologies.

Check it out in full detail here.

Controller filters in Play and Scala


We were chilling with other students enrolled in the coursera Scala class wondering aloud why one would ever use currying. Admittedly, the ability to organize parameter lists for your functions feels somewhat academic, but we have managed to use function currying in ways that we’re really happy with.
In our web app, we have many controllers that respond to different kinds of requests. Some of those controllers respond to simple GET requests that are not expected to alter state on the server. Some respond to POST requests that are expected to change state and also check CSRF tokens. And some other controllers respond to POST requests to Api endpoints for which CSRF tokens are not relevant.

We have a class called ControllerMethod that is flexible enough to handle these general types of requests whether GET or POST or Api POST, and yet can be tailored with the specific logic that should execute for that route.

class ControllerMethod () {
  def apply[A](dbSettings: ControllerDBSettings = WithoutDBConnection)
              (action: Action[A]): Action[A] =
  {
    Action(action.parser) { request =>
      dbSettings match {
        case WithoutDBConnection => action(request)
        … other cases depending on backend connection requirements …
      }
    }
  }
}
Treating ControllerMethod as a “root method” for controllers, we can use it to streamline our controllers like, for example, the one that returns an egraph:

def getEgraph(id: Long) = controllerMethod {
  Action { implicit request =>
    egraphStore.findFulfilledEgraph(id) match {
      case Some(egraph) => Ok(renderEgraphPage(egraph))
      case None => NotFound(“No Egraph found”)
    }
  }
}

This simple pattern eliminates tons of boilerplate code that would otherwise need to be written into each controller. Moreover, the POST controller that handles changes applied to an egraph uses a slightly modified version of controllerMethod. So with minimal variation in code, we treat the POST controller with functionality that you would expect to have, such as a writable database context and protected against CSRF attacks. Currying is not a technique I envisioned using much when I was first introduced to it. But as we matured as Scala devs and left old Java habits behind, we have used this technique to write dozens of controllers (and other classes) across our web app. Based on experience, currying has proven to be an extremely powerful pattern for bringing organizational sanity to our codebase. =)

Learning by doing: HTTPS requests


Because we value the security of our customers’ and celebrities’ data, we decided early on that both outbound and inbound requests would be served over SSL.

To configure SSL for a site, your DNS and web hosting vendors likely have clear instructions to follow. As for connecting to third-party services via HTTPS, I often felt incredulous that I didn’t come across any good and comprehensive guides online when I was figuring this stuff out, which is a bit weird since this must be a common task for many online companies. Like much of my recently acquired tech ops knowledge, I feel like each nugget of learning was hard-won, and I wish to share what I know here.
To make requests via HTTPS, your app needs a truststore that contains certificates from the parties that you expect to communicate with or from Certificate Authorities that you trust to identify other parties. There is also another concept called a keystore that contains private keys supposedly for implementing server-side SSL. Now, already things can get confusing. I got those definitions from this stackoverflow discussion that says that truststore and keystore are different concepts, and yet it really seems that truststore and keystore are used interchangeably. I’ve come to the conclusion to not allow this ambiguity to ruin my life.

Download certificates
The first step is to download certificates from the sites and services with which your server will communicate via HTTPS. For example, visithttps://api.stripe.com using Chrome, click the lock icon in the browser bar, and look for “Certificate Information.” You should see a certificate tree that looks something like:
image
Each of those is a certificate with its own expiration date. Root-level certificates usually have expiration dates farthest in the future, whereas leaf certificates expire more frequently.

I recommend downloading all of the available certificates so that they can be imported into the keystore. You don’t want to have your app’s communication with crucial services broken without warning because a leaf certificate expired or was replaced. Ahem, yes, I learned that by experience.

Import certificates into truststore
In a Scala/Java environment, the javax.net.ssl.trustStore system property will need to point to a keystore that includes the certificates of trusted third-party sites. (See? Very interchangeable.) 
To prepare the truststore, you can run a command that looks like:
keytool -import -alias stripe.com -file api.stripe.com.cer -keystore keystore
Do this on each certificate you want to import. There are plenty of resources out there on keytool commands.

Then fire up your app and you should be to communicate securely with APIs all over the interwebs!

Learning by doing: managing servers

I think writing software is fun. Less fun is stressing out when the servers running your code crap out. As tech teams go, my compatriots and I are more app developers than tech ops/dev ops/whatever. Our first several months of running a site were more eventful than I was hoping.

First, there was a huge AWS outage in June that took out both our app servers and our database servers several days before our launch date. On our launch date, we roundhouse kicked our own site when we reacted to high traffic by spinning up too many app servers and accidentally overloading the number of available database connections… we now know better. Excitement of this sort continued for a few months as we continued to have site issues almost every week.

A few doozies were particularly memorable.

One night during what should have been a routine deployment, I ran an upgrade script to change the database from version 23 to 24. Then I tried to deploy new app code, but the app servers failed to start because of inconsistent database version, and that was a non-starter for a website built on the Play 1.2 framework. That’s funny, how could that possibly be? I double-checked that the database schema was the correct version, and yet the app servers failed to start several more times. Meanwhile, our site is down and I start to hyperventilate because I feel crazy. I down a bottle of pinot noir to stave off a heart attack. Eventually, I noticed that the database’s load balancer incorrectly reported version 23 while the database itself was on version 24.

So replication was broken!! Our database load balancer diverts many SELECT queries to the replica, which was still a version behind, and that incorrect version was what the application read. To fix the site, I pointed the application directly at the master database instead of at the load balancer, thereby cutting the replica out of the picture. And we ran our site directly on the master database for the next few days until we fixed replication. The whole ordeal probably lasted less than an hour, but time slows down to make it feel much longer. I swore to never write non-backward-compatible schema changes no matter how trivial a column rename feels, and to always keep a bottle of pinot on hand.

Doozy #2 is a story from a day when MLB sent emails to millions of baseball fans on our behalf. It was to be our highest traffic day yet, and it went a bit like this except that our site was much less prepared. In the earlier part of the day some visitors to our site were experiencing half-minute page loads omgz =(. It turned out that a CSS styles file was often taking a long time to download even though we had just started using Amazon’s Cloudfront CDN to serve that file, implying that Cloudfront was making roundtrips to the application server. We increased the cache period of that asset on Cloudfront, and that little change got our site back to loading pages within a second even under load! Actually it was our friends at CloudBees, our app server PaaS vendor, who made that observation.

And there are more stories like that chronicled in wiki pages that we write about each site incident.

Learning occurred, to say the least. But probably the most valuable thing I learned was not any technical kernel of knowledge but rather how to not freak out about site issues. They’re going to happen when you run a web business. The key is to chill out and use them as training moments, especially while our site is still young.

Despite all of this, I’ve calculated out that we maintained >99.98% uptime. Not bad for a few app developers, if I do say so myself.

iPad app distribution for the stars

Celebrities on Egraphs use a special iPad app to connect with their fans. We didn’t want to put this app on the Apple app store because it is not meant for general usage, and also because we wanted to distribute upgrades without going through Apple. For us, security is crucial because much of the value of an egraph is its authenticity.

The recipe is actually quite simple.

Step 1 is initial distribution via email: To get the initial app to the iPad, we send an email to the celebrity with a link that installs the the initial version using the itms-services protocol.

Step 2 is to install the actual app via authenticated requests: When the celebrity logs into the initial app with a pre-provisioned account and password, that kicks off the process to download the actual app. The iPad makes a request to an egraphs.com server to get the locations of the .plist and .ipa files of the latest Egraphs iPad app. The .ipa lives privately on Amazon’s S3 and is only accessible through a short-lived authenticated REST request generated by the server.

Step 3 is to enable easy app upgrades: To distribute updates of the Egraphs ipad app, we just update the .ipa URL on our servers. The next time a celebrity user logs into he iPad app, step 2 will kick off again and encourage the celebrity to allow the upgrade.

The problem of controlling distribution to iPad apps actually reminds me of what enterprises do with custom Box apps (I used to work at Box). I like to describe our iPad app distribution system as an enterprise-grade solution for your favorite celebrities.

Why I went into computer science


I just came across this video of my computer science lecturer.

“You are like geometers and you’re living in the time of Euclid.”

He said the same thing to conclude the final lecture when I took this class in winter 2005, and the big picture of why we study this stuff just clicked for me. I do believe that Mehran Sahami has influenced generations of Stanford students.