Sunday, March 17, 2013

Simple Scala habits saved us on launch

Months after it happened, the launch of Egraphs still feels like a blur. There was the crazy rush to finish the site in the weeks leading up, publishing the site from Tropicana Field as the Rays played the Red Sox, and watching baseball fans show up and actually start spending money.

I would have preferred to avoid any development death marches, but we were committed to July 12 because we were working with external partners including the MLB, the Rays, and an initial lineup of celebrity partners including David Ortiz and Pedro Martinez. How many startups out can say that they have so much support on day one?

One thing that naturally occurred is that we stopped writing tests in the six weeks leading up to launch. It’s not something I’m proud of, but amazingly the site didn’t croak when customers started using it. We saw zero null pointer exceptions and very few logic bugs, and I attribute that to easy Scala habits we adopted to great effect.

For example, to achieve null-safety, we avoided calling Option.get like the plague. Scala allows you to access objects that might or might have been instantiated with map, or match, or for-comprehension. That one habit is responsible for the rarity of NPEs for us.

Secondly, we relied on type-safety as much as possible. If branching on the state of something to determine which logic path to execute, we got specific with types. It’s a beautiful thing to be able to write a lot of code and feel confident that if it compiles, then it likely works.

(Code examples of these simple Scala habits can be found in this presentation.)

Building Faceted Search With PostgreSQL

Sig wrote a great post about how he built our fully-functional marketplace. It’s pretty cool how quickly things can come together with just one engineer making good decisions and using cool technologies.

Check it out in full detail here.

Controller filters in Play and Scala


We were chilling with other students enrolled in the coursera Scala class wondering aloud why one would ever use currying. Admittedly, the ability to organize parameter lists for your functions feels somewhat academic, but we have managed to use function currying in ways that we’re really happy with.
In our web app, we have many controllers that respond to different kinds of requests. Some of those controllers respond to simple GET requests that are not expected to alter state on the server. Some respond to POST requests that are expected to change state and also check CSRF tokens. And some other controllers respond to POST requests to Api endpoints for which CSRF tokens are not relevant.

We have a class called ControllerMethod that is flexible enough to handle these general types of requests whether GET or POST or Api POST, and yet can be tailored with the specific logic that should execute for that route.

class ControllerMethod () {
  def apply[A](dbSettings: ControllerDBSettings = WithoutDBConnection)
              (action: Action[A]): Action[A] =
  {
    Action(action.parser) { request =>
      dbSettings match {
        case WithoutDBConnection => action(request)
        … other cases depending on backend connection requirements …
      }
    }
  }
}
Treating ControllerMethod as a “root method” for controllers, we can use it to streamline our controllers like, for example, the one that returns an egraph:

def getEgraph(id: Long) = controllerMethod {
  Action { implicit request =>
    egraphStore.findFulfilledEgraph(id) match {
      case Some(egraph) => Ok(renderEgraphPage(egraph))
      case None => NotFound(“No Egraph found”)
    }
  }
}

This simple pattern eliminates tons of boilerplate code that would otherwise need to be written into each controller. Moreover, the POST controller that handles changes applied to an egraph uses a slightly modified version of controllerMethod. So with minimal variation in code, we treat the POST controller with functionality that you would expect to have, such as a writable database context and protected against CSRF attacks. Currying is not a technique I envisioned using much when I was first introduced to it. But as we matured as Scala devs and left old Java habits behind, we have used this technique to write dozens of controllers (and other classes) across our web app. Based on experience, currying has proven to be an extremely powerful pattern for bringing organizational sanity to our codebase. =)

Learning by doing: HTTPS requests


Because we value the security of our customers’ and celebrities’ data, we decided early on that both outbound and inbound requests would be served over SSL.

To configure SSL for a site, your DNS and web hosting vendors likely have clear instructions to follow. As for connecting to third-party services via HTTPS, I often felt incredulous that I didn’t come across any good and comprehensive guides online when I was figuring this stuff out, which is a bit weird since this must be a common task for many online companies. Like much of my recently acquired tech ops knowledge, I feel like each nugget of learning was hard-won, and I wish to share what I know here.
To make requests via HTTPS, your app needs a truststore that contains certificates from the parties that you expect to communicate with or from Certificate Authorities that you trust to identify other parties. There is also another concept called a keystore that contains private keys supposedly for implementing server-side SSL. Now, already things can get confusing. I got those definitions from this stackoverflow discussion that says that truststore and keystore are different concepts, and yet it really seems that truststore and keystore are used interchangeably. I’ve come to the conclusion to not allow this ambiguity to ruin my life.

Download certificates
The first step is to download certificates from the sites and services with which your server will communicate via HTTPS. For example, visithttps://api.stripe.com using Chrome, click the lock icon in the browser bar, and look for “Certificate Information.” You should see a certificate tree that looks something like:
image
Each of those is a certificate with its own expiration date. Root-level certificates usually have expiration dates farthest in the future, whereas leaf certificates expire more frequently.

I recommend downloading all of the available certificates so that they can be imported into the keystore. You don’t want to have your app’s communication with crucial services broken without warning because a leaf certificate expired or was replaced. Ahem, yes, I learned that by experience.

Import certificates into truststore
In a Scala/Java environment, the javax.net.ssl.trustStore system property will need to point to a keystore that includes the certificates of trusted third-party sites. (See? Very interchangeable.) 
To prepare the truststore, you can run a command that looks like:
keytool -import -alias stripe.com -file api.stripe.com.cer -keystore keystore
Do this on each certificate you want to import. There are plenty of resources out there on keytool commands.

Then fire up your app and you should be to communicate securely with APIs all over the interwebs!

Learning by doing: managing servers

I think writing software is fun. Less fun is stressing out when the servers running your code crap out. As tech teams go, my compatriots and I are more app developers than tech ops/dev ops/whatever. Our first several months of running a site were more eventful than I was hoping.

First, there was a huge AWS outage in June that took out both our app servers and our database servers several days before our launch date. On our launch date, we roundhouse kicked our own site when we reacted to high traffic by spinning up too many app servers and accidentally overloading the number of available database connections… we now know better. Excitement of this sort continued for a few months as we continued to have site issues almost every week.

A few doozies were particularly memorable.

One night during what should have been a routine deployment, I ran an upgrade script to change the database from version 23 to 24. Then I tried to deploy new app code, but the app servers failed to start because of inconsistent database version, and that was a non-starter for a website built on the Play 1.2 framework. That’s funny, how could that possibly be? I double-checked that the database schema was the correct version, and yet the app servers failed to start several more times. Meanwhile, our site is down and I start to hyperventilate because I feel crazy. I down a bottle of pinot noir to stave off a heart attack. Eventually, I noticed that the database’s load balancer incorrectly reported version 23 while the database itself was on version 24.

So replication was broken!! Our database load balancer diverts many SELECT queries to the replica, which was still a version behind, and that incorrect version was what the application read. To fix the site, I pointed the application directly at the master database instead of at the load balancer, thereby cutting the replica out of the picture. And we ran our site directly on the master database for the next few days until we fixed replication. The whole ordeal probably lasted less than an hour, but time slows down to make it feel much longer. I swore to never write non-backward-compatible schema changes no matter how trivial a column rename feels, and to always keep a bottle of pinot on hand.

Doozy #2 is a story from a day when MLB sent emails to millions of baseball fans on our behalf. It was to be our highest traffic day yet, and it went a bit like this except that our site was much less prepared. In the earlier part of the day some visitors to our site were experiencing half-minute page loads omgz =(. It turned out that a CSS styles file was often taking a long time to download even though we had just started using Amazon’s Cloudfront CDN to serve that file, implying that Cloudfront was making roundtrips to the application server. We increased the cache period of that asset on Cloudfront, and that little change got our site back to loading pages within a second even under load! Actually it was our friends at CloudBees, our app server PaaS vendor, who made that observation.

And there are more stories like that chronicled in wiki pages that we write about each site incident.

Learning occurred, to say the least. But probably the most valuable thing I learned was not any technical kernel of knowledge but rather how to not freak out about site issues. They’re going to happen when you run a web business. The key is to chill out and use them as training moments, especially while our site is still young.

Despite all of this, I’ve calculated out that we maintained >99.98% uptime. Not bad for a few app developers, if I do say so myself.

iPad app distribution for the stars

Celebrities on Egraphs use a special iPad app to connect with their fans. We didn’t want to put this app on the Apple app store because it is not meant for general usage, and also because we wanted to distribute upgrades without going through Apple. For us, security is crucial because much of the value of an egraph is its authenticity.

The recipe is actually quite simple.

Step 1 is initial distribution via email: To get the initial app to the iPad, we send an email to the celebrity with a link that installs the the initial version using the itms-services protocol.

Step 2 is to install the actual app via authenticated requests: When the celebrity logs into the initial app with a pre-provisioned account and password, that kicks off the process to download the actual app. The iPad makes a request to an egraphs.com server to get the locations of the .plist and .ipa files of the latest Egraphs iPad app. The .ipa lives privately on Amazon’s S3 and is only accessible through a short-lived authenticated REST request generated by the server.

Step 3 is to enable easy app upgrades: To distribute updates of the Egraphs ipad app, we just update the .ipa URL on our servers. The next time a celebrity user logs into he iPad app, step 2 will kick off again and encourage the celebrity to allow the upgrade.

The problem of controlling distribution to iPad apps actually reminds me of what enterprises do with custom Box apps (I used to work at Box). I like to describe our iPad app distribution system as an enterprise-grade solution for your favorite celebrities.

Why I went into computer science


I just came across this video of my computer science lecturer.

“You are like geometers and you’re living in the time of Euclid.”

He said the same thing to conclude the final lecture when I took this class in winter 2005, and the big picture of why we study this stuff just clicked for me. I do believe that Mehran Sahami has influenced generations of Stanford students.

On starting up in Seattle

I’ve been living in Seattle for a year and a half now. Does that qualify me to talk about the Seattle startup scene? Let’s find out together.

Four of the five cofounders moved from the SF Bay Area to start Egraphs in a Seattle house, where I learned that the Central District is cheap and has great Ethiopian food. It may seem strange that we left San Francisco and Palo Alto to start a company, and around the same time my former CEO even wrote about how he had to leave Seattle to get started in 2005. But Seattle was a new city for us and a new adventure, and it has worked out pretty well.

For one, it is way cheaper to run a business in Seattle. We have very roughly estimated that it would have cost twice as much to operate in San Francisco due to higher business expenses and taxes and rents and salaries. For example, we now have an office in Pioneer Square, which has a high concentration of startups, and the rent we pay would get us an office one-fourth the size in Palo Alto. For comparing other expenses, just plug in higher numbers across the board for operating in SF. We have essentially been able to run lean since we started.

When I started checking out the local tech scene, I was pleasantly surprised to find a thriving startup community. The monthly Seattle Hacker News meetup is regularly attended by 100+ industry folks, and many of them run their own startups. I remember chatting in a circle of attendees and finding that every person in that group was running something. Predictably, none of them were interested in joining Egraphs, and also predictably many of them had previous experience at Microsoft and Amazon. At other events, I was thankful to find people to give me a crash course on AWS topics. The scene is not quite as bustling as the Bay Area, where you can completely book your schedule with industry shindigs if you really wanted to, but you can definitely find startup types to socialize with in Seattle.

With respect to engineering recruiting, we have had good fortune to find some awesome engineers to join our team. We didn’t find any of them through the traditional avenues of online job postings (though I did learn that Resumator is a cinch to use), nor did we find them through meetups or campus recruiting. We met them in informal social settings, and one we met through a study group we hosted for Martin Odersky’s coursera class on Scala. The beautiful thing about startup recruiting in Seattle is the deep reserve of engineering talent quietly languishing at Microsoft and Amazon, waiting for a greater purpose. But they’re probably not exploring the startup scene, so you need to do the legwork to connect with them on a meaningful level. If I needed to recruit more engineers right now, I would get a rock climbing membership and socialize regularly at Beer and Code and host house parties to extend my social circle. And then pitch them on the cool stuff we’re working on.

It certainly would have been a different experience starting in the Bay Area versus Seattle, but I’ll shy away from saying which is better. Between two workable environments, it really isn’t meaningful to make statements like “City A is better than City B” anyway. Starting in Seattle is definitely possible, and moreover this town has been good to us especially in terms of recruiting. Beyond that, location is mostly a matter of personal preferences.

As a postscript, I will say that I was completely caught off guard by winter blues. Again, personal preferences.

Free startup idea

The way to think of startup ideas is not to think of startup ideas but rather to look for problems. So says Paul Graham. If you can provide a focused solution to a high-pain problem shared by many, then you have yourself a business.

For example, Stripe does that for the many websites that want to accept credit card payments without the hassle of merchant accounts or PCI compliance. Papertrail does that for sites who want to be able to make sense of application server log statements and exceptions. Crashlytics filled a critical need for mobile developers who otherwise had limited view of how their apps fared in the hands of actual users.

A friend of mine was reading a book called ObamaCare Survival Guide and had all but given up trying to understand ObamaCare compliance. The trouble is that there are lots of small businesses out there who don’t have that luxury. Think restaurants. Think small contracting firms. The owners of these companies are busy people who now have many new rules to follow, and those rules are complex and differ state-to-state and sometimes by city (eg San Francisco’s Health Care Security Ordinance). Oh yeah, and expect that those rules will be changed by Congress at some point.

So, create a service that handles everything small businesses need to comply with employee healthcare laws. As a user, I would want to be able to enter in my business address and info about my employees, and this service would take care of the rest down to recordkeeping and sending payments and forms where they need to go.

This one’s for free, along with the zero hard research that’s gone into it. In fact, from now on, whenever you hear about a survival guide for anything, just assume that there is a business idea.

The brave world of video encoding

One of my recent projects was to build the new egraph page. As explained on the company blog, the new format delivers the egraph’s audio and image together as a video, versus as separate image and audio assets.

Much of the motivation for this exploration was that we figured out that many people who viewed egraphs completely missed that there is an audio experience. By presenting the egraph as a video with a BAPB (big ass play button), we hope that it is obvious to egraph viewers that there is something to experience by hitting the play button. Hear your star speak directly to you omgz.

Enter the brave world of video encoding. As this as my first foray into that world, I was in for a crash source. For starters, there are currently three major codecs. As Dive Into Html 5 recommends for maximum compatibility, a website should provide a video in all three formats as well as a flash fallback. So much for hoping that video would be a limited scope problem.

And secondly, even combining a single image with an audio track turned out to be more difficult than I initially anticipated. Major open-source video libraries like FFmpeg run on native code, and all of my encoding attempts involved using a Java library called Xuggle, which wraps FFmpeg. I found Xuggle to be extremely finicky, and given how frequently that library is referenced, I would say that the JVM world does not have great video support.

I managed to create an MP4 from egraph assets, and those MP4s were what I shipped out the door. Currently, the egraph page displays the MP4 with flash fallback provided by Video JS (Zencoder) in a single resolution. We shipped it because, judging from the browsers they use, a very small percentage of our users would not be able to play the video, and besides our beautiful classic egraph view is still available. But if time and money were unlimited, video encoding is clearly a project we can spend more time on.

Complexity in video encoding is introduced on one axis by the competing video formats, on a second axis by the wide range of viewing devices that span iPhones, Android phones, browsers, tablets, and TVs, and on a third axis by the speed of the viewing device’s internet connection. Did you know that Netflix encodes each video 120 times so that you can watch Archer wherever you want? And the next time I upload an arbitrary video to Youtube, I will admire the fact that it just works. For the rest of us who are not large tech companies, there are services like Zencoder and now Amazon Elastic Transcoder. I tip my hat to Zencoder for being ahead of the curve in recognizing the need for sophisticated video encoding services.

When AWS introduced their service, they described that: ”Implementing a transcoding system on your own can be fairly complex. You might need to use a combination of open source and commercial code, and you’ll need to deal with scaling and storage…. In short, it isn’t as easy as it could be.” That sounds about right to me. :-)

But the deeper you explore hard problems, the more obvious business opportunities seem.

Play Framework and our asynchronous future

There have been many exciting announcements recently about the Play Framework, including new releases and that LinkedIn will adopt Play across the company. So I took this as an opportunity to reflect on my own experience with Play.

It has been 18 months since we chose Play and Scala for the Egraphs technology stack, and I remember that our decision at the time was Play versus Rails or Django or Lift. One factor that influenced our decision was that we prefer strongly and staticly typed languages because of how heavily you can lean on the compiler to prevent many bugs. Need to rewrite several classes or even rename types across your codebase? Rewrite dozens of files and when it compiles, it probably works as expected, and besides your IDE will do much of the heavily lifting because it can infer so much about your code from static types. For these reasons, it is easier to maintain typed codebases. On the flip side, I have witnessed how easy it is to let bad code in dynamic codebases remain unrefactored for years.

It was also reassuring to see the close relationship between Typesafe, the Scala company, and Zenexity, the company behind Play. And Scala, with its expressiveness and lambda functions and everything else, was a welcome alternative to Java that vanquished my envy of Ruby and Python developers.

Fast forward to now, we have been running production instances with mostly smooth sailing. There were a few bumps along the way, but they’ve all been resolved, often due to how rapidly Play and its ecosystem are maturing. One thing that I was not expecting is just how exciting it is to be part of this all. We are apparently one of the more visible startups running Play and Scala, and our little startup is even featured on playframework.com in the same breath as LinkedIn and Gilt. I mean, lol. (Egraphs was headquartered in a basement a year ago.)

It is companies that determine the success of an open-source framework. Serious deployments of a web framework make it worth talking about until eventually it becomes the go-to standard for that language. The long list of recognizable companies that use Rails and Django determined the success of those respective framework, and so it will be with Play for the JVM world, except that Play is built on a type-safe and scalable language with modern constructs for asynchronous processes. I think we are in the middle of the Play’s Framework’s ascent toward general popularity.

While we’re on this topic, I will make another prediction about Play. When high-volume websites get to a certain size, there is a general desire to squeeze more performance out of application servers. To that end, Twitter migrated off Ruby and now runs on Scala. Same with Gilt, who also uses Play. Foursquare was rewritten from PHP to Scala. Facebook sidestepped the performance limitations of PHP by using HipHop to translate PHP to C++.

What each of these examples has in common is the search for performance offered by compiled languages. By focusing on scale-oriented features like Akka actor and non-blocking I/O, Play has positioned itself well to run high-traffic websites, so I wouldn’t be surprised to see more major websites migrate to Play in the coming years.