sudo fry rolls /* This blog is actually mobile-friendly! */

How I almost screwed up the Esplorio iOS launch and fixed it with duct tape

Team Esplorio officially launched the iOS app

Polo

Meet Polo - The Esplorio GPS Kitty

We first built our tracking app a long time ago. In the past few months, we put a beautiful UI on it and re-engineered the whole platform in the process.

We went from this simple one-page tracker prototype:

to a beautiful trip recording/sharing app:

this awesome app

With a bit of luck, we got Hunted and featured on the top of the Tech featured page for the day. Now that we’ve launched on the app store plus a shiny ProductHunt badge, it is pretty awesome.

What happened behind the scene?

For the 2 days leading up to the launch, we camped at Tim’s place to work our ass off. The first day we called it a day at 3am, and the second day we pulled an all-nighter trying to get all the launch stuff together then stayed up until late afternoon to respond to all the new traffic. That was almost 40 hours of work for the 2 days - which is pretty much a week equivalent for most people. It is insane! I do not recommend it.

And I almost f*cked it up

When shit hits the fan just before launch, it hits real hard.

About 13 hours before launch, I was doing usual maintenance on our servers, restarting some machines since the OS required a server restart for some security updates. One faulty restart then took out our whole database cluster. The cluster seemed to get into a very bad race condition and never recovered afterwards no matter what we did to save it. We then decommissioned it, spinned up a new production cluster to replace it using the latest backup that we had at the time. However, by the time the backup data was in place, it was already 3 hours before launch time, but our database views still had not finished indexing yet - which means the site and the app are both unusable.

Tim, Essa and I then had to make a call to whether we should keep going with the launch. It was a Thursday, the coming weekend would be the last weekend before Christmas, so we thought launching any time later than this (even on the Friday) would be a bad idea. At this point, we realised that we still have a staging database, which has the data replicated from production along with all the views being warmed up already, lying there ready to be used. We quickly tested it, everything seemed to work, the only risk is that since these are just staging servers, we have no replications set up, so we run the risk of having a bigger screw-up if one of the boxes fail.

We bit the bullet and used that cluster anyway. It worked flawlessly for the whole launch period. We then ran an XDCR during the launch from this substitute staging cluster to that new production cluster that we built overnight to make sure it always has newest data, with the hope that the view indices will be ready later in the day or maybe the day after at worst.

This afternoon, we confirmed that the new production cluster was ready. We made sure all the data is in place, switched all our servers to use that cluster and reversed the XDCR like it was before (production -> staging).

Yes, that’s right. We just fixed our app launch with duct tape and it worked - you can now get it at https://home.esplor.io!

Startup life is fun.

All our base are belong to Google (with a few gotchas)

Workday

The Esplorio team has moved into the same town resulting in me sharing a flat with Essa, and we took our servers along with us (I kid, I kid)

A while back we managed to get into the Google Launch programme, which includes a $100k voucher of Google Compute Engine (GCE) credits. The idea is that Google will assist these new exciting startups to scale with many different resources they have at their command, and beefy servers are just one of their specialties. There are a few technical gotchas I will mention at the end so if you want to skip the BS, go all the way down to The gotchas

The move

These credits sat around for quite some time because our whole team (3+1) were dead focused on getting the iOS app out until about 3 weeks ago when we asked our friend George Hickman to join Esplorio once more to help us with this huge switch involving a lot of different moving parts:

  • API servers serving the webapp and iOS app
  • Web frontend server
  • A single-node 8GB database box we need to convert to a proper distributed cluster as it was designed to run (Couchbase minimum requirements state 16GB of RAM for each node in the cluster)
  • A myriad of other servers to process geodata, images and queued tasks

By the end of the move, with George’s tremendous help, we rewrote all of our deployment scripts using the awesome Apache Libcloud. Spinning up a whole database cluster only takes one single deploy.db_cluster:node_count=100 line in the terminal. After all the scripts were rewritten, it took me another couple of days to complete the switch, tighten our firewalls, and scrub all the old servers. As careful as I was, some parts of the system still went down for about half an hour because of a DNS change.

We now even have a staging database cluster, which makes us feel a bit more like a proper software company, and plenty of firepower to prep for growth :fingers crossed:

It was a great experience. After months and months of writing way too much Javascript and Swift, I’ve eventually got my hands around some much needed DevOps stuff. It also serves as a reminder for myself: Esplorio is totally not a simple system to run!

The gotchas

However, we had several problems that we encountered during our move:

1. Unusual traffic from China

When we first set up some test GCE boxes, we noticed some suspicious traffic hitting our Django servers. Since Django has the ALLOWED_HOSTS check, fortunately it filters out various invalid hosts from hitting most of our endpoints, and on top of that it sends us alerts of these repeated spoofing attempts to hit our servers like this:

ERROR: Invalid HTTP_HOST header: 'azzvxgoagent5.appspot.com'.You may need to add u'azzvxgoagent5.appspot.com' to ALLOWED_HOSTS

No stack trace available

Request repr() unavailable.

(The HOST header may vary: azzvxgoagent5.appspot.com, azzvxgoagent3.appspot.com, azzvxgoagent1.appspot.com, www.google.com.hk)

After some investigation, it turns out all of these requests came from a program called GoAgent, which snoops around Google App Engine (GAE) servers and use them as a free resource to create a proxy service. As you would have guessed, it is apparently used by many Chinese to bypass the Great Firewall. Our Compute Engine boxes must have fallen in the same IP range that GAE boxes use, and we’ve had thousands of these requests coming our way.

We decided to filter out these requests before it reaches our Django instances, returning a HTTP 444 (bad request error, without any response) right when they hit our HTTP server.

2. Funky network setup by Google

To bring our database over to the new infrastructure without any downtime, we used a technique in Couchbase called XDCR (Cross-DataCenter Replication). The process is to first build the new cluster, and then set up an automatic unidirectional copy of the data from the old cluster into the new one where every single document in the old cluster will be sent over as part of the copy (each copy request is thus called an XDCR op). Once all the data is in place, one can simply flip the switch for the application to use the new cluster, and all the precious data will be there in the new cluster, ready to use. When all of the left-over XDCR ops finish, we can make a backup of the old server and then archive it.

In order for this to happen successfully, all nodes within the 2 clusters need to be able to talk to each other. We first set the new cluster up so that they can all talk to each other using Google’s internal IP addresses, leaving only one box exposed to the old cluster, because we thought if we point the XDCR target to this “leader” box, it’d be enough. XDCR failed, of course, because Couchbase clusters treat each node equally and so all of these individual nodes need to be able to talk to each other. I did some further digging into the GCE network structure, and found that Google have done some funky setup where the IP address of eth0 is the internal one, and the external address is apparently generated and configured elsewhere. The idea is that all the nodes are connected to the Internet not directly but via a different layer, and as a result external IP addresses can be changed either at creation time or even on the fly. It’s quite cool.

I predict our database cluster would perform even better if we use an all-internal setup, however it is a task for another day.

3. Quotas

The last gotcha we hit during the migration was the quotas. I assume this is enforced by Google to prevent abuse of their system. Basically, our whole setup required a total of a few dozen CPUs and a number of terabytes of SSD to run so we had to ask for quota raises twice. This was, thankfully, not much of a big trouble since we are a totally legit startup (yay!) and Google’s support was very quick and receptive about it.

Conclusion

I normally consider myself a (somewhat) full-stack developer, but much of the fun I’ve had still comes from back-end and DevOps. Building these new servers was a bit like assembling many parts of a big puzzle, and the end result was very satisfying. Now on to a tonne of other stuff waiting for me to complete while I procrastinate by writing this blog entry

San Francisco stories (no.5)

Not your kind of people

Supermoon observers

POV of an outsider

On one hand, I got to meet exceptional individuals on my trip to San Francisco and the Bay Area. These people are the ones in the driving who push the limits of technology, constantly on the forefront of innovation, trying to have a shot at the impossible, and keeping the wealth flowing in. No exaggeration: they left me in awe of their intelligence and talents. Oh yes, there are way way more talents to tap into outside of Silicon Valley - but these superhumans are a totally different breed, seriously…

On the other hand, there are homeless folks roaming almost every street we walked/drove by, and many of them are mentally ill. One night, when I was waiting for the BART in SF centre to get to our place in Oakland, an old man came around, kept saying a lot of gibberish to everybody nearby - amongst which I made out the part where he said he was a vet in ‘Nam (funnily enough…). He then proceeded to sing a song that I could not really understand either, danced along with it in a deranged way, spoke some more gibberish and walked away. I had a glance into his eyes and found them… well, soulless. It would have made a great photograph, but it was just all really sad so I hesitated and decided not to take the shot. Ever since, I have been reading more and more about the homeless and mental health problems of San Francisco - fascinating stuff.

All of it was just like a dream - SF is now 8 timezones away. A jet lag is like leaving your heart and soul somewhere else. It is time to readjust.

From High Wycombe, England

San Francisco stories (no.4)

winery

What to do in San Francisco:

After TechCrunch Disrupt Day 2, go to a random drink where you talk about travel startups, hotel pricing algorithm, your dream Arduino builds, chatting up with girls about your tech stack.

I can’t imagine doing the last bit in many other places I’ve been to without getting weird look, and people slowly walking away from this nutty Asian guy.

It's time to just do it

Next destination: San Francisco

3 months ago, the reaction would be “Yeah, let’s go to Disrupt SF!”, now it has changed to “Holy shit, it’s next week…”

We’ve now got a solid platform, with an upcoming iOS app that we have always wanted to build reaching launch date, and (positive - woot!) feedback flooding in from beta users and early adopters. In the past 4 months, I have written a lot of thousands of lines of code in at least 5 different programming languages, and currently on an over month-long GitHub streak with 70 pull requests within the last 7 days alone. Recruiters who spams me with LinkedIn messages about “new exciting challenges”, please take note: if something can pique my interest, it needs to be that Esplorio-level challenging and addicting.

just-do-it

There is still a crazy amount of work to do. The more we get finished, the more new stuff there is. However, it is safe to say that I am having the time of my life right now, and the best is yet to come.

Let’s hope all the long hours and all those “Okay, I’ll stay back to finish this, maybe one fewer day/night out/game…” moments will eventually pay off. See you again soon, San Francisco.

If you are going to attend TechCrunch Disrupt SF 2015, you can find us showing off all the awesome things we’ve built in the Startup Alley on the 21st Sep. Otherwise, I’ll stay in town with my team until the 30th if you fancy a catchup

PS: Fun fact. While I’m writing these lines, Esplorio servers are under attack by Chinese hackers