Deploying an application is hard. Deploying an application that has only one real chance at success, coinciding with a live event that can’t be repeated, is really hard! It’s almost like the difference between writing a good song and being able to perform live in front of a huge audience.
If you haven’t been following the story from Iowa, the short version is that a code issue in an app developed to report the caucus results led to incomplete caucus data being reported from 1800-plus precincts. This issue was compounded by volunteers who weren't comfortable using the new system and a failsafe that wasn’t scalable—when trouble arose, the backup plan was for each precinct to phone in their results, leading to long hold times and even more confusion.
Production hiccups can and do happen, but Iowa’s wide-ranging issues remind us that any number of things can go wrong when a brand-new solution is deployed for the first time. And watching the headlines fly, I couldn’t help but think of one of my own stories of going live, and what I learned from it.
My Own "Going Live" Story
Chicago’s Chiditarod is nothing short of a wild spectacle, even when you go in with the knowledge that it’s a charity bar-crawl/shopping-cart race based on Alaska’s famous Iditarod dogsled marathon. Racers arrive in teams and often in coordinated costumes, with a decorated cart full of food donations as an entry fee. After the amusing scene at the starting line, racers visit a series of checkpoints (read: bars, with a minimum stop of 25 minutes at each) along their assigned route, before converging on the finish line.
So, we volunteered and built a timing app. Yes: it’s really, actually a race, and some people really do try to win, but most competitors are just there to have fun. The timing software does determine a winner, but the primary use-case is more to enable checkpoint volunteers to deal with the chaos of a couple-dozen teams idling at their checkpoint, all with different departure schedules.
Similarly to the Iowa caucuses, this app’s usage coincides with a live event. Volunteer users come with a wide range of technical aptitudes, and are asked to use the app correctly from a chaotic live environment. And like a live performance, there are no do-overs, so if you really want to nail it, you have to rehearse beforehand.
One of the key factors to success was non-technical preparation. By the time I came around, the organizers had done this rodeo at least a half-dozen times before, and they had a robust system in place to on-board volunteers. Each year they held a big meeting a month or so before the race where they would go over training material and documentation with all the checkpoint leads, who would then fan out and go over the same with their own teams.
One nice thing about this approach is that it gave us a chance to run a pilot with real users—people who would be using the software on race day. We could test the quality of our documentation, answer questions, explain the system and some of the rules it enforced, and gather valuable feedback. We also put together a training environment (usually populated with the previous year’s data), deployed for users in advance, and let them use it when meeting with their own teams. If people ran into any issues, we were aware before race day.
Failing Well
It wasn’t all sunshine and daisies, however. One year, as the race added more teams and more checkpoints, the app got bogged down. We hadn’t prepared for the additional load of more people reading more data, and we found the threshold where our first, eager design would no longer work cheaply and reliably.
One thing that made this failure easier to swallow was the race’s backup system. The race organizers had used a paper-and-ink system to track team progress before they digitized, and wisely kept it in place as a backup. This system was scalable and reliable enough, so recovery was swift and the proceedings were able to continue as planned, with no long holds and no delays. (Side note: this experience alone has made me an advocate for paper-trails in elections!)
We made a few large changes to the system after that debacle, swapping out the data-store from a disk-driven relational database to an in-memory Redis instance. While doing that, we pulled the majority of data (teams, checkpoints, routes, etc.) into immutable configuration that would be loaded whenever the app started up, limiting persisted data solely to race-day concerns, like team check-in and departure times. Lastly, we improved our timing display, so that load or connectivity issues would cause graceful, mostly-transparent degradation within the software, rather than outright errors and broken behavior.
More importantly, before race-day, we did production-grade load-testing on the application to ensure it could handle all the connections it would be receiving and much, much more. The next year, the app barely broke a sweat, and even better, because we had done the testing ourselves in advance, the success was no surprise, and the day was stress-free.
Bringing It All Together
As we had ironed out the major technical issues, we entered into a cycle of minor, year-by-year refinements. We’d go on to address simple issues of human error (the volunteers are working in a bar, after all) that we could prevent with better design. We improved our error flows to give volunteers a better, more flexible way of dealing with common mistakes—for example, if a previous checkpoint forgot to check out a team—rather than just by calling me from a loud, boisterous room.
Systems like these will continue to change with time. Software products are often more of a process, like this, than a traditional one-off product. Our ability to iterate year-to-year allowed us to grow in our understanding of our users and the entire operation of the race. This kind of partnership, strengthened by good feedback loops, is probably the most important element of project success.
But a “live” product doesn’t always come with that luxury. Even before we could iterate, the Chiditarod Organization did the non-code groundwork to ensure we could prepare their volunteers, to ensure we could hear their feedback, and to ensure we could all keep getting better together. The Iowa Democratic Party, for whatever reason (and I’m not here to comment on that!), couldn’t get that groundwork done in time, and it caused delays and a ton of avoidable confusion.
“Code is risk,” the popular adage says, and it’s absolutely true; any operational change carries risk. So while “going live” can be a big, scary moment for any developer, there are proven ways to mitigate the fear and the uncertainty. Take a note from those who have to perform live on a regular basis: rehearse and practice before opening night, coordinate with the rest of the production team, and maybe include a safety net or two. The show must go on, after all!
Note: The 2020 Chiditarod takes place on March 7th. To donate or find out more information, visit chiditarod.org