Our team greeted the end of the workday with a long, collective sigh. After banging their heads against a wall for two days, Nathan and Connor had finally found a proper way to implement a new feature among a mess of existing logic. Elizabeth had just wrapped up an exhausting QA task that incorporated many different services, and I had spent the afternoon implementing some 11th-hour requirements on a monthly batch job that was scheduled to run in the next 36 hours. We were minutes away from packing up to head home for the day when messages from the product manager came across my chat client:
PM: Are you still here?
PM: We have a high priority issue with [the company's highest revenue-generating feature].
PM: A third party supplied us with some incomplete customer data, so not a single person can get successfully matched to their promotional offer. We have a legal obligation to get them these offers.
me: Hmm… I think we can fix that by matching them up on the data that we do have.
PM: Or alternatively... we shut the site down.
Whatever end-of-day repose that might have started settling in quickly disappeared with the words: "shut the site down." I've worked with this client for a year and we have never had to bring the site down.
me: Give us 15-30 minutes.
The clock started ticking… and our craftsmanship kicked in.
Four heads are better than one
As soon as these messages were done, I engaged the most powerful weapon available: the team sitting around me. I relayed the situation to Elizabeth, Connor, and Nathan—to which Connor quickly replied with, "Yeah, we can totally do this." So, I fired up the monitors on the wall of our team room and showed the team where I thought the fix could go. After a few rapid back-and-forth questions, I typed out some pseudo-code to make sure we were all on the same page. We soon had a minimally invasive solution that could live forever without breaking anything, but would be easy to remove as soon as the third party fixed things on their end.
Time elapsed: 5 minutes
The only way to go fast is to go well
Our client does a great job of continuously deploying. With many teams working on the same services, there are several deploys a day. Deploying is almost always a non-event, and even a hotfix should be no different. Connor took over the keyboard so that he and Nathan could start writing a unit test. The test would help flesh out any of the sticky implementation details and also help drive out any low-level bugs that we hadn't thought of. Concurrently, Elizabeth—who knows the ins and outs of this particular feature better than anyone—mirrored the code changes and set up her own machine so that she could manually verify it with an end-to-end test.
Time elapsed: 8 minutes
If you maintain a high signal-to-noise ratio, you can never over-communicate
Next up, we pulled in the client's lead developer to run the fix by him. Again, a chat message I sent with the words "business wants us to take down the site" had him in the office before I could even look up from my monitor. He was good with the solution, and camped out in our room to help us get it into production as soon as possible. Then, while the team worked on the fix, I set off to find a room of very rightfully panicked business stakeholders so that we could talk face-to-face. The conversation I walked in on was about getting clean data from the third party, which would be, at a minimum, a four-hour wait. I interrupted as politely as I could and provided a quick and broad overview of our fix. I probably spoke a little too quickly, but I know the words "we can have the fix live in another 15-20 minutes" sunk in.
Time elapsed: 12 minutes
Done means Done
Then I went back to the team room, which had quickly grown from the four of us to include the lead dev, three stakeholders and another senior dev from the client. With Connor at the keyboard and the code and tests up on the monitor, we all 'pair programmed' through the red and green of the unit tests, calling out an edge case here and a fix to the implementation details there. As soon as we saw green, Elizabeth got the code fix on her machine to quickly confirm that yes, everything went through as expected. After passing tests, a very public code review, and an end-to-end verification, we knew we were good to go. Once again, we let out a collective sigh and confidently handed our code over to the lead dev to be sent into the wild. He updated the production nodes, and we had him run one final post-release check just to confirm the fix worked as expected—all good.
Time elapsed: 25 minutes
Everyone filed out of our room, and we started packing up for the day. When our product manager pinged us a couple minutes later to let us know that customers were successfully getting matched to their promotional offers, we were very happy to hear it. However, we already knew that as a team—which included everyone from the client’s side as well—we'd implemented the best possible solution through collaboration, relying on our disciplines, and making sure we crossed our t's and dotted our i's. By the time the message came through that "it worked!" we were well into our denouement, riding down the adrenaline rush to what would hopefully be a nice, quiet evening.
Total time elapsed: 35 minutes