I Was There: Stories of Production Incidents

Code[ish] - A podcast by Salesforce Engineering

Categories:

Corey Martin values storytelling. It's just one way developers can share their experiences in order for others to take lessons. To that end, this episode takes a close look at production issues from two different applications to examine what went wrong and how it was fixed. Meg Viar is a Senior Software Developer at Nomadic Learning, an e-learning platform. One day, they noticed that, for a certain group of users, a column of information in their database row was nulled. It didn't look like any user--either internally or externally--intentionally changed these values, and there hadn't been any new code deployed in days. The only clue was that the data was all changed at the same time. It turned out that a weekly cron job was deleting some data on an in-memory list. However, the database ORM they use also overloads the delete keyword, and was actually deleting the production data. Restoring the data from a backup was easy, and reworking the code to not use the data was a quick fix. However, going forward, Meg and her team came up with several ways to adjust the process around code changes like this from occurring again. Brendan Hennessy is the co-founder and CTO at Launchpad Lab, a studio that builds custom web and mobile applications. One of their clients is an SAT/ACT test prep app, and students complained that the app was extraordinarily slow. Brendan was accustomed to seeing such feedback on testing days, when heavy volume brought added strain to servers, and they accounted for this by increasing capacity. But this was different: there weren't any tests scheduled during the period. Instead, one of their own services was inadvertently DDOSing an endpoint, expecting a response; when one didn't arrive, it just kept making requests. They reworked this code to make a request once and simply wait for a response without trying again. In the future, they committed themselves to doing more in-person blitzes of new features, since issues like this only arise after multiple users use the app--something automated tests have trouble simulating. Links from this episode Nomadic Learning builds digital academies Launchpad Lab builds custom web and mobile applications for startups and established businesses