June 30, 2023

This is a story about how we broke our website by sending a single email. But it's also a story the rabbit hole that followed after that made us realise even bigger issues.

Goodbye tinyletter

A few months ago tinyletter announced that they were shutting down. It was a free service that allowed us to send our newsletter but since it was shutting down we really needed an alternative. As we looked around we quickly learned that newsletter-y services have become quite executive and expensive. We considered buttondown, which certainly looked like the best option, but since our newsletter has thousands of subscribers we figured that it might be better to just build our own. Paying for every service out there quickly adds up and we're trying to run a lean operation in our spare time.

Luckily, it turned out that it wasn't the hardest thing to build. When tinyletter went offline we removed the links to their serivce on our site and we made backups of our subscribers. Folks wouldn't be able to sign up for the newsletter until we wrote something new ourselves, but we chose not to rush the feature. We're running this project in our spare time, after all.

Luckily, calmcode is running on top of Django so it wasn't the hardest thing to integrate with with an SES client that could send emails on our behalf. We built some forms, added a mechanism that could confirm an email and we were off. The newsletter did not have to integrate with any other part of the site, so it was a relatively simple task.

The email

We tested an email or two to our private email adresses and everything looked fine. So we decided to send out the first newsletter. We clicked the button and the email was sent. That's when the site became unresponsive.

We figured we'd look at our telemetry and this confirmed that there was a spike in our CPU.

CPU spike
The screenshot was taken when we were back online. Notice how the CPU spike isn't at 100%.

The spike only showed a 20% CPU usage but the site simply would not ping back. We knew that the times corresponded with the email being sent out though, so the issue wasn't that it was heavy on the CPU. The container was probably waiting a whole lot.

Ah, right, workers

Calmcode runs Django in a simple docker container. When made this transition we also configured workers in the container. This way when a single worked is busy the other worker can still receive and handle traffic. The hope was that this would be sufficient and that we wouldn't need to also run a job queue like Celery.

However, as we quickly remembered, we never actually bothered to configure more than one worker because the Docker container was already running near the memory limit. The container reported 75% memory use so adding another worker requires more memory. To be completely honest, we had been lazy here. When you only have a few hours per week for a project you accept that you've got to make shortcuts so we considered the memory limit a backlog item to be reviewed later. All this time, calmcode had been running on a single worker just fine, until it broke, of course.

The simple solution at this point would be to allocate more memory to the container in order to run more workers. But when you get bitten by a production issue like this, it's better to take a step back and make sure that you're not just fixing the symptoms.

The rabbit hole

To understand what was happening we looked at the memory use of the Django app locally. We used memray for this and this showed us that the culprit was our search engine lunr. It was using 224MiB of memory per worker.

Memory use of lunr
Memory use of lunr.

The thing with lunr is that it is meant as a simple-enough search engine for simple documentation pages. It powers things like mkdocs material which can run as a static site. The search engine is meant to be run in the browser in situations where the entire index can fit in memory on the client side. For 95% of all documentation pages with maybe 50 pages of content this is perfectly fine. But calmcode has close to 1000 pages of content. And lunr was creating a large JSON index for all this content in memory. No wonder hundreds of megabytes of memory were being used.

The fix

The thing with our search engine is that it wasn't really being used. Our traffic logs could confirm that some people were using it, but most people prefer to explore the content via our tracks page. The main reason we've always kept it around was because it was easier to do nothing than to take effort to remove it but now we had a good reason to do so. There was one course on playwright that uses the search bar to explain some testing, so we'll keep part of the search bar around for that course. But we're dropping it for everything else. You can still reach the bar with the right URL going forward, but the links to it will be removed.

When we made this change, you can see a nice drop in memory use.

Memory use of calmcode dropping
Memory use of calmcode after the fix.

This is what the memory looked like when we added extra workers to the container.

Memory use of calmcode now
Memory use of calmcode after the we added multiple workers.

The finale

All of this might sound like the end of the story. After all, we fixed our issue and reminded ourselves of good practices along the way. But unfortunately we've omitted the biggest issues with everything that we did here.

Remember that in the beginning, we removed the links to tinyletter? Well, after all of this was said and done we never put the links back. When we sent emails to our own email adressed we did so via the Django Admin inferface. Our Docker container was now running much lighter and we could totally send emails to our hearts content. But people who came to our website wouldn't be able to find the signup link to the newsletter because we completely forgot about putting the new links in.

With that in mind it seems like there are a few lessons worth repeating.

  • When working on new features, make sure you test them in person. The django admin might make it super easy to automate tasks, but this can also introduce a disconnect with the user experience.
  • If you have limited resources it might be better to remove features than to keep them around indefinately. The search bar was a nice feature but it was also a feature that was never used.
  • It can be fine to backlog important items when you're working on a hobby project. But when the backlog item is starting to cause big issues you should try to resolve it by properly going down the rabbit hole. A quick-fix can often work when you're pressed for time, but you shouldn't ignore reality when it hits you in the face.
  • When you're working on a production issue, make sure that you make plenty of screenshots for a post-mortem blogpost. Without them, this blogppost would've been a whole lot less interesting.

That said, please sign up for our newsletter. We promise not to break the site again.


Back to blog.