A postmortem on May 17, 2022 outage on learn

A couple of days ago, we had a high-severity incident affecting our platforms, resulting in an unpleasant experience when using our learning platform.

This write-up discusses what happened in the spirit of transparency with our users, kind supporters, and incredible contributors.

Quick recap of the series of events leading up to the outage:

  1. New Responsive Web Design (RWD) Certification release.

    Last week (Week 16, 2022), we released a new certification and new features on the interactive coding platform.

    The changes include a brand-new editor to let you work cohesively on the project-based curriculum without needing to leave freeCodeCamp. We tested this new stack for months in safe environments like their own machines and a testing platform known as staging.

    After a good few weeks, we confirmed the UX was acceptable. We released everything to our production environment (the main learning platform at https://freecodecamp.org/learn) in an early access mode for our community enthusiasts.

    Finally, after an extensive evaluation of the features, we moved it out of the early access into the first certification on the learning map.

  2. First signs of performance overheads.

    We saw that our MongoDB instances would complain of a small Oplog size during this.

    Oplog in our context usually doesn’t affect end-users like you directly. Regardless, we started fixing some of our code to help the issues. We increased our Oplog size as well, just in case. More on this later.

  3. Dueling with operating our infrastructure at scale and the costs that come with it.

    We optimize our infrastructure operating costs (paid staff hours + actual cloud bills) to get the most value like any organization. We always have an ongoing dilemma with op-costs. As a non-profit, our resources are limited, and we still need to serve millions of users like you every day.

    To balance this, we have measures in place for some slack between the infrastructure capacity we need and provide. We have automation to scale our infrastructure up as required. We also have safeguards to prevent accidental overspending.

    Here is an example. Before the latest release, we brought up the storage on the MongoDB cluster to twice the size just because we knew that we would need a little more space.

    However, this wasn’t enough, and we should have looked at other areas to avoid the outage in hindsight.

    On average, we have 10 million successful challenge submissions each day. Every instance of “Submit and move to the next challenge” is a successful submission.

    An average user can submit challenges at best once every 2-5 minutes or so until now. The new practice-focused editor means you can submit orders of magnitude faster than before. The challenges are now tiny steps to reinforce your knowledge with practice. But there are hundreds of these in a given project.

So, what happened?

In the last few days leading up to the outage, we slowly clogged up our API until it failed miserably.

The overhaul of the most popular certification that we have had for seven years is bound to make some anticipation. Twitter & this forum has countless comments about how much you all like it. As word spread about the newly updated certification, most of you started trying it out.

Our API is a Node.js app that runs on several virtual machines on Microsoft Azure. Until the latest release, these apps have never been used anywhere near to the capacity they can handle. We had seen only 15%-20% of use on average for at least a year now.

As users complete challenges faster than before, more simultaneous submissions need validation by the API. The web app (a React frontend) sends more requests to the API, which sends more updates to the database, and so on. The HTTP requests started getting resolved slower and slower until timing out eventually.

This slow clogging up did not happen instantly; it took about a half-week before tipping over, so it went unnoticed.

So, how did we fix it?

When we make a new release, we monitor our platform for signs of unusual activity. If we see something that we do not like, we roll back to a previous version as a rule of our DevOps protocol.

This outage was not about a single code bug or our strategy for the infrastructure to handle loads. We could not simply revert this and work on a quick fix.

We went on an all-hands-on-deck mode to understand the problems I summarized earlier.

It took us a while because we were getting all the wrong signals. MongoDB completed the queries slowly, but there was no significant spike in memory or compute resources in the database cluster. Our API instances were at capacity (serving as many req/s as possible). However, our VMs were still underutilized overall (no severe spike in memory or compute).

The only things that we knew at the time are:

  1. Our requests were slow and timing out.
  2. We could identify some commands are taking longer than they should.

So, we scrambled to optimize the logic we used in our code. We even dropped some endpoints that were used heavily but not mission-critical.

Things improved slightly, but our APIs still could not handle the loads.

This incident was particularly bizarre because we did not see any significant increase in the number of users online working through the challenges. We realized a bit late that the same number of users could generate way higher traffic than before by crushing through the new certification.

This experience was very bitter-sweet for us. It was a clear indication that we need to make changes and that the users are using the new curriculum—at an unprecedented scale.

We decided to bite the bullet and spin up more servers than allowed by our budget. Combine that with the fact that we are working on about half-a-dozen new project-based certifications; indeed, this approach is not financially viable long-term.

We need to get back to the drawing board.

Going forward from here:

Over the years, we have gathered a lot of cruft. Our API is old, and our DB design has not changed ever since its inception.

We use MongoDB and a single collection for the user. This design works well because our mental model for the platform has only one thing: You—the user, your profile, and your submissions. It means a single document about you as a user with all that data. However, the way we interact with the document is not optimal. At least not with the new approach for project-based learning, and it results in many updates to the document very frequently.

Traditionally, we never have put any blocking mechanisms between you submitting a challenge and getting it saved. So requests for submission result in the DB updates happening instantly.

But this has a few problems.

At the scale that we operate, the new practice-based approach needs hundreds of instances of the API, all running in parallel and updating the user records. Plus, the DB needs to guarantee the transaction for each request.

In theory, our simple and proven mental model for the API means we can scale it infinitely. Still, it’s something we can not do practically. It would mean a very, very high operational cost.

So, our team is hard at work as you read this to rejig some of the architecture. This time we are thinking about scale as a first-class citizen of the design. We are thinking about integrating more observability tools and measuring performance more closely than before. We are thinking about doing chaos engineering that we haven’t done earlier.

We promise what we have in the works is very exciting. The platform will become much more stable than ever before, scaling up for use by hundreds of millions of users.

So, above everything, we sincerely appreciate your patience with us during the outage that lasted over 30 hours on the learning platform.

The community was vibrant during this incident across all the other pillars of our community. And in case you did not notice, all our different platforms like the developer news, the forum, the chat, etc. running closely alongside the learning platform were not even affected.

We are grateful to every member of the mod team for helping manage the support requests.

Cheers & Happy Coding!

10 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.