Indias covid registration system crashed minutes after opening. They have over a billion people in India.
So I am curious what combination of technologies would be able handle hundreds of millions of requests at the same time and not crash. Please share your thoughts. Just to help my understanding. Thanks
I guess theoretically you can have a system that can handle unlimited requests given enough resources (massive amounts of resources). But in practice, something like a queue system might be used. There are third-party systems I have seen implemented on some websites. You basically just enter a queue and it handles delegating access to the site.
If you expect a billion people, you probably shouldn’t allow them to hit the site all at once. The API itself can likely be protected but that won’t help if the site used to communicate with it is overrun.
Most systems should handle it as long as they’re built that way: if it is expected that there would be a very large amount of requests, then you build a system that handles very large numbers of requests. At a basic level, you either
just have more servers and when someone hits a url it gets routed to one of them, or
you just use a beefier machine (as an extreme example, if the server has say 32 CPU cores and 6 Tb of RAM – I think that’s the current maximum of an Intel Xeon server – and the application is the only thing running on it, then you can throw almost anything at that machine).
The former is more sensible as you can add more/scale down as and when needed more easily. If a server goes down it doesn’t matter. It’s what you would normally do. If you need to handle more requests, use more computers.
Latter is naturally very expensive and if that machine goes down you’re still up a creek: dedicated, incredibly powerful maxed out servers are what you want if you’re, say, Pixar, and you want to render some frames of your new film or you’re some engineering company running simulations.
There’s normally a combination of the two things: more servers + more powerful servers for something that will have a lot of requests. Plus if the Indians were using a service (something like AWS), then that service should handle much of the scaling basically automatically (albeit at huge cost).
It is a bit hard to find out what went wrong with the portal, as they’re denying anything went wrong but they will have set it up to handle a huge number of requests, I assume they’ll have just not set something up correctly given how fast they fixed it