Working on Scale: Failover, timeouts, throttling and retries and why are they very important.

Hey everyone, this is the third article in series working on scale. If you haven’t read the previous ones please give it a look. In this article we will be talking about the things that people generally tend to ignore. These are timeouts, retries, throttling  and failover. These four together are very important when it comes to huge traffic.

Lets first see what these terms stands for and we will understand with example of you being hungry (client) at a restaurant (server) that is suppose to serve you a meal.

Failover, timeouts, throttling and retries

Timeout:

Lets say you went to restaurant you order for a dish. It was getting late and after 1 hour you decided to leave and go to other restaurant. Here 1 hour limit was your timeout. In terms of network requests you set timeout that if I don’t get response in this much time I will not wait more and go for further action which can be retry or failover.

Failovers:

So now that you don’t get food in your favorite restaurant you decided to go back home and cook something for you. Here your home cooking is failover option for you. In terms of network calls, If your request is not successful you try to make exception and do what ever step in necessary in that case.

Retries:

Now say your luck is very bad and whatever you cook is burnt by mistake. You are sad but you are brave to retry. Similarly in networks calls if one of the request fails. We do retry, this retry may be exponential back-off to relax the system. Like after you first retry you are tired you take 20 min rest and then start cooking.

Throttling:

What the restaurant must have done at this time? They must have started to throttle the clients. That is they should not accept orders until they are sure they will be able to serve them. This is called throttling. If your system is struggling to serve the current rate of traffic, its better to throttle the traffic thus saving your systems. This will ensure you are able to serve some traffic instead of not serving at all.

 

Why are these important?

Large scale systems are prone to delays due to one or the other cause, in that case timeouts will help. If timeout are there and you keep track of them, you can decide what instance of server it is taking longer time and is not able to serve due to timeouts.

Now your application makes a decision to retry and this time you are assigned other server which served your request faster. There is a generally a load balancer sitting in between the client and the server which does this. Load balancers have config that if some of the request are timing out take this server out of rotation. This will ensure that the bad server does not serve traffic.

Now say your servers are down and your application is not serving, say your caching servers are down, it is not serving the cache, in this case you have written a failover strategy that if cache are being missed please go back to the actual data source and read from there. Thus your application will keep giving proper results in subsequent retries though there may be some load on database. This load on database can bring down the database.

Thus comes throttling, you know start dropping few of the requests to database to keep it safe. This will bring down performance of the system and the number of client it can serve but keep the overall system running, which is of utmost important.

Throttling will save your systems from absurd bounce in traffic which may bring down whole system.

 

How not handling them properly may impair your systems

We will talk about this with an example. Lets say you have a NAT box (a machine which you use for your unbound traffic) in place. You have an external call to say third party apis. Now all these calls go through your NAT box. Now lets define the bandwidth of your NAT network, say 10 Gb per sec. And you have outbound traffic of 5 Gb per sec. Now in your system you have put timeout of 10ms and retry every 10s, 20s and 30s. If by some unintentional circumstances these calls starts failing. After the retries what will happen is your outbound traffic may burst up to 3 times that is 15 Gb per sec. Since you network cannot handle it, your system will be impaired. Just because some third party system is not handling the requests your systems are impaired.

Now what you can do in such situations. Keep the track of failing calls and once you are sure that there are lot more calls failing start throttling the request to NAT box and you will be able to save your system.

These are very small example of what these things can do. Try to understand them in depth before making any architectural decision for you system.

If you like the articles please share and subscribe. Read more on this series here.

Working on Scale: What is replication and sharding and where to use which one?


Gaurav Yadav

Gaurav is cloud infrastructure engineer and a full stack web developer and blogger. Sportsperson by heart and loves football. Scale is something he loves to work for and always keen to learn new tech. Experienced with CI/CD, distributed cloud infrastructure, build systems and lot of SRE Stuff.

1 COMMENT
  • Itsafzal
    Reply

    Worth reading, well explained

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.