In recent times, I was working with Redis clusters which have very high throughput. While doing this we came across a problem which was Redis bgsave was taking a lot of memory sometimes almost as the same memory as the data present in the memory. In this small writeup, we will see why this happened.
There is a 6 node Redis cluster with 3 master and a replication factor 1. This means each master has a replica so that when it goes down the replica can take over as a master. If you want to see how to do this setup you can follow this.
Now we started putting a load on the server. Now you can see as the throughput increases the memory that Redis takes to take a rdb snapshot keeps increasing. Now here comes the question of why is this happening.
The first guess that we made is since there is a high throughput the number of keys that are changing will be high and thus Redis may need to use memory to save those keys. So what we did is we make this snapshot process more frequently so that less number of keys will change in that span.
But it was a failure we still saw that the amount of memory that is being used is still the same. So it proved our hypothesis wrong. Now was the time to read more about bgsave in Redis.
Copy on Write:
It is a method in which the child process has the access to the memory block of parents and it can read the data from the same pages. Once anyone either parent or child needs to write the data then only a different copy is created and saved in memory. You can read more about it here.
Now the problem is since our throughput is very high. When we start the fork and till the fork is complete. Redis try to change a lot of pages due to high throughput. Thus more the pages changed more copies have to be created resulting in more usage of memory.
Do you get it?
Think like this. You got homework. Now you have all the notebooks in your rack. You will only take out those and put it on table(RAM) in which you have to work. In our problem, it’s like they have given something to change in all the notebooks. This your table(RAM) will be full(OOM)
Well, it’s tough to find a solution to this. What you can do is in a high throughput system either give the same amount of memory for bgsave. Or run the save at a time when throughput is very less.
If you like the article please share and subscribe. Also if you are preparing for an interview for DevOps and SRE, we have a book for you.