Hey everyone I am starting a new article series in which we will talk about working on scale. Today is the first article of this series, I don’t know how many articles it will contain as working for scale is a subject in itself and am a mere beginner. So lets start todays article of working on scale with what is replication and sharding.
Every now and then people who work for databases and data keeps hearing the buzz words replication and sharding. In this article we will see what these are and where to use which one.
As the name suggests in itself replication means keeping different replica of your data at different places. Its as simple as that, the problem comes when you have to keep that data in sync and make it consistent. This is what different databases try to solve. Some try to make it consistent on cost of availability while other prefers it to be more available than consistent. It all depends on what use case you have for your system.
What are advantages of replication:
So keeping different copies of the data helps in making sure that you data is not at risk if few of your machines go down. As there is a replica available to recover it. Also it provides way to make the reads scalable.
How will it make reads scalable?
Simple you can read from any of the replicas thus more the number of replicas more the options to read from thus not a few machines will take all the load. More servers are available to handle read loads.
Problems with replicas?
One problem with replicas is if you want a very consistent system there may be problem in consistency if there are network latency in the system. Thus your changes may reflect on the replica with some delay. Though different database solve this with different approach, you just have to choose the correct one.
Too much replication can put load on the masters where writes will be happening as it need to propagate the changes to more number of servers.
So this was a quick look into replication which helps you in scaling the reads. Now what about the writes? Here comes sharding.
What sharding does is instead of saving whole data in one server we split the data on some basis say geo location of data and save it in different servers thus making more servers available to accept the writes.
How will it make writes scalable?
Since now data is not being written on one server which may exhaust due to huge write, instead it is being distributed to different servers which will result in more number of parallel writes. Thus making writes more scalable.
Problem with sharding?
Sharding is tough to deploy as it needs code changes and you may need to put logic in application to write data in which server. This is tricky part and may cause a lot of problem. Your application needs to be designed in such a way to support the data sharding.
Sharding needs queries to be handled by an extra layer which know where this data may exist and talk to that server for the data while reads.
Sharding is to scale writes and replication is to scale reads.
So you have seen what data replication and sharding is, normally read are more than writes in system and they are also cheap in terms of time complexity. So you must try to design a system with as much less writes as possible thus making it easer to scale.
Hope you like the article. If you find something is wrong here please mention it in comment, I will try to fix it at the earliest. Any suggestions are welcome.