Logging infrastructure system design is very important for each and every infrastructure as you need to look into logs. When you have a huge number of applications talking to each other there is an enormous amount of logs that they produce. Handling this amount of logs can be very costly and a headache. Let’s look at this problem for handling your logs at scale and Logging infrastructure System Design.
There are three problems that need to be solved here.
- How to ship logs from the machine
- How to process and save it
- Show or visualize it.
How to ship logs from the machine
You need a log shipper which can read the log from your directories or containers and ship them to the next place. This log shipper needs to have less resource footprint. You don’t want your logging application to take down your main application.
There are a few good log shippers present that can help you. Fluentd, rsyslog, logstash, etc can be a good choice.
How to process and save it.
Your log shipper needs to push the logs to a queue because it cannot be processed on the fly due to the number of logs getting generated. You need to save it somewhere and then process later.
The queue that you use here is Kafka. I don’t think any other queue can do what Kafka can do at scale. When you save the log objects in Kafka it is then consumed by the worker that processes it for metrics data, fixes the log object format and then either pushes it to the next Kafka queue for further processing or saves it somewhere like s3.
Visualizing the data
Once the data is there in the queue after initial processing the next worker can take the data and put it in elastic from where Kibana can be used to visualize it. Also, you can use graylog to visualize and search the logs.
Have a look at the below diagram and see how components interact with each other.
This was a very basic design of logging infrastructure System Design and you will face more problems as you will dig deeper into these components. In this case, logs can be overwhelming so you have to build systems for log throttling, etc.
Read about how to implement real-time tracking in applications like Swiggy, uber, and Dunzo?
If you like the article please share and subscribe. You can install our app for quick updates. Suggestions are welcome and reach out to me if you have any other ideas about system design.
4 COMMENTS
You should also talk about what are the different storage types available and the costs involved in using them. ElasticSearch can be quite costly to store TBs of logs generated every hour and if you want to store data for more than a year. Though, it can be helpful to analyze the real time logs holding last 12 hrs of data.
Yeah perfect thanks for the suggestion. Will update the blog
A good read about the comparison between fluentd and fluent bit and the way they compliment each other.
https://logz.io/blog/fluentd-vs-fluent-bit/
“Fluent Bit acting as a lightweight shipper collecting data from the different nodes in the cluster and forwarding the data to Fluentd for aggregation, processing and routing to any of the supported output destinations.”
And the other thing to note is Elastic Search may have performance issues if the logs can change. I think this implies re-indexing in Elastic. So unless you have a separate box for Elastic, then probably go for MongoDB or something else. See this for more info: https://www.smashingmagazine.com/2018/05/building-central-logging-service/?fbclid=IwAR0foKHYojy6oLeyU8OUkdgyhlzkI-Ky0S-bx3fG5sQpkKGeF-uKyRWWQW0