Monitoring Infrastructure System Design


In this article about monitoring infrastructure system design, we will talk about how you can get metrics from any running application and software and then ship it and show it in a more understandable way. We will try to use the open-source software as much as possible.

When we talk about this, we are talking about three major components. 

  1. Who will generate metrics?
  2. Where to save these metrics. 
  3. How to visualize it. 

Let’s start with our article on monitoring infrastructure system design. 

When we talk about metrics and showing it, a major player in the open-source world is Prometheus. Prometheus is a very powerful and efficient time-series database and is used by large scale enterprises. Prometheus provides functionality flexible query language which makes it the first choice of most of the users.

When we talk about metrics, these are mostly time series data and Prometheus is very good at saving it. Graphite is also a great alternative for Prometheus. Now let’s look at our problems and see how we can solve it.

Who will generate metrics: 

When we are talking about Prometheus, it needs a particular format that it can consume. Now there are exporters present for most of the software from MySQL to MongoDB etc. What you have to do is run these exporters and these exporters will expose metrics on a specific port from where Prometheus can consume the metrics.

There can be two ways to send metrics to Prometheus, one is pushing mechanism and others can be a pull mechanism.

Pull Mechanism: In this model, the metrics generator exposes the metrics on a port and the metrics collector consumes the metrics from there. Prometheus is very good at this.

Push Mechanism: In this model, the generator itself push the metrics to the collector. Prefer graphite for this.

You can read about this comparison below.


https://www.learnsteps.com/prometheus-vs-graphite/

If you don’t find an exporter for your software, you can write your own exporters and Prometheus will be able to read those. Prometheus library is present in most of the programming languages to write and run the exporters. Below is a code snippet to expose metrics.

from prometheus_client import start_http_server, Summary
import random
import time

# Create a metric to track time spent and requests made.
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')

# Decorate function with metric.
@REQUEST_TIME.time()
def process_request(t):
    """A dummy function that takes some time."""
    time.sleep(t)

if __name__ == '__main__':
    # Start up the server to expose the metrics.
    start_http_server(9000)
    # Generate some requests.
    while True:
        process_request(random.random())

This example is taken from Prometheus’ client from Github. This will expose a port of 9000 where you can see the metrics when you run this example.


monitoring infrastructure system design

Where to save these metrics?

Prometheus can save these metrics after reading this from all the machines. You need to specify in Prometheus config about all the machines that you need to extract metrics from and Prometheus will do it for you. If you want to read about how Prometheus saves data you can read about it https://prometheus.io/docs/prometheus/1.8/storage/

After Prometheus saves these data you can run queries to get the data out of it. Read about the basics of the Prometheus query here. https://prometheus.io/docs/prometheus/latest/querying/basics/

How to visualize it?

When we are talking about visualization there is an awesome open source project you must have heard about. Grafana is an awesome tool to visualize data from Prometheus, graphite or any other source. You can install Grafana and then point the data sources to your Prometheus or graphite and then you can write a query and show the data in Grafana. Grafana provides different types of charts and graphs for data visualization and is very easy to use.

This was very basic of how you can have your monitoring infrastructure. Prometheus can be a problem when your scale will increase in that case you may use Thanos.

Thanos helps you to deploy Prometheus in a way that it will work at scale. There is a Trickster project that you can use as a caching layer between Prometheus and Grafana.

If you like the article please share and subscribe. Also if you want me to write in-depth about any of these components feel free to ping me.

If you like the article on monitoring infrastructure system design, please share and subscribe. You can install our app for quick updates. Suggestions are welcome and reach out to me if you have any other ideas about system design.


Gaurav Yadav

Gaurav is cloud infrastructure engineer and a full stack web developer and blogger. Sportsperson by heart and loves football. Scale is something he loves to work for and always keen to learn new tech. Experienced with CI/CD, distributed cloud infrastructure, build systems and lot of SRE Stuff.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.