How to debug issues and performance in production: The USE method

In this post, we are going to talk about the methodology that you can follow to debug issues. The methodology is the USE method and we will see how we can use it and why we should incorporate it in our day to day tasks of debugging issues.

USE [Utilization, Saturation, and Errors]

USE method is used to debug issues related to performance. What you do in short is check usage, saturation, and errors on every resource. For example, in the case of disks, we will see what is the usage of the disk, the iostat will tell us about the saturation level, and then if there are any errors occurring. These errors generally will be present in syslogs.

If you are seeing performance issues you can simply check these things for all the resources present on the system and it will be easier to find out the issue.

This methodology is so useful because in most cases the issue is actually happening because one or the other resource being consumed heavily. Since we are following a pre-defined way of checking the different resources, it is easy to reach the problem. There instances where people don’t follow a standing procedure to track down the issues. This becomes messy very easily. You must define a set of rules and procedures to test and check and in order.

If you are able to do it. It will be easier for any new joiner to debug the issue and you will not be relying on the people who have worked closely with it.

There is a great post on this on Brendan Gregg’s blog. You must check it out here. BLOG POST LINK

Points to mind while debugging production issues.

Story: Outage: Did you checked the disk usage?

This actually happened with me sometime back. All of a sudden in our Jenkins system the jobs were not getting executed. The team started checking every probability from plugin updates to scripts not working to dependencies. After spending around 4 hours someone checked the disk usage and found out that the disk was full. This should be something that should be caught in the one look at the system metrics but was missed. You don’t want to be in this situation.

The bad thing was the metrics were not there to check the disk usage. That is why blackbox monitoring is so important and this is where the USE method may have helped to save the debugging time. It seems like a problem that may never happen to you. But trust me there are a lot of issues that can happen and any new joiner can point it out but due to the hypothesis even senior guys take a lot of time debugging them. Remember its always the first principle, you just have to figure out the steps that form your first principle.

On that day people actually remembered the USE methodology could have saved them and also better metrics and alerts were needed.

If you like the article please share and subscribe.

How to debug issues and performance in production: The USE method

USE [Utilization, Saturation, and Errors]

Story: Outage: Did you checked the disk usage?

Related

Leave a Reply Cancel reply

USE [Utilization, Saturation, and Errors]

Story: Outage: Did you checked the disk usage?

Shout out to others

Related

Leave a Reply Cancel reply