DevOps Interview Questions: How will you approach a network issue.

Network issues are the most common tasks that you have to debug if you involved in the setup of infrastructure and day-to-day operations. In this article, we will see how we can approach this problem.

Let’s say we have few boxes that need to connect to a third-party API that is outside of the network, all our traffic is going out through a NAT box. We will try to see how we can debug this. Below is the diagram of how the connection flows.

DevOps Interview Questions:  How will you approach a network issue.

We should generally try to debug in very easy three steps. Check if DNS is working, check if the service is up, check if it is taking the correct route if not where it is breaking.

First of all, we need to check if the external service DNS is being resolved or not. To check this you can run the below command.

dig external_service_dns

Simply running telnet will also give you if it’s not resolvable. We are trying to establish a way to check so we are not doing it atm.

If this is resolved then the issue is not here, else you have to check your resolv.conf files and debug at DNS level.

Next, we will check if we can reach the NAT box. And then we will see if the external services are reachable. To do so we can use a traceroute to see the routes packet will take.

traceroute external_service_dns

The output will be something like below. 2nd line of the output should have your NAT IP.

traceroute to google.com (172.217.166.174), 30 hops max, 60 byte packets
 1  _gateway (192.168.1.1)  5.145 ms  5.269 ms  5.334 ms
 2  abts-mh-dynamic-001.33.169.122.airtelbroadband.in (122.169.33.1)  8.484 ms  8.751 ms  8.721 ms
 3  125.16.217.229 (125.16.217.229)  9.018 ms aes-static-249.51.22.125.airtel.in (125.22.51.249)  8.991 ms  9.277 ms
 4  116.119.55.33 (116.119.55.33)  22.388 ms 116.119.57.137 (116.119.57.137)  18.453 ms 182.79.154.71 (182.79.154.71)  20.789 ms
 5  142.250.161.56 (142.250.161.56)  20.392 ms  18.865 ms  18.834 ms
 6  10.23.221.190 (10.23.221.190)  23.658 ms 10.23.221.222 (10.23.221.222)  14.676 ms 10.23.206.126 (10.23.206.126)  24.185 ms
 7  74.125.243.97 (74.125.243.97)  14.236 ms 72.14.232.56 (72.14.232.56)  14.244 ms 108.170.251.113 (108.170.251.113)  14.171 ms
 8  74.125.243.98 (74.125.243.98)  20.370 ms 108.170.251.124 (108.170.251.124)  18.547 ms 108.170.251.107 (108.170.251.107)  14.245 ms
 9  72.14.233.107 (72.14.233.107)  21.057 ms 108.170.238.146 (108.170.238.146)  33.528 ms 142.250.63.117 (142.250.63.117)  12.968 ms
10  108.170.248.193 (108.170.248.193)  34.578 ms 72.14.235.154 (72.14.235.154)  30.827 ms 72.14.232.138 (72.14.232.138)  35.468 ms
11  216.239.48.64 (216.239.48.64)  39.609 ms 74.125.253.107 (74.125.253.107)  37.134 ms 108.170.248.209 (108.170.248.209)  33.761 ms
12  bom07s20-in-f14.1e100.net (172.217.166.174)  52.450 ms 216.239.57.189 (216.239.57.189)  52.390 ms 108.170.248.193 (108.170.248.193)  48.849 ms

This will show the route it is taking to reach the external service, this should go through your NAT box. If it is not going through your NAT box it is again an issue. So you have to check your routes of the subnet in which your internal apps are deployed. The external service may be rejecting the request because it is not coming from a NAT box. To fix it you have to add a route in your app’s subnet to route any traffic that has to go out should go through the NAT box.

Anything to 0.0.0.0, the next hop should be NAT box. 

So till now, we have established that the traffic is coming through the NAT box if we have fixed it. Next, if you are not able to reach external service on a port. Try to telnet and see if you are able to connect it. If you are not able to connect it, you will get the below error.

telnet external_dns port
Trying 127.0.0.1...
telnet: Unable to connect to remote host: Connection refused

This means that the service is not active on this port and you have to talk to the vendor about this.

This was very basic of how you can approach such problems. There can be many other issues on NAT box and external service that may need more in-depth debugging. The issue can be IP forwarding not active on the NAT box. Few headers are getting dropped due to some issue. There can be a lot of issues that can happen.

You have to remember a very basic principle that you have to check the path step by step and debug it. Don't try to skip the steps. It is as same as a clogged pipe and you need to clean it from end to end to allow water to flow.

If you like the article please share and subscribe.


Gaurav Yadav

Gaurav is cloud infrastructure engineer and a full stack web developer and blogger. Sportsperson by heart and loves football. Scale is something he loves to work for and always keen to learn new tech. Experienced with CI/CD, distributed cloud infrastructure, build systems and lot of SRE Stuff.

2 COMMENTS
  • Cyber Joe
    Reply

    Except if you are letting trace route go through your firewall you have already failed.

  • Sambasivarao
    Reply

    Very interested topic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.