Running Kubernetes in Production: Part 1

There are tons of articles available on the internet on the basics of how to run something, but there is a huge gap when you run something just to test it out and run it in production. A lot of tools will break when they start working at scale. In this series, we are going to talk about how you can effectively run these tools in prod from my experience. This is the first article in the series Running in Prod and in this, we will talk about Kubernetes. We will talk about common things that you have to do irrespective of the cloud provider you are choosing.

Before this, if you want to read about Kubernetes Basics, you can follow this list.

https://www.learnsteps.com/tag/basics-on-kubernetes/

DNS

DNS resolution is the first thing that happens when requesting any resource but comes last in mind when you are configuring Kubernetes for Production. Making sure that you are able to serve all the DNS query at a large scale become very tricky. There are two ways to approach this problem.

Horizontally scale the coredns service so that it can serve the request. Another way is to run NodeLocal DNSCache. It will run on each node and all the DNS resolution queries will be served from the same node and there will not be any DNS queries going out of the node.

Proper resource allocation for Kubelet and OS operations

Apart from pods, there are two more important things that run in any node. First is the Operating system, you have to make sure that you have resources available for your underlying operating system to perform its operations. The next thing is Kubelet, Kubelet is responsible for making sure your pods are up, launching new pods and containers, and restarting them when needed. It also does a health check and updates it back to API servers. If Kubelet is down for some time, all the pods will become unhealthy on that node and will be launched in other nodes. There are other Kubernetes-related processes that run like the CNI plugin.

For reserving resources for Kubernetes-related processes you can use the below kubelet flag

--kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

For reserving resources for OS, you can use the below kubelet flags

--system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

Kube Proxy Mode:

If you don’t know about Kube Proxy you can read more about it here. It can run in three modes user, iptables, and IPVS. Making sure that you are using the correct mode is very important as it will affect your traffic when the number of services increases. For example, if you are using iptables mode and you have a huge number of services, it can cause issues. In short, if you have more than 1000 services IPVS mode will give you good performance gain. You can read about it in-depth here. https://www.tigera.io/blog/comparing-kube-proxy-modes-iptables-or-ipvs/

Choosing Networking model

Choosing which kind of networking you are choosing becomes a very important decision. There are two to choose from ideally, flat networking like VPC CNI in AWS or overlay networking. Both have their pros and cons and it can be a very important decision while running Kubernetes in Production.

If you go with flat networking, you may end up with limited IPs in the case of few cloud providers but very fast and sorted networking, With overlay networking you may have to take a small hit of the latency which happens at IP IP encapsulation layer.

VPC Design

This one is really important, a lot of time while creating a Kubernetes cluster you end up choosing smaller subnets and thus limiting yourself to the number of pods that you can create. Make sure you plan for the future and have chosen CIDRs which has more than sufficient IPs to run pods in the cluster.

These were few considerations while running your Kubernetes clusters on production. We will look at more pointers in the next part of this series from secutiry point of view and also from execution point of few.

If you like the writeup please share it in the community so that other can also gain from this.


Gaurav Yadav

Gaurav is cloud infrastructure engineer and a full stack web developer and blogger. Sportsperson by heart and loves football. Scale is something he loves to work for and always keen to learn new tech. Experienced with CI/CD, distributed cloud infrastructure, build systems and lot of SRE Stuff.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.