Kubernetes Upgrade Failure Story: How a CNI config caused havoc in our Redis Infrastructure.

In one of my organizations, most of our Redis infrastructure runs on Kubernetes. if you know about the Redis cluster, you can connect to shards using the IP of the machines. Since we are running Redis in statefulsets, we have exposed the IPs of the pods inside our infrastructure. Kubernetes cluster is EKS and the pod IPs are reachable with the help of the external SNAT setting that EKS provides. Remember this setting for now as this was the main culprit.

So we started to upgrade our Kubernetes cluster from 1.15 to 1.16. If you remember there are a lot of API removals in Kubernetes version 1.16. We spent a lot of time fixing those. We launch and configure our Kubernetes with help of terraform. The upgrade path was as follows. Upgrade the cluster with terraform, then upgrade the CNI, Kube Proxy, and core DNS manually by applying the manifests provided by AWS. This is a standard path that you have to follow when you are upgrading your EKS cluster.

Upgrade Path

The upgrade path was as follows. Upgrade the cluster with terraform, then upgrade the CNI, Kube Proxy, and core DNS manually by applying the manifests provided by AWS. This is a standard path that you have to follow when you are upgrading your EKS cluster.

We completed our EKS upgrade and now is the time to update the CNI. Following the EKS upgrade documentation, we downloaded a manifest for the new CNI and applied it. All of our AWS CNI pods are replaced. Within minutes we started getting alerts for connection failure to our Redis Infrastructure. We started seeing intermittent connection errors to Redis from our applications. We are running around 20+ high throughput Redis Clusters.

Culprit

We started looking for the changes that were done and then we identified that the default setting of the external SNAT parameter AWS_VPC_K8S_CNI_EXTERNALSNAT=true is not the same as before. The manifest that we applied for has changed the setting and thus our connections are getting terminated intermittently. Now a question here is why intermittently.

When external SNAT is off the external requests coming to your pods will get translated to the primary IP of the ec2 instance. So if the pod has been assigned the primary IP the connection will be successful but if the pod has the secondary IP the connection will not be successful as the response will come from the Primary IP which will be dropped. You can read more about it here in-depth.

https://docs.aws.amazon.com/eks/latest/userguide/external-snat.html

So when we applied our CNI we didn’t change the setting of external SNAT and thus this issue started coming.

Here we discovered that anything we apply directly from AWS while upgrading should be verified before else these issues can happen. The external SNAT issue is a very frequent issue and you should be aware of it if you trying to access your pods from another VPC then your Kubernetes cluster is. Since you are applying something from a cloud vendor directly you should verify it with the source of truth that you have within your repository.

If you like this please share and subscribe for more such articles.


Gaurav Yadav

Gaurav is cloud infrastructure engineer and a full stack web developer and blogger. Sportsperson by heart and loves football. Scale is something he loves to work for and always keen to learn new tech. Experienced with CI/CD, distributed cloud infrastructure, build systems and lot of SRE Stuff.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.