Some companies are constantly creating new questions. At Square, for example, interviewers are asked to create their own interview questions (which are then approved by a committee to keep the bar consistent).
After we started migrating, we observed a lot of performance and functional issues in our cluster due to incorrect configuration. One of the effects of that was adding a lot of buffers in resource requests and limits to eliminate resource constraints as a possibility for performance degradation.
Our learning is that operating Kubernetes is complex. There are a lot of moving parts. And learning how to operate Kubernetes is most likely not core to your business. Offload as much as possible to cloud service providers (EKS, GKE, AKS). There is no value in doing this yourself.
However, initially we had an enormous amount of wastage of resources while we were migrating. Owing to our inability to tune our self-managed Kubernetes cluster the right way which led to a ton of performance issues, we ended up requesting a lot of resources in our pods as buffer and more like insurance to reduce chances of outages or performance issues due to lack of compute or memory.
Deploying Open Policy Agent to build the right controls helped automate the entire change management process and build the right safety nets for our developers. With Open Policy Agent, we can restrict scenarios like one just mentioned before — it is possible to restrict service objects from getting created unless the right annotation is present so that developers don’t accidentally create public ELBs.
Sometimes this independence could pose severe risks. For example, using the LoadBalancer type service in EKS provisions a public-network facing ELB by default. Adding a certain annotation would ensure that an internal ELB is provisioned.We made some of these mistakes early on.
One of the first observations was pod evictions due to memory constraints on nodes. The reason for this was disproportionately high resource limits as compared to their resource requests. With surge in traffic, increase in memory consumption could lead to memory saturation on nodes, further leading to pod eviction.
This does not apply in case of non-production environments (such as development, staging and CI). These environments don’t get any spike in traffic. Theoretically you can run infinite containers if you set CPU requests to zero and set a high enough CPU limit for your containers. If your containers start utilizing a lot of CPU, they will get throttled. You can do the same with memory requests and limits as well. However, the behaviour of reaching memory limits is different than that of CPU. If you utilize more than the set memory limit, your containers get OOM killed and they restart. If your memory limit is abnormally high (let’s say higher than the node’s capacity), you can keep using memory but eventually the scheduler will start evicting pods when the node runs out of available memory.
Even when using a managed Kubernetes service, invest early in infrastructure-as-code setup to make disaster recovery and upgrade process relatively less painful in the future and be able to recover fast in face of disasters.
In non-production environments, we safely over commit resources as much as possible by keeping resource requests extremely low and limits extremely high. The limiting factor in this case is memory i.e. no matter how low the memory request is and how high the memory limit is, pod eviction is a function of sum of memory utilized by all containers scheduled on a node.
Our learning was to keep resource requests high enough but not too high so that during low traffic hours we are wasting resources and keep resource limits relatively close to resource requests to allow for some breathing room for spiky traffic without pod evictions due to memory pressure on nodes. How close must the limits be to requests depends on your traffic patterns.
Kubernetes is meant to unlock the cloud platform for developers, make them more independent and push the DevOps culture. Opening up the platform to developers, reducing intervention by cloud engineering teams (or sysadmins) and making development teams independent should be one of the important goals.
This was the most obvious one. Our infrastructure today has far less compute, memory and storage provisioned than we had before. Apart from better capacity utilisation due to better packing of containers/processes, we were able to better utilise our shared services such as processes for observability (metrics, logs) than before.
You can make an attempt to push towards GitOps if you will. If you can’t do that, reducing manual steps to a bare minimum is a great start. We use a combination of eksctl, terraform and our cluster configuration manifests (including manifests for platform services) to set up what we call the “Grofers Kubernetes Platform”. To make the setup and deployment process simpler and repeatable, we have built an automated pipeline to set up new clusters and deploy changes to existing ones.
- What precisely is Residence College? As what other individuals get hold of it, is without a doubt the act of teaching younger little