Kubernetes Production Checklist

6 min readApr 25, 2019

In Part 1 — Why use Kubernetes, I explained why I felt like Kubernetes is the future for DevOps. I went over Monolithic vs Microservices Architecture and the benefits of Immutable Infrastructure.

In Part 2-Best Practices & Lessons Learned, I covered some of my takeaways and suggestions for implementing Kubernetes.

In this Part 3 post, I want to share this checklist that will be useful in making sure you have everything covered in the initial project planning of migrating to Kubernetes. Project planning is very important if you plan on implementing a new tool in your environment. If the goal is to get it into production, you obviously want to make sure you have identified every aspect that you can think of. Most of the time, you will discover things along the way you may not have thought about when you started. By providing this list, I hope to provide more insight that will help others along the way. This is just my recommendations and there are plenty of alternatives so if anyone has feedback, I would love to hear your comments.

Production Checklist

Provision and deploy
Installation and Configuration
Security
Monitoring and Performance
Logging
Backup/Restore
Networking
HA, DR, and Scalability
Cost optimization
Documentation and testing

Provision & Deploy — K8s should be provisioned by Hashicorp’s Terraform into the cloud provider of your choice unless for some reason you choose to create your own Kubernetes control plane. Terraform is great because it promotes reproducible infrastructure and allows version control for your infrastructure as code. The crazy thing is that at the time of this writing, Terraform is still only in beta and not 1.0 yet, but there are plenty of companies already using it for production. This shows the confidence that people have in the product and how useful it is. What makes Terraform great is that it enforces Immutable Infrastructure and also encourages a declarative style of code as opposed to procedural.

Chef and Ansible encourage a procedural style of code that specifies how to achieve some desired end state.
Terraform, CloudFormation, SaltStack, and Puppet encourage a declarative style of code that specifies your desired end state and the tool is responsible for figuring out how to achieve that state.

Once your K8s control plane is ready, you can provision your clusters using Helm Charts. Helm is the best way to package, configure, and deploy your applications and services onto Kubernetes. As a package manager, Helm charts can be used to install software, software dependencies, fetch packages from repositories, and configure the deployments.

Installation, Configuration, and Automation — Helm Charts can be used to manage your cluster configuration. You can have different templates and environment based configuration files which I have found to be great for deployments. This does not include the container creation process however because there are many ways to implement that workflow.

Security — This is a very important component that gets easily missed. You want to make sure your environment is secure. Kubernetes has RBAC, a built in permissions feature for role based access control. You usually want to apply a least privilege access control and make sure access is only given to people who need it. Kubernetes also has a built in secrets feature for storing keys and passwords. The bad thing however, is that it is only Base64 encoded so it actually isn’t encrypted. I would suggest using Helm secrets which I covered briefly in my previous post.

Monitoring & Performance — There are a handful of options for monitoring your K8s cluster: Prometheus, Kube dashboard, Appdynamics, Newrelic, just to name a few. If the goal is to use open source, I think Prometheus may be the winner but there are plenty of other choices to consider. The basic Kube dashboard is probably the easiest to setup and is great to have, not just for monitoring, but for having a GUI dashboard that you can do anything that you can do with the Kubectl CLI. Monitoring the application performance is also very important so that you can properly tune your settings for an optimal solution and for scaling purposes.

Logging — Similar to monitoring, there is a handful of options for K8s, and the most popular would probably be Fluentd/Graylog and Splunk. Graylog would probably be the winner out of the two because it is open source.

Backup/Restore — Making sure you have proper backups and restore process is key for any application infrastructure. You want to make sure you have frequent database backups, and that it is stored in a safe, secure, and redundant location such as S3 if you are in AWS. Run automated jobs using Cronjobs for backups, indexing, and cleanups. Don’t forget to test your restore process and make sure that the process is properly documented.

Networking — K8s has built in services that work pretty well for communication between pods and containers. In addition to the basic features, it also allows for complex customizations for routing features such as Ingress and if you require even further advanced networking, Istio is probably the best add-on for that. Istio is a service mesh sidecar to help manage network traffic help control flow of traffic and API calls between services. It has the ability to enable service to service encryption. Other benefits include Additional testing and deployment features such as A/B testing, Canary deployments and Blue/Green deployments where you can have percentage based traffic splits. Additional security benefits include the ability to secure pod to pod or service to service at the network and application layer. It also supports HTTP2.

For general networking purposes, you can leverage your existing cloud provider tools and services.

HA, DR, Scalability — High Availability and Disaster Recovery are important to have in your production environment and requires proper planning and documentation. The good thing about Kubernetes is that there are a lot of built-in features that will provide you with high availability. By using Replication sets, you already make sure you have more than one available pod/container to support your application. Using HPA-Horizontal Pod Scaling, if your pod is overutilized, it will autoscale for you. On the worker node level, you can use Cluster Autoscaler, which will scale up more nodes for you if there isn’t enough resources for additional pods.

These K8s tools I just mentioned easily provide HA and scalability for you. Disaster recovery requires a proper plan in place along with proper backup and restore processes. You can also leverage your cloud provider and use those tools to get the best replication and redundancy. In AWS for example, you can use multi-AZ, multi-region, and ELBs. With EKS, AWS takes care of managing your master nodes, so at least theres one less thing to worry about.

Cost Optimization —Migrating your environment to Kubernetes should ideally save some money as it optimizes the resources being used. Taking advantage of your cloud provider instance costs would probably be the best way to optimize costs as you determine the type of instance to use with your K8s worker nodes. In AWS, for example, purchasing Reserved Instances will save a significant amount if you are willing to pay more upfront.

Documentation and Testing — Last but not least, making sure you have proper documentation is very important and is often times overlooked and missed. Having a proper project plan of the implementation of Kubernetes will be helpful so that you make sure you cover all the aspects of your environment and not have any surprises and unexpected incidents. Creating a plan beforehand also saves you the effort of creating the documentation from scratch once the project is complete. For this reason, most of the time, proper documentation is often times skipped and missed as you move onto other tasks and projects.

Proper testing not just for bugs but for performance and scaling if you are migrating an application from an existing infrastructure. If your application was not containerized yet, testing your application in containers will be crucial because you may discovery issues that you didn’t have before it was containerized.

Implementing a resiliency tool that will test random failures like Chaos Monkey would help plan for unexpected issues and will make sure you have a very stable and reliable system.

Hopefully this list was helpful in getting one step closer to putting Kubernetes in production. There may be other things that I have not covered in this list, and of course every environment and architecture can be different as well. Moving to Kubernetes will definitely be one of the best decisions that can be made for your roadmap and hopefully, this post will help.

Kubernetes Production Checklist

Production Checklist

Written by Alex Ho

No responses yet