Summary of RKE project
One of the projects I was recently responsible for was to implement Rancher and RKE cluster management infrastructure in our private cloud data centers. It was a great learning experience to reflect back on due to the complexities involved.
The project started with the goal of using Rancher to manage RKE clusters. Rancher Kubernetes Engine (RKE) is Rancher’s s CNCF-certified Kubernetes distribution that works on bare-metal and virtualized servers. The big benefit of using this distribution is the ability to use the same management tool to provision and manage private data centers as well as in GCP/GKE, and AWS/EKS. The ability to have a centralized and automated installation and operations of K8s clusters allows us the ability to treat RKE as an internal product that my team would support. The eventual goal is to provide a self-service model where our customers being internal teams can easily provision new RKE K8s clusters and namespaces in an automated fashion. To the end user, RKE would be treated similar to EKS or GKE. There are a handful of options that we considered but felt that Rancher was the best option for us since we were going to go with the Enterprise Support license. The best part of going with this product was that the open source solution vs the enterprise solution was the same version of the application and that the only difference was just the support model. Leveraging the resource of Rancher support significantly increased our productivity in implementing the solution.
In the diagram above, I outlined the high level details of our provisioning process. Once we have the physical hardware in place in the data center and its racked and stacked, we implement ESXi on top so that it is ready to provision VMs. With the intention of running VMs, we purchase hardware that has plenty of CPU and memory. The develop and staging environments share the large ESXi hosts for VMs. In the production environment, ESXi VM hosts were mapped on a 1:1 ratio for K8s control planes, Etcd, and worker nodes.
A big benefit of using Packer to create templates is that we reduced our RHEL licensing footprint significantly since we only need the license for template creation. The licenses aren’t required to run RHEL OS in production if the servers do not require any updating so it is only used for installing packages and updates. All patches and updates are done on the template and deployed in a rolling upgrade strategy so that the production environment always has the latest patched templates.
Initial design and testing were done for automation creation all using Terraform Enterprise. There is no single click provisioning currently but it will be part of the next phase of improvements using Rundeck or Jenkins. Unfortunately Terraform Enterprise is not the best tool for CI provisioning and Jenkins is still an amazing tool for that.
The diagram above is our Rancher design. The Rancher management clusters are 3 node HA Instances provisioned in AWS and automated in Terraform Enterprise. Rancher currently advises against having the cluster in EKS due to a less controllable K8s platform which makes provisioning the cluster a bit more work to automate. These Rancher management clusters each support RKE clusters in the corresponding environments in the data centers. The Rancher base image is pulled from the Rancher public repository and stored locally in the non-production Artifactory to keep the cluster secure when patching. The image gets promoted to the production Artifactory once it has been tested and verified.
The develop, staging, and production environments are kept separately contained for testing purposes and use separate GIT branches.
RBAC Okta integration access is used to provide access to Rancher and the RKE clusters combining roles provided by Rancher to specific Okta groups for access.
The original proof of concept for using K8s in production actually led to going into production without being properly being production ready. It was meant to deliver an initial new service using K8s in the datacenter. It leveraged Ansible scripts for the provisioning but didn’t have immutability and automation in mind since it was a POC. The goal of an immutable infrastructure is to provide an infrastructure where servers are never modified after being deployed. The servers get rebuilt and replaced as part of any upgrades and patching from templates to stay consistent and prevent configuration drift.
The goal of the RKE project was to improve upon the previous implementation of K8s in the data center and have a stable production infrastructure. There was a lot of learnings to be taken from the POC. Some of the improvements put in place include:
- thought out design/plan that addresses most of the concerns and risks
- backup/restore/patch strategy.
- HA/DR.
- Centralized monitoring/alerting/logging
- K8s cluster management
- using Terraform Enterprise as centralized automation tool.
- VM template created using Packer and applying Ansible and Chef on top of templates to stay current with CIS security hardening.
- proper access control using Rancher and Okta integration for RBAC
- properly sized hardware
Improvements to make for the next phase would include:
- Centralized one click automation and CI — tools like Jenkins/Rundeck to leverage running existing Terraform Enterprise workspace jobs.
- CICD Cluster to constantly test the infrastructure automation code. Since there are constantly a lot of changes being made, having a CICD Cluster can help identify when things are broken right away and can help save time and effort in discovering bugs and errors.
- Centralized configmaps/secrets. A consistent strategy decided upon before implementation will prevent any future confusion. Hashicorp Vault could be an ideal goal.
- Self-service. The eventual goal is to provide the ability for our users/customers (internal to the company — ie. application teams, QA, etc.) to provision a cluster/namespace using automation behind a proper RBAC.
- consistent configuration management.
Lessons learned if I had to redo a K8s infrastructure in the data center
- present the project plan, scope and deliverable for purposes of visibility, exposure as well as to gather feedback/concerns.
- Commit to a demo of the POC or MVP that can showcase accomplishments and progress. This will also help set expectations from leadership and the organization as well as validating trust to stakeholders and customer expectations.
- Be well aware that implementing solutions in a private cloud data center will be a lot more complex than implementing in the public cloud. Public Cloud has a lot more features and tools already available whereas you may need to implement yourself in the data center. Some of the extra concerns include hardware provisioning, VPN connections if the data center requires connectivity to the other data centers or public cloud environments, availability of IP addresses, DNS, load balancing, custom ingress, etc.
- Implement CICD Cluster earlier on so that constant code changes will be validated.
- Requesting proper signoff for acceptance criteria from customer and/or leadership. Identifying the scope/time/resources assigned and that any changes can and will affect delivery. make sure to plan properly and allow enough time for completion. Ask for proper acceptance criteria for delivery and completion and timeframe. Deliver the project plan first before starting the implementation. Make sure you have enough resources and that the expected delivery date is acceptable. Identify risks and allow time for potential risks that will be identified. New risks WILL be identified and SCOPE will always be added. If a new tool is introduced or implemented, plan for extra time to learn and troubleshoot. Since Rancher/RKE was a new tool that was introduced, there was a lot of unknowns and new discoveries that came up as part of the discovery and implementation.
- More visibility!!!!!! This project significantly lacked exposure. The project was extremely complex but that might not have been clear to customer, product owner, or leadership. New risks/blockers and scope were continually introduced. This could have been improved by allocated more time early on during planning and design during the assessment phase. Making sure the knowledge and expertise of the tools used as well as any additional support was available is important to identify.
- Identify the proper timeframe and phases early on to keep track of momentum/velocity/slippage. If the project is using Sprints, having a guideline before implementation starts will be extremely helpful. Make sure there is room for delays so that your future sprints are not completely thrown off track. Committing to daily stand-ups, sprint planning, midpoint checkins, retrospectives, and backlog grooming. Additionally identify the project owner and scrum master. Whether the same person will serve as scrum master for the duration for the project or be rotational. Make sure these expectations are set before implementation starts.
- Making sure the team working on the project follows proper Agile/Scrum/sprint methodology and is familiar with it.