Scaling SRE and Platform Infrastructure Team into Distributed Squads

6 min readFeb 11, 2025

I want to share some of my reflections of establishing and supporting the centralized K8s platform and managed services team at VMware. VMware is no more as it was acquired by Broadcom and dismantled but I really enjoyed what I helped build with an amazing team and culture over 3 years.

We had a lot of potential as I helped evolve the team from 12 engineers to 50+ over 7 different countries providing 24/7 support and maintaining a 99.99% uptime SLO. Unfortunately as we planned our next steps of growth and scaling, we hit a huge roadblock and had to adapt to the extreme changes in business direction and pending acquisition that took 18 months due to regulatory approvals.

One of the specific topics I want to discuss is how I adopted the distributed squads concept to provide subject matter expertise (SME) to each squad to provide focus and dedication to successfully achieve more optimization and efficiency.

I also did something similar in my previous role at Splunk as I helped grow the team there from 9 to 18, it made sense to split up the team into a build and a run team to establish some focus. The entire team was doing too many things and engineers weren’t able to focus on projects. Assigning leads and narrowing down the scope allowed for more clarity and in addition really empowers the leads with an opportunity to make decisions and mentor/coach other individuals.

Innovative Squad-Based Structure

In my role as the people manager for North America and global head of SRE at VMware, I led a team responsible for a complex and rapidly growing infrastructure platform with 250+ clusters and over 11,000 Kubernetes nodes across multi-cloud and on-prem environments. As our customer base expanded, so did the breadth of our service offerings, resulting in a significant increase in support demand. Team feedback highlighted that managing this wide range of responsibilities required constant context-switching, impacting our ability to consistently deliver high-quality support.

Problem and Innovation

To address these challenges, I proposed a specialized squad structure, organizing our team into focused areas: Development, CI/CD, Observability, Kafka, and General SRE. We kept each squad small (3–6 engineers) to promote collaboration and expertise development, aligning with the “two-pizza team” approach for agility. Engineers could also participate across multiple squads, creating flexibility and cross-functional knowledge.

Development Squads: Focused on in-house Kubernetes platform development on AWS and on-prem. These squads consistently delivered updates, bug fixes, and optimizations, ensuring that our control plane and platform provided stability and reliability along with security and compliance. Some of the big projects we worked on included bringing our custom K8s platform/controlplane from AWS to an OnPrem environment and introducing a management UI for the platform that allowed for self-service features to create your own clusters and adding services such as an Observability package, Kafka, Vault, Istio, and Rabbitmq via operators.

CI/CD Squad: Managed our end-to-end CI/CD pipeline, critical for supporting hundreds of daily commits across environments in GitLab. The squad maintained seamless release processes and ensured pipeline health with automation in Rundeck, minimizing release risks. As the entire team shared the responsibility of doing releases, this squad was crucial in ensuring smooth on time releases.

Observability Squad: Built and maintained observability tools to provide visibility around metrics and logging for our platform health and for our customers. Our observability operator provides a monitoring stack that consists of Thanos, Prometheus, Grafana, Alertmanager, and Wavefront Proxy. Additionally, we also provide a Kube-fluentd-operator which is a managed multi-tenant logging service that allows shipping of SaaS services logs to a variety of targets that include Logz.io, Log Insight, S3, Loggly, and Elasticsearch. Kube-fluentd-operator is an open source project that our team actively maintained, not just for internal VMware usage but also for the community.

Kafka Squad: Built and operated our managed Kafka service, which was both highly reliable and cost-effective. This offering resulted in a 50% cost reduction compared to AWS MSK. The service was upgraded regularly with zero-downtime and the team wrote our own custom Kafka monitoring solution. We found the custom SLO exporter to be extremely useful and open sourced it for community use.

Kafka is highly reliable but has its limits, such as 4,000 partitions per broker or 200,000 partitions per cluster. We noticed metadata could persist even when topics were deleted, creating unnecessary clutter and inefficiencies. We developed and open-sourced a metadata reaper in GoLang that automatically identifies and removes unused topics, while applying filters to avoid accidental deletion of important topics, like those with registered schemas. This automation ensured our Kafka clusters remained clean, efficient, and scalable over time. I discussed more about these two open source tools here.

General SRE Squad: Maintained critical infrastructure components along with our other service offerings that included Istio Service Mesh, Vault Secret Management, RabbitMQ, VPN, and hosted certificate management. This team also ensured platform stability and handled critical customer requests.

Impact and Results

The squad-based structure provided several key benefits:

Subject Matter Expertise: Engineers in each squad gained deep specialization and domain knowledge, improving support speed and quality. By having a dedicated squad maintain the health of the pipelines, our CI/CD uptime increased by 30%.

Enhanced Ownership and Morale: Small, focused squads fostered ownership and pride. Engineers reported an increase in job satisfaction in internal surveys, validating the impact of clearer responsibilities and stronger team dynamics. When an engineer is overburdened with responsibilities and ownership along with incidents, escalations, oncall rotations, release duties, it can be an easy source of burnout.

Efficiency and Scalability: Each squad maintained its own sprint board, stories, and deliverables, enabling faster alignment with business goals. Due to the clear accountability, project management, and resource allocation, we noticed a 15% increase in engineer productivity and 20% reduction in our backlog tasks. The clear defined ownership provided areas of focus and priorities and transparency into priorities for both team members and leadership. Visibility is important for management and leadership as a key performance indicator (KPI) or quantifiable metrics to measure success at achieving goals. By providing clear targets and milestones, it will make your job easier of promoting recognitions and accomplishments.

Customer Satisfaction: Our dedicated squads offered faster, direct support, enhancing customer satisfaction and reducing incident outage times.

By implementing this squad-based structure, I aligned my team with our business growth, empowered engineers to take ownership and fostered a culture of expertise and continuous improvement. I am passionate about ensuring that my team consistently delivered high-value service and support to our customers while maintaining a healthy and positive work environment.

This is just an example of what worked for us based on some of these variables below:

Reason for dedicated squads:

Your team is too large (>10 engineers).
You have distinct areas of expertise requiring deep focus.
The on-call burden is overwhelming across unrelated systems.
You need to accelerate roadmap execution without interruptions.

Keep a unified team if:

The team is still small (<8 engineers) and can handle cross-functional responsibilities.
The operational workload is not overwhelming.
There is not enough ongoing work to justify permanent squads.

Scaling SRE and Platform Infrastructure Team into Distributed Squads

Written by Alex Ho

No responses yet