Implementing Thanos for Long-Term Storage in Large-Scale Kubernetes Clusters
As part of managing the central Kubernetes platform at VMware, I was responsible for handling the observability stack for large-scale clusters with high traffic. One significant challenge we faced was Prometheus outages caused by the sheer volume of metrics being processed by our clusters. The traditional approach of running two Prometheus replica nodes couldn’t scale effectively under the increasing load, resulting in instability despite horizontal and vertical scaling attempts.
To provide a more robust, scalable solution for metrics monitoring, we decided to roll out Thanos for long-term storage retention and implement Prometheus sharding to manage high-volume metrics. Our ultimate goal was to ensure that the observability stack could support SaaS and internal development teams while remaining highly compliant with API monitoring, health checks, and Service Level Objective (SLO) dashboards.
Initial Challenges with Prometheus in Large Clusters
In large clusters, Prometheus struggled to keep up with the load. We initially set up two replicas of Prometheus for High Availability (HA), where each instance independently scraped and stored metrics in its local Prometheus Time Series Database (TSDB). While this provided some redundancy, it didn’t fully solve the scaling problem, and during heavy loads, Grafana often encountered delays and inconsistencies when pulling data from the local Prometheus instances.
Additionally, when one Prometheus instance failed (due to overload or crash), Grafana would rely on the remaining instance, causing potential data inconsistencies and delays. When the crashed instance recovered, there would be periods of missing data and crash loops. This was particularly problematic for larger workloads, where the sheer volume of metrics could overwhelm the system.
Thanos Architecture for Scalability and Long-Term Storage
To address these challenges, we introduced Thanos to handle long-term metric storage. Grafana would pull the short term metrics data from each Prometheus instance. It also enabled us to centralize metric data in S3, providing robust long-term retention and better data consistency across clusters.
The Thanos architecture consists of several components that help aggregate, store, and query metrics in a highly efficient manner:
• Thanos Sidecar: Each Prometheus node runs a Thanos sidecar that uploads its data to S3. The sidecar also exposes the data for querying. This is operated alongside existing Prometheus servers without interference.
• Thanos Query: This stateless application aggregates data from various Prometheus instances and Thanos stores, providing a deduplicated view of the data in Grafana. It queries metrics across distributed sources to present unified data to dashboards.
• Thanos Store: This is the primary API gateway that doesn’t use much local disk space but constantly keeps the S3 data in sync. It ensures that long-term metric data is always available for querying.
• Thanos Compact: This component is responsible for compacting and downsampling the data stored in S3, applying retention policies to manage storage efficiently.
• Thanos Ruler: A critical part of the architecture, Thanos Ruler queries data across multiple Prometheus instances to evaluate alerts and recording rules. When Prometheus sharding is enabled, alerting and recording rules need to be moved to the Thanos Ruler, as the data is distributed across multiple shards.
• Thanos Receiver: This component takes TSDB blocks from Prometheus shards and uploads them to S3. It is essential for managing the shards and ensuring that the data is effectively ingested into long-term storage.
Implementing Prometheus Sharding for High-Volume Metrics
To better manage the load, we also introduced Prometheus sharding. In a sharded Prometheus setup, the scraping workload is distributed across multiple Prometheus instances. This approach prevents any single Prometheus instance from becoming a bottleneck when handling large-scale metrics.
Here’s how we implemented sharding and the benefits it brought to our observability stack:
• Distributed Scraping: By leveraging multiple Prometheus instances, we could distribute the scrape loads across different servers, ensuring no single instance would be overwhelmed. This improves overall system resilience and helps scale the observability stack without needing to heavily over-provision resources.
• High Availability: With multiple Prometheus replicas deployed across different regions or clusters, we ensured HA for scraping and monitoring even if one instance went down. This setup reduces downtime and enhances system availability during failures.
• Consistent Hashing: By using consistent hashing techniques, we could effectively distribute targets across different Prometheus instances, ensuring the metric data was evenly distributed and avoiding overload on a single instance.
Benefits of the New Stack
With the implementation of Thanos and Prometheus sharding, we achieved the following benefits:
• Improved Scalability: The observability stack was now able to handle high volumes of metrics, even as traffic increased, by distributing the load across multiple Prometheus instances and centralizing long-term data storage in S3.
• Better Data Retention: By utilizing Thanos for long-term storage, we were able to retain data for extended periods without overloading local Prometheus instances, which helped reduce local storage costs and improved data consistency.
• Resilience and Fault Tolerance: If any Prometheus instance failed, the system could continue working without significant delays, as Thanos ensured data was always available via S3 and Thanos Query provided aggregated and deduplicated data for Grafana.
• Simplified Alerting and Rules Management: Moving alerting and recording rules to Thanos Ruler ensured that we could evaluate alerts and metrics without worrying about data fragmentation caused by sharded Prometheus instances.
• Downsampling: Downsampling is the process of reducing the resolution or volume of data by aggregating it over longer time intervals while preserving its overall trends and patterns. It is commonly used in time-series databases and monitoring systems to reduce storage costs and improve query performance for historical data. Prometheus does not support downsampling directly and retains raw data based on the configured retention period but Thanos supports downsampling for long term storage. It creates multiple resolutions:
- Raw Data: Retained for recent periods (e.g., weeks).
- 5-Minute Resolution: For intermediate time ranges.
- 1-Hour Resolution: For long-term historical data.
Conclusion
Integrating Thanos for long-term storage and implementing Prometheus sharding has significantly improved our ability to handle high-volume metrics in large-scale Kubernetes clusters. This solution has provided the necessary scalability, resilience, and efficiency to meet the demands of our growing observability needs while reducing the risk of outages, inconsistencies, and data loss.
As we continued to scale and onboard more customers, these architectural improvements ensured that we can offer a robust, compliant, and highly available observability stack to both SaaS and internal development teams, empowering them with reliable metrics and performance data.