Comparing Thanos to VictoriaMetrics cluster

--

Thanos is known as long-term storage for Prometheus, while cluster version of VictoriaMetrics had been open sourced recently. Both solutions provide the following features:

  • Long-term storage with arbitrary retention.
  • Global query view over data collected from multiple Prometheus instances.
  • Horizontal scalability.

Let’s compare different aspects of Thanos and VictoriaMetrics starting from their architecture and then comparing insert and select paths by the following properties:

  • Setup and operational complexity
  • Reliability and availability
  • Consistency
  • Performance
  • Scalability

High Availability setups and hosting costs are highlighted in the end of the article.

The architecture

Thanos consists of the following components:

  • Sidecar — it is deployed along with each Prometheus instance and performs the following tasks: 1) uploads Prometheus data older than 2 hours to object storage such as Amazon S3 or Google Cloud Storage; 2) processes queries over recently added data to local Prometheus (younger than 2 hours).
  • Store gateway — processes queries over data stored in object storage such as S3 or GCS.
  • Query — implements Prometheus query API and provides global query view across data obtained from Sidecars and Stores.
  • Compact. By default Sidecar uploads data to object storage in 2-hour blocks. Compactor gradually merges these blocks into bigger ones to improve query efficiency and reduce the required storage size.
  • Rule — performs Prometheus recording rules and alerting rules over data obtained from Query (aka global query view). Rule component usually has elevated failure rate due to low reliability of Query and the underlying components:
Ruler has conceptual tradeoffs that might not be favorable for most use cases. The main tradeoff is its dependence on query reliability. For Prometheus it is unlikely to have alert/recording rule evaluation failure as evaluation is local.For Ruler the read path is distributed, since most likely Ruler is querying Thanos Querier which gets data from remote Store APIs.This means that query failure are more likely to happen, that’s why clear strategy on what will happen to alert and during query unavailability is the key.
  • Receiver — experimental component that may accept data from Prometheus via remote_write API. It isn’t ready for production yet as of Thanos v0.5.0.

Thanos architecture at a glance:

Thanos architecture

Now let’s look at VictoriaMetrics cluster architecture. It contains the following components:

  • vmstorage — stores data
  • vminsert — accepts data from Prometheus via remote_write API and spreads it across available vmstorage nodes
  • vmselect —performs incoming queries over Prometheus query API by fetching and merging the required data from vmstorage nodes

Each component may independently scale to multiple nodes with the most appropriate hardware configuration.

VictoriaMetrics cluster architecture:

VictoriaMetrics cluster architecture

The VictoriaMetrics cluster box alongside Load balancer boxes on the image above may run in Kubernetes and may be managed by Helm chart. Additionally it may be substituted by single-node VictoriaMetrics for those who don’t need horizontal scalability, since single-node version fits the majority of users with small-to-medium Prometheus setups. See vertical scalability benchmarks for more info.

Insert path: setup and operational complexity

Thanos requires the following steps for setting up insert path in Prometheus:

The --storage.tsdb.min-block-duration and --storage.tsdb.max-block-duration must be set to equal values to disable local compaction on order to use Thanos sidecar upload

Thanos may fail uploading data blocks to object storage when local data compaction is enabled in Prometheus. See this issue for more details. Disabled data compaction may hurt Prometheus query performance if --storage.tsdb.retention.time is much higher than 2 hours.

  • to install Sidecars alongside each Prometheus instance, so they would upload Prometheus data to object storage.
  • to set up Sidecars monitoring.
  • to install Compactors for each object storage bucket.

VictoriaMetrics requires setting remote_write section in Prometheus config, so Prometheus would replicate all the scraped data into VictoriaMetrics remote storage. See these instructions for details. There is no need in running any sidecars or disabling local data compaction in Prometheus.

Insert path: reliability and availability

Thanos Sidecar uploads local Prometheus data in 2-hour blocks. This means it may lose up to 2 hours of recently added data on each Prometheus instance in the event of local disk corruption or accidental data removal.

Incoming queries from Query component to Sidecar may negatively interfere with data upload process, since these tasks are performed in a single Sidecar process. In theory it is possible to run separate Sidecars for data upload to object storage and for queries.

VictoriaMetrics. Each Prometheus instance immediately replicates all the scraped data to remote storage such as VictoriaMetrics via remote_write API. There may be a lag in a few seconds between scraping and writing data to remote storage. This means that Prometheus may lose a few seconds of data on local disk corruption or accidental data removal, since the rest of the data is already replicated to remote storage.

Prometheus v2.8.0+ replicates scraped data to remote storage from write-ahead log (WAL). This means that it doesn’t lose data on temporary connection errors to remote storage or temporary unavailability of the remote storage. A quote from this article:

So if the endpoint is having an issue, we simply stop where we are in the write-ahead log and attempt to resend the failed batch of samples. It won’t drop data or cause memory issues because it won’t continue reading the write-ahead log until it successfully sends the data. The 2.8 update effectively uses a constant amount of memory, and the buffer is virtually indefinite, depending only on the size of your disk.

Insert path: consistency

Thanos. There are races between Compactor and Store gateway, which may result in inconsistent data or query failures. A few examples from this proposal:

* Thanos sidecar or compactor crashes during the process of uploading the block. It uploaded index, 2 chunk files and crashed. How to ensure readers (e.g compactor, store gateway) will handle this gracefully?
* Thanos compactor uploads compacted block and deletes source blocks. After next sync iteration it does not see a new block (read after write eventual consistency). It sees gap, wrongly plans next compaction and causes non-resolvable overlap.
* Thanos compactor uploads compacted block and deletes source blocks. Thanos Store Gateway syncs every 3m so it missed that fact. Next query that hits store gateway tries to fetch deleted source blocks and fails.

VictoriaMetrics. The data is always consistent thanks to the storage architecture.

Insert path: performance

Thanos. In general insert performance is good, since the Sidecar just uploads local data blocks created by Prometheus to object storage. Heavy queries from Query component may slow down data upload process a bit. Compactor may be limiting factor for ingestion speed per object storage bucket if newly uploaded blocks outpace Compactor performance.

VictoriaMetrics. Prometheus uses additional CPU time for replicating local data to remote storage. This CPU time is quite small comparing to the CPU time spent on other tasks performed by Prometheus such as data scraping, local storage housekeeping and rules evaluation. On the receiving side VictoriaMetrics has good ingestion performance per CPU core.

Insert path: scalability

Thanos Sidecars rely on object storage scalability during data blocks uploading. S3 and GCS have excellent scalability.

VictoriaMetrics. Just increase capacity for vminsert and vmstorage. The capacity may be increased either by adding new nodes or by switching to beefier hardware.

Select path: setup and operational complexity

Thanos requires setting up and monitoring the following components for select path:

  • Sidecar for each Prometheus with the enabled Store API for Query component (see below).
  • Store gateway for each object storage bucket with the data uploaded by Sidecars.
  • Query component connected to all the Sidecars plus all the Store gateways in order to provide global query view with Prometheus query API. It may be quite difficult to set up secure and reliable connections between Query component and Sidecars located in different datacenters.

VictoriaMetrics provides Prometheus query API out of the box, so there is no need in setting up any components outside VictoriaMetrics cluster. Just point Grafana and other Prometheus query API clients to VictoriaMetrics.

Select path: reliability and availability

Thanos. Query component needs working connections to all the Sidecars plus all the Store gateways in order to compute full valid responses for incoming queries. This may be quite problematic for large-scale setups with many Prometheus instances in different datacenters or availability zones.

Store gateway may start slowly when working with big object storage bucket, since it needs loading all the metadata from the bucket before start. See this issue for details. This may become pain point during upgrade procedure.

VictoriaMetrics select path touches only local connections between vmselect and vmstorage nodes inside the cluster. Such local connections have much higher reliability and availability comparing to inter-datacenter connections in Thanos setup.

All the VictoriaMetrics components are optimized for small startup times, so they may be upgraded quickly.

Select path: consistency

Thanos by default allows partial responses when certain Sidecars or Store gateways are unavailable:

Partial response is a potentially missed result within query against QueryAPI or StoreAPI. This can happen if one of StoreAPIs is returning error or timeout whereas couple of others returns success.

VictoriaMetrics also prefers availability over consistency by returning partial responses when certain vmstorage nodes are unavailable. This behavior may be disabled by setting -search.denyPartialResponse option.

Overall, VictoriaMetrics should return much lower number of partial responses comparing to Thanos because of higher availability described in the previous section.

Select path: performance

Thanos. Select performance for Query component is limited by the slowest Sidecar or Store gateway, since the Query component waits for all the responses from all Sidecars and Store gateways before returning the result.

Usually select performance between Sidecars and Store gateways is uneven because it depends on many factors:

  • the amount of data each Prometheus instance collects
  • the amount of data in each object storage bucket behind Store gateway
  • hardware configuration for each Prometheus+Sidecar and Store gateway
  • network latency between Query component and Sidecar or Store gateway. The latency may be quite high if Query and Sidecars are located in different datacenters (availability zones).
  • latency for operations on object storage. Usually object storage latency (S3, GCS) is much higher comparing to block storage latency (GCE disks, EBS).

VictoriaMetrics select performance is limited by the number of vmselect and vmstorage nodes and their hardware configuration. Just increase the number of these nodes or switch to beefier hardware in order to scale select performance. vminsert evenly spreads incoming data among available vmstorage nodes. This provides even performance among vmstorage nodes. VictoriaMetrics is optimized for speed, so it should give much better performance numbers comparing to Thanos.

Select path: scalability

Thanos. It is possible to spread load among multiple Query components with identical configs in order to scale select performance. It is possible to spread load among multiple Store gateways for the same object storage bucket in order to improve its performance. But it is quite hard to scale performance of a single Prometheus instance behind a Sidecar. So select path scalability for Thanos is limited by the slowest Sidecar+Prometheus pair.

VictoriaMetrics cluster provides much better select path scalability comparing to Thanos, since vmselect and vmstorage components may scale independently to any number of nodes. Network bandwidth inside the cluster may become the limiting factor for scalability. VictoriaMetrics is optimized for low network bandwidth usage in order to reduce this limiting factor.

High availability setup

Thanos. Run multiple Query components in distinct availability zones.

If a single zone becomes unavailable, then Query component from another zone will continue serving queries. It is likely it will return partial results, since certain Sidecars or Store gateways may be located in unavailable zone.

VictoriaMetrics. Run multiple clusters in distinct availability zones. Configure each Prometheus to replicate data to all the clusters simultaneously. See this example.

If you have Prometheus HA pairs with replica1 and replica2 in each pair, then configure replica1 to replicate data to the first cluster, while replica2 should replicate data to the second cluster.

If a single availability zone becomes unavailable, then the VictoriaMetrics cluster from another zone will continue receiving new data and serving queries with full responses.

Hosting costs

Thanos puts data to object storage buckets. The most popular options — GCS and S3 — have the following monthly costs:

  • GCS — from $4/TB for coldline storage to $36/TB for standard storage. Additionally the following resources are billed: egress network at $10/TB for internal traffic and $80-$230/TB for external traffic; storage API calls (read,write) at $0.4-$10 per million calls. See the pricing for details.
  • S3 — from $4/TB for glacier storage to $23/TB for standard storage. Additionally the following resources are billed: egress network at $2-$10/TB for internal traffic and $50-$90/TB for external traffic; storage API calls — $0.4-$100 per million calls. See the pricing for details.

The summary storage cost for Thanos depends not only on the data size, but also on the amount of egress traffic and the number of API calls.

VictoriaMetrics puts data to block storage. The most popular cloud options — durable replicated GCE disks and EBS — have the following monthly costs now:

  • GCE disks — from $40/TB for regional HDD to $240/TB for multi-regional SSD. See the pricing for details.
  • EBS — from $45/TB for HDD to $125/TB for SSD. See the pricing for details.

VictoriaMetrics is optimized for HDD, so there is no need in over-paying for SSD. VictoriaMetrics provides up to 10x better on-disk compression for real-world data comparing to Thanos based on Prometheus tsdb — see this articlefor details. This means that it requires less disk space comparing to Thanos. This usually results in lower costs for storing the same amount of data comparing to Thanos.

Conclusions

Thanos and VictoriaMetrics use different approaches for providing long-term storage, global query view and horizontal scalability:

  • VictoriaMetrics accepts data from Prometheus instances via standard remote_write API and writes it to durable replicated persistent volumes such as Google Compute Engine HDD disks, Amazon EBS or any other persistent disk, while Thanos disables compaction for local Prometheus data and uses non-standard Sidecars for uploading data to S3 or GCS. It also requires setting up Compactors for merging small data blocks on object storage buckets to bigger ones.
  • VictoriaMetrics implements Prometheus query API for global query view out of the box. It doesn’t need any external connections outside the cluster for the select path, since Prometheus replicates the scraped data to remote storage in real time. Thanos requires setting up Store gateways, Sidecars and Query component for the select path. The Query component must be connected to all the Sidecars and Store gateways in order to get global query view. It is quite hard to provide reliable and secure connections between Query component and Sidecars located in different datacenters (availability zones) for large-scale Prometheus setups. Select performance for the Query component is limited by the slowest Sidecar or Store gateway.
  • VictoriaMetrics cluster is easy to run in Kubernetes thanks to simple architecture and Helm chart, while Thanos is harder to configure and operate in K8S because of higher number of moving parts and possible configurations.

Full disclosure: I’m the core developer of VictoriaMetrics, so the article may be biased. Though I tried hard to be fair :) Feel free commenting the article if you think it should be extended with additional information.

The article didn’t cover unique features for Thanos and VictoriaMetrics. For instance, Thanos supports deduplication from Prometheus HA pairs and two-level downsampling — to 5 minute and 1 hour intervals, while VictoriaMetrics provides PromQL extensions with common table expressions and accepts data ingestion via Graphite, OpenTSDB and Influx protocols. It is quite easy to setup deduplication from Prometheus HA pairs in VictoriaMetrics with Promxy assistance — see these docs.

--

--