When running a single OpenTelemetry Collector instance for ingesting Prometheus metrics, there are two main areas of configuration for tuning: machine resources and Collector processor settings. Because the Collector is more sensitive to memory limits than CPU limits, this topic provides guidance on how to manage the memory effectively. It also recommends how to best configure the Collector processor settings.
An example of configuration and then testing and results are also provided.
In general the OpenTelemetry Collector is more sensitive to memory limits than CPU limits, so high-memory instances are ideal. We strongly recommend that the memory_limiter processor be enabled by default. While enabled, if memory usage rises above its default “soft limit” of 80% usage (as set by the limit_mib
setting), the Collector will start dropping data and applying back pressure to the pipeline. If memory rises above the “hard limit” of 100% of limit_mib
, the Collector will start repeatedly performing garbage collection. While this can prevent out-of-memory situations, ongoing dropped data and frequent garbage collection are not considered ideal conditions. If you see dropped points or unusually frequent GCs in the dashboard, the Collector will require more memory and a higher limit_mib
setting.
The batch processor is also highly recommended. It batches outgoing data for better compression and reduction in network connections. There are three parameters for the processor. Two determine when batches are sent, and the third determines how large batches can be.
send_batch_size
(default 8192 items): A batch will be sent if there are at least this many items (spans, metric points, or logs), in the processor’s queue.timeout
(default 200ms): A batch will be sent at minimum this often if there are any items in the queue.send_batch_max_size
(no default): If set, batches will contain no more than this many items. By default, there is no maximum batch size.In general, larger batches and longer timeouts will lead to better compression (and therefore less network usage), but will also require more memory. If the Collector is experiencing memory pressure, try lowering the batch size and/or timeout settings. If you need to decrease Collector traffic, try increasing the batch size. Finally, if the Collector logs show messages being rejected for being too large (for example, “grpc: received message larger than max”), try setting or decreasing the send_batch_max_size
setting.
Load tests for the data below were performed in Google Kubernetes Engine (GKE) with a single OpenTelemetry Collector instance running the Prometheus receiver. We used Avalanche to generate metrics.
If you are attempting to replicate this load test in Cloud Observability, consider creating a separate Cloud Observability project for this purpose to isolate the auto-generated Avalanche metrics from other “real” metric data.
e2-standard-4
nodeSee Tuning above for more information about the memory_limiter
and batch
processors, which are recommended for basic performance. We ran both resourcedetection
and resource
processors to mimic real life scenarios where label enrichment would likely also be occurring on incoming metrics within a Kubernetes environment.
Receiver: prometheusreceiver
configured with scrape_targets
copied from a running Prometheus server’s configuration.
memory_limiter
limit_mib: 8000
resourcedetection
resource
batch
send_batch_size: 1000
send_batch_max_size = 1500
timeout: 1s
otlp
When testing the OpenTelemetry Collector running with the Prometheus Receiver we observed the following performance:
ATS per scrape target | # of scrape targets | CPU | Memory usage |
---|---|---|---|
100,000 | 4 | 1 | 3.5GB |
100,000 | 7 | 1.7 | 5GB |
100,000 | 10 | 3.2 | 7GB |
20,000 | 50 | 1.3 | 2.5GB |
Ingest Prometheus metrics with an OpenTelemetry Collector on Kubernetes
Updated Jul 26, 2022