Plan an OpenTelemetry Collector Deployment in Kubernetes

Planning a Collector Deployment in Kubernetes

Cloud Observability recommends running the OpenTelemetry Collector with the Prometheus receiver to ingest infrastructure metrics. This ensures the highest data quality and completeness, and also allows the Collector to leverage the Prometheus ecosystem of exporters to scrape targets.

Choosing a deployment type

There are three approaches to deploying the Collector in Kubernetes:

When first deploying the OpenTelemetry Collector, you can start with a single replica deployment within a Kubernetes cluster, or for additional scalability, deploy Collectors as a DaemonSet to scrape appliction metrics. Both modes can be combined. For example, you can use the DaemonSet to scrape application metrics along with a single deploy to scrape infrastructure metrics and static targets.

For clusters that require increased performance without sacrificing increased resourcing, the StatefulSet collector can be used to shard a Prometheus configuration. The StatefulSet can be scaled both horiztonally and vertically with a specified amount of replicas.

The StatefulSet collector is under active development, some features may not yet be available.

Once the Collector has been deployed for a subset of services, it’s much easier to estimate metrics traffic for remaining services, and then plan and scale other the deployments appropriately. In any deployment mode, the Collector is configured to scrape Prometheus metrics using scrape targets. Compare and contrast the deployment types below:

Single Replica Deployment

Use Case: Trying out the OpenTelemetry Collector for the first time for a subset of services. Enables scraping application metrics, infrastructure metrics, and static targets in a Kubernetes cluster.

Benefits:

  • Simplest to deploy.
  • Straightforward to scale vertically (increasing resources for the single collector pod).
  • Can be resource efficient compared with a DaemonSet deployment - nodes can have very different workloads, and since there is one Collector pod, you can provision resources for the total workload.

Drawbacks:

  • Single point of failure - if the Collector fails, then this will result in loss of metrics from all nodes.
  • Harder to scale horizontally to multi-replica since multiple Collectors will continue to scrape the same targets, unless sharding the Collector deployment (coming soon).

DaemonSet Deployment

Use case: Scraping application metrics on each node. Best to use when workload on each node is constant or has some fixed limit, or the resources needed by each node is similar. Pods in a DaemonSet have the same configuration, so if all the nodes have similar resource needs, resources will be used more efficiently.

Benefits:

  • No single point of failure for the Collector - if one of the pods in the DaemonSet fail, then only a subset of metric data is lost.
  • Reduced node-to-node network traffic as the Collector is scraping metrics locally.

Drawbacks:

  • Daemonset pods can’t be scaled individually.
  • Can be resource inefficient - if the nodes have different workloads, the configuration for the pods in the DaemonSet would have to be based off the most demanding node in the cluster.

StatefulSet Deployment

Use case: Horizontally scalable, sharded Prometheus scraping. Allows for the provided Prometheus scrape configuration to be sharded by a set of Opentelemetry Collectors. Best to use for most topologies as it allows for more fault tolerance.

Benefits:

  • No single point of failure for the Collector - if one of the pods in the StatefulSet fail, then only a subset of metric data is lost.
  • Resource usage can be tweaked to match your load
  • Can scale horizontally and vertically

Drawbacks:

  • Autoscaling is still beta
  • High availability is not available yet
  • More architectural complexity

See also

Ingest Prometheus metrics with an OpenTelemetry Collector on Kubernetes

Performance test and tune the Collector

Updated Aug 30, 2022