This topic is about our Classic Satellites. If you installed Satellites after 4/06/2021, you are probably running Microsatellites.

About load balancing

For on-premise Satellites, span traffic is generally sent to a pool of Satellites behind a load balancer. Having effective load balancing is important to ensure a consistent and adjustable recall window for the Satellite pool and to allow for efficient use of Satellite computing resources.

To achieve that goal, there are two important metrics to consider.

Satellite recall

Each Satellite holds 100% of unsampled recent spans in memory, discarding older spans as newer spans arrive. The length of time between the newest and the oldest span currently held in memory is the “recall” or “recall window”. At any given time, each Satellite has its own recall value and each Satellite pool has a distribution of recall values.

Recall implications:

  • In order for a Stream to reliably assemble representative traces, the recall at each Satellite must be at least 5 minutes across all Satellites (and all the projects they host) at all times.
  • In order for Explorer queries, Service Diagram, and Correlations to return comprehensive results from 100% of the span data in the past certain number of minutes, the recall at each Satellite must be at least that certain number of minutes.

In general, a standard target for recall is 5 - 10 minutes, depending on how much history you want to see in the Explorer view. However, there is no technical limitation to maintaining a longer recall window if you need to query the full set of span data longer into the past.

You can’t configure the recall window directly; it’s proportional to the amount of span traffic sent from all tracer clients and the available Satellite memory. You can achieve longer recall by either reducing the amount of span traffic or increasing the available memory of Satellites in the pool (either by increasing the available memory per instance, or the overall number of instances). Significant variance in recall across Satellites within a Satellite pool can be a symptom of load imbalance, limiting your ability to tune the recall window to the desired length, and indicating suboptimal resource usage, as the useful capacity of the overall pool is limited by the lower bound.

Dropped spans

Even when the recall is balanced, there are situations where spans can be dropped at either the tracer or at the Satellite (i.e. discarded before a Satellite can process them in memory). Dropped spans can lead to incomplete traces, Streams that undercount events, and missing data from Explorer.

To resolve dropped spans from the Satellite, you need to adjust your configuration to increase the number of Satellites, adjust the amount of memory allocated to that project in the Satellite pool, or tune auto-scaling settings.

To resolve dropped spans from the client, you may need to change the configuration of the tracers.

The remainder of this topic focuses on approaches for achieving balanced recall. To address dropped spans, you need to tune the amount of memory allocated to Satellite.

Check out this topic for info on how to configure Datadog to monitor Satellite recall and dropped spans from both the client and Satellites.

Choosing a load balancer

There are many options for balancing load across a Lightstep Satellite pool. Broadly, these options can be categorized into either protocol-specific (often referred to as L7) or TCP (often referred to as L4) balancers.

L7 balancers can decode the specific protocol used and can provide more sophisticated features. Specifically for Satellite pools, there are two features that L7 balancers provide that can be helpful:

  • TLS termination: Allows encryption of the traffic between tracers and the balancer without having to configure the Satellite pool with TLS certificates.
  • Per-request balancing: Can lead to a more even distribution of traffic across the cluster. Some tracers will establish long-lived connections with the Satellites to save on the overhead of establishing a connection per request. L7 balancers enable requests sent over these long-lived connections to be balanced across the pool and decrease the likelihood of hot spots in the pool.

The disadvantage of using L7 balancers is that they often require more careful selection and configuration. For example, they require TLS certificates to be made available to the balancer and compatibility issues are more common. There are stricter considerations when using an L7 balancer with gRPC, see the section below.

Lightstep recommends using an L7 balancer for Satellite pools if such a configuration is feasible. It will lead to a healthier and more balanced pool.


When using HTTP as a transport, there are few requirements for L7 balancers.

Many cloud providers have L7 HTTP balancing solutions that you can use:

There are also many hardware and software solutions for HTTP(S) balancing that you can use:

The one consideration to make is to determine whether the balancer supports HTTP/2. Many of the tracers use HTTP/2 to reuse persistent connections for subsequent requests. This saves resources by not establishing a new connection per request.


When using gRPC as a transport, it’s important to ensure that the balancer chosen is compatible with gRPC. gRPC is a set of standards and open-source implementations for communication between server and client built on top of HTTP/2. The gRPC client and server create HTTP/2 connections and communicate over those connections.

The only Lightstep-recommended gRPC L7 load balancer is Envoy ( Envoy is an open-source software load balancer that can be deployed either as a separate application or as a sidecar running alongside an application with a tracer. Envoy supports gRPC natively.

Because gRPC is HTTP/2 on the wire, it is theoretically possible to use an HTTP L7 balancer for gRPC connections. In practice, most HTTP L7 balancers don’t work with gRPC. Often this is because they either only partially support HTTP/2 (usually supporting HTTP/2 connections from the client to the balancer but sending HTTP/1.1 traffic from the balancer to the server) or because other alterations are made to the requests (like headers which appear differently). For this reason, Lightstep doesn’t recommend trying to use an HTTP L7 balancer for gRPC traffic.


When using either HTTP or gRPC as a transport, there are not many requirements for L4 TCP balancer.

Many cloud providers have L4 TCP balancing solutions that can be used:

There are many hardware and software solutions for TCP balancing that can be used:

Because TCP balancers have minimal interaction with the protocols used, there are few restrictions in the selection process.

  • Because L4 balancers don’t do per-request balancing, usage of an L4 balancer can result in uneven distribution of traffic across a Satellite pool. This is because persistent connections established by a tracer will be pinned to a single Satellite. This could lead to overwhelming that Satellite with too many spans.
  • L4 balancers cannot terminate SSL. To get encrypted transport, provide a certificate to the Satellite itself.

Service mesh

A service mesh distributes the responsibility of routing and balancing traffic within a data center to software running on each application host ( Many service meshes support HTTP routing and balancing and some support gRPC routing and balancing.

For data centers already running a service mesh, using that mesh to route traffic between tracers and the Satellite Pool can be a simple configuration. Routing traffic using a service mesh has many of the considerations discussed above in the L7 routing sections.

For HTTP traffic, any service mesh that supports L7 HTTP routing should work well for routing Lightstep tracing traffic.

For gRPC traffic, we only recommend routing traffic with a service mesh based on envoy ( like Istio ( This is because Envoy has native gRPC support.

Balance and tune tracers

The tracer client libraries are engineered for minimal impact on the processes they are tracing while still collecting and reporting the tracing data intended for collection. This requires striking a balance between the use of various resources: memory, network, and CPU.

The use of the network is managed by buffering the data to be reported: spans and the associated attributes, events, and payloads. Buffering shifts some burden onto memory to hold this buffered data until the client flushes the content of the buffer and reports to the Satellite.

If the Reporting Status dashboard indicates that tracers are dropping spans, it may be necessary to modify parameters that can be configured to help control throughput.

OpenTelemetry tracers

The OpenTelemetry specification describes the built-in batching processor that exports the data to the exporter. These three parameters can be tuned to control performance.

From the spec:

  • scheduledDelayMillis: The delay interval in milliseconds between two consecutive exports. The default value is 5000.
  • exportTimeoutMillis: How long the export can run before it is cancelled. The default value is 30000.
  • maxExportBatchSize: The maximum batch size of every export. It must be smaller or equal to maxQueueSize. The default value is 512.

You can use maxExportBatchSize and scheduledDelayMillis to dictate the approximate max throughput for your tracer using this equation: Approximate Max Throughput = maxExportBatchSize (per tracer instance) / scheduledDelayMillis

When to adjust configuration

During instrumentation:
You can always change these values but it is wise to think about the throughput you expect and use the above “back of the envelope” calculation to determine good starting values. However, this doesn’t translate directly into a fixed amount of memory because along with the number of spans, the buffer also holds the attributes, events, and payloads that the instrumentation attaches to each span. The nature of the instrumentation load is also important to take into account when setting the buffer size. If the rate of spans created is relatively uniform, a lower buffer size will suffice. However, if the rate is bursty, the client may drop spans even when the buffer is sized well relative to the average rate.

When there are client dropped spans:
Client-dropped spans can mean a few things, but most likely is that the data sent to the tracer is exceeding the max throughput. By looking at the Spans Sent and Client Dropped values in the Reporting Status dashboard, it should be possible to estimate the increase in the buffer size that is required. Most apps will be fine with a buffer size that is 1 to 2 times the expected rate of spans/second.

The easiest way to resolve dropped client spans is to increase the maxExportBatchSize (MaxBufferedSpans) parameter. Start with changing this parameter rather than scheduledDelayMillis (ReportingPeriod). Because sending each report has a certain amount of processing overhead, increasing the amount of data sent per report is generally more performant than decreasing the reporting period. That being said, tuning the scheduledDelayMillis (ReportingPeriod) based on observed performance is also an acceptable path.

Another configuration option to be aware of is exportTimeoutMillis (ReportTimeout - but not all tracers support this). This is duration the tracer should wait for a response from the Satellite when sending a report. We recommend starting with the default. After your system has been running for a while, set it to the 99.99 percentile latency for that report.

Balancing and tuning Satellites

To determine if the span traffic is well balanced between Satellites, evaluate these metrics:

  • Span Ingress Rate
  • Recall
  • Dropped Spans

Check out this topic for info on how to configure Datadog to monitor Satellite recall and dropped spans from both the client and Satellites.

Each Satellite can only process a maximum of X ingress spans/bytes in the steady state. That rate depends on the size of spans, wire format, and whether the traffic is encrypted. Each Satellite reserves a maximum of Y bytes of memory as an index to store that span data temporarily. The length of time represented by the spans stored in memory (recall) is proportional to the span ingress rate. Higher ingress rates mean lower recall, all else being equal. When the ingress rate exceeds the maximum that a Satellite can handle, it will start dropping some spans.

Lightstep requires a recall window of > 5 minutes to reliably assemble traces and provide a good UI experience. In order to use memory efficiently, without losing data, it’s important to keep span ingress balanced and avoid dropping spans while maintaining a 5+ minute recall window across the Satellite pool. See below for details about how to track these metrics and perform troubleshooting.

Span ingress rate

One factor for effective Satellite load balancing is the ingress rate of spans. You should roughly balance this rate across all Satellites in a pool. A good rule of thumb is (max - min / min) < 0.05 (i.e. less than 5% load imbalance between Satellite instances).

Option 1: These ratios can be observed on the Recommended Customer Satellite Dashboard},using Satellite StatsD metrics and Datadog.

Option 2: If you already have network stats for ingress traffic to your Satellites, you may be able to apply the same formula for ingress bytes.


Option 1: The more reliable way to track Satellite Recall (max, min, and imbalance ratio) is to enable Satellite StatsD metrics.

Option 2: The easier, faster way to track current Satellite recall ONLY (no history) is on the Satellite Pool report that provides metrics for all Satellite Pools that are currently receiving spans.

The Satellites page doesn’t provide historical information. To monitor Satellite metrics over time, we recommend enabling the option to publish Satellite metrics to StatsD.

This page shows recent status information for all your Satellites and projects. If traffic is well balanced between Satellites, the difference between Min Recall and Max Recall should be <5% of Min Recall.

If the difference between Min Recall and Max Recall is >5% of Min Recall for any project, the load balancing is uneven and needs adjusting. Click View Satellite to see the recall values from each individual Satellite, which will reveal whether the imbalance is due to a single instance or is widespread throughout the pool.

When there is an em-dash (“—”) in the Min Recall and Max Recall columns, this indicates that the Satellite buffer is not full yet and thus a recall period cannot be calculated. This does not mean that the Satellite is not receiving spans.

Dropped spans

Option 1: The most reliable way to track dropped spans is to enable StatsD Satellite metrics.

Option 2: Use the Reporting Status page. This page shows the number of recently dropped spans and is a great way to spot-check your Satellites’ performance.

In a steady state, both Client Dropped Spans and Satellite Dropped Spans should be zero. Dropped spans are not indexed or counted as part of the Satellite ingestion pipeline.

Check general Satellite health

Load balancers can use the Satellite Diagnostics page to review individual Satellite health and determine if a Satellite is currently healthy and available to handle incoming span traffic. This endpoint is available at: http(s)://{satellite host}:{admin port}/_ready.

A 200 (OK) response indicates that the Satellite is currently able to accept incoming span traffic. Any other response, including a timeout, indicates that either the Satellite isn’t currently running, or it has too many queued span reports and can’t handle any more at the moment. In this case, the load balancer should send the request to a different Satellite.

Resolve Satellite imbalances

This flow chart provides resolutions to the above issues (click the image to enlarge it).

This flow chart provides resolutions for when your Satellites are running out of memory (click the image to enlarge it).