Load balance Cloud Observability

About load balancing

For on-premise Microsatellites, span traffic is generally sent to a pool of Microsatellites behind a load balancer. Having effective load balancing is important to allow for efficient use of Microsatellite computing resources.

To achieve that goal, there are two important metrics to consider.

Dropped spans

There are situations where spans can be dropped at either the tracer or at the Microsatellite (i.e. discarded before a Microsatellite can process them in memory). Dropped spans can lead to incomplete traces, Streams that undercount events, and missing data from Explorer.

To resolve dropped spans from the Microsatellite, you need to adjust your configuration to increase the number of Microsatellites, adjust the amount of memory allocated to that project in the Microsatellite pool, or tune auto-scaling settings.

To resolve dropped spans from the client, you may need to change the configuration of the tracers.

Choosing a load balancer

There are many options for balancing load across a Cloud Observability Microsatellite pool. Broadly, these options can be categorized into either protocol-specific (often referred to as L7) or TCP (often referred to as L4) balancers.

L7 balancers can decode the specific protocol used and can provide more sophisticated features. Specifically for Microsatellite pools, there are two features that L7 balancers provide that can be helpful:

  • TLS termination: Allows encryption of the traffic between tracers and the balancer without having to configure the Microsatellite pool with TLS certificates.
  • Per-request balancing: Can lead to a more even distribution of traffic across the cluster. Some tracers will establish long-lived connections with the Microsatellites to save on the overhead of establishing a connection per request. L7 balancers enable requests sent over these long-lived connections to be balanced across the pool and decrease the likelihood of hot spots in the pool.

The disadvantage of using L7 balancers is that they often require more careful selection and configuration. For example, they require TLS certificates to be made available to the balancer and compatibility issues are more common. There are stricter considerations when using an L7 balancer with gRPC, see the section below.

Cloud Observability recommends using an L7 balancer for Microsatellite pools if such a configuration is feasible. It will lead to a healthier and more balanced pool.

L7 HTTP

When using HTTP as a transport, there are few requirements for L7 balancers.

Many cloud providers have L7 HTTP balancing solutions that you can use:

There are also many hardware and software solutions for HTTP(S) balancing that you can use:

The one consideration to make is to determine whether the balancer supports HTTP/2. Many of the tracers use HTTP/2 to reuse persistent connections for subsequent requests. This saves resources by not establishing a new connection per request.

L7 gRPC

When using gRPC as a transport, it’s important to ensure that the balancer chosen is compatible with gRPC. gRPC is a set of standards and open-source implementations for communication between server and client built on top of HTTP/2. The gRPC client and server create HTTP/2 connections and communicate over those connections.

The only Cloud Observability-recommended gRPC L7 load balancer is Envoy (https://www.envoyproxy.io/). Envoy is an open-source software load balancer that can be deployed either as a separate application or as a sidecar running alongside an application with a tracer. Envoy supports gRPC natively.

Because gRPC is HTTP/2 on the wire, it is theoretically possible to use an HTTP L7 balancer for gRPC connections. In practice, most HTTP L7 balancers don’t work with gRPC. Often this is because they either only partially support HTTP/2 (usually supporting HTTP/2 connections from the client to the balancer but sending HTTP/1.1 traffic from the balancer to the server) or because other alterations are made to the requests (like headers which appear differently). For this reason, Cloud Observability doesn’t recommend trying to use an HTTP L7 balancer for gRPC traffic.

L4 HTTP or gRPC

When using either HTTP or gRPC as a transport, there are not many requirements for L4 TCP balancer.

Many cloud providers have L4 TCP balancing solutions that can be used:

There are many hardware and software solutions for TCP balancing that can be used:

Because TCP balancers have minimal interaction with the protocols used, there are few restrictions in the selection process.

  • Because L4 balancers don’t do per-request balancing, usage of an L4 balancer can result in uneven distribution of traffic across a Microsatellite pool. This is because persistent connections established by a tracer will be pinned to a single Microsatellite. This could lead to overwhelming that Microsatellite with too many spans.
  • L4 balancers cannot terminate SSL. To get encrypted transport, provide a certificate to the Microsatellite itself.

Service mesh

A service mesh distributes the responsibility of routing and balancing traffic within a data center to software running on each application host (https://www.nginx.com/blog/what-is-a-service-mesh/). Many service meshes support HTTP routing and balancing and some support gRPC routing and balancing.

For data centers already running a service mesh, using that mesh to route traffic between tracers and the Microsatellite Pool can be a simple configuration. Routing traffic using a service mesh has many of the considerations discussed above in the L7 routing sections.

For HTTP traffic, any service mesh that supports L7 HTTP routing should work well for routing Cloud Observability tracing traffic.

For gRPC traffic, we only recommend routing traffic with a service mesh based on envoy (https://www.envoyproxy.io/) like Istio (https://istio.io/). This is because Envoy has native gRPC support.

Balance and tune tracers

The tracer client libraries are engineered for minimal impact on the processes they are tracing while still collecting and reporting the tracing data intended for collection. This requires striking a balance between the use of various resources: memory, network, and CPU.

The use of the network is managed by buffering the data to be reported: spans and the associated attributes, events, and payloads. Buffering shifts some burden onto memory to hold this buffered data until the client flushes the content of the buffer and reports to the Microsatellite.

If the Reporting Status dashboard indicates that tracers are dropping spans, it may be necessary to modify parameters that can be configured to help control throughput.

OpenTelemetry tracers

The OpenTelemetry specification describes the built-in batching processor that exports the data to the exporter. These three parameters can be tuned to control performance.

From the spec:

  • scheduledDelayMillis: The delay interval in milliseconds between two consecutive exports. The default value is 5000.
  • exportTimeoutMillis: How long the export can run before it is cancelled. The default value is 30000.
  • maxExportBatchSize: The maximum batch size of every export. It must be smaller or equal to maxQueueSize. The default value is 512.

You can use maxExportBatchSize and scheduledDelayMillis to dictate the approximate max throughput for your tracer using this equation: Approximate Max Throughput = maxExportBatchSize (per tracer instance) / scheduledDelayMillis

When to adjust configuration

During instrumentation:
You can always change these values but it is wise to think about the throughput you expect and use the above “back of the envelope” calculation to determine good starting values. However, this doesn’t translate directly into a fixed amount of memory because along with the number of spans, the buffer also holds the attributes, events, and payloads that the instrumentation attaches to each span. The nature of the instrumentation load is also important to take into account when setting the buffer size. If the rate of spans created is relatively uniform, a lower buffer size will suffice. However, if the rate is bursty, the client may drop spans even when the buffer is sized well relative to the average rate.

When there are client dropped spans:
Client-dropped spans can mean a few things, but most likely is that the data sent to the tracer is exceeding the max throughput. By looking at the Spans Sent and Client Dropped values in the Reporting Status dashboard, it should be possible to estimate the increase in the buffer size that is required. Most apps will be fine with a buffer size that is 1 to 2 times the expected rate of spans/second.

The easiest way to resolve dropped client spans is to increase the maxExportBatchSize (MaxBufferedSpans) parameter. Start with changing this parameter rather than scheduledDelayMillis (ReportingPeriod). Because sending each report has a certain amount of processing overhead, increasing the amount of data sent per report is generally more performant than decreasing the reporting period. That being said, tuning the scheduledDelayMillis (ReportingPeriod) based on observed performance is also an acceptable path.

Another configuration option to be aware of is exportTimeoutMillis (ReportTimeout - but not all tracers support this). This is duration the tracer should wait for a response from the Microsatellite when sending a report. We recommend starting with the default. After your system has been running for a while, set it to the 99.99 percentile latency for that report.

Updated Apr 6, 2021