LightStep

LightStep Documentation

Welcome to the LightStep developer hub. You'll find comprehensive guides and documentation to help you start working with LightStep as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started    

Load Balance LightStep

About Load Balancing

For on-premise Satellites, span traffic is generally sent to a pool of Satellites behind a load balancer. Having effective load balancing is important to ensure a consistent and adjustable recall window for the Satellite pool and allow for efficient use of Satellite computing resources.

To achieve that goal, there are two important metrics to consider.

  • Satellite recall
  • Dropped spans

Satellite Recall

Each Satellite holds 100% of unsampled recent spans in memory, discarding older spans as newer spans arrive. The length of time between the newest and the oldest span currently held in memory is the "recall" or "recall window". At any given time, each Satellite has its own recall value and each Satellite pool has a distribution of recall values.

Recall Implications:

  • In order for a Stream to reliably assemble representative traces, the recall at each Satellite must be at least 5 minutes across all Satellites (and all the projects they host) at all times.
  • In order for Explorer queries, Service Diagram, and Correlations to return comprehensive results from 100% of the span data in the past certain number of minutes, the recall at each Satellite must be at least that certain number of minutes.

In general, a standard target for recall is 5 - 10 minutes, depending on how much history you want to see in the Explorer view. However, there is no technical limitation to maintaining a longer recall window if you need to query the full set of span data longer into the past.

You can't configure the recall window directly; it's proportional to the amount of span traffic sent from all tracer clients and the available Satellite memory. You can achieve longer recall by either reducing the amount of span traffic or increasing the available memory of Satellites in the pool (either by increasing the available memory per instance, or the overall number of instances). Significant variance in recall across Satellites within a Satellite pool can be a symptom of load imbalance, limiting your ability to tune the recall window to the desired length, and indicating suboptimal resource usage, as the useful capacity of the overall pool is limited by the lower bound.

Dropped Spans

Even when the recall is balanced, there are situations where spans can be dropped at either the tracer or at the Satellite (i.e. discarded before a Satellite can process them in memory). Dropped spans can lead to incomplete traces, Streams that undercount events, and missing data from Explorer. To resolve dropped spans, you need to adjust your configuration to increase the number of Satellites, adjust the amount of memory allocated to that project in the Satellite pool, or tune auto-scaling settings.

The remainder of this topic focuses on approaches for achieving balanced recall. To address dropped spans, you need to tune the amount of memory allocated to Satellites.

Using Datadog to monitor Satellites?

Check out this topic for info on how to configure Datadog to monitor Satellite recall and dropped spans from both the client and Satellites.

Choosing a Load Balancer

There are many options for balancing load across a LightStep Satellite pool. Broadly, these options can be categorized into either protocol-specific (often referred to as L7) or TCP (often referred to as L4) balancers.

L7 balancers can decode the specific protocol used and can provide more sophisticated features. Specifically for Satellite pools, there are two features that L7 balancers provide that can be helpful:

  • TLS termination: Allows encryption of the traffic between tracers and the balancer without having to configure the Satellite pool with TLS certificates.
  • Per-request balancing: Can lead to a more even distribution of traffic across the cluster. Some tracers will establish long-lived connections with the Satellites to save on the overhead of establishing a connection per request. L7 balancers enable requests sent over these long-lived connections to be balanced across the pool and decrease the likelihood of hot spots in the pool.

The disadvantage of using L7 balancers is that they often require more careful selection and configuration. For example, they require TLS certificates to be made available to the balancer and compatibility issues are more common. There are stricter considerations when using an L7 balancer with gRPC, see the section below.

LightStep recommends using an L7 balancer for Satellite pools if such a configuration is feasible. It will lead to a healthier and more balanced pool.

L7 HTTP

When using HTTP as a transport, there are few requirements for L7 balancers.

Many cloud providers have L7 HTTP balancing solutions that you can use:

There are also many hardware and software solutions for HTTP(S) balancing that you can use:

The one consideration to make is to determine whether the balancer supports HTTP/2. Many of the LightStep tracers use HTTP/2 to reuse persistent connections for subsequent requests. This saves resources by not establishing a new connection per request.

L7 gRPC

When using gRPC as a transport, it’s important to ensure that the balancer chosen is compatible with gRPC. gRPC is a set of standards and open-source implementations for communication between server and client built on top of HTTP/2. The gRPC client and server create HTTP/2 connections and communicate over those connections.

The only LightStep recommended gRPC L7 load balancer is Envoy (https://www.envoyproxy.io/). Envoy is an open-source software load balancer that can be deployed either as a separate application or as a sidecar running alongside an application with a tracer. Envoy supports gRPC natively.

Because gRPC is HTTP/2 on the wire, it is theoretically possible to use an HTTP L7 balancer for gRPC connections. In practice, most HTTP L7 balancers don’t work with gRPC. Often this is because they either only partially support HTTP/2 (usually supporting HTTP/2 connections from the client to the balancer but sending HTTP/1.1 traffic from the balancer to the server) or because other alterations are made to the requests (like headers which appear differently). For this reason, LightStep doesn’t recommend trying to use an HTTP L7 balancer for gRPC traffic.

L4 HTTP or gRPC

When using either HTTP or gRPC as a transport, there are not many requirements for L4 TCP balancer.

Many cloud providers have L4 TCP balancing solutions that can be used:

There are many hardware and software solutions for TCP balancing that can be used:

Because TCP balancers have minimal interaction with the protocols used, there are few restrictions in the selection process.

  • Because L4 balancers don’t do per-request balancing, usage of an L4 balancer can result in uneven distribution of traffic across a Satellite pool. This is because persistent connections established by a tracer will be pinned to a single Satellite. This could lead to overwhelming that Satellite with too many spans.
  • L4 balancers cannot terminate SSL. To get encrypted transport, provide a certificate to the Satellite itself.

Service Mesh

A service mesh distributes the responsibility of routing and balancing traffic within a data center to software running on each application host (https://www.nginx.com/blog/what-is-a-service-mesh/). Many service meshes support HTTP routing and balancing and some support gRPC routing and balancing.

For data centers already running a service mesh, using that mesh to route traffic between LightStep tracers and the Satellite Pool can be a simple configuration. Routing traffic using a service mesh has many of the considerations discussed above in the L7 routing sections.

For HTTP traffic, any service mesh that supports L7 HTTP routing should work well for routing LightStep tracing traffic.

For gRPC traffic, we only recommend routing traffic with a service mesh based on envoy (https://www.envoyproxy.io/) like Istio (https://istio.io/). This is because Envoy has native gRPC support.

Balance and Tune Tracers

The LightStep tracer client libraries are engineered for minimal impact on the processes they are tracing while still collecting and reporting the tracing data intended for collection. This requires striking a balance between the use of various resources: memory, network, and CPU.

The use of the network is managed by buffering the data to be reported: spans and the associated tags, logs, and payloads. Buffering shifts some burden onto memory to hold this buffered data until the client flushes the content of the buffer and reports to the satellite.

If the Reporting Status dashboard indicates that tracers are dropping spans, it may be necessary to modify parameters that can be configured to help control throughput. The primary ones are MaxBufferedSpans and ReportingPeriod.

Each tracer has its own name for these variables and a few subtle differences, as well as some other parameters that you can tune to affect tracer performances; those are enumerated in the Tracer Specific Options table below.

You can use MaxBufferedSpans and ReportingPeriod to dictate the approximate max throughput for your tracer using this equation:
Approximate Max Throughput = MaxBufferedSpans / ReportingPeriod

When to Adjust Configuration

During instrumentation:
You can always change these values but it is wise to think about the throughput you expect and use the above “back of the envelope” calculation to determine good starting values. However, this doesn't translate directly into a fixed amount of memory because along with the number of spans, the buffer also holds the tags, logs, and payloads that the instrumentation attaches to each span. The nature of the instrumentation load is also important to take into account when setting the buffer size. If the rate of spans created is relatively uniform, a lower buffer size will suffice. However, if the rate is bursty, the client may drop spans even when the buffer is sized well relative to the average rate.

When there are client dropped spans:
Client-dropped spans can mean a few things, but most likely is that the data sent to the tracer is exceeding the max throughput. By looking at the Spans Sent and Client Dropped values in the Reporting Status dashboard, it should be possible to estimate the increase in the buffer size that is required. Most apps will be fine with a buffer size that is 1 to 2 times the expected rate of spans/second.

The easiest way to resolve dropped client spans is to increase the MaxBufferedSpans parameter. Start with changing this parameter rather than ReportingPeriod. Because sending each report has a certain amount of processing overhead, increasing the amount of data sent per report is generally more performant than decreasing the reporting period. That being said, tuning the ReportingPeriod based on observed performance is also an acceptable path.

Another configuration option to be aware of is ReportTimeout (not all tracers support this). This is duration the tracer should wait for a response from the Satellite when sending a report. We recommend starting with the default. After your system has been running for a while, set it to the 99.99 percentile latency for that report.

Tracer Specific Options

This table shows the language-specific name of the variables described above as well as other parameters that you can tune.

Tracer
MaxBufferedSpans
ReportingPeriod
ReportTimeout
Additional Options

MaxBufferedSpans
default = 1000

In Go the report is sent either when the MaxSpansBuffer fills up or the ReportingPeriod goes off. To make sure there is also a way to prevent from sending too fast when there is a high volume of data there is a MinReportingPeriod. Even if the Buffer is full, it will wait the full MinReportingPeriod before sending

ReportingPeriod (Duration) - when a report will be sent if the buffer is not full.
default = 2500 * time.Millisecond

MinReportingPeriod (Duration) - the minimum time between reports, even if the buffer fills up before hand.
default = 500 * time.Millisecond

ReportTimeout (Duration)
default = 30 * time.Second

ReconnectPeriod (Duration) - The duration the tracer should wait before disconnecting the open transport connection with a satellite and opening a new one. This is important because this means the connections will be rebalanced between the satellites at this interval. If, when you start up all of your satellites you notice it take a long time for the loads to rebalance, you could decrease this number. Decreasing too far, will add more resource overhead on the tracer.
default = 5 * time.Minute

maxBufferedSpans
default = 1000

maxReportingIntervalMillis
default = 3000

deadlineMillis
default = 30000

withResetClient If true, the GRPC client connection will be reset at regular intervals used to load balance on server side.
default = true

disableReportingLoop If true, the background reporting loop will be disabled. Reports will only occur on explicit calls to Flush(); If not set, will default to
default = false

withGrpcRoundRobin Instructs gRPC to round-robin between satellites instances in a pool when sending traces.If not enabled defaults to picking the first record returned by the operating system
Resolver.
default = false

max_span_records
default = 4096

max_reporting_interval_millis
default = 2500

report_timeout_millis
default=30000

error_throttle_millis
optional, default = 60000 - when verbosity is set to 1, this the minimum time between logged errors.

delay_initial_report_millis
optional, default=1000 - maximum additional delay of the initial report in addition to the normal reporting interval. A value between zero and this maximum will be selected as the actual delay. This can be useful when concurrently launching a large number of new processes and there is a desire to distribute the initial reports over a window of time.

disable_reporting_loop
optional, default=false - if true, the timer that automatically sends reports to the Satellite (collector) will be disabled. This option is independent of disable_report_on_exit

disable_report_on_exit
optional, default=false - if true, the final report that is automatically sent at process exit in Node or page unload in the browser will not be sent.

gzip_json_requests
optional, default=true - if true, the reports will be gzipped before sent to the Satellite (collector).

disable_meta_event_reporting - the tracer will disable meta event reporting even if requested by the Satellite.

max_span_records
default = 1000

periodic_flush_seconds
default = 2.5

timeout_seconds
default = 30

maxSpanRecords
default = 5000

flushIntervalSeconds
default = 30

Set to 0 for no automatic background flushing.

max_span_records
default = 1000

period (seconds)
default = 3.0

WithMaxBufferedSpans
default = int.MaxValue()

WithReportPeriod
default = 5s

WithReportTimeout
default = 30s

WithAutomaticReporting If false, disables the automatic flushing of buffered spans.

max_buffered_spans
default = 2000

reporting_period
default = 500 milliseconds

report_timeout
default = 5s

See Java

See Java

See Java

See Java

max_span_records
default - 1000

max_reporting_period_secs
default = 5.0

min_reporting_period_secs
default = 0.1

(see Go for more explanation on the meaning of these two values)

Advanced Configuration: Client-side Load Balancing

Some LightStep tracers can balance requests across a Satellite pool themselves. For example, the gRPC documentation references balancing-aware clients that you can configure.

In the Go tracer, it is possible to override gRPC options using the resolver and balancer options. This configuration can result in an extremely well-balanced pool. Reach out if you have additional questions about this option.

In the Java tracer, setting withGrpcRoundRobin instructs gRPC to round-robin between Satellite instances in a pool when sending traces.

Balancing and Tuning Satellites

To determine if the span traffic is well balanced between Satellites, evaluate these 3 primary metrics:

  • Span Ingress Rate
  • Recall
  • Dropped Spans

Using Datadog?

Check out this topic for info on how to configure Datadog to monitor Satellite recall and dropped spans from both the client and Satellites.

Each satellite can only process a maximum of X ingress spans/bytes in the steady state. That rate depends on the size of spans, wire format, and whether the traffic is encrypted. Each satellite reserves a maximum of Y bytes of memory as an index to store that span data temporarily. The length of time represented by the spans stored in memory (recall) is proportional to the span ingress rate. Higher ingress rates mean lower recall, all else being equal. When the ingress rate exceeds the maximum that a satellite can handle, it will start dropping some spans.

LightStep requires a recall window of > 5 minutes to reliably assemble traces and provide a good UI experience. In order to use memory efficiently, without losing data, it's important to keep span ingress balanced and avoid dropping spans while maintaining a 5+ minute recall window across the satellite pool. See below for details about how to track these metrics and perform troubleshooting.

Span Ingress Rate

One factor for effective Satellite load balancing is the ingress rate of spans. You should roughly balance this rate across all Satellites in a pool. A good rule of thumb is (max - min / min) < 0.05 (i.e. less than 5% load imbalance between Satellite instances).

Option 1: These ratios can be observed on the Recommended Customer Satellite Dashboard, using Satellite StatsD metrics and Datadog.
Option 2: If you already have network stats for ingress traffic to your Satellites, you may be able to apply the same formula for ingress bytes.

Recall

Option 1: The more reliable way to track Satellite Recall (max, min, and imbalance ratio) is to enable Satellite StatsD metrics.
Option 2: The easier, faster way to track current Satellite recall ONLY (no history) is on the Satellite Pool report that provides metrics for all Satellite Pools that are currently receiving spans.

The Satellites page doesn't provide historical information. To monitor Satellite metrics over time, we recommend enabling the option to publish Satellite metrics to StatsD.

This page shows recent status information for all your satellites and projects. If traffic is well balanced between satellites, the difference between Min Recall and Max Recall should be <5% of Min Recall.

If the difference between Min Recall and Max Recall is >5% of Min Recall for any project, the load balancing is uneven and needs adjusting. Click View Satellite to see the recall values from each individual Satellite, which will reveal whether the imbalance is due to a single instance or is widespread throughout the pool.

When there is an em-dash ("—") in the Min Recall and Max Recall columns, this indicates that the Satellite buffer is not full yet and thus a recall period cannot be calculated. This does not mean that the Satellite is not receiving spans.

Dropped Spans

Option 1: The most reliable way to track dropped spans is to enable StatsD Satellite metrics.
Option 2: Use the Reporting Status page. This page shows the number of recently dropped spans and is a great way to spot-check your Satellites' performance.

In a steady state, both Client Dropped Spans and Satellite Dropped Spans should be zero. Dropped spans are not indexed or counted as part of the Satellite ingestion pipeline.

Check General Satellite Health

Load balancers can use the Satellite Diagnostics page to review individual Satellite health and determine if a Satellite is currently healthy and available to handle incoming span traffic. This endpoint is available at: http(s)://{satellite host}:{admin port}/_ready.

A 200 (OK) response indicates that the satellite is currently able to accept incoming span traffic. Any other response, including a timeout, indicates that either the Satellite isn't currently running, or it has too many queued span reports and can't handle any more at the moment. In this case, the load balancer should send the request to a different satellite.

Resolve Satellite Imbalances

This flow chart provides resolutions to the above issues (click the image to enlarge it).

Load Balance LightStep


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.