This topic is about our Classic Satellites. If you installed Satellites after 4/06/2021, you are probably running Microsatellites.

Lightstep Satellites generate several helpful StatsD-based metrics that you can send to any compliant monitoring system. If you already use Datadog as your system, you can add tags to provide more context for your metrics and you can use our pre-configured dashboard to quickly view performance.

Enabling StatsD Satellite Metrics

You can turn on metrics reporting when you configure your Satellites. Here are examples of that configuration using StatsD or Datadog, for both AWS/Debian and Docker.

StatsD Example

Start tabs

Docker

1
2
3
4
5
6
7
8
9
10
11
# Required
COLLECTOR_STATSD_HOST=127.0.0.1
COLLECTOR_STATSD_PORT=8125
COLLECTOR_STATSD_EXPORT_STATSD=true

# Recommended
COLLECTOR_STATSD_PREFIX=lightstep.prod.us-west-1

# Optional
COLLECTOR_STATSD_SATELLITE_PREFIX=satellite-canary
COLLECTOR_STATSD_CLIENT_PREFIX=client-via-canary

AWS or Debian

1
2
3
4
5
6
7
8
9
10
11
12
statsd:
    # Required
    host: 127.0.0.1
    port: 8125
    export_statsd: true

    # Recommended
    prefix: "lightstep.prod.us-west-1"

    # Optional
    satellite_prefix: "satellite-canary"
    client_prefix: "client-via-canary"

End code tabs

Datadog Example

Start tabs

Docker

1
2
3
4
5
6
7
8
9
10
11
12
# Required
COLLECTOR_STATSD_HOST=127.0.0.1
COLLECTOR_STATSD_PORT=8125
COLLECTOR_STATSD_EXPORT_DOGSTATSD=true

# Recommended
COLLECTOR_STATSD_PREFIX=lightstep.prod.us-west-1

# Optional
COLLECTOR_STATSD_SATELLITE_PREFIX=satellite-canary
COLLECTOR_STATSD_CLIENT_PREFIX=client-via-canary
COLLECTOR_STATSD_DOGSTATSD_TAGS="env:prod,pool:us-west-1,canary:true"

AWS or Debian

1
2
3
4
5
6
7
8
9
10
11
12
13
statsd:
    # Required
    host: 127.0.0.1
    port: 8125
    export_dogstatsd: true

    # Recommended
    prefix: "lightstep.prod.us-west-1"

    # Optional
    satellite_prefix: "satellite-canary"
    client_prefix: "client-via-canary"
    dogstatsd_tags: "env:prod,pool:us-west-1,canary:true"

End code tabs

Available Metrics

Following are the metrics that Satellites report. Important metrics that affect Satellite and Lightstep health are noted, with advice on when to alert and how to resolve the issue.

A note about project names in metrics:
* Many of these metrics are automatically labeled with a Lightstep project name, so the resulting time series can be grouped by project, if desired.
* For basic StatsD metrics, the lightstep project becomes part of the metric name itself, for example: satellite.spans.received.my_lightstep_project_name
* For Datadog metrics, the project name is attached using a tag called lighstep_project on the relevant metrics. The syntax to indicate a tag is {tag_name}.

client.spans.dropped

The number of spans dropped at the client because the outgoing queue is full and trying to send earlier spans to a Satellite.

Values are cumulative and can be aggregated across Satellites and projects.

Consider monitoring this metric
Why Monitor: The value of this metric represents how many spans the client can’t send to Satellites because its outgoing queue is full. When tracer clients can’t send spans to Satellites, the product experience may be compromised due to incomplete traces and incomplete statistics.
Alert Thresholds: Any value above 0 indicates some amount of data loss. We recommend setting alerts for when the value remains above 0 for an extended period. Check out this sample Datadog monitor.
Remediations: First try tuning the buffer size of the tracer client library by following these instructions. If the problem persists, audit your instrumentation to ensure you aren’t “over-instrumenting” by sending too many low value (or accidental) spans.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<client_prefix>.spans.dropped.<lightstep_project>

Datadog

1
2
<prefix>.<client_prefix>.spans.dropped
{lightstep_project}

End code tabs

satellite.access_tokens.invalid

The number of reports (i.e., batches of spans) that have been rejected by the Satellite due to an invalid access token.

Values are cumulative and can be aggregated across Satellites and projects.

Type: Count
Since: 2018-11-19_17-15-06Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.access_tokens.invalid.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.access_tokens.invalid
{lightstep_project}

End code tabs

satellite.bytes.received.thrift

The total bytes of Thrift span traffic received over the network by the Satellite. You can use this metric to tune your tracer if you’re seeing dropped spans from the client.

Values are cumulative and can be aggregated across Satellites and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.bytes.received.thrift

End code tabs

satellite.bytes.received.grpc

The total bytes of gRPC span traffic received by the Satellite over the network. You can use this metric to tune your tracer if you’re seeing dropped spans from the client.

Values are cumulative and can be aggregated across Satellites and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.bytes.received.grpc

End code tabs

satellite.spans.received

The total number of spans that the Satellite received and decoded. This value reflects any sampling you may have configured as reflected by <satellite_prefix>.spans.indexed and also includes any spans that Satellites may yet drop due to insufficient resources (<satellite_prefix>.spans.dropped).

Values are cumulative and can be aggregated across Satellites and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.spans.received.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.spans.received
{lightstep_project}

End code tabs

satellite.spans.dropped

The total number of spans that the Satellite dropped due to insufficient resources (after being received and decoded). These spans are not indexed or added to the statistics for streams.

Values are cumulative and can be aggregated across Satellites and projects.

Consider monitoring this metric
Why Monitor: The value of this metric represents how many spans the Satellite is unable to process due to insufficient resources. When spans are not able to be processed, the product experience may be compromised due to incomplete traces and incomplete statistics.
Alert Thresholds: Any value above 0 indicates some amount of data loss. We recommend setting alerts for when the value remains above 0 for an extended period. It might also be helpful to alert when the percentage of received spans that are subsequently dropped exceeds a value of 2% (configurable given your tolerance). satellite.spans.dropped / satellite.spans.received > 0.02 Check out these sample Datadog monitors.
Remediations: First verify that your bytes_per_project_overrides settings match the recommended values here then check whether the recall number is consistent across your Satellites. If they do, check your load balance settings. If the problem persists, try adding more Satellites.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.spans.dropped.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.spans.dropped
{lightstep_project}

End code tabs

satellite.index.queue.length

The number of reports (i.e., batches of spans) that have been read from the network and are currently waiting to be indexed.

This value is instantaneous (non-cumulative).

Type: Gauge
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.index.queue.length.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.index.queue.length
{lightstep_project}

End code tabs

satellite.index.queue.bytes

The number of bytes worth of reports that are currently waiting to be indexed (size of index.queue.length in bytes).

This value is instantaneous (non-cumulative).

Type: Gauge
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.index.queue.bytes.<lightstep_project>

Datadog

1
2
3
<prefix>.<satellite_prefix>.
index.queue.bytes
{lightstep_project}

End code tabs

satellite.spans.indexed

The number of spans that are successfully ingested by the Satellite and can be viewed in Lightstep or assembled into traces.

If Satellites are configured to use the sample_one_in_n parameter, this metric represents the number of spans that remain after down-sampling. See spans.received for pre-sampled counts.

Values are cumulative and can be aggregated across instances and projects.

Aggregate statistics in Streams and Histograms will be scaled up automatically to account for the sampling ratio.

Type: Count
Since: 2021-01-26_23-02-36Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.spans.indexed.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.spans.indexed
{lightstep_project}

End code tabs

satellite.bytes.indexed

The total bytes for spans that are successfully ingested by the Satellite and can be viewed in Lightstep or assembled into traces.

If Satellites are configured to use the sample_one_in_n parameter, this metric represents the total size in bytes that remain after down-sampling. See spans.received for pre-sampled counts. Values are cumulative and can be aggregated across instances and projects.

Aggregate statistics in Streams and Histograms will be scaled up automatically to account for the sampling ratio.

Type: Count
Since: 2021-01-26_23-02-36Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.bytes.indexed.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.bytes.indexed
{lightstep_project}

End code tabs

satellite.starts

The number of times this Satellite has been restarted (including the initial start). Increments by one for each restart.

Type: Count
Since: 2021-01-26_23-02-36Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.starts.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.starts
{lightstep_project}

End code tabs

satellite.current.recall.seconds

The number of seconds between now and the oldest span still indexed in the Satellite’s memory. This indicates how much history is currently available to facilitate trace assembly for the UI.

Values are instantaneous (non-cumulative) and aggregation across instances and/or projects is only meaningful with a “minimum” operator.

Consider monitoring this metric
Why Monitor: The value of this metric represents how much history is currently available to facilitate trace assembly. If this value drops too low, the product experience will be compromised.
Alert Thresholds: A value below 3 minutes signals a degraded state. A value between 3 and 5 minutes signals partial degradation. Check out this sample Datadog monitor.
Remediations: First verify that your bytes_per_project_overrides settings match the recommended values here, then check whether the recall number is consistent across your Satellites. If they do, check your load balance settings. If the problem persists, try adding more Satellites.

Type: Gauge
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.current.recall.seconds.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.current.recall.seconds
{lightstep_project}

End code tabs