Understand StatsD Microsatellite metrics

Microsatellites generate several helpful StatsD-based metrics that you can send to any compliant monitoring system. If you already use Datadog as your system, you can add tags to provide more context for your metrics.

Enabling StatsD Microsatellite metrics

You can turn on metrics reporting when you configure your Satellites. Here are examples of that configuration using StatsD or Datadog, for both AWS/Debian and Docker.

StatsD example

Start tabs

Docker

1
2
3
4
5
6
7
8
9
10
11
# Required
COLLECTOR_STATSD_HOST=127.0.0.1
COLLECTOR_STATSD_PORT=8125
COLLECTOR_STATSD_EXPORT_STATSD=true

# Recommended
COLLECTOR_STATSD_PREFIX=lightstep.prod.us-west-1

# Optional
COLLECTOR_STATSD_SATELLITE_PREFIX=satellite-canary
COLLECTOR_STATSD_CLIENT_PREFIX=client-via-canary

AWS or Debian

1
2
3
4
5
6
7
8
9
10
11
12
statsd:
    # Required
    host: 127.0.0.1
    port: 8125
    export_statsd: true

    # Recommended
    prefix: "lightstep.prod.us-west-1"

    # Optional
    satellite_prefix: "satellite-canary"
    client_prefix: "client-via-canary"

End code tabs

Datadog example

Start tabs

Docker

1
2
3
4
5
6
7
8
9
10
11
12
# Required
COLLECTOR_STATSD_HOST=127.0.0.1
COLLECTOR_STATSD_PORT=8125
COLLECTOR_STATSD_EXPORT_DOGSTATSD=true

# Recommended
COLLECTOR_STATSD_PREFIX=lightstep.prod.us-west-1

# Optional
COLLECTOR_STATSD_SATELLITE_PREFIX=satellite-canary
COLLECTOR_STATSD_CLIENT_PREFIX=client-via-canary
COLLECTOR_STATSD_DOGSTATSD_TAGS="env:prod,pool:us-west-1,canary:true"

AWS or Debian

1
2
3
4
5
6
7
8
9
10
11
12
13
statsd:
    # Required
    host: 127.0.0.1
    port: 8125
    export_dogstatsd: true

    # Recommended
    prefix: "lightstep.prod.us-west-1"

    # Optional
    satellite_prefix: "satellite-canary"
    client_prefix: "client-via-canary"
    dogstatsd_tags: "env:prod,pool:us-west-1,canary:true"

End code tabs

Available metrics

Following are the metrics that Microsatellites report. Important metrics that affect Microsatellite and Cloud Observability health are noted, with advice on when to alert and how to resolve the issue.

A note about project names in metrics:
* Many of these metrics are automatically labeled with a Cloud Observability project name, so the resulting time series can be grouped by project, if desired.
* For basic StatsD metrics, the project becomes part of the metric name itself, for example: satellite.spans.received.my_lightstep_project_name
* For Datadog metrics, the project name is attached using a tag called lightstep_project on the relevant metrics. The syntax to indicate a tag is {tag_name}.

client.spans.dropped

The number of spans dropped at the client because the outgoing queue is full and trying to send earlier spans to a Microsatellite.

Values are cumulative and can be aggregated across Microsatellites and projects.

Consider monitoring this metric
Why monitor: The value of this metric represents how many spans the client can’t send to Microsatellites because its outgoing queue is full. When tracer clients can’t send spans to Microsatellites, the product experience may be compromised due to incomplete traces and incomplete statistics.
Alert Thresholds: Any value above 0 indicates some amount of data loss. We recommend setting alerts for when the value remains above 0 for an extended period.
Remediations: First try tuning the buffer size of the tracer client library by following these instructions. If the problem persists, audit your instrumentation to ensure you aren’t “over-instrumenting” by sending too many low value (or accidental) spans.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<client_prefix>.spans.dropped.<lightstep_project>

Datadog

1
2
<prefix>.<client_prefix>.spans.dropped
{lightstep_project}

End code tabs

satellite.access_tokens.invalid

The number of reports (i.e., batches of spans) that have been rejected by the Microsatellite due to an invalid access token.

Values are cumulative and can be aggregated across Microsatellites and projects.

Type: Count
Since: 2018-11-19_17-15-06Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.access_tokens.invalid.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.access_tokens.invalid
{lightstep_project}

End code tabs

satellite.bytes.received.thrift

The total bytes of Thrift span traffic received over the network by the Microsatellite. You can use this metric to tune your tracer if you’re seeing dropped spans from the client.

Values are cumulative and can be aggregated across Microsatellites and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.bytes.received.thrift

End code tabs

satellite.bytes.received.grpc

The total bytes of gRPC span traffic received by the Microsatellite over the network. You can use this metric to tune your tracer if you’re seeing dropped spans from the client.

Values are cumulative and can be aggregated across Microsatellites and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.bytes.received.grpc

End code tabs

satellite.spans.received

The total number of spans that the Microsatellite received and decoded. This value reflects any sampling you may have configured as reflected by <satellite_prefix>.spans.indexed and also includes any spans that Microsatellites may yet drop due to insufficient resources (<satellite_prefix>.spans.dropped).

Values are cumulative and can be aggregated across Microsatellites and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.spans.received.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.spans.received
{lightstep_project}

End code tabs

satellite.spans.dropped

The total number of spans that the Microsatellite dropped due to insufficient resources (after being received and decoded). These spans are not indexed or added to the statistics for streams.

Values are cumulative and can be aggregated across Microsatellites and projects.

Consider monitoring this metric
Why monitor: The value of this metric represents how many spans the Microsatellite is unable to process due to insufficient resources. When spans are not able to be processed, the product experience may be compromised due to incomplete traces and incomplete statistics.
Alert Thresholds: Any value above 0 indicates some amount of data loss. We recommend setting alerts for when the value remains above 0 for an extended period. It might also be helpful to alert when the percentage of received spans that are subsequently dropped exceeds a value of 2% (configurable given your tolerance). satellite.spans.dropped / satellite.spans.received > 0.02
Remediations: If the problem persists, try adding more Microsatellites.

Type: Count
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.spans.dropped.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.spans.dropped
{lightstep_project}

End code tabs

satellite.index.queue.length

The number of reports (i.e., batches of spans) that have been read from the network and are currently waiting to be indexed.

This value is instantaneous (non-cumulative).

Type: Gauge
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.index.queue.length.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.index.queue.length
{lightstep_project}

End code tabs

satellite.index.queue.bytes

The number of bytes worth of reports that are currently waiting to be indexed (size of index.queue.length in bytes).

This value is instantaneous (non-cumulative).

Type: Gauge
Since: 2018-10-03_18-47-12Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.index.queue.bytes.<lightstep_project>

Datadog

1
2
3
<prefix>.<satellite_prefix>.
index.queue.bytes
{lightstep_project}

End code tabs

satellite.spans.indexed

The number of spans that are successfully ingested by the Microsatellite and can be viewed in Cloud Observability or assembled into traces.

If Microsatellites are configured to use the sample_one_in_n parameter, this metric represents the number of spans that remain after down-sampling. See spans.received for pre-sampled counts.

Values are cumulative and can be aggregated across instances and projects.

Aggregate statistics in Streams and Histograms will be scaled up automatically to account for the sampling ratio.

Type: Count
Since: 2021-01-26_23-02-36Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.spans.indexed.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.spans.indexed
{lightstep_project}

End code tabs

satellite.bytes.indexed

The total bytes for spans that are successfully ingested by the Microsatellite and can be viewed in Cloud Observability or assembled into traces.

If Microsatellites are configured to use the sample_one_in_n parameter, this metric represents the total size in bytes that remain after down-sampling. See spans.received for pre-sampled counts. Values are cumulative and can be aggregated across instances and projects.

Aggregate statistics in Streams and Histograms will be scaled up automatically to account for the sampling ratio.

Type: Count
Since: 2021-01-26_23-02-36Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.bytes.indexed.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.bytes.indexed
{lightstep_project}

End code tabs

satellite.starts

The number of times this Microsatellite has been restarted (including the initial start). Increments by one for each restart.

Type: Count
Since: 2021-01-26_23-02-36Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.starts.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.starts
{lightstep_project}

End code tabs

forward_spans.dropped

The total number of spans that the Microsatellite dropped between the Microsatellite and the Cloud Observability platform, and so won’t be available for trace assembly.

Values are cumulative and can be aggregated across Microsatellites and projects.

Consider monitoring this metric
Why monitor: The value of this metric represents how many spans the Microsatellite is unable to forward to the Cloud Observability SaaS for analysis.
Alert Thresholds: Any value above 0 indicates some amount of data loss. We recommend setting alerts for when the value remains above 0 for an extended period.
Remediations: If the problem persists, try increasing the memory or adding Microsatellite.

Type: Count
Since: 2021-03-22_13-16-05z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.forward_spans.dropped.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.forward_spans.dropped
{lightstep_project}

End code tabs

forward_spans.dropped.size_exceeded

Spans dropped when being sent to Cloud Observability because they exceed the maximum span size (128 KB).

Values are cumulative and can be aggregated across Microsatellites and projects.

Consider monitoring this metric
Why monitor: The value of this metric represents how many spans the Microsatellite is unable to process because the span size is over 128 KB. When spans are not able to be processed, the product experience may be compromised due to incomplete traces and incomplete statistics.
Alert Thresholds: Any value above 0 indicates some amount of data loss. We recommend setting alerts for when the value remains above 0 for an extended period. It might also be helpful to alert when the percentage of received spans that are subsequently dropped exceeds a value of 2% (configurable given your tolerance). forward_spans.dropped.size_exceeded / satellite.spans.received > 0.02
Remediations: If the problem persists, try reducing the size of the spans.

Type: Count
Since: 2021-03-22_13-16-05z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.forward_spans.dropped.size_exceeded.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.forward_spans.dropped.size_exceeded
{lightstep_project}

End code tabs

forward_spans.request.compressed_bytes.sum

Total bytes (compressed) emitted from the Microsatellite.

Values are cumulative and can be aggregated across Microsatellites and projects.

Consider monitoring this metric
Why monitor: The value of this metric represents how many compressed bytes of span data emitted by the Microsatellite and can be useful when looking at network egress costs.

Type: Count
Since: 2022-04-22_21-58-06Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.forward_spans.request.compressed_bytes.sum.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.forward_spans.request.compressed_bytes.sum
{lightstep_project}

End code tabs

forward_spans.request.compressed_bytes.failed

The amount of span bytes (compressed) that failed to send from the Microsatellite to the Cloud Observability SaaS.

Values are cumulative and can be aggregated across Microsatellites and projects.

Consider monitoring this metric
Why monitor: The value of this metric represents how many bytes of compressed span data that the Microsatellite is unable to send due to errors. The metric includes the code tag, whose value will be the error code received.
Alert Thresholds: Any value above 0 indicates some amount of data loss. We recommend setting alerts for when the value remains above 0 for an extended period. It might also be helpful to alert when the percentage of received spans that fail exceeds a value of 10% (configurable given your tolerance). forward_spans.request.compressed_bytes.failed / forward_spans.request.compressed_bytes.sum > 0.10

Type: Count
Since: 2022-04-22_21-58-06Z

Start tabs

StatsD

1
<prefix>.<satellite_prefix>.forward_spans.request.compressed_bytes.failed.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.forward_spans.request.compressed_bytes.failed
{lightstep_project}

End code tabs

Updated Apr 6, 2021