LightStep Satellites generate several helpful StatsD-based metrics that you can send to any compliant monitoring system. If you already use Datadog as your system, you can add tags to provide more context for your metrics and you can use our pre-configured dashboard to quickly view performance.

Enabling StatsD Satellite Metrics

You can turn on metrics reporting when you configure your Satellites. Here are examples of that configuration using StatsD or Datadog, for both AWS/Debian and Docker.

StatsD Example

Docker

1
2
3
4
5
6
7
8
9
10
11
# Required
COLLECTOR_STATSD_HOST=127.0.0.1
COLLECTOR_STATSD_PORT=8125
COLLECTOR_STATSD_EXPORT_STATSD=true

# Recommended
COLLECTOR_STATSD_PREFIX=lightstep.prod.us-west-1

# Optional
COLLECTOR_STATSD_SATELLITE_PREFIX=satellite-canary
COLLECTOR_STATSD_CLIENT_PREFIX=client-via-canary

AWS or Debian

1
2
3
4
5
6
7
8
9
10
11
12
statsd:
    # Required
    host: 127.0.0.1
    port: 8125
    export_statsd: true

    # Recommended
    prefix: "lightstep.prod.us-west-1"

    # Optional
    satellite_prefix: "satellite-canary"
    client_prefix: "client-via-canary"

Datadog Example

Docker

1
2
3
4
5
6
7
8
9
10
11
12
# Required
COLLECTOR_STATSD_HOST=127.0.0.1
COLLECTOR_STATSD_PORT=8125
COLLECTOR_STATSD_EXPORT_DOGSTATSD=true

# Recommended
COLLECTOR_STATSD_PREFIX=lightstep.prod.us-west-1

# Optional
COLLECTOR_STATSD_SATELLITE_PREFIX=satellite-canary
COLLECTOR_STATSD_CLIENT_PREFIX=client-via-canary
COLLECTOR_STATSD_DOGSTATSD_TAGS="env:prod,pool:us-west-1,canary:true"

AWS or Debian

1
2
3
4
5
6
7
8
9
10
11
12
13
statsd:
    # Required
    host: 127.0.0.1
    port: 8125
    export_dogstatsd: true

    # Recommended
    prefix: "lightstep.prod.us-west-1"

    # Optional
    satellite_prefix: "satellite-canary"
    client_prefix: "client-via-canary"
    dogstatsd_tags: "env:prod,pool:us-west-1,canary:true"

Available Metrics

Following are the metrics that LightStep Satellites report. Important metrics that affect Satellite and LightStep health are noted, with advice on when to alert and how to resolve the issue.

A note about project names in metrics:
* Many of these metrics are automatically labeled with a LightStep project name, so the resulting time-series can be grouped by project, if desired.
* For basic StatsD metrics, the lightstep project becomes part of the metric name itself, for example: satellite.spans.received.my_lightstep_project_name
* For Datadog metrics, the project name is attached using a tag called lighstep_project on the relevant metrics. The syntax to indicate a tag is {tag_name}.

client.spans.dropped

The number of spans dropped at the client because the outgoing queue is full and trying to send earlier spans to a Satellite.

Values are cumulative and can be aggregated across Satellites and projects.

Consider monitoring this metric
Why Monitor: The value of this metric represents how many spans the client can’t send to Satellites because its outgoing queue is full. When tracer clients can’t send spans to Satellites, the product experience may be compromised due to incomplete traces and incomplete statistics.
Alert Thresholds: Any value above 0 indicates some amount of data loss. We recommend setting alerts for when the value remains above 0 for an extended period. Check out this sample Datadog monitor.
Remediations: First try tuning the buffer size of the LightStep tracer client library by following these instructions. If the problem persists, audit your instrumentation to ensure you aren’t “over-instrumenting” by sending too many low value (or accidental) spans.

Type: Count
Since: 2018-10-03_18-47-12Z

StatsD

1
<prefix>.<client_prefix>.spans.dropped.<lightstep_project>

Datadog

1
2
<prefix>.<client_prefix>.spans.dropped
{lightstep_project}

satellite.access_tokens.invalid

The number of reports (i.e., batches of spans) that have been rejected by the Satellite due to an invalid access token.

Values are cumulative and can be aggregated across Satellites and projects.

Type: Count
Since: 2018-11-19_17-15-06Z

StatsD

1
<prefix>.<satellite_prefix>.access_tokens.invalid.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.access_tokens.invalid
{lightstep_project}

satellite.bytes.received.thrift

The total bytes of Thrift span traffic received over the network by the Satellite. You can use this metric to tune your tracer if you’re seeing dropped spans from the client.

Values are cumulative and can be aggregated across Satellites and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

StatsD

1
<prefix>.<satellite_prefix>.bytes.received.thrift

satellite.bytes.received.grpc

The total bytes of gRPC span traffic received by the Satellite over the network. You can use this metric to tune your tracer if you’re seeing dropped spans from the client.

Values are cumulative and can be aggregated across Satellites and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

StatsD

1
<prefix>.<satellite_prefix>.bytes.received.grpc

satellite.spans.received

The total number of spans that the Satellite received and decode. This value includes any spans that Satellite may yet drop due to insufficient resources. See <satellite_prefix>.spans.dropped for more information.

Values are cumulative and can be aggregated across Satellites and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

StatsD

1
<prefix>.<satellite_prefix>.spans.received.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.spans.received
{lightstep_project}

satellite.spans.dropped

The total number of spans that the Satellite dropped due to insufficient resources after being received and decoded. These spans are not indexed or added to the statistics for streams.

Values are cumulative and can be aggregated across Satellites and projects.

Consider monitoring this metric
Why Monitor: The value of this metric represents how many spans the Satellite is unable to process due to insufficient resources. When spans are not able to be processed, the product experience may be compromised due to incomplete traces and incomplete statistics.
Alert Thresholds: Any value above 0 indicates some amount of data loss. We recommend setting alerts for when the value remains above 0 for an extended period. It might also be helpful to alert when the percentage of received spans that are subsequently dropped exceeds a value of 2% (configurable given your tolerance). satellite.spans.dropped / satellite.spans.received > 0.02 Check out these sample Datadog monitors
Remediations: First verify that your bytes_per_project_overrides settings match the recommended values here then check whether the recall number is consistent across your Satellites. If they do, check your load balance settings. If the problem persists, try adding more Satellites.

Type: Count
Since: 2018-10-03_18-47-12Z

StatsD

1
<prefix>.<satellite_prefix>.spans.dropped.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.spans.dropped
{lightstep_project}

satellite.index.queue.length

The number of reports (i.e., batches of spans) that have been read from the network and are currently waiting to be indexed.

This value is instantaneous (non-cumulative).

Type: Gauge
Since: 2018-10-03_18-47-12Z

StatsD

1
<prefix>.<satellite_prefix>.index.queue.length.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.index.queue.length
{lightstep_project}

satellite.index.queue.bytes

The number of bytes worth of reports that are currently waiting to be indexed (size of index.queue.length in bytes).

This value is instantaneous (non-cumulative).

Type: Gauge
Since: 2018-10-03_18-47-12Z

StatsD

1
<prefix>.<satellite_prefix>.index.queue.bytes.<lightstep_project>

Datadog

1
2
3
<prefix>.<satellite_prefix>.
index.queue.bytes
{lightstep_project}

satellite.spans.indexed

The number of spans that the Satellite has indexed and added to stream statistics.

Values are cumulative and can be aggregated across instances and projects.

Type: Count
Since: 2018-10-03_18-47-12Z

StatsD

1
<prefix>.<satellite_prefix>.spans.indexed.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.spans.indexed
{lightstep_project}

satellite.current.recall.seconds

The number of seconds between now and the oldest span still indexed in the Satellite’s memory. This indicates how much history is currently available to facilitate trace assembly for the UI.

Values are instantaneous (non-cumulative) and aggregation across instances and/or projects is only meaningful with a “minimum” operator.

Consider monitoring this metric
Why Monitor: The value of this metric represents how much history is currently available to facilitate trace assembly. If this value drops too low, the product experience will be compromised.
Alert Thresholds: A value below 3 minutes signals a degraded state. A value between 3 and 5 minutes signals partial degradation. Check out this sample Datadog monitor.
Remediations: First verify that your bytes_per_project_overrides settings match the recommended values here, then check whether the recall number is consistent across your Satellites. If they do, check your load balance settings. If the problem persists, try adding more Satellites.

Type: Gauge
Since: 2018-10-03_18-47-12Z

StatsD

1
<prefix>.<satellite_prefix>.current.recall.seconds.<lightstep_project>

Datadog

1
2
<prefix>.<satellite_prefix>.current.recall.seconds
{lightstep_project}