Datadog Dashboard Definition

If you are exporting your StatsD metrics from the Satellites to Datadog, you can use the dashboard definition below as a good starting point for visualizing the operation of your Satellites.

If you are using these standard prefix values (in the code sample below), you can copy dash.json directly.

Start tabs

Docker

1
2
3
4
5
. . .
COLLECTOR_STATSD_EXPORT_DOGSTATSD=true
COLLECTOR_STATSD_PREFIX=lightstep
COLLECTOR_STATSD_SATELLITE_PREFIX=satellite
COLLECTOR_STATSD_CLIENT_PREFIX=client

AWS or Debian

1
2
3
4
5
statsd:
    . . .
    prefix: "lightstep"
    satellite_prefix: "satellite"
    client_prefix: "client"

End code tabs

If instead you are using custom prefix values, you can generate a correctly populated dash.json using the following dash.json.mustache template and following these instructions.

dash.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
{
  "graphs": [
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:lightstep.satellite.spans.received{$pool,$project} by {lightstep_project}.as_count().rollup(sum)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of spans received by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:lightstep.satellite.spans.indexed{$pool,$project} by {lightstep_project}.as_count().rollup(sum)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of spans indexed by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:lightstep.client.spans.dropped{$pool,$project} by {lightstep_project}.as_count().rollup(sum)",
            "style": {
              "palette": "warm",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of client dropped spans by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:lightstep.satellite.spans.dropped{$pool,$project} by {lightstep_project}.as_count().rollup(sum)",
            "style": {
              "palette": "warm",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of satellite dropped spans by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:lightstep.satellite.access_tokens.invalid{$pool}.as_count().rollup(sum)",
            "style": {
              "palette": "warm",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of reports with invalid access token"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:lightstep.satellite.bytes.received.grpc{$pool}.as_count().rollup(sum), sum:lightstep.satellite.bytes.received.thrift{$pool}.as_count().rollup(sum)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of gRPC/Thrift bytes received, all projects"
    },
    {
      "definition": {
        "yaxis": {
          "scale": "log",
          "includeZero": false
        },
        "markers": [
          {
            "type": "error dashed",
            "value": "y = 180",
            "label": "Minimum Recommended Recall"
          }
        ],
        "viz": "timeseries",
        "requests": [
          {
            "q": "min:lightstep.satellite.current.recall.seconds{$pool,$project} by {pool,lightstep_project}.as_count().rollup(min)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "line"
          }
        ]
      },
      "title": "Satellite recall, minimum by project in seconds (log scale)"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "max:lightstep.satellite.index.queue.length{$pool,$project} by {lightstep_project}.as_count().rollup(max)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "line"
          }
        ]
      },
      "title": "Spans queued for indexing by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "max:lightstep.satellite.index.queue.bytes{$pool,$project} by {lightstep_project}.as_count().rollup(max)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "line"
          }
        ]
      },
      "title": "Spans queued for index (in bytes) by project"
    }
  ],
  "template_variables": [
    {
      "default": "*",
      "prefix": "lightstep_project",
      "name": "project"
    },
    {
      "default": "*",
      "prefix": "pool",
      "name": "pool"
    }
  ],
  "title": "Lightstep > Recommended Satellite Dashboard",
  "description": "See https://docs.lightstep.com/docs/satellite-metrics for more information about each metric."
}

dash.json.mustache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
{
  "graphs": [
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:{{prefix}}.{{satellite_prefix}}.spans.received{$pool,$project} by {lightstep_project}.as_count().rollup(sum)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of spans received by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:{{prefix}}.{{satellite_prefix}}.spans.indexed{$pool,$project} by {lightstep_project}.as_count().rollup(sum)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of spans indexed by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:{{prefix}}.{{client_prefix}}.spans.dropped{$pool,$project} by {lightstep_project}.as_count().rollup(sum)",
            "style": {
              "palette": "warm",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of client dropped spans by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:{{prefix}}.{{satellite_prefix}}.spans.dropped{$pool,$project} by {lightstep_project}.as_count().rollup(sum)",
            "style": {
              "palette": "warm",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of satellite dropped spans by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:{{prefix}}.{{satellite_prefix}}.access_tokens.invalid{$pool}.as_count().rollup(sum)",
            "style": {
              "palette": "warm",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of reports with invalid access token"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "sum:{{prefix}}.{{satellite_prefix}}.bytes.received.grpc{$pool}.as_count().rollup(sum), sum:{{prefix}}.{{satellite_prefix}}.bytes.received.thrift{$pool}.as_count().rollup(sum)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "bars"
          }
        ]
      },
      "title": "# of gRPC/Thrift bytes received, all projects"
    },
    {
      "definition": {
        "yaxis": {
          "scale": "log",
          "includeZero": false
        },
        "markers": [
          {
            "type": "error dashed",
            "value": "y = 180",
            "label": "Minimum Recommended Recall"
          }
        ],
        "viz": "timeseries",
        "requests": [
          {
            "q": "min:{{prefix}}.{{satellite_prefix}}.current.recall.seconds{$pool,$project} by {pool,lightstep_project}.as_count().rollup(min)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "line"
          }
        ]
      },
      "title": "Satellite recall, minimum by project in seconds (log scale)"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "max:{{prefix}}.{{satellite_prefix}}.index.queue.length{$pool,$project} by {lightstep_project}.as_count().rollup(max)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "line"
          }
        ]
      },
      "title": "Spans queued for indexing by project"
    },
    {
      "definition": {
        "viz": "timeseries",
        "requests": [
          {
            "q": "max:{{prefix}}.{{satellite_prefix}}.index.queue.bytes{$pool,$project} by {lightstep_project}.as_count().rollup(max)",
            "style": {
              "palette": "dog_classic",
              "width": "normal",
              "type": "solid"
            },
            "type": "line"
          }
        ]
      },
      "title": "Spans queued for index (in bytes) by project"
    }
  ],
  "template_variables": [
    {
      "default": "*",
      "prefix": "lightstep_project",
      "name": "project"
    },
    {
      "default": "*",
      "prefix": "pool",
      "name": "pool"
    }
  ],
  "title": "Lightstep > Recommended Satellite Dashboard",
  "description": "See https://docs.lightstep.com/docs/satellite-metrics for more information about each metric."
}

Creating a Dashboard Using the Datadog API

Once you have downloaded or generated a dash.json file that contains the proper prefixes, you can use the Datadog API to create the dashboard in your Datadog project.

This command requires environment variables for the DATADOG_API_KEY and the DATADOG_APP_KEY which can be found or created in the Datadog project settings.

1
curl -X POST -H "Content-type: application/json" -d @dash.json "https://app.datadoghq.com/api/v1/dash?api_key=${DATADOG_API_KEY}&application_key=${DATADOG_APP_KEY}"

Datadog Monitor Definitions

If you are exporting your StatsD metrics from the Satellites to Datadog, we recommend complementing the above dashboard with some basic monitoring. The following sections provide sample Datadog monitor definitions you can use to create alerts on useful metrics.

If you are using these standard prefix values (in the code sample below), you can copy the *.json files directly.

Start tabs

Docker

1
2
3
4
5
. . .
COLLECTOR_STATSD_EXPORT_DOGSTATSD=true
COLLECTOR_STATSD_PREFIX=lightstep
COLLECTOR_STATSD_SATELLITE_PREFIX=satellite
COLLECTOR_STATSD_CLIENT_PREFIX=client

AWS/Debian

1
2
3
4
5
statsd:
    . . .
    prefix: "lightstep"
    satellite_prefix: "satellite"
    client_prefix: "client"

End code tabs

If instead you are using custom prefix values, you can generate a correctly populated json file using the following json.mustache templates and following these instructions.

Lightstep Client Spans Dropped: Too many spans being dropped

client_dropped.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "name": "Lightstep Client Spans Dropped: Too many spans being dropped",
  "type": "query alert",
  "query": "avg(last_30m):sum:lightstep.client.spans.dropped{*} by {lightstep_project}.as_rate() > 0",
  "message": "{{#is_alert}} To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#client-spans-dropped).\n{{/is_alert}}\n\n{{#is_recovery}} Client spans dropped has returned to 0 (normal level) {{/is_recovery}}\n\nNotify: @ops-oncall",
  "tags": [
    "LightStep"
  ],
  "options": {
    "notify_audit": false,
    "locked": false,
    "timeout_h": 0,
    "new_host_delay": 300,
    "require_full_window": true,
    "notify_no_data": false,
    "renotify_interval": "0",
    "escalation_message": "",
    "no_data_timeframe": null,
    "include_tags": true,
    "thresholds": {
      "critical": 0
    }
  }
}

client_dropped.json.mustache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "name": "Lightstep Client Spans Dropped: Too many spans being dropped",
  "type": "query alert",
  "query": "avg(last_30m):sum:{{prefix}}.{{client_prefix}}.spans.dropped{*} by {lightstep_project}.as_rate() > 0",
  "message": "{{#is_alert}} To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#client-spans-dropped).\n{{/is_alert}}\n\n{{#is_recovery}} Client spans dropped has returned to 0 (normal level) {{/is_recovery}}\n\nNotify: @ops-oncall",
  "tags": [
    "LightStep"
  ],
  "options": {
    "notify_audit": false,
    "locked": false,
    "timeout_h": 0,
    "new_host_delay": 300,
    "require_full_window": true,
    "notify_no_data": false,
    "renotify_interval": "0",
    "escalation_message": "",
    "no_data_timeframe": null,
    "include_tags": true,
    "thresholds": {
      "critical": 0
    }
  }
}

Lightstep Satellite Spans Dropped: Too many spans being dropped (by percentage)

satellite_dropped_percent.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "name": "Lightstep Satellite Spans Dropped: Too many spans being dropped",
  "type": "query alert",
  "query": "sum(last_30m):sum:lightstep.satellite.spans.dropped{*} by {lightstep_project}.as_count() / sum:lightstep.satellite.spans.received{*} by {lightstep_project}.as_count() > 0.02",
  "message": "{{#is_alert}} To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-spans-dropped).\n{{/is_alert}}\n\n{{#is_recovery}} Satellite spans dropped has returned to a normal level. {{/is_recovery}}\n\nNotify: @ops-oncall",
  "tags": [
    "LightStep"
  ],
  "options": {
    "notify_audit": false,
    "locked": false,
    "timeout_h": 0,
    "new_host_delay": 300,
    "require_full_window": true,
    "notify_no_data": false,
    "renotify_interval": "0",
    "escalation_message": "",
    "no_data_timeframe": null,
    "include_tags": true,
    "thresholds": {
      "critical": 0.02
    }
  }
}

satellite_dropped_percent.json.mustache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "name": "Lightstep Satellite Spans Dropped: Too many spans being dropped",
  "type": "query alert",
  "query": "sum(last_30m):sum:{{prefix}}.{{satellite_prefix}}.spans.dropped{*} by {lightstep_project}.as_count() / sum:{{prefix}}.{{satellite_prefix}}.spans.received{*} by {lightstep_project}.as_count() > 0.02",
  "message": "{{#is_alert}} To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-spans-dropped).\n{{/is_alert}}\n\n{{#is_recovery}} Satellite spans dropped has returned to a normal level. {{/is_recovery}}\n\nNotify: @ops-oncall",
  "tags": [
    "LightStep"
  ],
  "options": {
    "notify_audit": false,
    "locked": false,
    "timeout_h": 0,
    "new_host_delay": 300,
    "require_full_window": true,
    "notify_no_data": false,
    "renotify_interval": "0",
    "escalation_message": "",
    "no_data_timeframe": null,
    "include_tags": true,
    "thresholds": {
      "critical": 0.02
    }
  }
}

Lightstep Satellite Spans Dropped: Too many spans being dropped (by count)

satellite_dropped_count.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "name": "Lightstep Satellite Spans Dropped: Too many spans being dropped",
  "type": "query alert",
  "query": "avg(last_30m):sum:lightstep.satellite.spans.dropped{*} by {lightstep_project}.as_rate() > 0",
  "message": "{{#is_alert}} To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-spans-dropped).\n{{/is_alert}}\n\n{{#is_recovery}} Satellite spans dropped has returned to a normal level. {{/is_recovery}}\n\nNotify: @ops-oncall",
  "tags": [
    "LightStep"
  ],
  "options": {
    "notify_audit": false,
    "locked": false,
    "timeout_h": 0,
    "new_host_delay": 300,
    "require_full_window": true,
    "notify_no_data": false,
    "renotify_interval": "0",
    "escalation_message": "",
    "no_data_timeframe": null,
    "include_tags": true,
    "thresholds": {
      "critical": 0
    }
  }
}

satellite_dropped_count.json.mustache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "name": "Lightstep Satellite Spans Dropped: Too many spans being dropped",
  "type": "query alert",
  "query": "avg(last_30m):sum:{{prefix}}.{{satellite_prefix}}.spans.dropped{*} by {lightstep_project}.as_rate() > 0",
  "message": "{{#is_alert}} To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-spans-dropped).\n{{/is_alert}}\n\n{{#is_recovery}} Satellite spans dropped has returned to a normal level. {{/is_recovery}}\n\nNotify: @ops-oncall",
  "tags": [
    "LightStep"
  ],
  "options": {
    "notify_audit": false,
    "locked": false,
    "timeout_h": 0,
    "new_host_delay": 300,
    "require_full_window": true,
    "notify_no_data": false,
    "renotify_interval": "0",
    "escalation_message": "",
    "no_data_timeframe": null,
    "include_tags": true,
    "thresholds": {
      "critical": 0
    }
  }
}

Lightstep Satellite Recall: Recall is too low

satellite_recall.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
  "name": "Lightstep Satellite Recall: Recall is too low",
  "type": "query alert",
  "query": "avg(last_30m):min:lightstep.satellite.current.recall.seconds{*} by {lightstep_project} < 180",
  "message": "{{#is_alert}} Satellite recall is too low. Expect service degradation. To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-current-recall-seconds).{{/is_alert}}\n\n{{#is_alert_recovery}} Satellite recall has returned to an acceptable level, though partial service degradation may still exist. To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-current-recall-seconds).{{/is_alert_recovery}}\n\n{{#is_warning}} Satellite recall is too low. Expect partial service degradation. To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-current-recall-seconds).{{/is_warning}}\n\n{{#is_warning_recovery}} Satellite recall has returned to a healthy level.{{/is_warning_recovery}}\n\nNotify: @ops-oncall",
  "tags": [
    "LightStep"
  ],
  "options": {
    "notify_audit": false,
    "locked": false,
    "timeout_h": 0,
    "new_host_delay": 300,
    "require_full_window": true,
    "notify_no_data": false,
    "renotify_interval": "0",
    "escalation_message": "",
    "no_data_timeframe": null,
    "include_tags": true,
    "thresholds": {
      "critical": 180,
      "warning": 300
    }
  }
}

satellite_recall.json.mustache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
  "name": "Lightstep Satellite Recall: Recall is too low",
  "type": "query alert",
  "query": "avg(last_30m):min:{{prefix}}.{{satellite_prefix}}.current.recall.seconds{*} by {lightstep_project} < 180",
  "message": "{{#is_alert}} Satellite recall is too low. Expect service degradation. To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-current-recall-seconds).{{/is_alert}}\n\n{{#is_alert_recovery}} Satellite recall has returned to an acceptable level, though partial service degradation may still exist. To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-current-recall-seconds).{{/is_alert_recovery}}\n\n{{#is_warning}} Satellite recall is too low. Expect partial service degradation. To fix, follow [these instructions](https://docs.lightstep.com/docs/satellite-metrics#satellite-current-recall-seconds).{{/is_warning}}\n\n{{#is_warning_recovery}} Satellite recall has returned to a healthy level.{{/is_warning_recovery}}\n\nNotify: @ops-oncall",
  "tags": [
    "LightStep"
  ],
  "options": {
    "notify_audit": false,
    "locked": false,
    "timeout_h": 0,
    "new_host_delay": 300,
    "require_full_window": true,
    "notify_no_data": false,
    "renotify_interval": "0",
    "escalation_message": "",
    "no_data_timeframe": null,
    "include_tags": true,
    "thresholds": {
      "critical": 180,
      "warning": 300
    }
  }
}

Creating Monitors Using the Datadog API

Once you have downloaded or generated the relevant monitor definitions (ex. satellite_recall.json) that contain the proper prefixes, you can use the Datadog API to create the monitors in your Datadog project.

This command requires environment variables for the DATADOG_API_KEY and the DATADOG_APP_KEY which can be found or created in the Datadog project settings.

To create a monitor based upon the satellite.recall.json definition, for example, run the following:

1
curl -X POST -H "Content-type: application/json" -d @satellite_recall.json "https://app.datadoghq.com/api/v1/monitor?api_key=${DATADOG_API_KEY}&application_key=${DATADOG_APP_KEY}"

Generating Files with Mustache

If you are using custom prefix values, you can generate a matching correctly populated json dashboard and monitor definitions by using the supplied mustache templates (those code samples with the .json.mustache suffix in these docs). In order to do so, you must first provide the custom prefixes you are using.

data.json

1
2
3
4
5
{
  "prefix": "lightstep_custom",
  "satellite_prefix": "satellite_custom",
  "client_prefix": "client_custom"
}

Then, you can use this Go implementation of mustache, to generate the correctly populated json. For example, you can generate dash.json with your custom prefix values by running the following command.

1
mustache data.json dash.json.mustache > dash.json