Two key components of the LightStep architecture are involved with LightStep’s performance: the LightStep tracers and the Satellites. LightStep tracers collect span data from your instrumentation, hold it in a buffer, and then send that data to the LightStep Satellites, based on their configured reporting period. Satellites collect 100% of that data and hold on to it for a period of time (called the recall window). The SaaS LightStep engine queries the Satellites as you use the UI to retrieve data necessary to build the span data into traces, service diagrams, Streams, and other meaningful reports.
LightStep performance is based on the ingress and egress of data from the tracers to the Satellites, the amount of memory the Satellites have to store that data, and the length of the recall window. If the tracers collect more data than their buffer can hold, then the tracer may drop the spans. If the Satellite doesn’t have enough memory to store the data sent by the tracers, then they may drop spans. And if the length of the recall window is too short, you may not see enough meaningful information in the UI.
The LightStep tracer client libraries are engineered for minimal impact on the processes they are tracing while still collecting and reporting the tracing data intended for collection. The use of the network is managed by buffering the data to be reported: spans and the associated tags, logs, and payloads. Buffering shifts some burden onto memory to hold this buffered data until the client flushes the content of the buffer and reports to the Satellite. You set the buffer size when you instantiate the tracer in your code. If the size is too small, you may start to see the client tracer dropping spans.
Satellites are responsible for collecting the spans generated by the tracers, and then processing and temporarily storing that data during trace assembly for the UI. When a Satellite receives spans, it places them in a temporary buffer for a period of time known as the recall window. The recall window can’t be configured directly; it is proportional to the amount of span traffic sent from all tracer clients and the available Satellite memory. When either the memory allocation is too low or there are too few Satellites, you may start to see Satellites dropping spans. Longer recall can be achieved either by reducing the amount of span traffic or by increasing the available memory of Satellites in the pool (either by increasing the available memory per instance, or the overall number of instances). Additionally, the Satellites in a pool may have varied recall windows. Balancing that time period between Satellites also protects against dropped spans.
If you’re using the LightStep free trial, then LightStep manages your Satellites. If you see dropped spans from the Satellite, please contact customer service.
LightStep provides a number of different ways to monitor tracer and Satellite performance, especially regarding dropped spans:
Reporting Status Dashboard:
See per project and by service, the platform, library, number of instances of that service currently reporting, number of spans dropped by the client tracer and Satellite, and the pool those Satellites belong to.
Satellite Pool Report:
See a high-level overview of all Satellite pools and individual Satellites, their current performance and configuration, and the projects reporting into them.
Provides a health check for a Satellite, along with configuration information.
StatsD Reporting Metrics:
Provides detailed StatsD metrics that you can import into a monitoring tool.