Lightstep ingests metrics from a number of sources as time series and saves them to a time series database. When you use Lightstep to view that data, it transforms it into data points plotted on a chart. Different sources send different data. Some may roll up to a 60 second average. Others might report multiple individual points and timestamps, or send an array of points. Some may even aggregate the data before sending it. Lightstep stores that data in form that’s sent and then when you make a query to that data, based on the metric kind and the time series operators chosen for the chart, it aligns the seconds to transform data uniformly. Once that happens, the aggregation method you select for the chart determines the values to display.
A time series is a sequence of measurements over a period of time. For example, if you recorded the temperature outside at every hour, the time series would be each of the values at every hour: 8:00 am - 55°, 9:00am - 57°, 10:00 am - 58°, and so on. Time series allow you to easily see changes over time when displayed as a chart.
Here’s an example of a time series for the
http.request metric being reported by two agents (the first number in each array is a timestamp):
[1607986123, 50.0] [1607986125, 100.0]
When displaying time series data, you may want the visualization of the data to be different depending on what you’re measuring. For example, if you’re measuring the temperature, you want to see the actual temperature at a particular point in time. Seeing just the change between the points in time would not be very helpful. A 5° between 50° and 55° is not a big deal, but it might be if it was between 95° and 100°. But if you were measuring HTTP requests to an endpoint, then you might only want to know about the change, or even the rate of change since the last point in time. Did the request rate go up or down? And by how much?
Because of these differences, Lightstep supports gauges and two types of counters - deltas and cumulative.
Metrics must be labeled as their kind before they are ingested.
Gauges represent an observed value at a specific point in time or over a specified range of time. Temperature readings are an example of a gauge metric. CPU usage is another example; you want to know exactly how much of an available resource is being used at a given point in time. Gauges are best when you don’t much care about the degree of change over time - you want the actual number at that time.
Deltas show how the values change from one reporting period (point on the graph) to the next. HTTP requests is an example of a delta metric. You want to see if requests are going up or down, and by how much.
Cumulative metrics add their value to the last value. They count the total number of things at a specific point in time, but as opposed to deltas, each value uses the same “start” timestamp to determine the value. An example of a cumulative metric is total web page hits. The value at each point in time increases from the last value, and you want to know how many of something you have accumulated at a given point in time.
A metric type represents the value type being reported. Lightstep supports the following metric types:
A distribution type returns a set of values for a point in time and performs aggregation on those values before charting the points. Lightstep supports percentile aggregation and can display the 50th, 95th, 99th, and 99.9th percentiles.
For existing Lightstep customers interested in tracking distribution metrics, please opt-in here. For new customers to Lightstep, this feature is already enabled in your account.
Time Series Operators
Time series operators determine how a value is reported over time.
latest operator can be used only for gauge-type metrics, and stores the last value reported to the database for a time period. So for example, if Lightstep receives a gauge time series array of
[71, 71, 71.5, 72] for a time period, Lightstep stores the value
latest operator can only be used with gauge-type metrics.
count operator computes the difference since the previous value. If Lightstep ingests a time series array for a delta metric kind of
[1,1,1,2,2,2,3,3] and the time series operator is set to
count, Lightstep stores the value
15- the total amount that value changed since last reported. Counts are useful when you want to know how much something has changed over a period of time. If you use
count for something like HTTP requests, you’d be able to see how the number of requests has changed at a particular point in time.
Rather than telling you how much something has changed at a particular point in time,
rate tells you how many happened over a time period. Think of a car that went 50 miles in an hour (you’re interested in the number of miles) versus the fact that it went 50 miles per hour (you’re interested in how fast it goes).
In the requests time series above (
[1,1,1,2,2,2,3,3]), if the time series operator is set as
rate and time period is 10 seconds, instead of using
15 ( for
delta), Lightstep stores the value
When using counter metrics (delta or cumulative), you can choose between displaying that metric as a
count or as a
rate when you create your chart.
Alignment is the process of taking the metrics stored in the database and converting it into the data points shown on the chart by determining the granularity (the time interval between those points). How that alignment is determined is based on the time period currently being displayed and the time series operator applied to the metric.
Lightstep displays 120 data points in every chart. It queries the database giving the number of seconds each data point represents and returns a “bucket” of time series aligned to those points. So if a chart is displaying the last hour, there will be 2 data points per minute, or one every 30 seconds. The timestamped values from the database that are closest to each 30 second interval are used for that data point.
The maximum interval is 45 seconds.
What exact value is displayed at that data point depends on the aggregation method you choose.
Aggregation determines the actual value to display. Lightstep determines the value for a point in time based on the aggregation method you choose for the chart:
- Count (non-null): The number of values found that are not null. For example, given the values of [10, 15, null, 50] the count is 3.
- Count (non-zero): The number of values found that are not zero (null is counted). For example, given the values of [10, 15, null, 0 50] the count is 4.
- Mean: The average (sum of the data divided by the count) of the data.
For example, given the values of [10, 15, 50] the mean is 25.
- Min: The lowest point in the data.
For example, given the values of [10, 15, 50] the min is 10.
- Max: The highest point in the data.
For example, given the values of [10, 15, 50] the max is 50.
- Sum: The total of all points in the data.
For example, given the values of [10, 15, 50] the sum is 75.
Distribution type metrics are automatically summed and then aggregated into percentiles.
All this computation so far returns the time series as single values for the entire metric (for example, on a line chart, as a single line). You can use tags on your metric data (also called labels in some metric sources) to filter the query and restrict what gets displayed or to group the query by tag values.
Metric data often uses tags to annotate the data with descriptions that help in getting your data to tell a more exact story. For example, metrics might use the
service tag to show what service a metric was emitted from, or the
customer tag to show which customer made the request. You can use tags from your metric data to filter your query (to explicitly include/exclude metrics with certain tag/value pairs) and to break out a single line of data points into separate data points for each value of a tag.
Filter the Query
For a metric like HTTP requests, one chart probably isn’t going to be very helpful. You’d see large numbers and if there’s a troubling spike, probably wouldn’t be able to pinpoint where the issue is coming from or who it’s affecting. By using tags to filter your query, you can create concise charts that include only the data that’s important for this chart.
When you filter a query, Lightstep only retrieves data from the metric database that matches (includes) the tag/value or that doesn’t match (exclude).
For example, as part of your service level agreements with your VIP accounts, you need to keep a close eye on performance for them. You might use the
customer tag create a chart for each VIP customer.
Group the Query
Like filtering, grouping gives you more insight into your data by allowing you to “split apart” the metrics into groups. When you choose to group by a tag, Lightstep creates separate data points (separate lines on a line chart or different color boxes on a bar chart) for each value of the tag. Now you can get a sense of the distribution of the metric across the different tag values.
Say you’ve added the filter
service: iOS so that the query only returns HTTP requests from the iOS service. But seeing all requests made by service might not give you the details you need to find an issue. If you group by the
customer tag, then you can see how the service is performing for each customer.
Let’s look at an example metric query starting with the time series reported by the agent and finishing with a useful chart in Lightstep. We’ll use
requests, which is a delta metric kind.
Here are three time series flushed from two different agents for the
requests metric, shown in a table. Both agents report every 30 seconds.
For simplicity in this example, we’ll assume Lightstep is only charting the first three minutes. Instead of showing a valid time stamp, we use a time relative to when Lightstep queried the metric database (
now = the time of the request).
|agent1||[now-15, 100], [now+15, 100]||method: post|
|agent1||[now-15, 150], [now+15, 125]||method: get|
|agent2||[now-15, 50], [now+15, 50]||method: post|
|agent2||[now-15, 100], [now+15, 100]||method: get|
Lightstep ingests these time series and stores them in the metric database. When you build a chart, the metric
requests is now available to select in the query builder.
Now Lightstep considers the time series operator.
Apply the Time Series Operator
requests is a delta metric, you can choose to display either the count (the number of requests for that point in time) or the rate (the number of requests per second).
For example, this chart uses
count and shows there are a total of 364.474 requests at 11:45 am.
This chart has the same query but uses
rate, and shows there are 8.099 requests per second at 11:45 am. .
count. Now Lightstep knows to count the number of requests reported for each time series and next checks to see if any filters or groups have been applied.
Group the Data by Tag Values
Let’s say you want the chart to display a line for each value of the
method tag. Now Lightstep knows it needs to create separate data points for each value of the
Next, it needs to align the time series for each method to match the time being displayed in the chart.
Align Time to the Chart
Remember for this example, we’re just charting the first three minutes. Lightstep requests time buckets of 60 seconds, so asks for metric values from
now+60 secs, and
now+120 secs for each method value.
Since the time is different from what is stored in the database, Lightstep needs to align the time. The first time series in the database is:
[[now-15, 100], [now+15, 100]
Lightstep aligns that by interpolating it to:
[-60, 0], [0, 50], [60, 0] [120 0]
This alignment allows the data points from all three time series line up on unified points. Here’s the alignment of all three time series:
|agent1||[-60, 0], [0, 100], [60, 0], [120, 0]||method: post|
|agent1||[-60, 0], [0, 125], [60, 0], [120, 0]||method: get|
|agent2||[-60, 0] [0, 50] [60, 0] [120, 0]||method: post|
|agent2||[-60, 0] [0, 100] [60, 0] [120, 0]||method: get|
Now it needs to determine the value to display at each data point for each method using the selected aggregation.
Aggregate the Data
Let’s say we want the chart to display the maximum value at each time point.
We end up with the following data points:
For the method
post, we get the following:
[-60, 0], [0, 100], [60, 0], [120, 0]
Lightstep drops the value
50 from agent2, because it only reports the maximum value at any time.
For the method
get we get:
[-60, 0], [0, 125], [60, 0], [120, 0]
Lightstep drops the value
100 from agent2.