How Cloud Observability displays metrics

Cloud Observability ingests metrics from a number of sources as time series and saves them to a time series database. When you use Cloud Observability to view that data, it transforms it into data points plotted on a chart. Different sources send different data. Some may roll up to a 60 second average. Others might report multiple individual points and timestamps, or send an array of points. Some may even aggregate the data before sending it. Cloud Observability stores that data in form that’s sent and then when you make a query to that data, based on the metric kind and the time series operators chosen for the chart, it aligns the seconds to transform data uniformly. Once that happens, the aggregation method you select for the chart determines the values to display.

Time series

A time series is a sequence of measurements over a period of time. For example, if you recorded the temperature outside at every hour, the time series would be each of the values at every hour: 8:00 am - 55°, 9:00am - 57°, 10:00 am - 58°, and so on. Time series allow you to easily see changes over time when displayed as a chart.

Example:
Here’s an example of a time series for the http.request metric being reported by two agents (the first number in each array is a timestamp):

agent1: [1607986123, 100.0]

agent2: [1607986123, 50.0] [1607986125, 100.0]

Metric kinds

When displaying time series data, you may want the visualization of the data to be different depending on what you’re measuring. For example, if you’re measuring the temperature, you want to see the actual temperature at a particular point in time. Seeing just the change between the points in time would not be very helpful. A 5° between 50° and 55° is not a big deal, but it might be if it was between 95° and 100°. But if you were measuring HTTP requests to an endpoint, then you might only want to know about the change, or even the rate of change since the last point in time. Did the request rate go up or down? And by how much?

Because of these differences, Cloud Observability supports gauges and two types of counters - deltas and cumulative.

Metrics must be labeled as their kind before they are ingested. If when first ingested a metric name doesn’t match the kind recorded, it will be rejected and an error will be sent back to the client. If you want to change a metric’s kind, contact Customer Success at support@lightstep.com.

Gauges

Gauges represent an observed value at a specific point in time or over a specified range of time. Temperature readings are an example of a gauge metric. CPU usage is another example; you want to know exactly how much of an available resource is being used at a given point in time. Gauges are best when you don’t much care about the degree of change over time - you want the actual number at that time.

Delta

Deltas show how the values change from one reporting period (point on the graph) to the next. HTTP requests is an example of a delta metric. You want to see if requests are going up or down, and by how much.

Cumulative

Cumulative metrics add their value to the last value. They count the total number of things at a specific point in time, but as opposed to deltas, each value uses the same “start” timestamp to determine the value. An example of a cumulative metric is total web page hits. The value at each point in time increases from the last value, and you want to know how many of something you have accumulated at a given point in time.

Metric types

A metric type represents the value type being reported. Cloud Observability supports the following metric types:

  • Integer
  • Float
  • Distribution
    A distribution type returns a set of values for a point in time and performs aggregation on those values before charting the points. Cloud Observability supports percentile aggregation and can display the 50th, 95th, 99th, and 99.9th percentiles.

Aggregation

Aggregation is the process of taking the metrics stored in the database that may be irregularly spaced and converting these into the data points shown on the chart that are spaced evenly. That spacing (also called output period) is calculated by Cloud Observability such that, regardless of the query duration, each chart contains roughly 120 data points.

For example, if a chart is displaying an hour of data, there will be 2 data points per minute, or one every 30 seconds. What exact values are displayed every 30 seconds depends on the aggregation method you choose.

The minimum output period is 30 seconds.

Below, we will describe three different ways of performing aggregation.

Latest

The latest operator can be used only for gauge-type metrics, and displays the last value reported to the database for a time period. So for example, if Cloud Observability receives a gauge time series array of [71, 71, 71.5, 72] for a time period, Cloud Observability displays the value 72 and drops the others.

The latest operator can only be used with gauge-type metrics.

Delta

The delta operator computes the difference since the previous value. If Cloud Observability ingests a time series array for a delta metric kind of [1,1,1,2,2,2,3,3] and the time series operator is set to delta, Cloud Observability computes the value 15- the total amount that value changed since last reported. Deltas are useful when you want to know how much something has changed over a period of time. If you use delta for something like HTTP requests, you’d be able to see how the number of requests has changed at a particular point in time.

Rate

Rather than telling you how much something has changed at a particular point in time, rate tells you how many happened over a time period. Think of a car that went 50 miles in an hour (you’re interested in the number of miles) versus the fact that it went 50 miles per hour (you’re interested in how fast it goes).

In the requests time series above ([1,1,1,2,2,2,3,3]), if the time series operator is set as rate and time period is 10 seconds, instead of using 15 ( for delta), Cloud Observability computes the value 1.5 (15/10).

When using counter metrics (delta or cumulative), you can choose between displaying that metric as a delta or as a rate when you create your chart.

Attributes (tags/labels)

Metric data often uses attributes to annotate the data with descriptions that help in getting your data to tell a more exact story. For example, metrics might use the service attribute to show what service a metric was emitted from, or the customer attribute to show which customer made the request. You can use attributes from your metric data to filter your query (to explicitly include/exclude metrics with certain attribute/value pairs) and to break out a single line of data points into separate data points for each value of an attribute.

Filter the query

For a metric like HTTP requests, one chart probably isn’t going to be very helpful. You’d see large numbers and if there’s a troubling spike, probably wouldn’t be able to pinpoint where the issue is coming from or who it’s affecting. By using attributes to filter your query, you can create concise charts that include only the data that’s important for this chart.

When you filter a query, Cloud Observability only retrieves data from the metric database that matches (includes) the attribute/value or that doesn’t match (exclude).

For example, as part of your service level agreements with your VIP accounts, you need to keep a close eye on performance for them. You might use the customer attribute create a chart for each VIP customer.

Group the query

Like filtering, grouping gives you more insight into your data by allowing you to “split apart” the metrics into groups. When you choose to group by an attribute, Cloud Observability creates separate data points (separate lines on a line chart or different color boxes on a bar chart) for each value of the attribute. Now you can get a sense of the distribution of the metric across the different attribute values.

Say you’ve added the filter service: iOS so that the query only returns HTTP requests from the iOS service. But seeing all requests made by service might not give you the details you need to find an issue. If you group by the customer attribute, then you can see how the service is performing for each customer.

  • count of non-null values: The number of values found that are not null. For example, given the values of [10, 15, null, 50] the count is 3.
  • count of non-zero values: The number of values found that are not zero (null is counted). For example, given the values of [10, 15, null, 0 50] the count is 4.
  • maximum value: The highest point in the data.
    For example, given the values of [10, 15, 50] the max is 50.
  • mean of all values: The average (sum of the data divided by the count) of the data.
    For example, given the values of [10, 15, 50] the mean is 25.
  • minimum value: The lowest point in the data.
    For example, given the values of [10, 15, 50] the min is 10.
  • sum of all values: The total of all points in the data.
    For example, given the values of [10, 15, 50] the sum is 75.

    Distribution type metrics are automatically summed and then aggregated into percentiles.

End-to-end example

Let’s look at an example metric query starting with the time series reported by the agent and finishing with a useful chart in Cloud Observability. We’ll use requests, which is a delta metric kind.

Time series

Here are three time series flushed from two different agents for the requests metric, shown in a table. Both agents report every 30 seconds.

For simplicity in this example, we’ll assume Cloud Observability is only charting the first three minutes. Instead of showing a valid time stamp, we use a time relative to when Cloud Observability queried the metric database (now = the time of the request).

http.requests kind=delta

Agent Time/Metric Value Attribute/Value
agent1 [now-15, 100], [now+15, 100] method: post
agent1 [now-15, 150], [now+15, 125] method: get
agent2 [now-15, 50], [now+15, 50] method: post
agent2 [now-15, 100], [now+15, 100] method: get

Cloud Observability ingests these time series and stores them in the metric database. When you build a chart, the metric requests is now available to select in the query builder.

Now Cloud Observability considers the time series operator.

Apply aggregation

Because requests is a delta metric, you can choose to display either the delta (the number of requests for that point in time) or the rate (the number of requests per second).

Let’s choose delta. Now Cloud Observability knows to count the number of requests reported for each time series.

Remember for this example, we’re just charting the first three minutes. Cloud Observability requests time buckets of 60 seconds, so it asks for metric values from now-60 secs, now, now+60 secs, and now+120 secs for each method value.

Since the time is different from what is stored in the database, Cloud Observability needs to align the time. The first time series in the database is:
[[now-15, 100], [now+15, 100]

Cloud Observability aligns that by interpolating it to:
[-60, 0], [0, 50], [60, 0] [120 0]

This alignment allows the data points from all three time series line up on unified points. Here’s the alignment of all three time series:

Agent Time/Metric Value Attribute/Value
agent1 [-60, 0], [0, 100], [60, 0], [120, 0] method: post
agent1 [-60, 0], [0, 125], [60, 0], [120, 0] method: get
agent2 [-60, 0] [0, 50] [60, 0] [120, 0] method: post
agent2 [-60, 0] [0, 100] [60, 0] [120, 0] method: get

Group the data by attribute values

Let’s say you want the chart to display a line for each value of the method attribute. Now Cloud Observability knows it needs to create separate data points for each value of the method attribute.

Let’s say we want the chart to display the maximum value at each time point.
We end up with the following data points:

For the method post, we get the following:
[-60, 0], [0, 100], [60, 0], [120, 0]
Cloud Observability drops the value 50 from agent2, because it only reports the maximum value at any time.

For the method get we get:
[-60, 0], [0, 125], [60, 0], [120, 0]
Cloud Observability drops the value 100 from agent2.

See also

Create and manage dashboards

Create and manage panels

Create alerts

Updated Sep 23, 2021