Liz Heym Catching Waves With Time-Series Data, SF Bay Area Ruby Meetup July 18 2024

Slide 1

Slide 1 text

Catching Waves With Time-Series Data Liz Heym Cisco Meraki

Slide 2

Slide 2 text

We’ll cover: - How to select a tool for managing time-series data - How to organize, query, and aggregate time-series data - How to translate your design to API constraints

Slide 3

Slide 3 text

But first! What is time-series data? Time-series data is a collection of observations recorded over consistent intervals of time.

Slide 4

Slide 4 text

A surfer’s goal Liz has just taken her first surf lesson, and she’s keen on learning how her surfing will improve over time. She’s decided to record this data in a time-series database and to access it via an API endpoint. But where does she start?

Slide 5

Slide 5 text

Selecting the right board for the conditions 1 Surf a board you already have Use a time-series DB already in your tech stack 2 Use the old board, but add a new set of fins Use an extension for a DB you already use 3 Buy a new board Adopt a new DB technology 4 Shape your own board Design your own DB

Slide 6

Slide 6 text

1. Surf a board you already have If you already have a database that’s well-suited for time-series data, why change? Maybe you just need to adjust your techniques!

Slide 7

Slide 7 text

2. Keep the old board, but add a new set of fins ● Old board = Postgres ● New fins = Postgres extension ● A few options: pg_timeseries or TimescaleDB

Slide 8

Slide 8 text

3. Buy a new board Sometimes, your existing tools don’t cut it, and you need to invest in something entirely new. ClickHouse is a fast, open-source analytical database, designed around time-series data.

Slide 9

Slide 9 text

4. Shape your own board Sometimes, no available database seems suited to your highly specific needs. In 2008, the engineers Meraki found themselves in this position, and LittleTable was born.

Slide 10

Slide 10 text

4. Shape your own board: LittleTable ● Relational database ● Optimized for time-series data ● Data clustered for continuous disk access ● SQL interface for querying LT White Paper

Slide 11

Slide 11 text

The Perfect Technique ● Now that you have a board, you need to learn how to surf it! ● Much like in surfing, there are tried-and-true techniques for best handling time-series data. ● We’ll cover: 1. Data arranged by time 2. Hierarchically-delineated key 3. Querying by index 4. Aggregation and Compression

Slide 12

Slide 12 text

1. Data arranged by time ● Key feature of a time-series DB ● ClickHouse automatically generates an index on the ts column ● Performant when accessing a range of time ● LittleTable is append-only

Slide 13

Slide 13 text

2. Hierarchically-delineated key ● In addition to being grouped by time, data is organized according to this composite key. ● Crucial to understand how this data is going to be accessed—not every query will be efficient

Slide 14

Slide 14 text

2. Hierarchically-delineated key In this example, the composite key is: Network Id, Device Id

Slide 15

Slide 15 text

2. Hierarchically-delineated key ● Organize by increasing specificity ● Cisco Meraki’s example from the previous slide: Network, Device ● For Liz’s surfing application: Surfer, Region, Break

Slide 16

Slide 16 text

3. Querying by index: LittleTable ● LittleTable is organized across two axes: composite key and time ○ Only need a prefix ● Performant query for LittleTable: ○ Surfer ○ Region, ○ Timestamp

Slide 17

Slide 17 text

3. Querying by index: ClickHouse ● ClickHouse include timestamp at the end of the composite index ○ So you must query with the full key ● Non-performant query ○ Surfer, Timestamp ● Performant query ○ Surfer, Region, Break, Timestamp Liz, LA, Malibu, over the past month Liz, Humboldt, Moonstone, over the past month Two weeks … Two weeks

Slide 18

Slide 18 text

4. Aggregation and Compression ● Time-series data can pile up fast ● Two needs: ○ Don’t have infinite storage ○ Also want to show as much data as possible

Slide 19

Slide 19 text

4. Aggregation and Compression ● Don’t have infinite storage ○ Data retention ○ Time-to-live ● Also want to show as much data as possible ○ Compression ○ Aggregation

Slide 20

Slide 20 text

4. Compression: TimescaleDB

Slide 21

Slide 21 text

4. Aggregation: LittleTable ● Base table and aggregate table ● Base table (data per wave): ○ Distance, Duration ● Aggregate table (data per interval of time): ○ Total distance, total duration, max speed, wave count

Slide 22

Slide 22 text

4. Aggregation: LittleTable ● We can aggregate the data over the following intervals: ○ Base table—with a TTL of 1 month ○ One day—with a TTL of 6 months ○ One week—with a TTL of 1 year ○ One month—with a TTL of 5 years

Slide 23

Slide 23 text

Getting out there We have our data: ● Stored ● Aggregated ● Easily accessible Now we design an API endpoint that Liz can use to easily query her surf data.

Slide 24

Slide 24 text

Getting out there: Query params ● Required ○ Surfer ○ Timespan ● Optional ○ Region ○ Break

Slide 25

Slide 25 text

Getting out there: Timespan and interval ● timespan = the full period of time over which we want data. ○ Our longest TTL is 5 years: that’s the max timespan ● interval = the grain at which the data is aggregated ○ Calculated based on the timespan ● The interval options are: ○ One day (TTL 6 months) ○ One week (TTL 1 year) ○ One month (TTL 5 years)