Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Interpolating Time Series Data On Demand (Lakshmanan Velusamy, Stealth Startup) | RTA Summit 2023

Interpolating Time Series Data On Demand (Lakshmanan Velusamy, Stealth Startup) | RTA Summit 2023

Many real-world datasets are time series in nature, continuously recording measurements and status of entities over time. In time series datasets, missing data and gaps can occur frequently due to system failures or data processing errors. Time series interpolation (aka Gap filling) is a recently introduced feature in Pinot 0.11 that allows you to fill in missing data in your time series dataset.

When using gap filling in Pinot, you can configure the size of the gap to be filled, the type of interpolation algorithm to use (such as linear or spline), and the maximum number of missing data points that can be filled. Pinot also provides options to customize the behavior of gap filling for different use cases, such as supporting multi-dimensional gap filling and handling noisy data.

Gap filling is a powerful feature in Pinot that helps you to work with incomplete data sets more effectively and make more accurate predictions. It can also help you reduce the polling interval to store fewer data to improve storage cost and performance.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Time Series Data • Track change of value/state mutation over

    time. • Discrete (Ex: Orders) vs Continuous (Ex: Status) events. • Append vs Upsert.
  2. Time Series Data: Examples • Real-time vehicle data: speed, braking,

    acceleration, etc • Server performance metrics: CPU, memory, network usage, etc • Industrial equipments: Telemetry from IOT sensors
  3. Time Series Data • Sparse data. • Missing data. •

    Higher polling interval => Higher the granularity => More resources => Higher the cost.
  4. Past Present Future What happened? Why did it happen? What

    should I have done? What is happening? Why is it happening? What should I do now? What will happen? Why will it happen? What should I do next? What Why Should Time Series Charts
  5. Computing Store Uptime at Scale • High fidelity pre-aggregation of

    events ◦ Less flexible ◦ High cost • Interpolating raw state transitions on-demand ◦ More flexibility ◦ Low cost
  6. Total Expected Store Uptime store_id1 store_id2 store_id3 store_id4 2022/07/01 11:00

    2022/07/01 10:00 2022/07/01 16:00 2022/07/01 21:00 10 hours 6 hours 10 hours 6 hours
  7. Actual Store Uptime store_1 store_2 store_3 store_4 2022/07/01 11:00 2022/07/01

    10:00 2022/07/01 16:00 2022/07/01 21:00 10 hours 6 hours 9.5 hours 6 hours 2022/07/01 13:00 2022/07/01 13:30
  8. Sample Dataset store_id event_ts is_open should_be_open store_1 2023-04-01 11:00 true

    true store_1 2023-04-01 21:00 false false store_2 2023-04-01 10:00 true true store_2 2023-04-01 16:00 false false store_3 2023-04-01 11:00 true true store_3 2023-04-01 13:00 false true store_3 2023-04-01 13:30 true true store_3 2023-04-01 21:00 false false store_4 2023-04-01 10:00 true true store_4 2023-04-01 16:00 false false
  9. Total Store Uptime With Gapfill store_s1 store_s2 store_s3 store_s4 2023/04/01

    10:00 2023/04/01 21:00 10 hours 6 hours 10 hours 6 hours
  10. Total Expected Store Uptime With Gapfill store_s1 store_s2 store_s3 store_s4

    2022/07/01 10:00 2022/07/01 21:00 10 hours 6 hours 10 hours 6 hours
  11. Query: Uptime of a specific store SELECT time_col, SUM(*) as

    total_store_uptime_minutes FROM ( SELECT GapFill(time_col, '1:MILLISECONDS:EPOCH',1656633600,1656720000,'1:MINUTES', '1:DAYS', Fill(status, 'FILL_PREVIOUS_VALUE'), TimeSeriesOn(store_id)), status FROM ( SELECT DATETIMECONVERT(event_ts, '1:SECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MINUTES') as time_col, LastWithTime(is_open, created_at, 'BOOLEAN') as status FROM store_status_events_tbl WHERE (event_ts BETWEEN 1656633600 AND 1656720000) AND (store_id = 'store_id1') GROUP BY time_col ORDER BY time_col ASC LIMIT 100000 ) LIMIT 2000000 ) WHERE status = 'true' GROUP BY time_col LIMIT 2000
  12. Step 1: Select raw events SELECT time_col, SUM(*) as total_store_uptime_minutes

    FROM ( SELECT GapFill(time_col, '1:MILLISECONDS:EPOCH',1656633600,1656720000,'1:MINUTES', '1:DAYS', FILL(status, 'FILL_PREVIOUS_VALUE'), TIMESERIESON(store_id), status FROM ( SELECT DATETIMECONVERT(event_ts, '1:SECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MINUTES') as time_col, LastWithTime(is_open, created_at, 'BOOLEAN') as status FROM store_status_events_tbl WHERE (event_ts BETWEEN 1656633600 AND 1656720000) AND (store_id = 'store_id1') GROUP BY time_col ORDER BY time_col ASC LIMIT 100000 ) LIMIT 2000000 ) WHERE status = 'true' GROUP BY time_col LIMIT 2000
  13. Step 2: Fill the gaps SELECT time_col, SUM(*) as total_store_uptime_minutes

    FROM ( SELECT GapFill(time_col, '1:MILLISECONDS:EPOCH',1656633600,1656720000,'1:MINUTES', '1:DAYS', Fill(status, 'FILL_PREVIOUS_VALUE'), TimeSeriesOn(store_id)), status FROM ( SELECT DATETIMECONVERT(event_ts, '1:SECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MINUTES') as time_col, LastWithTime(is_open, created_at, 'BOOLEAN') as status FROM store_status_events_tbl WHERE (event_ts BETWEEN 1656633600 AND 1656720000) AND (store_id = 'store_id1') GROUP BY time_col ORDER BY time_col ASC LIMIT 100000 ) LIMIT 2000000 ) WHERE status = 'true' GROUP BY time_col LIMIT 2000
  14. Step 3: Aggregate SELECT time_col, SUM(*) as total_store_uptime_minutes FROM (

    SELECT GapFill(time_col, '1:MILLISECONDS:EPOCH',1656633600,1656720000,'1:MINUTES', '1:DAYS', Fill(status, 'FILL_PREVIOUS_VALUE'), TimeSeriesOn(store_id)), status FROM ( SELECT DATETIMECONVERT(event_ts, '1:SECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MINUTES') as time_col, LastWithTime(is_open, created_at, 'BOOLEAN') as status FROM store_status_events_tbl WHERE (event_ts BETWEEN 1656633600 AND 1656720000) AND (store_id IN ('store_id1') GROUP BY time_col ORDER BY time_col ASC LIMIT 100000 ) LIMIT 2000000 ) WHERE status = 'true' GROUP BY time_col LIMIT 2000
  15. Query: Total online stores SELECT time_col, COUNT(*) as total_store_uptime_minutes FROM

    ( SELECT GapFill(time_col, '1:MILLISECONDS:EPOCH',1656633600,1656720000,'1:MINUTES', '1:DAYS', FILL(status, 'FILL_PREVIOUS_VALUE'), TIMESERIESON(store_id)), status, store_id FROM ( SELECT DATETIMECONVERT(event_ts, '1:SECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MINUTES') as time_col, LastWithTime(is_open, event_time, 'BOOLEAN') as status, store_id FROM enriched_store_staus_events WHERE (event_ts BETWEEN 1656633600 AND 1656720000) GROUP BY time_col, store_id ORDER BY time_col ASC LIMIT 100000 ) LIMIT 2000000) WHERE status = 'true' GROUP BY time_col LIMIT 2000
  16. Request Flow Pinot Broker Server BrokerReducerService QueryRouter Pinot Server n

    …….. 2.Stripped Sql Query Result Set n Result set 0 Result set n …….. 1. Gapfill Sql Query 5.GapFilledResult Result Set 0 2. Stripped Sql Query GapfillProcessor 4.1 Gapfilling 4.2 Filtering 5.1 Aggregation 5.2 Filtering Reduced Result 3 Pinot Server 0
  17. Execution Plan: Server Side Pinot Server 0 Filter Operator Aggregate

    Operator Combine Operator Pinot Server 1 Filter Operator Aggregate Operator Projector Operator 2023-04-01 11:00, store_id_0, true 2023-04-01 11:00, store_id_1, true 2023-04-01 13:00, store_id_1, false Transform Operator Transform Operator 2023-04-01 13:30, store_id_1, true 2023-04-01 21:00, store_id_0, false 2023-04-01 21:00, store_id_1, false Same as above Same as above Same as above Same as above Same as above Same as above Broker
  18. Execution Plan: Broker Side Pinot Broker Reducer Gapfill 2023-04-01 11:00,

    store_id_0, true 2023-04-01 11:00, store_id_1, true 2023-04-01 13:00, store_id_1, false 2023-04-01 13:30, store_id_1, true 2023-04-01 21:00, store_id_0, false 2023-04-01 21:00, store_id_1, false Filter Aggregate 2023-04-01 11:00, store_id_0, true 2023-04-01 11:01, store_id_0, true 2023-04-01 11:02, store_id_0, true … 2023-04-01 11:00, store_id_1, true 2023-04-01 11:01, store_id_1, true 2023-04-01 11:02, store_id_1, true … 2023-04-01 13:00, store_id_1, false 2023-04-01 13:01, store_id_1, false 2023-04-01 13:01, store_id_1, false … 2023-04-01 11:00, store_id_0, true 2023-04-01 11:01, store_id_0, true 2023-04-01 11:02, store_id_0, true … … 2023-04-01 11:00, store_id_1, true 2023-04-01 11:01, store_id_1, true 2023-04-01 11:02, store_id_1, true … …
  19. Future Work • Push down the gapfill logic from pinot

    broker to pinot server. ◦ Leverage data partitioning if events are co-located. ◦ Running the gapfill logic on the pinot servers where the data live. ◦ Reduce the data transmission. ◦ Increase the parallelism and performance of gapfill. • Leverage V2 engine. • Add support for advanced interpolations: Linear, Polynomial, Spline etc • Time series extrapolation