Deep Dive into LINE's Time Series Database 'Flash'

by LINE DevDay 2019

Slide 1

Slide 1 text

2019 DevDay Deep Dive Into LINE's Time Series Database 'Flash' > Xuanhuy Do > LINE Observability Infrastructure Team Senior Software Engineer

Slide 2

Slide 2 text

Agenda > Observability Introduction > Road to our new TSDB Flash > Flash Architecture > Challenges, Lession and Learn

Slide 3

Slide 3 text

LINE’s Observability Team

Slide 4

Slide 4 text

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs

Slide 5

Slide 5 text

Help our engineers to keep their services healthy

Slide 6

Slide 6 text

Metrics collecting, Alert on issues Distributed tracing for root cause analysis Log collecting, Alert on WARN / ERROR Provide Multiple Infrastructures

Slide 7

Slide 7 text

Provide Multiple Infrastructures Metrics collecting, Alert on issues Distributed tracing for root cause analysis Log collecting, Alert on WARN / ERROR

Slide 8

Slide 8 text

Provide Multiple Infrastructures Metrics collecting, Alert on issues Distributed tracing for root cause analysis Log collecting, Alert on WARN / ERROR

Slide 9

Slide 9 text

Metrics Collecting, Alert on Issues

Slide 10

Slide 10 text

Metrics Are So Important Trending Alerting Anomaly Detection Root cause analysis Impact Measurement

Slide 11

Slide 11 text

Without metrics You know nothing about your application

Slide 12

Slide 12 text

Number of Metrics per Minute Over Years ~2016 2017 2018 2019 2020 10M 200M 1B

Slide 13

Slide 13 text

So Many Metrics?? Culture Microservice boom Convenient Libraries Business Growth

Slide 14

Slide 14 text

So Many Metrics?? 1000 Servers 10.000 Metrics Each * 10.000.000 Metrics Total =

Slide 15

Slide 15 text

Our Metrics Platform Intake Layer Metric Storage Graph UI Alert Setting Agent Push Metrics

Slide 16

Slide 16 text

Intake Layer Metric Storage Graph UI Alert Setting Agent Push Metrics Focus on Metric Storage

Slide 17

Slide 17 text

Our Metric Storage History 2019 Release new storage 2017 OpenTSDB 2012 Sharding MySQL 2018 Start develop new storage 2013~2016 2011 MySQL

Slide 18

Slide 18 text

Mysql Era OOM Latency (Swap) Disk Full Complex Sharding No Tag Support

Slide 19

Slide 19 text

Cover by Operation Mysql Era

Slide 20

Slide 20 text

Our Metric Storage History 2019 Release new storage 2017 OpenTSDB 2012 Sharding MySQL 2018 Start develop new storage 2013~2016 Super big sharding MySQL 2011 MySQL

Slide 21

Slide 21 text

OpenTSDB Era Complex Modules Latency (Cache Miss) Latency  (GC) Some Query Patterns Are Slow

Slide 22

Slide 22 text

Cover by Operation OpenTSDB Era

Slide 23

Slide 23 text

We Decided To Build Our Own Storage !

Slide 24

Slide 24 text

Our Metric Storage History 2019 Release new storage 2017 OpenTSDB 2012 Sharding MySQL 2018 Start develop new storage 2013~2016 Super big sharding MySQL 2011 MySQL

Slide 25

Slide 25 text

Why Not Using Existing OSS? InfluxDB (EE) Not Satisfy Expected Write Throughput + High Cost Clickhouse Not Satisfy Expected Read Throughput TimeScaleDB Not Satisfy Expected Read Throughput HeroicDB Not Satisfy Expected Write Throughput FB Beringei Only in Memory Module, Not Support Tag Base Query Netflix Atlas Only in Memory Module M3DB Unstable + no Document (1 Year Ago)

Slide 26

Slide 26 text

Why Not Using Vendor Solution? Usage cost Impossible for customization Migration cost

Slide 27

Slide 27 text

What We Needed Polyglot Protocol Low Cost / Easy Maintainance Massive Scale Out for Write/Read

Slide 28

Slide 28 text

What We Needed Polyglot Protocol User could use any client protocol : - HTTP - UDP - Prometheus Remote Write/Read

Slide 29

Slide 29 text

What We Needed Massive Scale Out for Write/Read How massive ? - Vertical scale out (could serve up to 10x billions of metrics) - Low latency (p99 < 100ms) for read/write

Slide 30

Slide 30 text

What We Needed Low maintainance cost - Binary deployment - Limit number of modules to maintain Low Cost / Easy Maintainance

Slide 31

Slide 31 text

What We Achieved 4 Millions Datapoints Write per Sec 1000 Queries per Seconds Write P99 < 100ms Read P99 < 100ms

Slide 32

Slide 32 text

How We Did That!?

Slide 33

Slide 33 text

What is metrics storage Or Time Series Database(TSDB)

Slide 34

Slide 34 text

Glossary CPU

Slide 35

Slide 35 text

Glossary CPU { “Host” = “Host1” } { “Zone” = “JP” } Labels

Slide 36

Slide 36 text

Glossary 12:00 12:01 12:02 CPU 0.5 CPU 1.0 CPU 0.7 Datapoint

Slide 37

Slide 37 text

Glossary 12:00 12:01 12:02 Serie

Slide 38

Slide 38 text

TSDB Storage type Serie struct { Id uint64 Label map[string]string Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTS int64, toTs int64) []Datapoint }

Slide 39

Slide 39 text

Slide 40

Slide 40 text

How We Achive This Scale? type Serie struct { Id uint64 Label map[string]string Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTs int64, toTs int64) []Datapoint } 4.000.000 Serie Per Sec 1000 Query Range per Sec

Slide 41

Slide 41 text

Technical Decisions In-Memory base Fit “entirely” 28 hour recent data into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Microservice Base Architecture Idea: Split Label Storage Service And Datapoints Storage Service

Slide 44

Slide 44 text

CPU { “Host” = “Host1” } { “Zone” = “JP” } 12:00 0.5

Slide 45

Slide 45 text

CPU { “Host” = “Host1” } { “Zone” = “JP” } Hash Uint64: 1234 Serie ID

Slide 46

Slide 46 text

1234 CPU { “Host” = “Host1” } { “Zone” = “JP” } 1234 12:00 0.5

Slide 47

Slide 47 text

1234 CPU { “Host” = “Host1” } { “Zone” = “JP” } 1234 12:00 0.5 Label Storage Datapoints Storage

Slide 48

Slide 48 text

1234 CPU { “Host” = “Host1” } { “Zone” = “JP” } 1234 12:00 0.5 Optimized For Text Store Optimized For Number Store

Slide 49

Slide 49 text

Simplified Architecture Metric Input Layer Datapoints Storage Label Storage Metric Query Layer Grpc

Slide 50

Slide 50 text

Result We Split Different Services Which Have Different Requirement So We Could Optimize in Different Way

Slide 51

Slide 51 text

Slide 52

Slide 52 text

We Want Both Fast Write + Fast Read

Slide 53

Slide 53 text

The Only Way To Achieve Is To Using DRAM

Slide 54

Slide 54 text

Datapoints Storage Idea: Split in Memory Datapoints Storage  and Persistent Datapoints Storage Fit Latest 28 Hours Data “Entirely” Into Memory

Slide 55

Slide 55 text

Metric Storage Metric Input Layer Label Storage Metric Query Layer InMemory Datapoint Storage (28h) Persistent Datapoint Storage

Slide 56

Slide 56 text

In-Memory Datapoint Storage > Store 28 hours data entirely in memory • Use Delta-Delta XOR algorithm to compress data points > Vertical scale by hash-base serie id distribution > High-Availability with rocksdb as log backend and Raft concensus protocol for replication

Slide 57

Slide 57 text

Delta-Delta XOR Algorithm 12:00 12:01 12:02 CPU 0.5 CPU 0.6 CPU 0.8 12:02 CPU 0.7

Slide 58

Slide 58 text

Delta-Delta XOR Algorithm 0.5 0.6 0.7 0.8 0.1 0.1 0.1 0 0

Slide 59

Slide 59 text

Delta-Delta XOR Algorithm 0.5 0.6 0.7 0.8 0.1 0.1 0.1 0 0

Slide 60

Slide 60 text

Delta-Delta XOR Algorithm 0.5 0.6 0.7 0.8 0.1 0.1 0.1 0 0

Slide 61

Slide 61 text

In-Memory Datapoint Storage > Store 28 hours data entirely in memory > Use Delta-Delta XOR algorithm to compress data points > Vertical scale by hash-base serie id distribution • High-Availability with rocksdb as log backend and Raft concensus protocol for replication

Slide 62

Slide 62 text

Storage Unit > Store data by “Shard” Unit > Each Shard == 3 Physical Machines (Replication Factor 3) with high DRAM + SSD > Each machine replicate data using “Raft”

Slide 63

Slide 63 text

Data Distributed by Hashing Client Hash(Serie 1) = 1 Shard 1 Shard 2 Insert(Serie 1) CentralDogma Shard Topology

Slide 64

Slide 64 text

Raft Consensus Algorithm Client Leader Follower Follower RAFT RAFT

Slide 65

Slide 65 text

RocksDB Backed Write Raft Log Client RocksDB Leader Memory

Slide 66

Slide 66 text

Result Having in High Availability in Memory Layer Easy Scale Both READ and WRITE

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Go Programming Language > Simplicity • Write code • Deployment (binary) > Very good tooling for profiling (CPU/memory..) • Very helpful for building memory intensive application > Has GC but still able to control memory layout

Slide 69

Slide 69 text

Result Able To Deliver the Result in Very Short Time

Slide 70

Slide 70 text

Slide 71

Slide 71 text

Metric Storage Metric Input Layer Label Storage Metric Query Layer In Memory Metric Storage (28h) Persistent Metric Storage

Slide 72

Slide 72 text

Result Don’t Have To Reinvent “ALL” the Wheel

Slide 73

Slide 73 text

Final words: We passed a long road to make this storage

Slide 74

Slide 74 text

It take just 2 months to build first prototype and beta release

Slide 75

Slide 75 text

But take a year to release to the production

Slide 76

Slide 76 text

It takes time to make usable thing But it takes a lot of time to make best thing

Slide 77

Slide 77 text

Current Achievement Is Not Good Enough 4 Millions Datapoints Write per Sec 1000 Queries per Seconds Write P99 < 100ms Read P99 < 100ms 100 10000

Slide 78

Slide 78 text

We Still Have a Lot To Do > Better performance for label index layer > More perfomance, more reliablility, better cost optimization > Multi-Region deployment for disaster discovery > New storage base ecosystem (Alert, On-Call…)

Slide 79

Slide 79 text

Thank You