Deep Dive into LINE's Time Series Database 'Flash'

2019 DevDay Deep Dive Into LINE's Time Series Database 'Flash'
> Xuanhuy Do > LINE Observability Infrastructure Team Senior Software Engineer

Agenda > Observability Introduction > Road to our new TSDB
Flash > Flash Architecture > Challenges, Lession and Learn

LINE’s Observability Team

In control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge of its external outputs

Help our engineers to keep their services healthy

Metrics collecting, Alert on issues Distributed tracing for root cause
analysis Log collecting, Alert on WARN / ERROR Provide Multiple Infrastructures

Provide Multiple Infrastructures Metrics collecting, Alert on issues Distributed tracing
for root cause analysis Log collecting, Alert on WARN / ERROR

Metrics Collecting, Alert on Issues

Metrics Are So Important Trending Alerting Anomaly Detection Root cause
analysis Impact Measurement

Without metrics You know nothing about your application

Number of Metrics per Minute Over Years ~2016 2017 2018
2019 2020 10M 200M 1B

So Many Metrics?? Culture Microservice boom Convenient Libraries Business Growth

So Many Metrics?? 1000 Servers 10.000 Metrics Each * 10.000.000
Metrics Total =

Our Metrics Platform Intake Layer Metric Storage Graph UI Alert
Setting Agent Push Metrics

Intake Layer Metric Storage Graph UI Alert Setting Agent Push
Metrics Focus on Metric Storage

Our Metric Storage History 2019 Release new storage 2017 OpenTSDB
2012 Sharding MySQL 2018 Start develop new storage 2013~2016 2011 MySQL

Mysql Era OOM Latency (Swap) Disk Full Complex Sharding No
Tag Support

Cover by Operation Mysql Era

2012 Sharding MySQL 2018 Start develop new storage 2013~2016 Super big sharding MySQL 2011 MySQL

OpenTSDB Era Complex Modules Latency (Cache Miss) Latency  (GC) Some
Query Patterns Are Slow

Cover by Operation OpenTSDB Era

We Decided To Build Our Own Storage !

2012 Sharding MySQL 2018 Start develop new storage 2013~2016 Super big sharding MySQL 2011 MySQL

Why Not Using Existing OSS? InfluxDB (EE) Not Satisfy Expected
Write Throughput + High Cost Clickhouse Not Satisfy Expected Read Throughput TimeScaleDB Not Satisfy Expected Read Throughput HeroicDB Not Satisfy Expected Write Throughput FB Beringei Only in Memory Module, Not Support Tag Base Query Netflix Atlas Only in Memory Module M3DB Unstable + no Document (1 Year Ago)

Why Not Using Vendor Solution? Usage cost Impossible for customization
Migration cost

What We Needed Polyglot Protocol Low Cost / Easy Maintainance
Massive Scale Out for Write/Read

What We Needed Polyglot Protocol User could use any client
protocol : - HTTP - UDP - Prometheus Remote Write/Read

What We Needed Massive Scale Out for Write/Read How massive
? - Vertical scale out (could serve up to 10x billions of metrics) - Low latency (p99 < 100ms) for read/write

What We Needed Low maintainance cost - Binary deployment -
Limit number of modules to maintain Low Cost / Easy Maintainance

What We Achieved 4 Millions Datapoints Write per Sec 1000
Queries per Seconds Write P99 < 100ms Read P99 < 100ms

How We Did That!?

What is metrics storage Or Time Series Database(TSDB)

Glossary CPU

Glossary CPU { “Host” = “Host1” } { “Zone” =
“JP” } Labels

Glossary 12:00 12:01 12:02 CPU 0.5 CPU 1.0 CPU 0.7
Datapoint

Glossary 12:00 12:01 12:02 Serie

TSDB Storage type Serie struct { Id uint64 Label map[string]string
Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTS int64, toTs int64) []Datapoint }

TSDB Storage type Serie struct { Id uint64 Label map[string]string
Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTs int64, toTs int64) []Datapoint }

How We Achive This Scale? type Serie struct { Id
uint64 Label map[string]string Datapoints [] Datapoint } type Datapoint struct { Value float64 Timestamp int64 } type Storage interface { Put(Series []Serie) QueryRange(Label map[string]string, fromTs int64, toTs int64) []Datapoint } 4.000.000 Serie Per Sec 1000 Query Range per Sec

Technical Decisions In-Memory base Fit “entirely” 28 hour recent data
into memory Microsevice Split into multiple services for local optimization Grpc as main communication protocol Employ Go language Ease of deployment and good toolings Utilize Opensource Not try to reinvent all the wheels, combine with other opensources to reduce development cost

Microservice Base Architecture Idea: Split Label Storage Service And Datapoints
Storage Service

CPU { “Host” = “Host1” } { “Zone” = “JP”
} 12:00 0.5

CPU { “Host” = “Host1” } { “Zone” = “JP”
} Hash Uint64: 1234 Serie ID

1234 CPU { “Host” = “Host1” } { “Zone” =
“JP” } 1234 12:00 0.5

“JP” } 1234 12:00 0.5 Label Storage Datapoints Storage

“JP” } 1234 12:00 0.5 Optimized For Text Store Optimized For Number Store

Simplified Architecture Metric Input Layer Datapoints Storage Label Storage Metric
Query Layer Grpc

Result We Split Different Services Which Have Different Requirement So
We Could Optimize in Different Way

We Want Both Fast Write + Fast Read

The Only Way To Achieve Is To Using DRAM

Datapoints Storage Idea: Split in Memory Datapoints Storage  and Persistent
Datapoints Storage Fit Latest 28 Hours Data “Entirely” Into Memory

Metric Storage Metric Input Layer Label Storage Metric Query Layer
InMemory Datapoint Storage (28h) Persistent Datapoint Storage

In-Memory Datapoint Storage > Store 28 hours data entirely in
memory • Use Delta-Delta XOR algorithm to compress data points > Vertical scale by hash-base serie id distribution > High-Availability with rocksdb as log backend and Raft concensus protocol for replication

Delta-Delta XOR Algorithm 12:00 12:01 12:02 CPU 0.5 CPU 0.6
CPU 0.8 12:02 CPU 0.7

Delta-Delta XOR Algorithm 0.5 0.6 0.7 0.8 0.1 0.1 0.1
0 0

In-Memory Datapoint Storage > Store 28 hours data entirely in
memory > Use Delta-Delta XOR algorithm to compress data points > Vertical scale by hash-base serie id distribution • High-Availability with rocksdb as log backend and Raft concensus protocol for replication

Storage Unit > Store data by “Shard” Unit > Each
Shard == 3 Physical Machines (Replication Factor 3) with high DRAM + SSD > Each machine replicate data using “Raft”

Data Distributed by Hashing Client Hash(Serie 1) = 1 Shard
1 Shard 2 Insert(Serie 1) CentralDogma Shard Topology

Raft Consensus Algorithm Client Leader Follower Follower RAFT RAFT

RocksDB Backed Write Raft Log Client RocksDB Leader Memory

Result Having in High Availability in Memory Layer Easy Scale
Both READ and WRITE

Go Programming Language > Simplicity • Write code • Deployment
(binary) > Very good tooling for profiling (CPU/memory..) • Very helpful for building memory intensive application > Has GC but still able to control memory layout

Result Able To Deliver the Result in Very Short Time

Metric Storage Metric Input Layer Label Storage Metric Query Layer
In Memory Metric Storage (28h) Persistent Metric Storage

Result Don’t Have To Reinvent “ALL” the Wheel

Final words: We passed a long road to make this
storage

It take just 2 months to build first prototype and
beta release

But take a year to release to the production

It takes time to make usable thing But it takes
a lot of time to make best thing

Current Achievement Is Not Good Enough 4 Millions Datapoints Write
per Sec 1000 Queries per Seconds Write P99 < 100ms Read P99 < 100ms 100 10000

We Still Have a Lot To Do > Better performance
for label index layer > More perfomance, more reliablility, better cost optimization > Multi-Region deployment for disaster discovery > New storage base ecosystem (Alert, On-Call…)

Thank You

Deep Dive into LINE's Time Series Database 'Flash'

Deep Dive into LINE's Time Series Database 'Flash'

More Decks by LINE DevDay 2019

Other Decks in Technology

Featured

Transcript