InfluxDB & LevelDB Inside-out

Metrics を保存するための時系列 DB の話をします InfluxDB Part1. LevelDB Part2.

  InfluxDB は絶賛開発中です (NOT product ready)  情報が古くなったらスライドを捨ててください

InfluxDB Part1.

は保存のために作られた時系列 • Y Combinator から出資を受けたスタートアップ Errplane が開発
• モニタリング・アラート SaaS を立ち上げるためのバックエンドのデータベースとして開発 • ビジョンと機能が特徴的で注目を集めている

(from https://angel.co/errplane)

特徴的な点を順に紹介する • HTTP API • Powerful query language • Built
in explorer

から操作クエリ発行ブラウザの Javascript インタプリタから直接 DB 操作やクエリ発行が可能 #
Create a database $ curl -sL -w "%{http_code}¥n" ¥ -X POST ¥ 'http://db:8086/db?u=root&p=root' ¥ -d '{"name": "test1"}‘ 201 Response code が 2xx なら成功

# Get list of databases $ curl -s ¥ 'http://db:8086/dbs?u=root&p=root'
¥ | python -mjson.tool [ { "name": "test1" }, { "name": "dummy" } ] * NOTE: 2013-11-20 Undocumented feature Response body が JSON で得られる結果はで得る

# Add database user $ curl -sL -w "%{http_code}¥n" ¥
–X POST ¥ ‘http://db:8086/db/test1/users?u=root&p=root’ ¥ -d ‘{“username”: “smly”, “password”: “yeah”}’ 200 # Get list of database users $ curl –s ¥ ‘http://db:8086/db/test1/users?u=root&p=root’ ¥ | python -mjson.tool [ { “username”: “smly” }, ]

# Write points $ curl -sL -w "%{http_code}¥n" ¥ –X
POST ¥ ‘http://db:8086/db/test1/series?u=smly&p=yeah’ ¥ -d ‘[{ “name”: “events”, “columns”: [“uid”, “action”, “item”, “price”], “points”: [ [“194016”, “BUY”, “watch”, 19999], [“194016”, “CLICK”, “watch”, 19999] ] },{ “name”: “error”, “columns”: [“uid”, “action”, “errorName”], “points”: [ [“194016”, “BUY”, “NullPointException”] ] }]’ 200 Database username/password Series name (like MySQL’s table) Series name

# Write points $ curl -sL -w "%{http_code}¥n" –X POST
¥ ‘http://db:8086/db/test1/series?u=smly&p=yeah’ ¥ -d ‘[{ “name”: “events”, “columns”: [“uid”, “action”, “item”, “price”], “points”: [ [“5485”, “CLICK”, “watch”, 19999], ] }, { “name”: “events”, “columns”: [“uid”, “action”], “points”: [ [“5485”, “LEAVE”], ] }, }]’ 200 Schema-less !!

point は “time” column が自動で追加される．これは指定することもできる． # Write points $
curl -sL -w "%{http_code}¥n" –X POST ¥ ‘http://db:8086/db/test1/series?u=smly&p=yeah& time_precision=s’ -d ‘[ { “name”: “events”, “columns”: [“time”, “uid”, “action”], “points”: [ [1385390220, “5485”, “LEAVE”], ] }, }]’ 200 Need to specify the precision of the value Seconds (s) , milliseconds (m) or microseconds (u)

Gist: https://gist.github.com/smly/7711109 (requires influxdb-python) • Generate & Write 1 point
for each 10 seconds. • Totol points: 6 * 60 * 24 * 30 (30 days) $ git clone gist.github.com/7711109 $ cd 7711109 $ python write.py ...........................done

で時系列データを扱うことができる Support many useful aggregation/summarize functions!! $ python >
from InfluxDB import InfluxDBClient > c = InfluxDBClient( ‘root’, ‘root’, ‘smly’, ‘secret’, ‘testdb’) # Get points in last hour > c.query(“”” SELECT * FROM markov “””)

# Get points in recent 10 minutes > c.query(“”” SELECT
* FROM markov WHERE time > now() – 10m “””) [{ “name”: “foo”, “columns”: [ “time”, “sequence_number”, “value” ], “points”: [ [1385399228, 369291, 100], [1385399218, 369292, 100.677049], ... ] }] now() と時間リテラルで直近の 10 分のデータを指定 The `time` and `sequence_number` Columns are special built-in columns.

# Get mean value grouped by 5 minutes > c.query(“””
SELECT MEAN(value) FROM markov GROUP BY time(5m) WHERE time > now() – 1h “””) [{ “name”: “foo”, “columns”: [ “time”, “sequence_number”, “mean” ], “points”: [ [1385399100, 1, 103.01160280567923], [1385398800, 1, 102.77647754870585], ... ] }, {…}] 5 min ごとの time で GROUP BY して value の平均を計算他, COUNT, DISTINCT など SQL にある様々な関数を実装している

正規表現や複数のの選択ができる Schema-less なので SQL にできない複数 series の選択も可能（time の指定がない場合は
last hour から選択する） SELECT * FROM events, errors; SELECT * FROM /stats¥..*/i; SELECT * FROM events WHERE state == ‘NY’ SELECT * FROM log_lines WHERE line =~ /error/i; SELECT * FROM response_times WHERE value > 500; SELECT * FROM nagios_checks WHERE status != 0;

からクエリ発行＆可視化 (from http://obfuscurity.com/2013/11/My-Impressions-of-InfluxDB)

便利だけど確認していない • InfluxDB v0.3.0 • Chromium 30.0.1599.114 • Mozilla Firefox
25.0.1

Cron を書いて定期的に削除しなくてもスケジュール実行できるようになりそう Dashboard 上から発行されるクエリは read 権限だけを持つユーザーによるといった使い方ができるようになりそう

による分散協調 2013-11-30 時点の #PR 20 を読んだ感想。まだ master には merge
されていないので変更あるかも • “Database, series, timestamp” で hash 化して得られた RingLocation 値ごとにデータを分散させる – Timestamp で区切ると効率悪くならないか？ • 同一の RingLocation 値を持つサーバーで Raft による協調＆Replication が行われる – ref: http://raftconsensus.github.io/ – Ref: https://github.com/influxdb/influxdb/pull/20

Query Engine Coordinator (Raft) HTTP API Storage Engine Storage Engine
… Web UI Port: 8086 Port: 8083

A Sinatra style pattern muxer for Go's net/http (https://github.com/bmizerany/pat.go)

pat.go の Get, Post, Del を wrap したメソッド registerEndpoint を
HTTPServer に実装している

Flex (Fast lexical analyzer generator) + YACC (Compiler-Compiler) + Bison
(CFG perser generator; LALR) 1. DURATION, 2. REGEX_STRING, 3. INSENSITIVE_REGEX_STRING の字句

struct を定義して初期化時に hash に登録する（面白いネタがないので省略）

LevelDB を wrapper した Levigo を使用 (*1) 性能に影響を与える LRUCache size
や Bloom filter の bits size は固定値（暫定的な感じ） *1: 開発者によって “may change” であると発表されている。 “Introduction to InfluxDB by Paul Dix” http://g33ktalk.com/introduction-to-influxdb

LevelDB のデータストアは 1 つだけ作成．２種類の Key:value のペアが格納される（厳密ではない） Column / Column
ID を管理するための key/value pair Key: “[PREFIX]~[DB]~[SERIES]~[COLUMN]”, Value: ColumnID Point / Data を管理するための key/value pair Key: [TIMESTAMP][SEQ_NUM][ColumnID], Value: Data

の記録例 [{ “name”: “events”, “columns”: [“uid”, “action”, “item”, “price”], “points”:
[ [“5485”, “CLICK”, “watch”, 19999], ] }, { “name”: “events”, “columns”: [“uid”, “action”], “points”: [ [“5485”, “LEAVE”], ] }, }] [PREFIX]~test1~events~uid: 1 [PREFIX]~test1~events~action: 2 [PREFIX]~test1~events~item: 3 [PREFIX]~test1~events~price: 4 1384951400_0000001_1: “5485” 1384951400_0000001_2: “CLICK” 1384951400_0000001_3: “watch” 1384951400_0000001_4: “19999” 1384951400_0000001_5: “5485” 1384951400_0000001_6: “LEAVE” 以下のように記録される（厳密ではない） Column/ColumnID の key:value

の記録例 [{ “name”: “events”, “columns”: [“uid”, “action”, “item”, “price”], “points”:
[ [“5485”, “CLICK”, “watch”, 19999], ] }, { “name”: “events”, “columns”: [“uid”, “action”], “points”: [ [“5485”, “LEAVE”], ] }, }] 以下のように記録される（厳密ではない） Point/Data の key:value [PREFIX]~test1~events~uid: 1 [PREFIX]~test1~events~action: 2 [PREFIX]~test1~events~item: 3 [PREFIX]~test1~events~price: 4 1384951400_0000001_1: “5485” 1384951400_0000001_2: “CLICK” 1384951400_0000001_3: “watch” 1384951400_0000001_4: “19999” 1384951400_0000001_5: “5485” 1384951400_0000001_6: “LEAVE”

Column/ColumnID に対する key:value DB, Series, Column に対して固有の ID を GET/PUT
[db]~[series]~[column] で key 作成

Point/Data に対する key:value Point は timestamp, seqNum, Columnid を key
とする 8 bytes の Buffer (x2) を作って連結して key 作成

• HTTP API と Web UI でブラウザからすぐ使える • SQL-like query
と便利拡張が使える • statsd, collectd などと連携すればHTML, CSS, JavaScript だけで Dashboard を開発できそう（誰か Kibana っぽいもの作ってください） (ref: https://github.com/obfuscurity/tasseo)

LevelDB Part2.

の特徴 LSM-tree を採用した write-performance optimized な database library • 高速
• Relational data model ではなく key-value storage • Key と Value は任意のバイト列 • Key はソートされて保存される • Forward/Backward iteration をサポート Paper: ” The Log-Structured Merge Tree (LSM-Tree)” http://www.cs.umb.edu/~poneil/lsmtree.pdf

データ構造 B-tree を Cascade することで insert が頻出する場面においても低いコストでリアルタイムにインデックスを保持できる。小さい木から、指数関数的に大きな木まで、深さの上限が定められている木の集まりで構成される。
C0 C1 C2 C3 Memory Disk | C3 |

C0, C1, C2 … と複数の木から探さなくてはならないケースがあるため B-tree より不利 C0 C1
C2 C3 Memory Disk

Range query も同様。 C0, C1, C2 … と複数の木から探さなくてはならないケースがあるため B-tree
より不利 C0 C1 C2 C3 Memory Disk Iter Iter Iter Iter

C0 C1 C2 C3 Memory Disk INSERT 上限まで木が成長したら次のレベルの木に Merge.

上限まで木が成長したら次のレベルの木に Merge. C0 C1 C2 C3 Memory Disk INSERT (reach
limited size)

上限まで木が成長したら次のレベルの木に Merge. C0 C1 C2 C3 Memory Disk Merge

上限まで木が成長したら次のレベルの木に Merge. 深くない木を使うことで balancing のコストを軽減 C0 C1 C2 C3 Memory
Disk INSERT

Update は point search したときに hit したノードを更新 C0 C1
C2 C3 Memory Disk

Delete はリクエスト時は削除フラグを立てるだけ. Merge 時に実際にノードを削除する. C0 C1 C2 C3 Memory Disk
DELETE 削除フラグ

Delete はリクエスト時は削除フラグを立てるだけ. 実際にノードを削除するのは Merge 時（Balancing のコストを軽減） C0 C1 C2
C3 Memory Disk Merge 削除

のに LevelDB は C0 に LockFreeSkipList (aka ConcurrentSkipList; CSL)
を採用している C0 C1 C2 C3 Memory Disk ?

詳細は以下の書籍 Chapter 14 を参照すべき linked-list をたどるための近道がレベルごとに用意されている構造

なぜのにか？何故これ (SkipList) をLSM-Treeで使おうと思ったかが未だちょっと理解できていないのですが、どこかに解説はないでしょうか。 “LevelDB
の雑感 – くまメモ” http://d.hatena.ne.jp/kumagi/20111201/1322735619 “ Contributor が LSM-tree の C0 (memtable) に Hash を使わない理由として, (1) Sorting と (2) Concurrency を挙げている http://t1825.db-leveldb.dbtalk.us/a-question-about-leveldb-s- skip-list-t1825.html kwsk

のはされていないので却下 Sorting されていないと Iterator の support が困難。従って
Hash は却下。 C0 Memory Iter

の論文はを想定しているが • C0 の木として B-tree like ではない
AVL-tree などを想定している • 常にメモリ上にある木であるため， Disk page size nodes である必要がないから

のの性能が良い今は Multi-core/Multi-thread 当たり前の時代 LockFreeSkipList が脚光を浴び始める。 AVL-tree とかは深さを対数に維持するため Rebalancing
を必要とする。並列プログラムにおいて Rebalancing はボトルネックや競合状態を発生させる可能性がある。 “ The Art of Multiprocessor Programming, “14. Skiplists and Balanced Search”

“ のの性能が良い Good high contention performance, comparable single-thread performance.
In the multithreaded case (12 workers), CSL tested 10x faster than a RWSpinLocked std::set for an averaged sized list (1K - 1M nodes). https://github.com/facebook/folly/blob/master/folly/ConcurrentSkipList.h folly 曰く、”マルチスレッドの場合 10 倍高速”

の結論 Itetator support ほしいのでソートされたデータ構造がほしい。 Concurrency を考えるとパフォーマンス面で LockFreeSkipList は妥当な選択であると思われる。

におけるチューニング • 古い話。LevelDB 1.2 は 2011-05-16 リリース • On-disk size
を大きくして file open を減らす • Bloom Filter を導入して Throughput 底上げ Ref: http://basho.com/leveldb-in-riak-1-2/

によるの効率化探索を Bloom Filter による O(1) の操作で省略することができる（枝刈り） C0
C1 C2 C3 Memory Disk O(1) で探索の必要なしと判断 O(1) で探索の必要なしと判断

による効率化無駄足は許すが取り零しは無い Bloom Filter は LSM-tree の枝刈りに都合が良い Bloom Filter
の答え Key がある Key がない本当の答え Key がある True Positive (正解) False Netagive = 0 (取り零しはない) Key がない False Positive != 0 (無駄足は許容する) True Negative (正解)

• InfluxDB はダッシュボード開発が容易．ブラウザから SQL-like にデータ取得できる時系列 DB . •
Metrics 保存時には高頻度の Insert が予想される．LSM-tree ベースの LevelDB を使っているのは妥当そう.

InfluxDB & LevelDB Inside-out

InfluxDB & LevelDB Inside-out

More Decks by @smly

Featured

Transcript