Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
300
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
200
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
810
Datageeks_27-05.pdf
samklr
0
57
Big data and Machine learning APIs
samklr
4
270
Scalable Machine Learning
samklr
2
240
mesos.devoxx.2014
samklr
2
270
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
Haskell を武器にして挑む競技プログラミング ─ 操作的思考から意味モデル思考へ
naoya
7
1.6k
AI駆動開発の実践とその未来
eltociear
1
280
多様なデジタルアイデンティティを攻撃からどうやって守るのか / 20251212
ayokura
0
490
30分であなたをOmniのファンにしてみせます~分析画面のクリック操作をそのままコード化できるAI-ReadyなBIツール~
sagara
0
180
大企業でもできる!ボトムアップで拡大させるプラットフォームの作り方
findy_eventslides
1
850
会社紹介資料 / Sansan Company Profile
sansan33
PRO
11
390k
今年のデータ・ML系アップデートと気になるアプデのご紹介
nayuts
1
540
AlmaLinux + KVM + Cockpit で始めるお手軽仮想化基盤 ~ 開発環境などでの利用を想定して ~
koedoyoshida
0
120
ActiveJobUpdates
igaiga
1
140
CARTAのAI CoE が挑む「事業を進化させる AI エンジニアリング」 / carta ai coe evolution business ai engineering
carta_engineering
0
2k
寫了幾年 Code,然後呢?軟體工程師必須重新認識的 DevOps
cheng_wei_chen
1
1.5k
Amazon Quick Suite で始める手軽な AI エージェント
shimy
0
570
Featured
See All Featured
Visual Storytelling: How to be a Superhuman Communicator
reverentgeek
2
390
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
20
Typedesign – Prime Four
hannesfritz
42
2.9k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
47
33k
Writing Fast Ruby
sferik
630
62k
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
120
jQuery: Nuts, Bolts and Bling
dougneiner
65
8.3k
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.6k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
0
290
Designing for Performance
lara
610
69k
The SEO identity crisis: Don't let AI make you average
varn
0
32
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None