Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
300
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
210
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
810
Datageeks_27-05.pdf
samklr
0
62
Big data and Machine learning APIs
samklr
4
280
Scalable Machine Learning
samklr
2
250
mesos.devoxx.2014
samklr
2
270
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
Claude Codeベストプラクティスまとめ
minorun365
37
21k
エンジニアとして長く走るために気づいた2つのこと_大賀愛一郎
nanaism
1
230
【Oracle Cloud ウェビナー】ランサムウェアが突く「侵入の隙」とバックアップの「死角」 ~ 過去の教訓に学ぶ — 侵入前提の防御とデータ保護 ~
oracle4engineer
PRO
2
200
ReproでのicebergのStreaming Writeの検証と実運用にむけた取り組み
joker1007
0
390
フロントエンド開発者のための「厄払い」
optim
0
150
それぞれのペースでやっていく Bet AI / Bet AI at Your Own Pace
yuyatakeyama
1
500
The Engineer with a Three-Year Cycle - 2
e99h2121
0
170
Hardware/Software Co-design: Motivations and reflections with respect to security
bcantrill
1
250
3リポジトリーを2ヶ月でモノレポ化した話 / How I turned 3 repositories into a monorepo in 2 months
kubode
0
100
The Engineer with a Three-Year Cycle
e99h2121
0
160
持続可能な開発のためのミニマリズム
sansantech
PRO
3
530
エンジニアとマネジメントの距離/Engineering and Management
ikuodanaka
3
300
Featured
See All Featured
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
150
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
Documentation Writing (for coders)
carmenintech
77
5.2k
The Cult of Friendly URLs
andyhume
79
6.8k
30 Presentation Tips
portentint
PRO
1
190
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
200
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
410
Rebuilding a faster, lazier Slack
samanthasiow
85
9.4k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
50
Building the Perfect Custom Keyboard
takai
2
670
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None