Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
300
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
200
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
810
Datageeks_27-05.pdf
samklr
0
57
Big data and Machine learning APIs
samklr
4
270
Scalable Machine Learning
samklr
2
240
mesos.devoxx.2014
samklr
2
270
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
re:Invent 2025 ~何をする者であり、どこへいくのか~
tetutetu214
0
210
業務のトイルをバスターせよ 〜AI時代の生存戦略〜
staka121
PRO
2
140
世界最速級 memcached 互換サーバー作った
yasukata
0
340
今年のデータ・ML系アップデートと気になるアプデのご紹介
nayuts
1
320
形式手法特論:CEGAR を用いたモデル検査の状態空間削減 #kernelvm / Kernel VM Study Hokuriku Part 8
ytaka23
2
460
Lambdaの常識はどう変わる?!re:Invent 2025 before after
iwatatomoya
1
480
ブロックテーマとこれからの WordPress サイト制作 / Toyama WordPress Meetup Vol.81
torounit
0
570
AWSを使う上で最低限知っておきたいセキュリティ研修を社内で実施した話 ~みんなでやるセキュリティ~
maimyyym
2
420
Power of Kiro : あなたの㌔はパワステ搭載ですか?
r3_yamauchi
PRO
0
120
AWS CLIの新しい認証情報設定方法aws loginコマンドの実態
wkm2
6
730
WordPress は終わったのか ~今のWordPress の制作手法ってなにがあんねん?~ / Is WordPress Over? How We Build with WordPress Today
tbshiki
1
740
[CMU-DB-2025FALL] Apache Fluss - A Streaming Storage for Real-Time Lakehouse
jark
0
120
Featured
See All Featured
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
3
390
Building Flexible Design Systems
yeseniaperezcruz
330
39k
A better future with KSS
kneath
240
18k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
196
70k
Stop Working from a Prison Cell
hatefulcrawdad
273
21k
Learning to Love Humans: Emotional Interface Design
aarron
274
41k
[RailsConf 2023] Rails as a piece of cake
palkan
58
6.2k
Into the Great Unknown - MozCon
thekraken
40
2.2k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.6k
Code Review Best Practice
trishagee
74
19k
Balancing Empowerment & Direction
lara
5
800
Designing for humans not robots
tammielis
254
26k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None