$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
300
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
200
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
810
Datageeks_27-05.pdf
samklr
0
59
Big data and Machine learning APIs
samklr
4
270
Scalable Machine Learning
samklr
2
250
mesos.devoxx.2014
samklr
2
270
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
AR Guitar: Expanding Guitar Performance from a Live House to Urban Space
ekito_station
0
230
20251203_AIxIoTビジネス共創ラボ_第4回勉強会_BP山崎.pdf
iotcomjpadmin
0
140
Oracle Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
1
410
日本の AI 開発と世界の潮流 / GenAI Development in Japan
hariby
1
480
Amazon Quick Suite で始める手軽な AI エージェント
shimy
1
1.9k
ソフトウェアエンジニアとAIエンジニアの役割分担についてのある事例
kworkdev
PRO
0
270
業務の煩悩を祓うAI活用術108選 / AI 108 Usages
smartbank
9
12k
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
9.9k
ESXi のAIOps だ!2025冬
unnowataru
0
370
テストセンター受験、オンライン受験、どっちなんだい?
yama3133
0
170
日本Rubyの会: これまでとこれから
snoozer05
PRO
6
240
AIエージェント開発と活用を加速するワークフロー自動生成への挑戦
shibuiwilliam
5
860
Featured
See All Featured
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.6k
The Cost Of JavaScript in 2023
addyosmani
55
9.4k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
7.9k
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
3.4k
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
0
94
How GitHub (no longer) Works
holman
316
140k
End of SEO as We Know It (SMX Advanced Version)
ipullrank
2
3.8k
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
170
Rails Girls Zürich Keynote
gr2m
95
14k
Darren the Foodie - Storyboard
khoart
PRO
0
2k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.6k
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.1k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None