$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
300
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
200
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
810
Datageeks_27-05.pdf
samklr
0
58
Big data and Machine learning APIs
samklr
4
270
Scalable Machine Learning
samklr
2
250
mesos.devoxx.2014
samklr
2
270
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
ハッカソンから社内プロダクトへ AIエージェント「ko☆shi」開発で学んだ4つの重要要素
sonoda_mj
6
1.4k
AgentCoreとStrandsで社内d払いナレッジボットを作った話
motojimayu
1
730
「もしもデータ基盤開発で『強くてニューゲーム』ができたなら今の僕はどんなデータ基盤を作っただろう」
aeonpeople
0
220
S3を正しく理解するための内部構造の読解
nrinetcom
PRO
3
240
Oracle Database@Azure:サービス概要のご紹介
oracle4engineer
PRO
2
180
20251222_サンフランシスコサバイバル術
ponponmikankan
2
130
AWS re:Invent 2025~初参加の成果と学び~
kubomasataka
0
180
Connection-based OAuthから学ぶOAuth for AI Agents
flatt_security
0
300
AI with TiDD
shiraji
1
250
[Data & AI Summit '25 Fall] AIでデータ活用を進化させる!Google Cloudで作るデータ活用の未来
kirimaru
0
2k
1人1サービス開発しているチームでのClaudeCodeの使い方
noayaoshiro
2
570
たまに起きる外部サービスの障害に備えたり備えなかったりする話
egmc
0
390
Featured
See All Featured
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
34k
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.7k
Kristin Tynski - Automating Marketing Tasks With AI
techseoconnect
PRO
0
110
16th Malabo Montpellier Forum Presentation
akademiya2063
PRO
0
25
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.6k
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
170
Everyday Curiosity
cassininazir
0
110
Leveraging Curiosity to Care for An Aging Population
cassininazir
1
130
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
359
30k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None