Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
290
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
800
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
240
mesos.devoxx.2014
samklr
2
260
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
290
Other Decks in Technology
See All in Technology
知覚とデザイン
rinchoku
1
630
OSSで50の競合と戦うためにやったこと
yamadashy
3
1k
OTEPsで知るOpenTelemetryの未来 / Observability Conference Tokyo 2025
arthur1
0
320
SOTA競争から人間を超える画像認識へ
shinya7y
0
610
「タコピーの原罪」から学ぶ間違った”支援” / the bad support of Takopii
piyonakajima
0
150
ヘンリー会社紹介資料(エンジニア向け) / company deck for engineer
henryofficial
0
420
AWSが好きすぎて、41歳でエンジニアになり、AAIを経由してAWSパートナー企業に入った話
yama3133
1
180
serverless team topology
_kensh
3
240
データとAIで明らかになる、私たちの課題 ~Snowflake MCP,Salesforce MCPに触れて~ / Data and AI Insights
kaonavi
0
100
ハノーファーメッセ2025で見た生成AI活用ユースケース.pdf
hamadakoji
1
500
CREが作る自己解決サイクルSlackワークフローに組み込んだAIによる社内ヘルプデスク改革 #cre_meetup
bengo4com
0
360
CLIPでマルチモーダル画像検索 →とても良い
wm3
0
510
Featured
See All Featured
Practical Orchestrator
shlominoach
190
11k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
Large-scale JavaScript Application Architecture
addyosmani
514
110k
Designing for Performance
lara
610
69k
Why You Should Never Use an ORM
jnunemaker
PRO
59
9.6k
YesSQL, Process and Tooling at Scale
rocio
173
15k
How STYLIGHT went responsive
nonsquared
100
5.9k
Building an army of robots
kneath
305
46k
Six Lessons from altMBA
skipperchong
29
4k
Automating Front-end Workflow
addyosmani
1371
200k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
A designer walks into a library…
pauljervisheath
209
24k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None