Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
260
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
330
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
47
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
日本版とグローバル版のモバイルアプリ統合の開発の裏側と今後の展望
miichan
1
120
2024年にチャレンジしたことを振り返るぞ
mitchan
0
130
【re:Invent 2024 アプデ】 Prompt Routing の紹介
champ
0
140
生成AIをより賢く エンジニアのための RAG入門 - Oracle AI Jam Session #20
kutsushitaneko
4
210
Turing × atmaCup #18 - 1st Place Solution
hakubishin3
0
470
Oracle Cloud Infrastructure:2024年12月度サービス・アップデート
oracle4engineer
PRO
0
160
フロントエンド設計にモブ設計を導入してみた / 20241212_cloudsign_TechFrontMeetup
bengo4com
0
1.9k
アップデート紹介:AWS Data Transfer Terminal
stknohg
PRO
0
170
watsonx.ai Dojo #5 ファインチューニングとInstructLAB
oniak3ibm
PRO
0
160
podman_update_2024-12
orimanabu
1
260
10分で学ぶKubernetesコンテナセキュリティ/10min-k8s-container-sec
mochizuki875
3
320
20241214_WACATE2024冬_テスト設計技法をチョット俯瞰してみよう
kzsuzuki
3
440
Featured
See All Featured
Docker and Python
trallard
41
3.1k
How GitHub (no longer) Works
holman
311
140k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
365
25k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
Practical Orchestrator
shlominoach
186
10k
Unsuck your backbone
ammeep
669
57k
Agile that works and the tools we love
rasmusluckow
328
21k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
Visualization
eitanlees
146
15k
Testing 201, or: Great Expectations
jmmastey
40
7.1k
Why You Should Never Use an ORM
jnunemaker
PRO
54
9.1k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
48
2.2k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None