Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
270
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
49
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
re:Invent2024 KeynoteのAmazon Q Developer考察
yusukeshimizu
1
150
2025年に挑戦したいこと
molmolken
0
160
PaaSの歴史と、 アプリケーションプラットフォームのこれから
jacopen
7
1.5k
ドメイン駆動設計の実践により事業の成長スピードと保守性を両立するショッピングクーポン
lycorptech_jp
PRO
12
1.9k
デジタルアイデンティティ人材育成推進ワーキンググループ 翻訳サブワーキンググループ 活動報告 / 20250114-OIDF-J-EduWG-TranslationSWG
oidfj
0
530
.NET 最新アップデート ~ AI とクラウド時代のアプリモダナイゼーション
chack411
0
200
なぜfreeeはハブ・アンド・スポーク型の データメッシュアーキテクチャにチャレンジするのか?
shinichiro_joya
2
460
JuliaTokaiとJuliaLangJaの紹介 for NGK2025S
antimon2
1
110
ゼロからわかる!!AWSの構成図を書いてみようワークショップ 問題&解答解説 #デッカイギ #羽田デッカイギおつ
_mossann_t
0
1.5k
【NGK2025S】動物園(PINTO_model_zoo)に遊びに行こう
kazuhitotakahashi
0
230
2025年のARグラスの潮流
kotauchisunsun
0
790
今年一年で頑張ること / What I will do my best this year
pauli
1
220
Featured
See All Featured
Building Adaptive Systems
keathley
38
2.4k
Mobile First: as difficult as doing things right
swwweet
222
9k
Why Our Code Smells
bkeepers
PRO
335
57k
Why You Should Never Use an ORM
jnunemaker
PRO
54
9.1k
jQuery: Nuts, Bolts and Bling
dougneiner
62
7.6k
Fantastic passwords and where to find them - at NoRuKo
philnash
50
2.9k
Code Reviewing Like a Champion
maltzj
521
39k
Writing Fast Ruby
sferik
628
61k
The Art of Programming - Codeland 2020
erikaheidi
53
13k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
330
21k
Build The Right Thing And Hit Your Dates
maggiecrowley
33
2.5k
Gamification - CAS2011
davidbonilla
80
5.1k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None