$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
300
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
200
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
810
Datageeks_27-05.pdf
samklr
0
57
Big data and Machine learning APIs
samklr
4
270
Scalable Machine Learning
samklr
2
240
mesos.devoxx.2014
samklr
2
270
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
A Compass of Thought: Guiding the Future of Test Automation ( #jassttokai25 , #jassttokai )
teyamagu
PRO
1
230
乗りこなせAI駆動開発の波
eltociear
1
800
生成AI時代の自動E2Eテスト運用とPlaywright実践知_引持力哉
legalontechnologies
PRO
0
200
Oracle Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
0
680
Noを伝える技術2025: 爆速合意形成のためのNICOフレームワーク速習 #pmconf2025
aki_iinuma
2
1.9k
[JAWS-UG 横浜支部 #91]DevOps Agent vs CloudWatch Investigations -比較と実践-
sh_fk2
1
230
AI活用によるPRレビュー改善の歩み ― 社内全体に広がる学びと実践
lycorptech_jp
PRO
1
160
最近のLinux普段づかいWaylandデスクトップ元年
penguin2716
1
640
Docker, Infraestructuras seguras y Hardening
josejuansanchez
0
150
Ryzen NPUにおけるAI Engineプログラミング
anjn
0
250
re:Inventで気になったサービスを10分でいけるところまでお話しします
yama3133
1
110
AI時代におけるアジャイル開発について
polyscape_inc
0
120
Featured
See All Featured
Product Roadmaps are Hard
iamctodd
PRO
55
12k
What's in a price? How to price your products and services
michaelherold
246
12k
Embracing the Ebb and Flow
colly
88
4.9k
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.6k
Building Flexible Design Systems
yeseniaperezcruz
330
39k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
Git: the NoSQL Database
bkeepers
PRO
432
66k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.3k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.8k
Faster Mobile Websites
deanohume
310
31k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
359
30k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None