Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
290
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
800
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
240
mesos.devoxx.2014
samklr
2
260
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
290
Other Decks in Technology
See All in Technology
SoccerNet GSRの紹介と技術応用:選手視点映像を提供するサッカー作戦盤ツール
mixi_engineers
PRO
1
180
Access-what? why and how, A11Y for All - Nordic.js 2025
gdomiciano
1
110
動画データのポテンシャルを引き出す! Databricks と AI活用への奮闘記(現在進行形)
databricksjapan
0
150
Findy Team+のSOC2取得までの道のり
rvirus0817
0
360
Goに育てられ開発者向けセキュリティ事業を立ち上げた僕が今向き合う、AI × セキュリティの最前線 / Go Conference 2025
flatt_security
0
350
Why Governance Matters: The Key to Reducing Risk Without Slowing Down
sarahjwells
0
110
【Oracle Cloud ウェビナー】クラウド導入に「専用クラウド」という選択肢、Oracle AlloyとOCI Dedicated Region とは
oracle4engineer
PRO
3
110
GopherCon Tour 概略
logica0419
2
190
生成AIで「お客様の声」を ストーリーに変える 新潮流「Generative ETL」
ishikawa_satoru
1
320
Green Tea Garbage Collector の今
zchee
PRO
2
390
AI駆動開発を推進するためにサービス開発チームで 取り組んでいること
noayaoshiro
0
190
VCC 2025 Write-up
bata_24
0
180
Featured
See All Featured
Designing for Performance
lara
610
69k
Build your cross-platform service in a week with App Engine
jlugia
232
18k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
140
34k
Product Roadmaps are Hard
iamctodd
PRO
54
11k
A better future with KSS
kneath
239
17k
KATA
mclloyd
32
15k
Context Engineering - Making Every Token Count
addyosmani
5
190
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
252
21k
Imperfection Machines: The Place of Print at Facebook
scottboms
269
13k
Rails Girls Zürich Keynote
gr2m
95
14k
VelocityConf: Rendering Performance Case Studies
addyosmani
332
24k
Done Done
chrislema
185
16k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None