Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
290
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
230
mesos.devoxx.2014
samklr
2
260
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
290
Other Decks in Technology
See All in Technology
Aurora DSQLはサーバーレスアーキテクチャの常識を変えるのか
iwatatomoya
1
1.2k
「全員プロダクトマネージャー」を実現する、Cursorによる仕様検討の自動運転
applism118
22
12k
Evolución del razonamiento matemático de GPT-4.1 a GPT-5 - Data Aventura Summit 2025 & VSCode DevDays
lauchacarro
0
210
AWSを利用する上で知っておきたい名前解決のはなし(10分版)
nagisa53
10
3.2k
組織を巻き込む大規模プラットフォーム移行戦略 〜50+サービスのマルチリージョン・マルチプロダクト化で学んだステークホルダー協働の実践〜 / Platform migration strategy engaging all stakeholders
toshi0607
2
200
Apache Spark もくもく会
taka_aki
0
140
未経験者・初心者に贈る!40分でわかるAndroidアプリ開発の今と大事なポイント
operando
6
750
機械学習を扱うプラットフォーム開発と運用事例
lycorptech_jp
PRO
0
670
IoT x エッジAI - リアルタイ ムAI活用のPoCを今すぐ始め る方法 -
niizawat
0
120
品質視点から考える組織デザイン/Organizational Design from Quality
mii3king
0
210
エンジニアが主導できる組織づくり ー 製品と事業を進化させる体制へのシフト
ueokande
1
110
共有と分離 - Compose Multiplatform "本番導入" の設計指針
error96num
2
1.2k
Featured
See All Featured
Context Engineering - Making Every Token Count
addyosmani
3
62
Product Roadmaps are Hard
iamctodd
PRO
54
11k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
33
2.4k
Building a Modern Day E-commerce SEO Strategy
aleyda
43
7.6k
Mobile First: as difficult as doing things right
swwweet
224
9.9k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.4k
Optimizing for Happiness
mojombo
379
70k
A designer walks into a library…
pauljervisheath
207
24k
Speed Design
sergeychernyshev
32
1.1k
Designing for Performance
lara
610
69k
The Pragmatic Product Professional
lauravandoore
36
6.9k
Automating Front-end Workflow
addyosmani
1370
200k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None