Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
270
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
49
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
Larkご案内資料
customercloud
PRO
0
650
分解して理解する Aspire
nenonaninu
1
300
Oracle Base Database Service 技術詳細
oracle4engineer
PRO
6
57k
7日間でハッキングをはじめる本をはじめてみませんか?_ITエンジニア本大賞2025
nomizone
2
1.9k
目の前の仕事と向き合うことで成長できる - 仕事とスキルを広げる / Every little bit counts
soudai
25
7.2k
JEDAI Meetup! Databricks AI/BI概要
databricksjapan
0
150
エンジニアのためのドキュメント力基礎講座〜構造化思考から始めよう〜(2025/02/15jbug広島#15発表資料)
yasuoyasuo
18
6.9k
あれは良かった、あれは苦労したB2B2C型SaaSの新規開発におけるCloud Spanner
hirohito1108
2
630
リアルタイム分析データベースで実現する SQLベースのオブザーバビリティ
mikimatsumoto
0
1.4k
RSNA2024振り返り
nanachi
0
590
抽象化をするということ - 具体と抽象の往復を身につける / Abstraction and concretization
soudai
20
8.2k
AndroidXR 開発ツールごとの できることできないこと
donabe3
0
130
Featured
See All Featured
A Philosophy of Restraint
colly
203
16k
Code Review Best Practice
trishagee
67
18k
Automating Front-end Workflow
addyosmani
1368
200k
Code Reviewing Like a Champion
maltzj
521
39k
How to train your dragon (web standard)
notwaldorf
91
5.8k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
160
15k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
2.1k
Testing 201, or: Great Expectations
jmmastey
42
7.2k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.4k
Embracing the Ebb and Flow
colly
84
4.6k
Become a Pro
speakerdeck
PRO
26
5.1k
How STYLIGHT went responsive
nonsquared
98
5.4k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None