Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
320
0
Share
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
380
High Performance RPC with Finagle
samklr
1
220
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
830
Datageeks_27-05.pdf
samklr
0
75
Big data and Machine learning APIs
samklr
4
290
Scalable Machine Learning
samklr
2
260
mesos.devoxx.2014
samklr
2
290
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
3k
Algebra for analytics
samklr
1
310
Other Decks in Technology
See All in Technology
AI駆動1on1〜AIに自分を育ててもらう〜
yoshiakiyasuda
0
110
#jawsugyokohama 100 LT11, "My AWS Journey 2011-2026 - kwntravel"
shinichirokawano
0
310
Amazon S3 Filesについて
yama3133
2
180
Zero Data Loss Autonomous Recovery Service サービス概要
oracle4engineer
PRO
5
14k
The Journey of Box Building
tagomoris
4
300
自分のハンドルは自分で握れ! ― 自分のケイパビリティを増やし、メンバーのケイパビリティ獲得を支援する ― / Take the wheel yourself
takaking22
1
820
弁護士ドットコム株式会社 エンジニア職向け 会社紹介資料
bengo4com
1
120
"SQLは書けません"から始まる データドリブン
kubell_hr
2
460
新メンバーのために、シニアエンジニアが環境を作る時代
puku0x
0
1.1k
CDK Insightsで見る、AIによるCDKコード静的解析(+AI解析)
k_adachi_01
2
180
Snowflake Intelligence導入で 分かった活用のコツ
wonohe
0
110
みんなの「データ活用」を支えるストレージ担当から持ち込むAWS活用/コミュニティー設計TIPS 10選~「作れる」より、「続けられる」設計へ~
yoshiki0705
0
210
Featured
See All Featured
Between Models and Reality
mayunak
3
260
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
190
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.4k
Reality Check: Gamification 10 Years Later
codingconduct
0
2.1k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
コードの90%をAIが書く世界で何が待っているのか / What awaits us in a world where 90% of the code is written by AI
rkaga
61
43k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.3k
Chasing Engaging Ingredients in Design
codingconduct
0
170
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
260
The Spectacular Lies of Maps
axbom
PRO
1
690
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.2k
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
310
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None