Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Technology
0
200
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
290
High Performance RPC with Finagle
samklr
1
140
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
660
Datageeks_27-05.pdf
samklr
0
41
Big data and Machine learning APIs
samklr
4
230
Scalable Machine Learning
samklr
2
190
mesos.devoxx.2014
samklr
2
210
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.4k
Algebra for analytics
samklr
1
240
Other Decks in Technology
See All in Technology
バッファープールが大きいMySQL v5.7でDROP DATABASEが詰まった原因と対策 / Causes and Remedies for DROP DATABASE Stuck in MySQL v5.7 with Large Buffer Pool
line_developers
PRO
4
690
COSCUP x KCD Taiwan 2020 - 那些年我們在開源社群的日子 - Cloud Native Taiwan
pohsien
0
110
Learning to Solve Hard Minimal Problems
takmin
1
260
cobra は便利になっている
nwiizo
0
130
AWS CLI のエイリアス機能はいいぞ /jawsug-bgnr-48-lt
michimani
1
260
ここが好きだよAWS管理ポリシー_devio2022/i_am_iam_lover
yukihirochiba
0
3k
Istioを活用したセキュアなマイクロサービスの実現/Secure Microservices with Istio
ido_kara_deru
3
360
Oracle Database Technology Night #57 Database Services in Oracle Cloud 最新情報アップデートと活用Tips
oracle4engineer
PRO
0
150
CloudWatchアラームによるサービス継続のための監視入門 / Introduction to Monitoring for Service Continuity with CloudWatch Alarms
inomasosan
1
400
VS Code Meetup #21 - もう一度知りたい基礎編 - ファイル操作、コーディングの基本編
74th
0
160
QuickSight 触ってみた
tomuro
0
360
AWSで実現する「好き」の感情 / Develop Suki by AWS #devio2022
syobochim
1
240
Featured
See All Featured
Gamification - CAS2011
davidbonilla
75
3.9k
Web development in the modern age
philhawksworth
197
9.3k
Building an army of robots
kneath
299
40k
Support Driven Design
roundedbygravity
87
8.6k
Building Flexible Design Systems
yeseniaperezcruz
310
34k
Pencils Down: Stop Designing & Start Developing
hursman
113
9.8k
Principles of Awesome APIs and How to Build Them.
keavy
113
15k
Designing on Purpose - Digital PM Summit 2013
jponch
106
5.7k
Fashionably flexible responsive web design (full day workshop)
malarkey
396
62k
Statistics for Hackers
jakevdp
782
210k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
12
940
Scaling GitHub
holman
451
140k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None