Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
260
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
330
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
47
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
230
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.7k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
Security-JAWS【第35回】勉強会クラウドにおけるマルウェアやコンテンツ改ざんへの対策
4su_para
0
170
Incident Response Practices: Waroom's Features and Future Challenges
rrreeeyyy
0
160
Why does continuous profiling matter to developers? #appdevelopercon
salaboy
0
180
Application Development WG Intro at AppDeveloperCon
salaboy
0
180
20241120_JAWS_東京_ランチタイムLT#17_AWS認定全冠の先へ
tsumita
2
240
Oracle Cloud Infrastructureデータベース・クラウド:各バージョンのサポート期間
oracle4engineer
PRO
28
12k
Making your applications cross-environment - OSCG 2024 NA
salaboy
0
180
Taming you application's environments
salaboy
0
180
Lambdaと地方とコミュニティ
miu_crescent
2
370
透過型SMTPプロキシによる送信メールの可観測性向上: Update Edition / Improved observability of outgoing emails with transparent smtp proxy: Update edition
linyows
2
210
隣接領域をBeyondするFinatextのエンジニア組織設計 / beyond-engineering-areas
stajima
1
270
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
2
3.2k
Featured
See All Featured
jQuery: Nuts, Bolts and Bling
dougneiner
61
7.5k
StorybookのUI Testing Handbookを読んだ
zakiyama
27
5.3k
Put a Button on it: Removing Barriers to Going Fast.
kastner
59
3.5k
Producing Creativity
orderedlist
PRO
341
39k
A Modern Web Designer's Workflow
chriscoyier
693
190k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
109
49k
Art, The Web, and Tiny UX
lynnandtonic
297
20k
Adopting Sorbet at Scale
ufuk
73
9.1k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
27
840
How To Stay Up To Date on Web Technology
chriscoyier
788
250k
Documentation Writing (for coders)
carmenintech
65
4.4k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
44
6.8k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None