Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
280
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
170
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
52
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
220
mesos.devoxx.2014
samklr
2
250
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
Devin(Deep) Wiki/Searchの活用で変わる開発の世界観/devin-wiki-search-impact
tomoki10
0
300
研究開発部メンバーの働き⽅ / Sansan R&D Profile
sansan33
PRO
3
17k
IIWレポートからみるID業界で話題のMCP
fujie
0
100
Amazon Q Developer for GitHubとAmplify Hosting でサクッとデジタル名刺を作ってみた
kmiya84377
0
3.4k
Nonaka Sensei
kawaguti
PRO
3
640
Introduction to Sansan, inc / Sansan Global Development Center, Inc.
sansan33
PRO
0
2.6k
技術職じゃない私がVibe Codingで感じた、AGIが身近になる未来
blueb
0
120
What's new in OpenShift 4.19
redhatlivestreaming
1
220
生成AIをテストプロセスに活用し"よう"としている話 #jasstnano
makky_tyuyan
0
150
AWS全冠したので振りかえってみる
tajimon
0
140
Amplifyとゼロからはじめた AIコーディング 成果と展望
mkdev10
1
180
Cloud Native Scalability for Internal Developer Platforms
hhiroshell
2
440
Featured
See All Featured
Producing Creativity
orderedlist
PRO
346
40k
It's Worth the Effort
3n
184
28k
Visualization
eitanlees
146
16k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Git: the NoSQL Database
bkeepers
PRO
430
65k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.3k
Statistics for Hackers
jakevdp
799
220k
Building a Scalable Design System with Sketch
lauravandoore
462
33k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Why You Should Never Use an ORM
jnunemaker
PRO
56
9.4k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Being A Developer After 40
akosma
90
590k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None