Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
240
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
320
High Performance RPC with Finagle
samklr
1
150
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
770
Datageeks_27-05.pdf
samklr
0
46
Big data and Machine learning APIs
samklr
4
240
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
220
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.6k
Algebra for analytics
samklr
1
260
Other Decks in Technology
See All in Technology
.NETの非同期戦略とUnityとの相互運用
neuecc
2
2.4k
オーティファイ会社紹介資料 / Autify Company Deck
autifyhq
7
100k
業務で使えるかもしれない…!?GitHub Actions の Tips 集 / CI/CD Test Night #7
ponkio_o
PRO
24
7.1k
期待しすぎずに取り組む両面 TypeScript
shozawa
4
520
Introducing Pkl
enomotok
0
190
Kubeflow Pipelines v2 で変わる機械学習パイプライン開発
asei
4
340
生成AIの不確実性と向き合うためのオブジェクト指向設計
tkikuchi1002
2
690
技術イベントはなんとかひねり出す 日経の技術広報の取り組み/techpr3
nishiuma
0
230
初心者が行く!サーバレスWebアプリ開発の道
nagaharutogawa
0
450
オブジェクトのおしゃべり大失敗 メッセージングアンチパターン集 / messaging anti-pattern collection
ytake
0
340
生成AIサービスPanorama AIご説明資料
sdt
0
300
Autify Company Deck
autifyhq
1
30k
Featured
See All Featured
Optimising Largest Contentful Paint
csswizardry
7
2.3k
How to Ace a Technical Interview
jacobian
272
22k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
124
32k
Principles of Awesome APIs and How to Build Them.
keavy
119
16k
A designer walks into a library…
pauljervisheath
199
23k
Testing 201, or: Great Expectations
jmmastey
27
6.3k
Robots, Beer and Maslow
schacon
PRO
154
7.9k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
219
21k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
225
51k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
34
8.8k
Unsuck your backbone
ammeep
661
56k
The World Runs on Bad Software
bkeepers
PRO
60
6.6k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None