Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
250
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
330
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
770
Datageeks_27-05.pdf
samklr
0
47
Big data and Machine learning APIs
samklr
4
240
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
230
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.7k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
累計ダウンロード数1億8000万を超えるアプリケーションプラットフォームのレガシーシステム脱却とモダン化への道
kmitsuhashi
0
120
大規模ドラレコデータ収集・機械学習基盤を支える AWS CDK 〜導入・運用事例紹介〜
pemugi
0
110
コンテナ・K8s研修 - 後半 Kubernetes 基礎&ハンズオン【MIXI 24新卒技術研修】
mixi_engineers
PRO
1
120
AWSで”最小権限の原則”を実現するための考え方 /20240722-ssmjp-aws-least-privilege
opelab
10
4.4k
データ分析基盤を作ってみよう~設計編~
nrinetcom
PRO
1
110
AWS IAMのアンチパターン/AWSが考える最低権限実現へのアプローチ概略(JAWS-UG朝会#59資料改修20分版)
htan
0
330
AI研修【MIXI 24新卒技術研修】
mixi_engineers
PRO
0
130
プレイドにおけるDatadog APMの活用方法
plaidtech
PRO
2
120
【基調講演】変える、今ここから ― IoTとAIで紡ぐ未来
soracom
PRO
0
320
How to Think Like a Performance Engineer
csswizardry
4
590
フルリモートワークはエンジニアの夢を叶えたか? #cm_odyssey
mamohacy
2
600
コンテナ・K8s研修 - 前半 コンテナ基礎・ハンズオン【MIXI 24新卒技術研修】
mixi_engineers
PRO
0
170
Featured
See All Featured
What's in a price? How to price your products and services
michaelherold
239
11k
Building Adaptive Systems
keathley
34
2k
Documentation Writing (for coders)
carmenintech
63
4.2k
Designing on Purpose - Digital PM Summit 2013
jponch
113
6.6k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
24
1.8k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
26
1.6k
Mobile First: as difficult as doing things right
swwweet
219
8.8k
How to train your dragon (web standard)
notwaldorf
79
5.5k
jQuery: Nuts, Bolts and Bling
dougneiner
61
7.4k
Robots, Beer and Maslow
schacon
PRO
157
8.1k
Facilitating Awesome Meetings
lara
46
5.8k
What’s in a name? Adding method to the madness
productmarketing
PRO
21
2.9k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None