Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
280
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
230
mesos.devoxx.2014
samklr
2
250
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
AIの全社活用を推進するための安全なレールを敷いた話
shoheimitani
2
450
Delegating the chores of authenticating users to Keycloak
ahus1
0
140
AI導入の理想と現実~コストと浸透〜
oprstchn
0
190
第4回Snowflake 金融ユーザー会 Snowflake summit recap
tamaoki
1
240
さくらのIaaS基盤のモニタリングとOpenTelemetry/OSC Hokkaido 2025
fujiwara3
2
350
ビギナーであり続ける/beginning
ikuodanaka
3
720
OSSのSNSツール「Misskey」をさわってみよう(右下ワイプで私のOSCの20年を振り返ります) / 20250705-osc2025-do
akkiesoft
0
140
ビズリーチが挑む メトリクスを活用した技術的負債の解消 / dev-productivity-con2025
visional_engineering_and_design
3
6.7k
United Airlines Customer Service– Call 1-833-341-3142 Now!
airhelp
0
160
Oracle Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
0
100
改めてAWS WAFを振り返る~業務で使うためのポイント~
masakiokuda
2
240
ドメイン特化なCLIPモデルとデータセットの紹介
tattaka
2
580
Featured
See All Featured
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
53
2.9k
Navigating Team Friction
lara
187
15k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
960
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
GitHub's CSS Performance
jonrohan
1031
460k
GraphQLとの向き合い方2022年版
quramy
49
14k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Reflections from 52 weeks, 52 projects
jeffersonlam
351
20k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
48
2.9k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Done Done
chrislema
184
16k
Raft: Consensus for Rubyists
vanstee
140
7k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None