Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
280
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
230
mesos.devoxx.2014
samklr
2
250
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
クマ×共生 HACKATHON - 熊対策を『特別な行動」から「生活の一部」に -
pharaohkj
0
180
2025-07-25 NOT A HOTEL TECH TALK ━ スマートホーム開発の最前線 ━ SOFTWARE
wakinchan
0
170
AI エンジニアの立場からみた、AI コーディング時代の開発の品質向上の取り組みと妄想
soh9834
8
590
2025/07/22_家族アルバム みてねのCRE における生成AI活用事例
masartz
2
140
自分がLinc’wellで提供しているプロダクトを理解するためにやったこと
murabayashi
1
170
Ktor + Google Cloud Tasks/PubSub におけるOTel Messaging計装の実践
sansantech
PRO
1
330
新規事業におけるAIリサーチの活用例
ranxxx
0
170
【CEDEC2025】LLMを活用したゲーム開発支援と、生成AIの利活用を進める組織的な取り組み
cygames
PRO
1
1.6k
完璧を目指さない小さく始める信頼性向上
kakehashi
PRO
0
110
経験がないことを言い訳にしない、 AI時代の他領域への染み出し方
parayama0625
0
260
MCPに潜むセキュリティリスクを考えてみる
milix_m
1
870
Jitera Company Deck / JP
jitera
0
250
Featured
See All Featured
The Art of Programming - Codeland 2020
erikaheidi
54
13k
It's Worth the Effort
3n
185
28k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
850
Side Projects
sachag
455
43k
Large-scale JavaScript Application Architecture
addyosmani
512
110k
Become a Pro
speakerdeck
PRO
29
5.4k
Building Adaptive Systems
keathley
43
2.7k
A designer walks into a library…
pauljervisheath
207
24k
Producing Creativity
orderedlist
PRO
346
40k
How STYLIGHT went responsive
nonsquared
100
5.7k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
34
5.9k
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None