Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
310
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
210
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
820
Datageeks_27-05.pdf
samklr
0
68
Big data and Machine learning APIs
samklr
4
280
Scalable Machine Learning
samklr
2
260
mesos.devoxx.2014
samklr
2
280
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
3k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
GitLab Duo Agent Platform + Local LLMサービングで幸せになりたい
jyoshise
0
100
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
7
7.1k
大規模サービスにおける レガシーコードからReactへの移行
magicpod
1
130
Serverless Agent Architecture on Azure / serverless-agent-on-azure
miyake
1
150
組織のSREを推進するためのPlatform EngineeringとEKS / Platform Engineering and EKS to drive SRE in your organization
chmikata
0
180
管理者向けGitHub Enterpriseの運用Tips紹介: 人にもAIにも優しいプラットフォームづくり
yuriemori
0
110
「データとの対話」の現在地と未来
kobakou
0
1.3k
Devinを導入したら予想外の人たちに好評だった
tomuro
0
880
JAWS DAYS 2026 CDP道場 事前説明会 / JAWS DAYS 2026 CDP Dojo briefing document
naospon
0
140
製造業ドメインにおける LLMプロダクト構築: 複雑な文脈へのアプローチ
caddi_eng
1
450
Databricksアシスタントが自分で考えて動く時代に! エージェントモード体験もくもく会
taka_aki
0
320
LINEヤフーにおけるAI駆動開発組織のプロデュース施策
lycorptech_jp
PRO
0
400
Featured
See All Featured
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.9k
Chasing Engaging Ingredients in Design
codingconduct
0
130
Designing for Timeless Needs
cassininazir
0
150
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.4k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
Large-scale JavaScript Application Architecture
addyosmani
515
110k
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
190
We Are The Robots
honzajavorek
0
190
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
10k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.4k
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
380
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None