Un data lake semplice e serverless su AWS

Un data lake semplice e serverless su AWS Riccardo Magliocchetti
- @rmistaken / @[email protected] Data Beers Torino

Agenda for today - The context and the architecture -
The data lake - Data formats - Vendor lock-in

The architecture

The context - The problem: pushing business metrics out of
a web application - Small team, no data engineer - No real time requirements, data produced daily - Small quantity of data

The team - Nicola Martino - Bartolo Albanese

The data lake

AWS S3 - Cloud object storage: a persistent cache that
looks like a filesystem - Cheap unless you move a lot of data outside AWS - De facto standard

AWS Glue - Serverless data integration service - Pay per
use - Can do data transformations but you can DIY: event notifications plus lambdas - Data catalog functionality for data in S3

AWS Athena - Serverless query service: query Glue data in
SQL - Pay per use, 5$ / TB

Data formats

Data formats - Started simple with CSV: text, best debugging
- Optimized size with ORC: binary, columnar, compressed - Optimized DX with Parquet: same but a bit less space efficient, bigger ecosystem

Vendor lock-in

Vendor lock-in - Can be replicated using open source software:
- minIO as object storage - Apache Hive (or similar) to ingest data and get queried in SQL

Conclusions

Conclusions - You can build simple data pipelines cheaply without
much operations overhead - Interesting for small teams without data specific roles or knowledge - Serverless makes sense for small scale

References - The AWS idea of a data architecture https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/

Thanks! Riccardo Magliocchetti @rmistaken / @[email protected]

Un data lake semplice e serverless su AWS

Un data lake semplice e serverless su AWS

Python Torino

More Decks by Python Torino

Other Decks in Programming

Featured

Transcript