Slide 1

Slide 1 text

Un data lake semplice e serverless su AWS Riccardo Magliocchetti - @rmistaken / @rmistaken@hachyderm.io Data Beers Torino

Slide 2

Slide 2 text

Agenda for today - The context and the architecture - The data lake - Data formats - Vendor lock-in

Slide 3

Slide 3 text

The architecture

Slide 4

Slide 4 text

The context - The problem: pushing business metrics out of a web application - Small team, no data engineer - No real time requirements, data produced daily - Small quantity of data

Slide 5

Slide 5 text

The team - Nicola Martino - Bartolo Albanese

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

The data lake

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

AWS S3 - Cloud object storage: a persistent cache that looks like a filesystem - Cheap unless you move a lot of data outside AWS - De facto standard

Slide 10

Slide 10 text

AWS Glue - Serverless data integration service - Pay per use - Can do data transformations but you can DIY: event notifications plus lambdas - Data catalog functionality for data in S3

Slide 11

Slide 11 text

AWS Athena - Serverless query service: query Glue data in SQL - Pay per use, 5$ / TB

Slide 12

Slide 12 text

Data formats

Slide 13

Slide 13 text

Data formats - Started simple with CSV: text, best debugging - Optimized size with ORC: binary, columnar, compressed - Optimized DX with Parquet: same but a bit less space efficient, bigger ecosystem

Slide 14

Slide 14 text

Vendor lock-in

Slide 15

Slide 15 text

Vendor lock-in - Can be replicated using open source software: - minIO as object storage - Apache Hive (or similar) to ingest data and get queried in SQL

Slide 16

Slide 16 text

Conclusions

Slide 17

Slide 17 text

Conclusions - You can build simple data pipelines cheaply without much operations overhead - Interesting for small teams without data specific roles or knowledge - Serverless makes sense for small scale

Slide 18

Slide 18 text

References - The AWS idea of a data architecture https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/

Slide 19

Slide 19 text

Thanks! Riccardo Magliocchetti @rmistaken / @rmistaken@hachyderm.io