Un data lake semplice e
serverless su AWS
Riccardo Magliocchetti - @rmistaken / @rmistaken@hachyderm.io
Data Beers Torino
Slide 2
Slide 2 text
Agenda for today
- The context and the architecture
- The data lake
- Data formats
- Vendor lock-in
Slide 3
Slide 3 text
The architecture
Slide 4
Slide 4 text
The context
- The problem: pushing business metrics out of a web application
- Small team, no data engineer
- No real time requirements, data produced daily
- Small quantity of data
Slide 5
Slide 5 text
The team
- Nicola Martino
- Bartolo Albanese
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
The data lake
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
AWS S3
- Cloud object storage: a persistent cache that looks like a filesystem
- Cheap unless you move a lot of data outside AWS
- De facto standard
Slide 10
Slide 10 text
AWS Glue
- Serverless data integration service
- Pay per use
- Can do data transformations but you can DIY: event notifications plus
lambdas
- Data catalog functionality for data in S3
Slide 11
Slide 11 text
AWS Athena
- Serverless query service: query Glue data in SQL
- Pay per use, 5$ / TB
Slide 12
Slide 12 text
Data formats
Slide 13
Slide 13 text
Data formats
- Started simple with CSV: text, best debugging
- Optimized size with ORC: binary, columnar, compressed
- Optimized DX with Parquet: same but a bit less space efficient, bigger
ecosystem
Slide 14
Slide 14 text
Vendor lock-in
Slide 15
Slide 15 text
Vendor lock-in
- Can be replicated using open source software:
- minIO as object storage
- Apache Hive (or similar) to ingest data and get queried in SQL
Slide 16
Slide 16 text
Conclusions
Slide 17
Slide 17 text
Conclusions
- You can build simple data pipelines cheaply without much operations
overhead
- Interesting for small teams without data specific roles or knowledge
- Serverless makes sense for small scale
Slide 18
Slide 18 text
References
- The AWS idea of a data architecture
https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/