Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Un data lake semplice e serverless semplice su AWS

Un data lake semplice e serverless semplice su AWS

In questo talk vedremo una architettura per creare un data lake, cioè il dove salviamo e processiamo i nostri dati, usando l'infrastruttura serverless su AWS.

In this short talk we'll se an architecture for a data lake - the place where you store and process your data - built with serverless services on AWS.

Avatar for Riccardo Magliocchetti

Riccardo Magliocchetti

October 18, 2023
Tweet

More Decks by Riccardo Magliocchetti

Other Decks in Programming

Transcript

  1. Un data lake semplice e serverless su AWS Riccardo Magliocchetti

    - @rmistaken / @rmistaken@hachyderm.io Data Beers Torino
  2. Agenda for today - The context and the architecture -

    The data lake - Data formats - Vendor lock-in
  3. The context - The problem: pushing business metrics out of

    a web application - Small team, no data engineer - No real time requirements, data produced daily - Small quantity of data
  4. AWS S3 - Cloud object storage: a persistent cache that

    looks like a filesystem - Cheap unless you move a lot of data outside AWS - De facto standard
  5. AWS Glue - Serverless data integration service - Pay per

    use - Can do data transformations but you can DIY: event notifications plus lambdas - Data catalog functionality for data in S3
  6. Data formats - Started simple with CSV: text, best debugging

    - Optimized size with ORC: binary, columnar, compressed - Optimized DX with Parquet: same but a bit less space efficient, bigger ecosystem
  7. Vendor lock-in - Can be replicated using open source software:

    - minIO as object storage - Apache Hive (or similar) to ingest data and get queried in SQL
  8. Conclusions - You can build simple data pipelines cheaply without

    much operations overhead - Interesting for small teams without data specific roles or knowledge - Serverless makes sense for small scale