Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Un data lake semplice e serverless su AWS

Un data lake semplice e serverless su AWS

Video: https://video.linux.it/w/6HoHhhyrnfqYnzZiM9sRx5?start=28m&stop=40m40s

Creare e mantenere data pipeline, quando non è il tuo lavoro, risulta particolarmente noioso: quindi meglio fare le cose semplici, magari facendo amministrare i sistemi agli altri.

In questo intervento vediamo un'architettura per un data lake, ossia dove conservare e processare i dati su cui vogliamo fare analisi, in modalità serverless, cioè senza dover amministrare noi server e sistemi, usando l'infrastruttura di AWS.


Riccardo Magliocchietti — Freelance software tinkerer

Python Torino

October 18, 2023
Tweet

More Decks by Python Torino

Other Decks in Programming

Transcript

  1. Agenda for today - The context and the architecture -

    The data lake - Data formats - Vendor lock-in
  2. The context - The problem: pushing business metrics out of

    a web application - Small team, no data engineer - No real time requirements, data produced daily - Small quantity of data
  3. AWS S3 - Cloud object storage: a persistent cache that

    looks like a filesystem - Cheap unless you move a lot of data outside AWS - De facto standard
  4. AWS Glue - Serverless data integration service - Pay per

    use - Can do data transformations but you can DIY: event notifications plus lambdas - Data catalog functionality for data in S3
  5. Data formats - Started simple with CSV: text, best debugging

    - Optimized size with ORC: binary, columnar, compressed - Optimized DX with Parquet: same but a bit less space efficient, bigger ecosystem
  6. Vendor lock-in - Can be replicated using open source software:

    - minIO as object storage - Apache Hive (or similar) to ingest data and get queried in SQL
  7. Conclusions - You can build simple data pipelines cheaply without

    much operations overhead - Interesting for small teams without data specific roles or knowledge - Serverless makes sense for small scale