Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A simple serverless data pipeline

A simple serverless data pipeline

Building and maintaining data pipelines when it’s not your full time job is a pain! So better keep things simple without the need to manage the system yourself. In this talk I’ll show a data pipeline architecture built leveraging some cloud offerings by AWS and Preset.

In this talk we’ll discuss:
- an overview of the architecture
- the data lake: AWS S3, AWS Glue and AWS Athena
- the exploration and visualization platform: Apache Superset
- data formats and Python implementations
- vendors lock-in

Presented at Pycon Portugal 2023 on September 7th 2023.

Avatar for Riccardo Magliocchetti

Riccardo Magliocchetti

September 08, 2023
Tweet

More Decks by Riccardo Magliocchetti

Other Decks in Programming

Transcript

  1. Agenda for today - An overview of a simple serverless

    data pipeline architecture - The data lake: AWS S3, AWS Glue, AWS Athena - Apache Superset, the exploration and visualization platform - Data formats - Vendor lock-in
  2. The context - Small team - No data engineer -

    End users are business people - No real time requirements, data produced daily - Small quantity of data
  3. AWS S3 - Cloud object storage: a persistent cache that

    looks like a filesystem - Cheap unless you move a lot of data outside AWS - De facto standard
  4. AWS Glue - Serverless data integration service - Pay per

    use - Can do data transformations but you can DIY: event notifications plus lambdas - Data catalog functionality for data in S3
  5. Apache Superset - Data exploration and visualization platform - Written

    in Python (Flask, SQLAlchemy, Pandas) and Javascript - Plenty of databases supported - SQL Editor included for exploration - Hosted version from Preset.io
  6. Apache Superset: Databases and Datasets - You connect to databases

    - You choose which database table to expose as a Dataset
  7. Data formats - Started simple with CSV: text, best debugging

    - Optimized size with ORC: binary, columnar, compressed - Optimized DX with Parquet: same but a bit less space efficient, bigger ecosystem
  8. Vendor lock-in - This solution is based on AWS because

    it’s already in use - Every cloud provider has a different take on data engineering, e.g. in Google Cloud Platform it would be BigQuery based - Can be replicated using open source software: - minIO as cloud storage - Apache Hive (or similar) to ingest data and get queried by Apache Superset - Apache Superset, self-hosted
  9. Conclusions - You can build simple data pipelines cheaply without

    much operations overhead - Interesting for small teams without data specific roles or knowledge - Serverless makes sense for small scale
  10. References - https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/ The AWS idea of a data architecture

    - https://preset.io/blog/2021-5-25-data-lake-athena/ An example of connecting Apache Superset to AWS Athena - https://superset.apache.org/docs/intro Apache Superset Documentation