Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A simple serverless data pipeline

A simple serverless data pipeline

Building and maintaining data pipelines when it’s not your full time job is a pain! So better keep things simple without the need to manage the system yourself. In this talk I’ll show a data pipeline architecture built leveraging some cloud offerings by AWS and Preset.

In this talk we’ll discuss:
- an overview of the architecture
- the data lake: AWS S3, AWS Athena
- the exploration and visualization platform: Apache Superset
- data formats and Python implementations
- vendors lock-in

Riccardo Magliocchetti

May 26, 2023
Tweet

More Decks by Riccardo Magliocchetti

Other Decks in Programming

Transcript

  1. Agenda for today - An overview of a simple serverless

    data pipeline architecture - The data lake: AWS S3, AWS Glue, AWS Athena - Apache Superset, the exploration and visualization platform - Data formats - Vendor lock-in
  2. The context - Small team - No data engineer -

    End users are business people - No real time requirements, data produced daily - Small quantity of data
  3. AWS S3 - Cloud object storage: a persistent cache that

    looks like a filesystem - Cheap unless you move a lot of data outside AWS - De facto standard
  4. AWS Glue - Serverless data integration service - Pay per

    use - Can do data transformations but you can DIY: event notifications plus lambdas - Data catalog functionality for data in S3
  5. Apache Superset - Data exploration and visualization platform - Written

    in Python (Flask, SQLAlchemy, Pandas) and Javascript - Plenty of databases supported - SQL Editor included for exploration - Hosted version from Preset.io
  6. Apache Superset: Databases and Datasets - You connect to databases

    - You choose which database table to expose as a Dataset
  7. Data formats - Started simple with CSV: text, best debugging

    - Optimized size with ORC: binary, columnar, compressed - Optimized DX with Parquet: same but a bit less space efficient, bigger ecosystem
  8. Vendor lock-in - This solution is based on AWS because

    it’s already in use - Every cloud provider has a different take on data engineering, e.g. in Google Cloud Platform it would be BigQuery based - Can be replicated using open source software: - minIO as cloud storage - Apache Hive (or similar) to ingest data and get queried by Apache Superset - Apache Superset, self-hosted
  9. Conclusions - You can build simple data pipelines cheaply without

    much operations overhead - Interesting for small teams without data specific roles or knowledge - Serverless makes sense for small scale
  10. References - https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/ The AWS idea of a data architecture

    - https://preset.io/blog/2021-5-25-data-lake-athena/ An example of connecting Apache Superset to AWS Athena - https://superset.apache.org/docs/intro Apache Superset Documentation