$30 off During Our Annual Pro Sale. View Details »

A simple serverless data pipeline

A simple serverless data pipeline

Building and maintaining data pipelines when it’s not your full time job is a pain! So better keep things simple without the need to manage the system yourself. In this talk I’ll show a data pipeline architecture built leveraging some cloud offerings by AWS and Preset.

In this talk we’ll discuss:
- an overview of the architecture
- the data lake: AWS S3, AWS Glue and AWS Athena
- the exploration and visualization platform: Apache Superset
- data formats and Python implementations
- vendors lock-in

Presented at Pycon Portugal 2023 on September 7th 2023.

Riccardo Magliocchetti

September 08, 2023
Tweet

More Decks by Riccardo Magliocchetti

Other Decks in Programming

Transcript

  1. A simple serverless
    data pipeline
    Riccardo Magliocchetti - @rmistaken / @[email protected]
    Pycon Portugal 2023

    View Slide

  2. I am not a data engineer

    View Slide

  3. Agenda for today
    - An overview of a simple serverless data pipeline architecture
    - The data lake: AWS S3, AWS Glue, AWS Athena
    - Apache Superset, the exploration and visualization platform
    - Data formats
    - Vendor lock-in

    View Slide

  4. The architecture

    View Slide

  5. View Slide

  6. The context
    - Small team
    - No data engineer
    - End users are business people
    - No real time requirements, data produced daily
    - Small quantity of data

    View Slide

  7. The rest of the team
    - Nicola Martino
    - Bartolo Albanese

    View Slide

  8. The data lake

    View Slide

  9. View Slide

  10. AWS S3
    - Cloud object storage: a persistent cache that looks like a filesystem
    - Cheap unless you move a lot of data outside AWS
    - De facto standard

    View Slide

  11. AWS Glue
    - Serverless data integration service
    - Pay per use
    - Can do data transformations but you can DIY: event notifications plus
    lambdas
    - Data catalog functionality for data in S3

    View Slide

  12. AWS Athena
    - Serverless query service: query Glue data in SQL
    - Pay per use, 5$ / TB

    View Slide

  13. Apache Superset

    View Slide

  14. Apache Superset
    - Data exploration and visualization platform
    - Written in Python (Flask, SQLAlchemy, Pandas) and Javascript
    - Plenty of databases supported
    - SQL Editor included for exploration
    - Hosted version from Preset.io

    View Slide

  15. Apache Superset: Databases and Datasets
    - You connect to databases
    - You choose which database table to expose as a Dataset

    View Slide

  16. Apache Superset: Chart

    View Slide

  17. Apache Superset: Dashboard

    View Slide

  18. Apache Superset: SQL Editor

    View Slide

  19. Data formats

    View Slide

  20. Data formats
    - Started simple with CSV: text, best debugging
    - Optimized size with ORC: binary, columnar, compressed
    - Optimized DX with Parquet: same but a bit less space efficient, bigger
    ecosystem

    View Slide

  21. Vendor lock-in

    View Slide

  22. Vendor lock-in
    - This solution is based on AWS because it’s already in use
    - Every cloud provider has a different take on data engineering, e.g. in Google
    Cloud Platform it would be BigQuery based
    - Can be replicated using open source software:
    - minIO as cloud storage
    - Apache Hive (or similar) to ingest data and get queried by Apache Superset
    - Apache Superset, self-hosted

    View Slide

  23. Conclusions

    View Slide

  24. Conclusions
    - You can build simple data pipelines cheaply without much operations
    overhead
    - Interesting for small teams without data specific roles or knowledge
    - Serverless makes sense for small scale

    View Slide

  25. References
    - https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/ The AWS idea
    of a data architecture
    - https://preset.io/blog/2021-5-25-data-lake-athena/ An example of connecting Apache Superset to
    AWS Athena
    - https://superset.apache.org/docs/intro Apache Superset Documentation

    View Slide

  26. Thanks!
    Riccardo Magliocchetti
    @rmistaken / @[email protected]

    View Slide