A simple serverless
data pipeline
Riccardo Magliocchetti - @rmistaken / @[email protected]
Pycon(IT) 23
Slide 2
Slide 2 text
I am not a data engineer
Slide 3
Slide 3 text
Agenda for today
- An overview of a simple serverless data pipeline architecture
- The data lake: AWS S3, AWS Glue, AWS Athena
- Apache Superset, the exploration and visualization platform
- Data formats
- Vendor lock-in
Slide 4
Slide 4 text
The architecture
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
The context
- Small team
- No data engineer
- End users are business people
- No real time requirements, data produced daily
- Small quantity of data
Slide 7
Slide 7 text
The rest of the team
- Nicola Martino
- Bartolo Albanese
Slide 8
Slide 8 text
The data lake
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
AWS S3
- Cloud object storage: a persistent cache that looks like a filesystem
- Cheap unless you move a lot of data outside AWS
- De facto standard
Slide 11
Slide 11 text
AWS Glue
- Serverless data integration service
- Pay per use
- Can do data transformations but you can DIY: event notifications plus
lambdas
- Data catalog functionality for data in S3
Slide 12
Slide 12 text
AWS Athena
- Serverless query service: query Glue data in SQL
- Pay per use, 5$ / TB
Slide 13
Slide 13 text
Apache Superset
Slide 14
Slide 14 text
Apache Superset
- Data exploration and visualization platform
- Written in Python (Flask, SQLAlchemy, Pandas) and Javascript
- Plenty of databases supported
- SQL Editor included for exploration
- Hosted version from Preset.io
Slide 15
Slide 15 text
Apache Superset: Databases and Datasets
- You connect to databases
- You choose which database table to expose as a Dataset
Slide 16
Slide 16 text
Apache Superset: Chart
Slide 17
Slide 17 text
Apache Superset: Dashboard
Slide 18
Slide 18 text
Apache Superset: SQL Editor
Slide 19
Slide 19 text
Data formats
Slide 20
Slide 20 text
Data formats
- Started simple with CSV: text, best debugging
- Optimized size with ORC: binary, columnar, compressed
- Optimized DX with Parquet: same but a bit less space efficient, bigger
ecosystem
Slide 21
Slide 21 text
Vendor lock-in
Slide 22
Slide 22 text
Vendor lock-in
- This solution is based on AWS because it’s already in use
- Every cloud provider has a different take on data engineering, e.g. in Google
Cloud Platform it would be BigQuery based
- Can be replicated using open source software:
- minIO as cloud storage
- Apache Hive (or similar) to ingest data and get queried by Apache Superset
- Apache Superset, self-hosted
Slide 23
Slide 23 text
Conclusions
Slide 24
Slide 24 text
Conclusions
- You can build simple data pipelines cheaply without much operations
overhead
- Interesting for small teams without data specific roles or knowledge
- Serverless makes sense for small scale
Slide 25
Slide 25 text
References
- https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/ The AWS idea
of a data architecture
- https://preset.io/blog/2021-5-25-data-lake-athena/ An example of connecting Apache Superset to
AWS Athena
- https://superset.apache.org/docs/intro Apache Superset Documentation