Cost-effective data engineering with Python

Cost-effective data engineering with Python A case study by Dario
Ruben Scanferlato

Dario Ruben Scanferlato Master of Science in Engineering and Management
at PoliTo Freelance IT consultant specializing in data & supply chain analytics Data and beer enthusiast dario-scanferlato

Agenda Present a data engineering case study related to the
processing of drug shipment information Discuss costs of running data pipelines in the cloud Provide an overview of the Python ecosystem for data engineering

Problem at hand Supply-chain data events in XML format NoSQL
document database Data warehouse A report is submitted to a government agency that verifies compliance Some US hospitals are required to report to the government information relative to purchased drugs. We are tasked to extract drug shipment information from some XML files, combine it with other sources, and generate a report. Save original files in archive Data Pipeline

Supply-chain data events in XML format NoSQL document database Data
warehouse A report is submitted to a government agency that verifies compliance Save original files in archive Parse document, convert to JSON format Extract drug shipment data with NoSQL query Combine with other data sources

How can we setup a data pipeline and run in
the cloud?

The CEO of Microsoft has a proposal for you

Example of a Data Factory pipeline

Checking in with the senior developer

*This pipeline has been greatly simplified for clarity

Positive aspects of Azure Data Factory No-code tool that allows
to quickly setup and orchestrate data pipelines to copy or sync data between an arbitrary source and destination Provides the possibility to connect to hundreds of different data sources Scalable Consumption based billing (pay only what you use)

Negative aspects of Azure Data Factory Implementing custom logic, transformations
and validations is frustrating or impossible Large and unavoidable overhead for simple activities Has its own syntax for defining parameters and pipeline templates Errors sometimes unclear Costs are high and hard to predict

Running costs: ~$200 USD a day

Where are all these costs coming from?

Even more slightly hidden fees...

And all for a product that is sometimes hard to
debug

How can we make data pipelines cheaper? Simplify the architecture
Process data in batches Use open-source, license-free tools Run your pipeline on infrastructure that is not cloud provider specific, e.g. Docker, to prevent vendor lock-in Remember development and maintenance costs: pick a tool that your team is comfortable with

2nd most popular programming language overall, according to the 2024
GitHub developer survey Has a wide ecosystem of tools for dealing with any file format, and interact with any cloud resource Allows handling any kind of arbitrarily complex logic to process our data Can be executed locally Can be unit tested Open-source, inherently less expensive compared to a managed platform Relatively easy to package and deploy a Python application to run in the cloud using Docker, or a serverless architecture The case for Python

Read XML files from SFTP server Data Factory activities (i.e.
our pipeline “building blocks”) can be replicated with few lines of code

Read XML files from SFTP server Save document in NoSQL
database Convert to JSON Another example

Read XML files from SFTP server Save document in NoSQL
database Data warehouse A report is submitted to a government agency that verifies compliance Save original files in archive Extract drug shipment data Combine with other data sources Convert to JSON

Azure Functions Cheap options for running Python code in the
cloud Docker Containers Docker containers are fully configurable sandboxed processes that can run isolated from other processes on the host machine. Every major cloud provider has a service for running Docker containers Serverless functions (such as Azure Functions or AWS Lambda) allow you to run code on- demand without having to personally provision dedicated infrastructure. Functions can be triggered upon receiving a file, or they can be run on a schedule.

In case we need a tool to manage our pipelines,
Airflow is an open-source orchestration tool. Pipelines can be defined as Python scripts and they can be managed and monitored through a user interface.

Total cost less than 10 USD / day. By using
Python, we were also able to Implement additional custom data validation and cleaning steps Log and collect custom information, such as metadata, statistics on imported records, performance metrics Automatically run unit tests on our code, using the pytest module

Recap Given its wide array of data-focused tools, Python can
be used to joyfully tackle a variety of challenges in the data engineering space Switching to open-source license-free tools will likely decrease cloud infrastructure costs Light-weight provider-agnostic platforms such as Docker allow you to avoid vendor lock-in, and easily switch for the cheapest cloud provider on the market. Prices for hosting such types of infrastrcuture are lower because cloud providers are forced to compete with each other.

Thanks! dario-scanferlato Contact [email protected]

Cost-effective data engineering with Python

Cost-effective data engineering with Python

Python Torino

More Decks by Python Torino

Other Decks in Programming

Featured

Transcript