Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Cost-effective data engineering with Python

Cost-effective data engineering with Python

Video: https://video.linux.it/w/gBD5pzc6Y26KRtofGzVLZ2?start=4m58&stop=22m9

Come ridurre i costi delle risorse cloud del 95% riscrivendo in Python una pipeline di Azure Data Factory.

Dario Ruben Scanferlato — Ingegnere industrial convertito in programmatore, lavora come libero professionista su progetti legati ai dati e alla supply chain

Avatar for Python Torino

Python Torino

December 18, 2024
Tweet

More Decks by Python Torino

Other Decks in Programming

Transcript

  1. Dario Ruben Scanferlato Master of Science in Engineering and Management

    at PoliTo Freelance IT consultant specializing in data & supply chain analytics Data and beer enthusiast dario-scanferlato
  2. Agenda Present a data engineering case study related to the

    processing of drug shipment information Discuss costs of running data pipelines in the cloud Provide an overview of the Python ecosystem for data engineering
  3. Problem at hand Supply-chain data events in XML format NoSQL

    document database Data warehouse A report is submitted to a government agency that verifies compliance Some US hospitals are required to report to the government information relative to purchased drugs. We are tasked to extract drug shipment information from some XML files, combine it with other sources, and generate a report. Save original files in archive Data Pipeline
  4. Supply-chain data events in XML format NoSQL document database Data

    warehouse A report is submitted to a government agency that verifies compliance Save original files in archive Parse document, convert to JSON format Extract drug shipment data with NoSQL query Combine with other data sources
  5. Positive aspects of Azure Data Factory No-code tool that allows

    to quickly setup and orchestrate data pipelines to copy or sync data between an arbitrary source and destination Provides the possibility to connect to hundreds of different data sources Scalable Consumption based billing (pay only what you use)
  6. Negative aspects of Azure Data Factory Implementing custom logic, transformations

    and validations is frustrating or impossible Large and unavoidable overhead for simple activities Has its own syntax for defining parameters and pipeline templates Errors sometimes unclear Costs are high and hard to predict
  7. How can we make data pipelines cheaper? Simplify the architecture

    Process data in batches Use open-source, license-free tools Run your pipeline on infrastructure that is not cloud provider specific, e.g. Docker, to prevent vendor lock-in Remember development and maintenance costs: pick a tool that your team is comfortable with
  8. 2nd most popular programming language overall, according to the 2024

    GitHub developer survey Has a wide ecosystem of tools for dealing with any file format, and interact with any cloud resource Allows handling any kind of arbitrarily complex logic to process our data Can be executed locally Can be unit tested Open-source, inherently less expensive compared to a managed platform Relatively easy to package and deploy a Python application to run in the cloud using Docker, or a serverless architecture The case for Python
  9. Read XML files from SFTP server Data Factory activities (i.e.

    our pipeline “building blocks”) can be replicated with few lines of code
  10. Read XML files from SFTP server Save document in NoSQL

    database Convert to JSON Another example
  11. Read XML files from SFTP server Save document in NoSQL

    database Data warehouse A report is submitted to a government agency that verifies compliance Save original files in archive Extract drug shipment data Combine with other data sources Convert to JSON
  12. Azure Functions Cheap options for running Python code in the

    cloud Docker Containers Docker containers are fully configurable sandboxed processes that can run isolated from other processes on the host machine. Every major cloud provider has a service for running Docker containers Serverless functions (such as Azure Functions or AWS Lambda) allow you to run code on- demand without having to personally provision dedicated infrastructure. Functions can be triggered upon receiving a file, or they can be run on a schedule.
  13. In case we need a tool to manage our pipelines,

    Airflow is an open-source orchestration tool. Pipelines can be defined as Python scripts and they can be managed and monitored through a user interface.
  14. Total cost less than 10 USD / day. By using

    Python, we were also able to Implement additional custom data validation and cleaning steps Log and collect custom information, such as metadata, statistics on imported records, performance metrics Automatically run unit tests on our code, using the pytest module
  15. Recap Given its wide array of data-focused tools, Python can

    be used to joyfully tackle a variety of challenges in the data engineering space Switching to open-source license-free tools will likely decrease cloud infrastructure costs Light-weight provider-agnostic platforms such as Docker allow you to avoid vendor lock-in, and easily switch for the cheapest cloud provider on the market. Prices for hosting such types of infrastrcuture are lower because cloud providers are forced to compete with each other.