Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dynamic workflow orchestration with Apache Airf...

Dynamic workflow orchestration with Apache Airflow and CrateDB

This talk will illustrate how easy it is to automate orchestration workflows with Apache Airflow and CrateDB.

cratedb

June 23, 2023
Tweet

More Decks by cratedb

Other Decks in Technology

Transcript

  1. Dynamic data orchestration Intro to CrateDB and Airflow Connect Airflow

    and CrateDB Database workflow: use case Dynamic task mapping Demo Summary and next steps 01 02 03 04 05 06 07 Outline 2 Crate.io
  2. What is dynamic data orchestration? 3 Crate.io Data collector Storage

    Analytics Orchestration and management • Models dependences between different data tasks • Heterogenous environments • Integrations with data lakes, data warehouses and cloud based tools • Handles dynamics in data sources, sizes and frequencies
  3. About CrateDB 6 Crate.io • A distributed, horizontally scaling database

    • Open Source under Apache License 2.0 • PostgreSQL compatibility • Perfect choice for: ◦ Non-transactional data ◦ Mixed structured/unstructured data ◦ Fast analytical queries ◦ Highly scalable deployments
  4. Getting started with CrateDB 7 Crate.io • Run on Docker:

    docker run -–publish=4200:4200 -–publish=5432:5432 crate • Access the Admin UI via: http://localhost:4200 • CrateDB Cloud Free Trial: https://crate.io/lp-free-trial
  5. What is Apache Airflow 9 Crate.io • Open-source workflow management

    platform • Workflow is modelled as a Directed Acyclic Graph (DAG) • DAG is defined programmatically (i.e., Python) • Task: basic unit of execution
  6. Getting started with Airflow 10 Crate.io • Easy start with

    Astro CLI1 • Create a new project directory • Initialize the project: astro dev init • Start the project: astro dev start • Airflow UI: http://localhost:8080 1https://github.com/astronomer/astro-cli
  7. Use CrateDB connection 13 Crate.io • PostgresOperator: task involving interaction

    with PostgreSQL database • task_id: name of your task • postgres_conn_id: name of CrateDB connection • sql: SQL statement
  8. Import data files from S3 15 Crate.io • Idea: import

    files from AWS S3 to CrateDB and check if imported values are in a certain range timestamp, value 1451624400, 0.2 1451624402, 0.4 1451624404, 0.1 ... CREATE TABLE my_table ( "timestamp" TIMESTAMP, "value" REAL ) value BETWEEN 0 AND 1
  9. Dynamic task mapping 16 Crate.io S3 Fetch files ? ?

    … • Unknown number of files • Task for each file • Dynamic task mapping since Airflow 2.3 • Supported with expand()method
  10. Dynamic task mapping: example 17 Crate.io • Call expand()on a

    task and pass it a list or a dictionary • It is possible to set constant arguments using the method partial()
  11. COPY FROM 18 Crate.io • Imports the content of a

    CSV or JSON file from a URI to a table • CrateDB supports two URI schemes: file and s3 AWS credentials S3 bucket + path COPY my_table FROM 's3://[{access_key}:{secret_key}@] <bucket_name>/<path>' table name
  12. Use case: Airflow tasks 19 Crate.io Task Requirement Operator 1.

    Fetch files in S3 AWS connection S3Hook 2. Create COPY FROM statements for each file AWS credentials PythonOperator 3. Import data to CrateDB CrateDB connection PostgresOperator 4. Check on value column CrateDB connection SQLColumnCheckOperator
  13. Demo recap 21 Crate.io • DAG specification: name, description, start

    date, schedule_interval • Task dependencies: chain, bitwise operator • S3Hook for accessing S3 bucket • COPY FROM statement for importing data to CrateDB • Data quality checks: SQLColumnCheckOperator
  14. Wrap up 23 • Apache Airflow and CrateDB for database

    orchestration tasks • CrateDB offers easy integration due to PostgreSQL compatibility • Open source and easy to scale Resources and tutorials: • Dynamic orchestration with Airflow and CrateDB • Dynamic tasks in Airflow • CrateDB documentation • Airflow community • CrateDB community https://github.com/crate/crate https://github.com/apache/airflow https://github.com/crate/crate-airflow-tutorial