Analytics Orchestration and management • Models dependences between different data tasks • Heterogenous environments • Integrations with data lakes, data warehouses and cloud based tools • Handles dynamics in data sources, sizes and frequencies
Astro CLI1 • Create a new project directory • Initialize the project: astro dev init • Start the project: astro dev start • Airflow UI: http://localhost:8080 1https://github.com/astronomer/astro-cli
files from AWS S3 to CrateDB and check if imported values are in a certain range timestamp, value 1451624400, 0.2 1451624402, 0.4 1451624404, 0.1 ... CREATE TABLE my_table ( "timestamp" TIMESTAMP, "value" REAL ) value BETWEEN 0 AND 1
CSV or JSON file from a URI to a table • CrateDB supports two URI schemes: file and s3 AWS credentials S3 bucket + path COPY my_table FROM 's3://[{access_key}:{secret_key}@] <bucket_name>/<path>' table name
Fetch files in S3 AWS connection S3Hook 2. Create COPY FROM statements for each file AWS credentials PythonOperator 3. Import data to CrateDB CrateDB connection PostgresOperator 4. Check on value column CrateDB connection SQLColumnCheckOperator
date, schedule_interval • Task dependencies: chain, bitwise operator • S3Hook for accessing S3 bucket • COPY FROM statement for importing data to CrateDB • Data quality checks: SQLColumnCheckOperator
orchestration tasks • CrateDB offers easy integration due to PostgreSQL compatibility • Open source and easy to scale Resources and tutorials: • Dynamic orchestration with Airflow and CrateDB • Dynamic tasks in Airflow • CrateDB documentation • Airflow community • CrateDB community https://github.com/crate/crate https://github.com/apache/airflow https://github.com/crate/crate-airflow-tutorial