Slide 1

Slide 1 text

For analytics / ML projects Developing a data pipeline on cloud Fon Liwprasert ML Engineer at Sertis

Slide 2

Slide 2 text

Kamolphan Liwprasert (Fon) Machine Learning Engineer, Sertis I’m an ML engineering enthusiast with data engineering background. I love to develop solutions on cloud. 6 x GCP certified.

Slide 3

Slide 3 text

Every data analytics or ML project needs a good data pipeline as a foundation.

Slide 4

Slide 4 text

But the data is always not ready to use at the beginning.

Slide 5

Slide 5 text

We also need to think about automation and scalability as the project goes on.

Slide 6

Slide 6 text

Developing a data pipeline on cloud Specifically on Google Cloud Platform

Slide 7

Slide 7 text

Text & diagram slides Agenda ● Why do we need data pipeline? ● What are options on cloud? Storages / Computes / Pipelines ● Introducing Apache Airflow ● Reference Architecture + Demo code

Slide 8

Slide 8 text

Why do we need data pipeline? 🛠

Slide 9

Slide 9 text

Purposes of data pipeline Ingest data from multiple sources Transform or clean the data to ensure data quality Automation of the process

Slide 10

Slide 10 text

What are options on cloud? ☁

Slide 11

Slide 11 text

Choosing Storage Options

Slide 12

Slide 12 text

BigQuery Choosing Storage Options: zoom-in For analytics workload Cloud Bigtable Cloud Storage

Slide 13

Slide 13 text

Choosing Compute Options Google Compute Engine Google Kubernetes Engine Google Cloud Run Google App Engine Google Cloud Function IAAS : Infrastructure as a service CAAS : Container as a service PAAS : Platform as a service FAAS : Function as a service GCE GKE GAE GCF Virtual Machine Managed K8s cluster Serverless container Serverless application Serverless function platform

Slide 14

Slide 14 text

Choosing Data Processing Options Cloud Dataproc Processing data Workflow and scheduler Cloud Scheduler Cloud Dataflow Cloud Composer Spark or Hadoop Data processing Unified pipeline w/ Apache Beam 3 2 1 Cloud Functions or or Cloud Run Serverless options Cloud Workflows (optional)

Slide 15

Slide 15 text

Option 1: Low-cost & severless option Processing data (light workload) Cloud Functions Cloud Run (Or Pub/Sub) for Cloud Functions Workflow and scheduler REST API Scheduler ✓ Severless: easy & fast ✓ Low-cost solution ✓ Suitable for light workload

Slide 16

Slide 16 text

Option 2: Big data solution Spark or Hadoop Data processing Unified data pipeline with Apache Beam Processing data (Big data workload) Workflow and scheduler Cloud Dataproc REST API Cloud Dataflow Scheduler ✓ Big data framework: Spark, Apache Beam, Flink ✓ Scalability and reliability ✓ Opensource solutions

Slide 17

Slide 17 text

Option 3: Cloud Composer (Airflow) Cloud Composer Kubernetes Engine Cloud SQL Managed service + ✓ Easier for maintenance ✓ Scalability and reliability ✓ Suitable for large number of jobs that require workers

Slide 18

Slide 18 text

Introducing Apache Airflow

Slide 19

Slide 19 text

Why Apache Airflow? ● Popular open-source project for ETL or data pipeline orchestration. ● All codes are in Python. Easy to learn and use. ● Can be run locally as well for development environments.

Slide 20

Slide 20 text

Airflow is rich in 3rd party integration https://github.com/apache/airflo w/tree/main/airflow/providers . . . https://registry.astronomer.io/providers/

Slide 21

Slide 21 text

Apache Airflow UI DAGs (Job list) Calendar View Tree View

Slide 22

Slide 22 text

Apache Airflow basic components Sensor Wait on an event i.e. poking for a file Operator Running an action; PythonOperator Hook Interface to external services or system

Slide 23

Slide 23 text

Reference Architecture and Demo Code 📐📄

Slide 24

Slide 24 text

Reference Architecture Batch Ingestion BigQuery Cloud Storage SFTP server Cloud Composer SFTPToGCSOperator GCSToBigQueryOperator Analytics Workload BI Dashboards SFTPSensor

Slide 25

Slide 25 text

Text & diagram slides DAG overview SFTPSensor SFTPToGCSOperator GCSToBigQueryOperator Check if a file is available Upload that file to GCS Load file from GCS to BigQuery

Slide 26

Slide 26 text

Demo Time! bit.ly/airflow_gcp_demo

Slide 27

Slide 27 text

Text & diagram slides Simple data pipeline using Airflow (1) Import and initialize DAG bit.ly/airflow_gcp_demo ← DAG name ← Schedule ← Import necessary components

Slide 28

Slide 28 text

Text & diagram slides Simple data pipeline using Airflow (2) SFTPSensor for waiting for a file to be available bit.ly/airflow_gcp_demo ← SFTP Connection ← Path to file wait_for_file check-for-file SFTPSensor

Slide 29

Slide 29 text

Text & diagram slides Simple data pipeline using Airflow (3) SFTPToGCSOperator for uploading file(s) from SFTP to GCS bit.ly/airflow_gcp_demo ← GCP Connection ← Path to File & GCS upload_file_to_gcs upload-file-from-sftp SFTPToGCSOperator

Slide 30

Slide 30 text

Simple data pipeline using Airflow (4) GCSToBigQueryOperator for loading data to BigQuery from GCS source file(s) bit.ly/airflow_gcp_demo ← source GCS ← JSON schema (dictionary) ← format: CSV, AVRO, parquet load_to_bigquery load-to-bigquery GCSToBigQueryOperator

Slide 31

Slide 31 text

Text & diagram slides Simple data pipeline using Airflow (5) Put everything together! wait_for_file check-for-file upload_file_to_gcs upload-file-from-sftp load_to_bigquery load-to-bigquery SFTPSensor SFTPToGCSOperator GCSToBigQueryOperator bit.ly/airflow_gcp_demo ← Create DAG!

Slide 32

Slide 32 text

Text & diagram slides DAG graph view (from previous code)

Slide 33

Slide 33 text

Text & diagram slides Key Takeaways 🔑 Choosing the right solution for data pipeline depends on requirements and workload.

Slide 34

Slide 34 text

Text & diagram slides Thank you 😃 Let’s connect! Fon Liwprasert ML Engineer at Sertis linkedin.com/in/fonylew fonylew