Efficient Energy Analytics with Airflow, Spark, and MLFlow

Hank Ehly

April 06, 2023

  1. Confidential 3 • Energy tech is a fun, fulfilling area

    to work in ◦ Exciting technical projects for software developers ◦ Fulfilling career choice due to ubiquitous nature Introduction Photo by Michael Marais Photo by Dan LeFebvre IoT Electric Vehicles (EV) Electricity Market Price Prediction Image: Renewable Energy Institute • I want to show you: ◦ some examples of software problems that we encounter at ENECHANGE ◦ the technologies that we use to solve these problems Image: Synergy Energy Usage Data Analysis
  2. Confidential • Energy consumption data from smart meters • 30

    minute intervals (48 values per day) Meter ID Date/time Usage (kWh) 1 2023/04/03 12:00 1.4 1 2023/04/03 12:30 1.0 1 2023/04/03 13:00 1.1 smart meter energy consumption data (sample) Introduction number of meters number of days (per meter) lots of data!
  3. Confidential Example #1 - Bulk data downloads with Apache Airflow

    usage data usage data energy users energy company Meter ID Date/time Prediction (kWh) Actual (kWh) Savings (kWh) 1 2023/04/03 12:00 1.4 1.1 0.3 1 2023/04/03 12:30 1.0 1.1 0.1 1 2023/04/03 12:30 1.1 1.1 0.0 ENECHANGE datastore Save electricity between 5-6 PM tomorrow to earn points! Have: • Big blob of data • Columnar (Parquet) format Want: • Subset of data • Multiple zipped CSV files (~100 MB each) • E-mail pre-signed links • Slack notifications • Automatic retries (for idempotent & known to fail tasks)
  4. Confidential • Apache Airflow is an open-source platform for developing,

    scheduling, and monitoring batch-oriented workflows • You define the workflow steps in Python code and Airflow turns it into an executable program. • Run on a schedule or trigger via API request (redacted) Example #1 - Bulk data downloads with Apache Airflow from datetime import datetime from airflow import DAG from airflow.decorators import task from airflow.operators.bash import BashOperator with DAG(dag_id="demo", schedule="0 0 * * *", …): task1 = BashOperator( task_id="hello", bash_command="echo hello" ) @task() def task2(): print("world") task1 >> task2() ① ② ③ ④ ⑤
  5. Confidential • Predict how much electricity buildings A,B,C will use

    tomorrow • Machine Learning • Challenges: ◦ Growing number of models ◦ Different models require different inputs ◦ Tweak each model individually ◦ Track performance ◦ Record what training data we used ◦ Save model files (and any associated data visualizations) 9 Example #2 - Organized Machine Learning with MLflow smart meter
  6. Confidential • MLflow is an open source Python framework for

    creating and managing machine learning models. • Deploy it as a web application. Interact via Python API (see code example) and Web UI. • Keeps track of model executable files, parameters, model versions, performance metrics and data visualization files. Example #2 - Organized Machine Learning with MLflow # Get training data train_x, train_y, test_x, test_y = get_train_data() with mlflow.start_run(): # Train a model model = ElasticNet(alpha=0.5, l1_ratio=0.5) model.fit(train_x, train_y) # Evaluate performance y_pred = model.predict(test_x) (rmse, mae, r2) = eval_metrics(test_y, y_pred) # Record model parameters mlflow.log_param("alpha", alpha) mlflow.log_param("l1_ratio", l1_ratio) # Record performance metrics mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.log_metric("mae", mae) # Save model executable file mlflow.sklearn.log_model( model, artifact_path="model", registered_model_name="MyFirstModel" ) Image: Databricks Comparing model performance via the UI Training and saving a new model to MLflow
  7. Confidential Example #2 - Organized Machine Learning with MLflow Object

    Storage RDB Jupyter Notebook (our team) 1. Download models from MLflow (select * where stage=Production) 2. Make predictions 3. Export to storage * 10 * * * Everyday at 10 AM ./run_prediction_job.py
  8. Confidential • Energy tech is a fun, fulfilling area to

    work in. • ENECHANGE uses modern technology to solve problems in the energy industry. • Apache Airflow is a platform for creating and managing complex batch workflows. • MLflow is a framework for keeping machine learning operations organized. Summary Thank you for listening!