ABCS25: Full Steam Ahead: Engineering a Modern Data Platform at Rhaetian Railway by Simon Schwab & Lukas Heusser

Slide 1

Slide 1 text

Full Steam Ahead: Engineering a Modern Data Platform at Rhaetian Railway 5. June 2025, Lukas Heusser & Simon Schwab Swisscom Data & AI Consulting

Slide 2

Slide 2 text

1. Introduction 2. Requirements and Challenges 3. New Platform 4. Benefits 5. Questions Agenda

Slide 3

Slide 3 text

Introduction Who we are and why we’re here

Slide 4

Slide 4 text

4 Drive Transformation Swisscom Data & AI Consulting Frame Explore Realize Scale Our mission is to help our customers fully exploit the potential of their data. To this end, we design and implement data-based, analytical systems that sustainably improve their core business.

Slide 5

Slide 5 text

5 About us Lukas Heusser Senior Data & AI Consultant [email protected] +41 79 549 78 72 «Implementing data and analytics solutions, with a focus on Databricks, Snowflake, and Azure» • Swisscom AG B2B Data & AI Consulting Unit • BSc. Business Information Technology • Certified in Databricks, Snowflake & Azure Ein Bild, das Logo, Screenshot, Grafiken, Symbol

Slide 6

Slide 6 text

6 About us Simon Schwab Senior Data & AI Consultant [email protected] +41 79 840 35 69 «Swiss data and cloud professional with a passion for designing and implementing modern data platforms» • Swisscom AG B2B Data & AI Consulting Unit • MSc. Business Information Technology • Certified in Azure, AWS , Databricks & Project Management Ein Bild, das Logo, Screenshot, Grafiken, Symbol

Slide 7

Slide 7 text

7 Operating the existing Azure data platform Improving user interaction with the data and data architecture to enable use case development Developing and running a data platform for analytics and reporting Rhaetian Railway Customer Modernizing RhB’s data landscape Coming from an Azure-based data platform, RhB faced architectural and operational challenges that limited scalability and transparency. After evaluating the existing setup, Swisscom Data & AI Consulting proposed a greenfield approach with Azure Databricks to modernize the data landscape and address core design issues.

Slide 8

Slide 8 text

8 Running the Bernina and Glacier Express 102 Train stations and stops 385 km of track length with > 1000 vehicles > 15 Mio. Passengers and 4 Mio. Commuters Rhaetian Railway at a glance

Slide 9

Slide 9 text

9 Project Setup Data Architecture Data Sources BI IT Infrastructure Data Platform Data Engineering Since 2024 Operations Advisory of Advisory of Data Science Since 2024

Slide 10

Slide 10 text

Requirements and Challenges Or why we decided to build the platform from the ground up

Slide 11

Slide 11 text

11 Existing Data Platform Architecture

Slide 12

Slide 12 text

Challenges with the existing solution • No separation between development and production environment • Long and tedious implementation time for new use cases, primarily caused by a lack of structure • Overall data architecture was not in focus • Long data loading times taking sometimes more than 5 hours to complete • Transformations already applied during data ingestion • Many assumptions, such as data types, this led to frequent, unnecessary errors • Capacity problems with SQL server, which was solved by a temporal upscaling

Slide 13

Slide 13 text

Since 2024 Source of the challenges • Requirements and communication were unclear • Testing and test definitions were neglected (under pressure) • Implementation requires domain knowledge • (Too) fast reverse engineering of data sources due to missing or insufficient documentation • Time pressure that demanded compromises

Slide 14

Slide 14 text

New Platform Our approach to build a unified data platform on Azure Databricks

Slide 15

Slide 15 text

15 Simplified Architecture

Slide 16

Slide 16 text

16 Infrastructure as Code RhB IT Providing a basic landing zone with network connectivity and managing the Entra ID Setup Of the needed resources for Terraform 1 CI/CD Pipelines To automate the validation and deployment of the IaC 3 Deployment Of the platform itself with all services and permissions 2 Development of use cases Building the data lakehouse and realizing use cases on data 4

Slide 17

Slide 17 text

17 Ingestion/Landing as a layer • Azure Data Factory as a central ingestion tool • Storing data in the landing layer • Parquet files are stored in Storage Account • Orchestration through Databricks Workflows • After ingestion, data is processed with Databricks to the bronze layer • Data is moved to archive after 30 days

Slide 18

Slide 18 text

18 Our development setup Visual Studio Code with Databricks Extension for the local development of the Databricks assets (e.g. pipelines, jobs, custom python packages etc.) Azure DevOps for version control of delivery objects and CI/CD between dev, testing and production environments Databricks including Asset Bundles for code-based definition of Databricks assets and resources and to execute dev and testing resources

Slide 19

Slide 19 text

19 Databricks Asset Bundles What it is • Assets (Jobs, Compute, Notebooks) • Definition in .yml • Tool is built on Terraform (similar workflow) Why we use it • Helps with automating (CI/CD pipelines) • Consistent deployment • Better collaboration

Slide 20

Slide 20 text

20 Databricks Asset Bundles Example Azure DevOps Pipelines .yml Databricks Job definiton Dev deployment of Databricks Job Prod deployment of Databricks Job

Slide 21

Slide 21 text

21 Deployment Pipelines feature branch Asset Bundle • databricks.yml • Notebooks • Pipelines • Other resources release branch Asset Bundle • databricks.yml • Notebooks • Pipelines • Other resources main branch Asset Bundle • databricks.yml • Notebooks • Pipelines • Other resources Deployment Pipeline Deployment Pipeline

Slide 22

Slide 22 text

22 End-2-End Data Orchestration Trigger Azure Data Factory Load SAP Source System n Load Operational Data Clean & Transform data Load use case specific models Data Quality checks Landing Bronze Silver Gold Databricks Workflows

Slide 23

Slide 23 text

23 Metadata driven Framework Entity configuration { "entity": "Person", "target_table": "person", "source_system": "sap", "source_tables": ["ZBI_I_PA0002"], "scd2_key_columns": ["PersonNummerId"], "scd2_table": "silver.person" } column_name_old,column_name_new,data_type PERNR,PersonNummerId,INTEGER SUBTY,Subtyp,STRING OBJPS,ObjektIdentifikation,STRING Entity config Source-2-target mapping > Parametrized pipeline ... catalog = dbutils.widgets.get("catalog_silver") schema = dbutils.widgets.get("schema_silver") domain = dbutils.widgets.get("domain_silver") entity = dbutils.widgets.get("entity_silver") # load entities config per domain from the config json path_entities_config = f"file:{path}/{domain}/entities.json" entities_config = spark.read.option("multiline", "true").json(path_entities_config) # check if the current entity is defined in the config entity_config = ( entities_config.filter(entities_config.entity == entity) .collect() ) main_notebook.py (snippet of parameter initialization) Custom Transformations Custom transformations per entity

Slide 24

Slide 24 text

Benefits and Lessons Learned What we are proud of and what we could have done differently

Slide 25

Slide 25 text

25 Faster use case development (Analytics & Reporting), from months to days Enabling RhB data team to realize use cases on their own (Analytics & Data Science) Understanding what actually happens with their data in the platform Rhaetian Railway Benefits Lessons learned • Overlap with Terraform and Databricks asset bundles • 2 instead of 3 environments does not really reduce overhead • Efficiency through automation and parameterization – applied wisely, not blindly • Evaluation of Azure Data Factory as Ingestion tool