Data Mesh Architecture

@arafkarsh arafkarsh Architecting & Building Apps using a tech presentorial
Combination of presentation & tutorial AI / ML Generative AI LLMs, RAG 6+ Years Microservices Blockchain 8 Years Cloud Computing 8 Years Network & Security 8 Years Distributed Computing ARAF KARSH HAMID Co-Founder / CTO Ozazo AI, Kochi, Kerala, India @arafkarsh arafkarsh Data Mesh CLOUD NATIVE ARCHITECTURE SERIES o Introduction to Data Mesh and Key Principles o Problems Data Mesh Solves o Real-World Use Cases: Banking, Retail, Health Care, Manufacturing, Oil Refineries, Food Delivery o Building a Data Mesh o Challenges & Benefits

@arafkarsh arafkarsh Agenda 2 o Setting up the context 1.
Understanding Data Mesh 2. Real World Use Cases 3. Data Mesh Architecture 4. Popular Data Mesh Stacks 5. Challenges & Benefits 6. Technology Stack

@arafkarsh arafkarsh 0 Setting up the Context o Moving from
Centralized Data Teams o Comparison Data Lake / Warehouse / Mart 3

@arafkarsh arafkarsh Moving from Centralized Data Teams 4 o For
many enterprises, the journey toward becoming data-driven started with a simple promise: collect all data in one place, clean it, structure it, and make it available for analytics. o That promise gave us data warehouses, data lakes, data marts, lakehouses, and modern analytics platforms. o These systems solved many problems, but they also created a new one: as organizations scaled, data ownership often became centralized, disconnected from the teams that actually understood the business meaning of the data. o This is where Data Mesh becomes important.

@arafkarsh arafkarsh @arafkarsh arafkarsh 5

@arafkarsh arafkarsh Comparison Data Lake / Warehouse / Mart 6
Data Lake Warehouse Data Mart Storage for Raw Data Data lakes store raw, unprocessed data in its native format, including structured data, semi-structured data (like logs or XML), and unstructured data (such as emails and documents). Data warehouses store data that has been processed and structured into a defined schema. Also does not store raw data, similar to data warehouses. It stores processed and refined data specific to a particular business function. Scalability Typically built on scalable cloud platforms or Hadoop, data lakes can handle massive volumes of data Moderately scalable, traditionally limited by hardware when on-premises, but modern cloud-based solutions offer considerable scalability. Least scalable due to its focused and limited scope, typically designed to serve specific departmental needs. Performance Performance can vary. While it's excellent for big data processing and machine learning tasks, it might not perform as well for quick, ad-hoc query scenarios compared to structured systems. Optimized for high performance in query processing, especially for complex queries across large datasets. Designed for speed and efficiency in retrieval operations. Generally offers high performance for its limited scope and targeted queries, enabling faster response times for the specific business area it serves. Source: https://www.youtube.com/watch?v=-bSkREem8dM

@arafkarsh arafkarsh Comparison Data Lake / Warehouse / Mart 7
Data Lake Warehouse Data Mart Flexibility Extremely flexible in terms of the types of data it can store and how data can be used. It allows for the exploration and manipulation of data in various formats. Less flexible as it requires data to fit into a predefined schema, which might limit the types of data that can be easily integrated and queried. Also has limited flexibility, tailored to specific business functions with data structured for particular uses. Purpose Ideal for data discovery, data science, and machine learning where access to large and diverse data sets is necessary. Designed for business intelligence, analytics, and reporting, where fast, reliable, and consistent data retrieval is crucial. Serves specific departmental needs by providing data that is relevant and quickly accessible to business users within a department. Data Integrity & Consistency Data integrity and consistency can be a challenge due to the variety and volume of raw and unprocessed data. High integrity and consistency. Data is processed, cleansed, and conformed to ensure reliability and accuracy, which is critical for decision-making processes. Similar to data warehouses, data marts ensure high data integrity and consistency within their focused scope, as the data often originates from a data warehouse.

@arafkarsh arafkarsh 1 Understanding Data Mesh o Understanding the Great
Divide o What is Data Mesh? o 4 Principles of Data Mesh o Problems Data Mesh Solves 8

@arafkarsh arafkarsh @arafkarsh arafkarsh 9 Zhamak Dehghani Source: https://martinfowler.com/articles/data-mesh-principles.html

@arafkarsh arafkarsh @arafkarsh arafkarsh 10 Zhamak Dehghani

@arafkarsh arafkarsh 4 Principles of Data Mesh 11 1. Domain-Oriented
Decentralized Data Ownership and Architecture: Data is managed by domain-specific teams that treat their data as a product. These teams are responsible for their own data pipelines and outputs. 2. Data as a Product: Data is treated as a product with a focus on the consumers' needs. This includes clear documentation, SLAs, and a user-friendly interface for accessing the data. 3. Self-Serve Data Infrastructure as a Platform: This principle aims to empower domain teams by providing them with a self-serve data infrastructure, which helps them handle their data products with minimal central oversight. 4. Federated Computational Governance: Governance is applied across domains through a federated model, ensuring that data quality, security, and access controls are maintained without stifling innovation. Source: Dehghani, Zhamak. Data Mesh (p. 56). O'Reilly Media.

@arafkarsh arafkarsh 2 Real World Use Cases 1. Financial Service
– Banking 2. Retail / E-Commerce 3. Health Care 14 4. Manufacturing 5. Oil Refineries 6. Food Delivery Aggregator

@arafkarsh arafkarsh Real-World Use Cases 15 1. Financial Services: In
a banking scenario, different domains such as loans, credit cards, and customer service can independently manage their data, enabling faster innovation and personalized customer experiences while maintaining compliance through federated governance. 2. E-commerce: Large e-commerce platforms manage diverse data from inventory, sales, customer feedback, and logistics. Each domain can optimize its data management and analytics, improving service delivery and operational efficiency. 3. Healthcare: Different departments such as clinical data, patient records, and insurance processing can manage their data as discrete products, enhancing data privacy, compliance, and patient outcomes through more tailored data usage. 4. Manufacturing: Domains like production, supply chain, and maintenance in a manufacturing enterprise can leverage Data Mesh to optimize their operations independently, using real-time data streaming via Kafka for immediate responsiveness and decision-making.

@arafkarsh arafkarsh @arafkarsh arafkarsh 16 1

@arafkarsh arafkarsh 3 Data Mesh Architecture o How? o Data
Mesh Architecture o Building Data Mesh 22

@arafkarsh arafkarsh Building Data Mesh: 1 of 4 26 1.
Define Requirements and Assess Current Capabilities o Assess Current Data Usage and Needs: Analyze current data flows, storage needs, and processing requirements. Identify pain points in your existing infrastructure. o Forecast Future Needs: Estimate future data growth based on business goals. Consider not only the volume but also the complexity and diversity of data that will need to be managed. o Compliance and Security Needs: Ensure that your infrastructure will comply with applicable data protection regulations (like GDPR, HIPAA, PCI) and security standards. 2. Choose the Right Data Storage Solutions o Diverse Data Storage Options: Use a combination of storage solutions (SQL databases, NoSQL databases, data warehouses, and data lakes) to cater to different types of data and access patterns. o Elastic Scalability: Opt for cloud-based solutions such as Amazon S3, Google Cloud Storage, or Azure Blob Storage for elastic scalability and durability.

Implement Data Processing Frameworks o Batch Processing: Implement batch processing systems for large-scale analytics and reporting. Apache Hadoop and Spark are popular choices for handling massive amounts of data with fault tolerance. o Stream Processing: For real-time data processing needs, use tools like Apache Kafka, Apache Flink, and Apache Storm. These tools can handle high throughput and low- latency processing. o Hybrid Processing: Consider hybrid models that combine batch and stream processing for more flexibility. 4. Ensure Data Integration and Orchestration o Data Integration Tools: Use robust ETL (Extract, Transform, Load) tools or more modern ELT approaches to integrate data from various sources. Tools like Apache Kafka Connect, Talend, Apache Nifi, or Stitch can automate these processes. o Workflow Orchestration: Use workflow orchestration tools like Apache Airflow or Dagster to manage dependencies and scheduling of data processing jobs across multiple platforms.

Adopt a Microservices Architecture o Decoupled Services: Implement microservices to break down your data infrastructure into smaller, manageable, and independently scalable services. o Containerization: Use Docker containers to encapsulate microservices, making them portable and easier to manage. o Orchestration Platforms: Utilize Kubernetes or Docker Swarm for managing containerized services, ensuring they scale properly with demand. 6. Use Data Management and Monitoring Tools o Data Cataloging: Implement data catalogue tools to manage metadata and ensure data is findable and accessible. Tools like Apache Atlas or Collibra can be useful. o Monitoring and Logging: Use monitoring tools to track the performance and health of your data systems. Service Mesh, Prometheus, Grafana, and ELK (Elasticsearch, Logstash, Kibana) stacks are effective for monitoring and visualizing metrics.

Ensure Scalability and Reliability o Load Balancing: Use load balancers to distribute workloads evenly across servers, preventing any single point of failure. (Kubernetes/Kafka/Flink) o Data Redundancy and Backup: Implement data replication and backup strategies to ensure data durability and recoverability. o Scalable Architecture Design: Design your infrastructure to scale out (adding more machines) or scale up (adding more power to existing machines) based on demand. (Kubernetes/Kafka/Flink) 8. Foster a Culture of Continuous Improvement o Regular Audits and Updates: Regularly review and upgrade your infrastructure to incorporate new technologies and improvements. o Training and Development: Keep your team updated with the latest data technologies and best practices through continuous training and development.

@arafkarsh arafkarsh Building Data Mesh – Platform vs Domain Team
30 Responsibility Platform Team Domain Team Infrastructure Provides paved roads, templates, CI/CD, storage, catalog, monitoring Uses platform capabilities Data Product Provides standards and tooling Owns schema, quality, documentation, SLA, access rules Governance Automates policy enforcement Applies policies to domain data Operations Provides observability foundation Operates and improves data products Consumers Provides discovery and access mechanisms Supports data-product consumers

@arafkarsh arafkarsh Data Contract Example – Oil Refinery 31 Field
Example Product Asset Health Data Product Owner Inspection / Reliability domain Freshness Updated every 15 minutes Schema equipment_id, corrosion_rate, last_inspection_date, risk_score Quality rule equipment_id must not be null Access Inspection, Reliability, Maintenance, Safety SLA 99% availability during operating hours

@arafkarsh arafkarsh 4 Popular Data Mesh Stacks o Google Cloud
Big Query o AWS S3 and Athena o Azure o DBT Snowflake o SAP Data Mesh o Kafka Data Mesh 32

@arafkarsh arafkarsh Popular Data Mesh Tech Stacks 33 o Google
Cloud BigQuery o AWS S3 and Athena o Azure Synapse Analytics o dbt and Snowflake o Databricks (How To Build a Data Product with Databricks) o MinIO and Trino o SAP o Kafka and RisingWave Source: https://www.datamesh-architecture.com/ Data Mesh User Stories Data mesh is primarily an organizational approach, and that's why you can't buy a data mesh from a vendor.

@arafkarsh arafkarsh Google Data Mesh Stack 34 BigQuery is the
central component for storing analytical data. BigQuery is a columnar data store and can perform efficient JOIN operations with large data set. BigQuery supports both, batch ingestion and streaming ingestion. When the operational system architecture relies on Apache Kafka, then streaming through Kafka Connect Google BigQuery Sink Connector is recommended. Source: Google Cloud BigQuery

@arafkarsh arafkarsh Amazon Data Mesh Stack 35 AWS S3 is
the central component for storing analytical data. S3 is a file based object store and data can be stored in many formats, such as CSV, JSON, Avro, or Parquet. S3 buckets are used for all stages: raw files, aggregated data, and even data products. Every domain team typically has their own AWS S3 buckets to store their own data products. Analytical queries are executed through AWS Athena that queries data stored in many locations, including files on S3, with standard SQL and performs cross-dataset join operations. Athena uses Presto, a distributed query engine. Source: AWS S3 and Athena

@arafkarsh arafkarsh Azure Data Mesh Stack 36 Source: Azure Synapse
Analytics Microsoft offers Azure Synapse Analytics, along with both Data Lake Storage Gen2 and SQL database, as the central components for implementing a data mesh architecture.

@arafkarsh arafkarsh DBT Snowflake Data Mesh Stack 37 Source: dbt
and Snowflake dbt is a framework to transform, clean, and aggregate data within your data warehouse. Transformations are written as plain SQL statements and result in models that are SQL views, materialized views, or tables, without the need to define their structure using DDL upfront. dbt embraces tests to verify data when running any transformation, both for sources and results Snowflake stores data in tables that are logically organized in databases and schemas.

@arafkarsh arafkarsh SAP Data Mesh Stack 38 Source: SAP SAP
Datasphere comes with an exceptional integration into SAP applications, allowing to re-use the rich business semantic and data entity models for building data products. SAP Datasphere integrates out of the box with SAP S/4HANA tables and supports replication as well as federation. The integration is based on the VDM (virtual data model) which forms the basis for data access in SAP S/4HANA. SAP HANA Cloud and SAP HANA Cloud Data Lake can be fully leveraged for data stored within SAP Datasphere.

@arafkarsh arafkarsh Kafka Streaming Data Mesh Stack 39 Source: Kafka
and RisingWave Kafka already has its place in many "classical" implementations of data mesh, namely as an ingestion layer for streaming data into data products. This tech stack extends the scope of Kafka far beyond merely serving as an ingestion layer. Here, data products are not just ingested from Kafka but data products live on Kafka, for bi- directional interactions between the operational systems and data products. Open-Source Stack "Classical" data mesh implementations firmly put their data products on the analytical plane, either in data warehouses (such as Snowflake or BigQuery), data lakes (S3, MinIO) or data lakehouses (Databricks).

@arafkarsh arafkarsh 5 Challenges & Benefits o Challenges of Implementing
Data Mesh o Benefits of Data Mesh o Data Mesh Summary 40

@arafkarsh arafkarsh Challenges of Implementing Data Mesh 41 o Cultural
Shift: Adopting Data Mesh requires significant changes in organizational culture and mindset, particularly the shift towards viewing data as a product. o Technical Heterogeneity: Implementing a self-serve data infrastructure that can accommodate diverse technologies and systems across domains can be challenging. o Governance Complexity: Balancing autonomy with oversight requires sophisticated governance mechanisms that can be complex to implement and maintain.

@arafkarsh arafkarsh Anti Patterns 42 # Anti-Pattern Why It Fails
1. Calling every table a data product A table without ownership, quality, documentation, and SLA is not a product. 2. Buying a catalog and calling it Data Mesh A catalog helps discovery but does not solve ownership or governance. 3. Decentralizing without standards Leads to inconsistent schemas, duplication, and low trust. 4. Central team still owns everything Domain ownership never becomes real. 5. Governance by committee only Slows delivery and does not scale. 6. No platform engineering Domain teams struggle with infrastructure complexity. 7. No product owner No one is accountable for consumer experience.

@arafkarsh arafkarsh Data Governance 43 # Governance Concern Computational Implementation
1. PII / PHI detection Automated classification during ingestion 2. Access control Policy-as-code using IAM, ABAC, RBAC, or OPA 3. Data quality Automated tests using dbt, Great Expectations, Soda, or custom rules 4. Lineage OpenLineage, Apache Atlas, DataHub, Collibra, or cloud-native lineage 5. Schema compatibility Contract validation in CI/CD 6. Auditability Immutable logs and access audit trails 7. Data retention Automated lifecycle policies

@arafkarsh arafkarsh Access Controls 44 Term Explanation IAM — Identity
and Access Management IAM manages who the user or system is and what permissions they have. In Data Mesh, IAM ensures only approved users, services, or agents can access specific data products. RBAC — Role-Based Access Control RBAC grants access based on a person’s role, such as Data Analyst, Risk Manager, Compliance Officer, or Maintenance Engineer. It is simple and useful when permissions can be grouped by job function. ABAC — Attribute-Based Access Control ABAC grants access based on attributes such as department, location, data classification, purpose, user clearance, or time of access. It is more flexible than RBAC and works well for fine-grained enterprise data governance. OPA — Open Policy Agent OPA is a policy-as-code engine used to enforce access, compliance, and governance rules automatically. In Data Mesh, OPA can evaluate policies like “only compliance users can access PII data” before allowing access to a data product.

@arafkarsh arafkarsh Business Perspective 45 Area Traditional Central Data Model
Data Mesh Model Ownership Central data team Business/domain teams Accountability Often unclear Product owner per data product Speed Pipeline-request driven Domain-enabled self- service Quality Detected late Built into product contract Governance Central approval-heavy Federated + automated AI readiness Data may lack context Context-rich, governed data products

@arafkarsh arafkarsh Benefits of Data Mesh 46 o Scalability: By
decentralizing data ownership and management, Data Mesh can scale more effectively as organizations grow. o Agility: Domains can quickly adapt and respond to changes and needs within their specific areas, leading to faster innovation. o Enhanced Collaboration: Data Mesh fosters a collaborative environment by encouraging domains to share their data products across the organization, enhancing cross-functional projects and innovation. o Improved Data Quality and Accessibility: With domain experts managing their own data, the overall quality and relevance of data improve, making it more accessible and useful to end users.

@arafkarsh arafkarsh 6 Technology Stack o Apache Kafka – Connect,
Streams o Apache Flink o Apache Pinot o Apache Iceberg 48

@arafkarsh arafkarsh 61 Thank you DREAM EMPOWER AUTOMATE MOTIVATE India:
+91.999.545.8627 https://arafkarsh.medium.com/ https://speakerdeck.com/arafkarsh https://www.linkedin.com/in/arafkarsh/ https://www.youtube.com/user/arafkarsh/playlists http://www.slideshare.net/arafkarsh http://www.arafkarsh.com/ @arafkarsh arafkarsh LinkedIn arafkarsh.com Medium.com Speakerdeck.com

@arafkarsh arafkarsh 62 Slides: https://speakerdeck.com/arafkarsh Blogs https://arafkarsh.medium.com/ Web: https://arafkarsh.com/ Source:
https://github.com/arafkarsh

Data Mesh Architecture

Data Mesh Architecture

More Decks by Araf Karsh Hamid

Other Decks in Technology

Featured

Transcript