Slide 1

Slide 1 text

NAVIGATING THE DATA LANDSCAPE From Fundamentals to the Future

Slide 2

Slide 2 text

CONTENTS WHAT IS DATA DATA MODELS DATA STORAGE and it's importance and query languages and retreival of data DATA REPLICATION and data sharding

Slide 3

Slide 3 text

CONTENTS DATA PLATFORM DATA MESH and it's use cases and distributed systems DATA AS A PRODUCT and future of data and data sharding DATA PIPELINE

Slide 4

Slide 4 text

HI EVERYONE

Slide 5

Slide 5 text

ABOUT ME My name is Mann and I am not a Data Scientist Steve Mann Delhi JUG Leader Senior Software Engineer at Fynd

Slide 6

Slide 6 text

WHAT THIS TALK IS NOT around building data applications a course on big data around understanding forms of data an overview of data tools

Slide 7

Slide 7 text

WHAT IS DATA

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Data is a magical unicorn made up of ones and zeroes, roaming through cyberspace, collecting cat videos and conspiracy theories along the way WHAT IS DATA

Slide 10

Slide 10 text

WHY IS DATA IMPORTANT Save Time Make Informed Decisions Personalisation and Customer Understanding Risk Management Foster Innovation Transparency and Accountability

Slide 11

Slide 11 text

INTRODUCING Data. Fitness. You

Slide 12

Slide 12 text

BUILDING DATA INTENSIVE APPLICATIONS Reliable Scalable Maintainable

Slide 13

Slide 13 text

RELATIONAL DOCUMENT GRAPH TIME SERIES DATA MODELS

Slide 14

Slide 14 text

user_id name email phone workout_id exercise_id date duration user_id exercise_id name difficulty_level plan_id name exercises price Workout Table User Table Exercise Table Fitness Plan Table RELATIONAL MODEL

Slide 15

Slide 15 text

Workout Document Fitness Plan Document User Document DOCUMENT MODEL

Slide 16

Slide 16 text

User 1 Workout Post 1 User 3 User 2 follows creates User 4 comments on comments on Workout Post 2 creates comments on GRAPH MODEL

Slide 17

Slide 17 text

tags: workout_id=1, user_id=101, exercise_id=201, fields: date=2023-07-01, duration=45 minutes tags: workout_id=2, user_id=102, exercise_id=202, fields: date=2023-07-02, duration=60 minutes TIME SERIES MODEL Workout Measurement tags: exercise_id=201, fields: name=Push-ups, difficulty_level=Intermediate tags: exercise_id=202, fields: name=Squats, difficulty_level=Beginner Exercise Measurement

Slide 18

Slide 18 text

HOW TO QUERY YOUR DATA? SELECT * FROM query_languages WHERE type = 'SQL';

Slide 19

Slide 19 text

QUERY STYLES MAP REDUCE? DECLARATIVE STYLE imperative or declarative? SQL AGGREGATION PIPELINE the javascript SQL IMPERATIVE STYLE YYDT

Slide 20

Slide 20 text

IMPERATIVE STYLE You Yourself Do It YYDT - Pronounced at YUKK 🤢

Slide 21

Slide 21 text

DECLARATIVE STYLE

Slide 22

Slide 22 text

MAP REDUCE QUERYING

Slide 23

Slide 23 text

AGGREGATION PIPELINE

Slide 24

Slide 24 text

DATABASES IN-MEMORY DATA WAREHOUSES DATA LAKES OBJECT STORAGE WHERE TO STORE DATA?

Slide 25

Slide 25 text

JSON XML AVRO PARQUET CSV DATA ENCODING

Slide 26

Slide 26 text

VIA DATABASE VIA REST (OR SOAP) VIA ASYNC MESSAGING PASSING DATA BETWEEN SERVICES?

Slide 27

Slide 27 text

VIA DATABASE Simple to implement Needs to be backward compatible Needs to be forward compatible

Slide 28

Slide 28 text

Old version of code (missing description) DATA OUTLIVES CODE Data written by new code Description is lost

Slide 29

Slide 29 text

VIA REST Easier to develop Flexible (loose coupling) Popular (community support) MICRO-SERVICES!

Slide 30

Slide 30 text

Server Database Client RESTful API over HTTP Who needs a nap when you have REST to keep you RESTed

Slide 31

Slide 31 text

VIA ASYNC MESSAGING Event Driven Architecture Fault Tolerant Asynchronous Flow MICRO-SERVICES!

Slide 32

Slide 32 text

Server Database Client Message Huge shoutout to the clients Broker Message Client Message

Slide 33

Slide 33 text

WE CAN HANDLE DATA!

Slide 34

Slide 34 text

BUT...

Slide 35

Slide 35 text

SHARDING Split a single dataset into partitions or shards All shards run on separate nodes Leverage horizontal scaling Increased throughput

Slide 36

Slide 36 text

user_id name age 1 Venkat 31 2 Josh 28 3 Ivar 30 4 Mala 23 Sharding User Table SHARDING user_id name age 1 Venkat 31 2 Josh 28 user_id name age 3 Ivar 30 4 Mala 23 Sharding

Slide 37

Slide 37 text

DATA REPLICATION Increased throughput Scalability Fault tolerance

Slide 38

Slide 38 text

User Read-Write Query Leader Data change Data Change Replication Streams Follower Follower LEADERS AND FOLLOWERS

Slide 39

Slide 39 text

TIME TO DIFFERENTIATE BETWEEN OLTP AND OLAP

Slide 40

Slide 40 text

DATA PIPELINES Flexibility Scalability Separation of Concern

Slide 41

Slide 41 text

extract extract extract extract load load load load EXTRACT-TRANSFORM-LOAD (ETL) Workout DB Nutrition DB User DB Wearable Sales DB OLTP Systems Data Warehouse Transform Transform Transform Transform OLAP System

Slide 42

Slide 42 text

E & L extract and load E & L extract and load EXTRACT-LOAD-TRANSFORM (ELT) Workout DB Nutrition DB User DB Wearable Sales DB OLTP Systems Data Warehouse / Lake OLAP System Transform

Slide 43

Slide 43 text

DATA PLATFORM Application Layer Security Layer Authentication, Authorisation, Logging, Alerting Web Apps, Microservices, Enterprise Applications Storage Layer OLTP Systems and Databases Ingestion Layer APIs, ETL, ELT, Pub Sub, Data Streams Analytics Layer OLAP Systems, Data Warehouses, Data Lakes Data Governance

Slide 44

Slide 44 text

DATA GOVERNANCE Security Access and Availability Quality Compliance

Slide 45

Slide 45 text

ISSUES WITH DISTRIBUTED SYSTEMS System Faults and Partial Failures Unreliable Networks Unreliable System Clocks Questionable Reality

Slide 46

Slide 46 text

FUTURE TRENDS

Slide 47

Slide 47 text

load load load load extract extract extract extract REVERSE ETL Workout DB Nutrition DB User DB Wearable Sales DB OLTP Systems Data Warehouse / Lake Transform Transform Transform Transform OLAP System

Slide 48

Slide 48 text

DATA MESH Data Infrastructure as a Platform Domain 1 Domain 3 Domain 2 Domain 4

Slide 49

Slide 49 text

MESH OR NOT TO MESH More about process than architecture Domain Centric Segregation Federated Governance Data as a product

Slide 50

Slide 50 text

DATA AS A PRODUCT DaaP is a mindset Treats data as a valuable asset Data Ownership Introduces data interfaces

Slide 51

Slide 51 text

FEEL FREE TO REACH OUT

Slide 52

Slide 52 text

THANK YOU!