Slide 1

Slide 1 text

@arafkarsh arafkarsh Architecting & Building Apps a tech presentorial Combination of presentation & tutorial ARAF KARSH HAMID Co-Founder / CTO MetaMagic Global Inc., NJ, USA @arafkarsh arafkarsh AI / ML Generative AI LLMs, RAG 6+ Years Microservices Blockchain 8 Years Cloud Computing 8 Years Network & Security 8 Years Distributed Computing 1 Distributed Cache: Hazelcast, Redis, EHCache NoSQL Vs. SQL: Redis / MongoDB / DynamoDB Scalability: Shards and Partitions Multi-Tenancy: DB, Schema, Table Compliance and Data Security Data Lake, Warehouse, Mart, Data Mesh Microservices Architecture Series Part 5 of 15 To Build Cloud Native Apps Using Composable Enterprise Architecture

Slide 2

Slide 2 text

@arafkarsh arafkarsh 2 Source: https://arafkarsh.medium.com/embracing-cloud-native-a-roadmap-to-innovation-a6b06fe3a9fb Cloud-Native Architecture

Slide 3

Slide 3 text

@arafkarsh arafkarsh 3 Source: https://arafkarsh.medium.com/embracing-cloud-native-a-roadmap-to-innovation-a6b06fe3a9fb

Slide 4

Slide 4 text

@arafkarsh arafkarsh 4 Slides are color coded based on the topic colors. Distributed Cache EHCache, Hazelcast, Redis, Coherence 1 NoSQL Vs. SQL Redis, MongoDB DynamoDB, Neo4J Data Mesh / Lake 2 Scalability Sharding & Partitions 3 Multi-Tenancy Compliance Data Security 4

Slide 5

Slide 5 text

@arafkarsh arafkarsh Agile Scrum (4-6 Weeks) Developer Journey Monolithic Domain Driven Design Event Sourcing and CQRS Waterfall Optional Design Patterns Continuous Integration (CI) 6/12 Months Enterprise Service Bus Relational Database [SQL] / NoSQL Development QA / QC Ops 5 Microservices Domain Driven Design Event Sourcing and CQRS Scrum / Kanban (1-5 Days) Mandatory Design Patterns Infrastructure Design Patterns CI DevOps Event Streaming / Replicated Logs SQL NoSQL CD Container Orchestrator Service Mesh

Slide 6

Slide 6 text

@arafkarsh arafkarsh Application Modernization – 3 Transformations 6 Monolithic SOA Microservice Physical Server Virtual Machine Cloud Waterfall Agile DevOps Source: IBM: Application Modernization > https://www.youtube.com/watch?v=RJ3UQSxwGFY Architecture Infrastructure Delivery Modernization 1 2 3

Slide 7

Slide 7 text

@arafkarsh arafkarsh Distributed Caching • EHCache • Hazelcast • Oracle Coherence • Redis 7 1

Slide 8

Slide 8 text

@arafkarsh arafkarsh Distributed Cache Feature Set 8 1. Language Support: Refers to the programming languages for which the distributed caching solution provides APIs or client libraries. 2. Partitioning & Replication: The ability to partition data across multiple nodes and maintain replicas for fault tolerance and availability. 3. Eviction Policies: Strategies to remove data from the cache when it reaches capacity. Standard policies include Least Recently Used (LRU), and Least Frequently Used (LFU). 4. Persistence: Storing cached data on disk allows cache recovery in case of node failure or restart. 5. Querying: Support for querying cached data using a query language or API. 6. Transactions: The ability to perform atomic operations and maintain data consistency across cache operations. 7. High Availability & Fault Tolerance: Support for redundancy and automatic failover to ensure the cache remains operational in case of node failures. 8. Performance: A measure of the cache's ability to handle read and write operations with low latency and high throughput. 9. Data Structures: The types of data structures supported by the caching solution. 10. Open Source: Whether the caching solution is open-source and freely available for use and modification

Slide 9

Slide 9 text

@arafkarsh arafkarsh Distributed Cache Comparison 9 Feature EHCache Hazelcast Coherence Redis Language Support Java Java, .NET, C++, Python, Node.js, etc. Java Java, Python, .NET, C++, etc. Partitioning & Replication Terracotta integration (limited) Native support Native support Native support (Redis Cluster) Eviction Policies LRU, FIFO, custom LRU, LFU, custom LRU, custom LRU, LFU, volatile, custom Persistence Disk-based persistence Disk-based persistence Disk-based persistence Disk-based and in- memory persistence Querying Limited support SQL-like querying (Predicate API) SQL-like querying (Filter API) Limited querying support Transactions Limited support Native support Native support Native support High Availability & Fault Tolerance Limited (with Terracotta) Native support Native support Native support (via replication and clustering) Performance Moderate High High High Data Structures Key-value pairs Key-value pairs, queues, topics, etc. Key-value pairs, caches, and services Strings, lists, sets, hashes, etc. Open Source Yes Yes No (proprietary) Yes

Slide 10

Slide 10 text

@arafkarsh arafkarsh Operational In-Memory Computing 10 Cache Topology Standalone This setup consists of a single node containing all the cached data. It’s equivalent to a single- node cluster and does not collaborate with other running instances. Distributed Data is spread across multiple nodes in a cache such that only a single node is responsible for fetching a particular entry. This is possible by distributing/partitioning the cluster in a balanced manner (i.e., all the nodes have the same number of entries and are hence load balanced). Failover is handled via configurable backups on each node. Replicated Data is spread across multiple nodes in a cache such that each node consists of the complete cache data, since each cluster node contains all the data; failover is not a concern. Caching Strategies Read Through A process by which a missing cache entry is fetched from the integrated backend store. Write Through A process by which changes to a cache entry (create, update, delete) are pushed into the backend data store. It is important to note that the business logic for Read-Through and Write- Through operations for a specific cache are confined within the caching layer itself. Hence, your application remains insulated from the specifics of the cache and its backing system-of-record. Caching Mode Embedded When the cache and the application co-exist within the same JVM, the cache can be said to be operating in embedded mode. The cache lives and dies with the application JVM. This strategy should be used when: • Tight coupling between your application and the cache is not a concern • The application host has enough capacity (memory) to accommodate the demands of the cache Client / Server In this setup, the application acts as the client to a standalone (remote) caching layer. This should be leveraged when: • The caching infrastructure and application need to evolve independently • Multiple applications use a unified caching layer which can be scaled up without affecting client applications. Java Cache API: JSR 107 [ Distributed Caching / Distributed Computing / Distributed Messaging ]

Slide 11

Slide 11 text

@arafkarsh arafkarsh Cache Deployment Models 11 Application Standalone Embedded Cache JVM Application Node 1 Embedded Cache JVM Application Node 2 Embedded Cache JVM Distributed or Replicated Cache Cluster Application Using Client API JVM Standalone Remote Cache JVM Stand Alone Client Server Cache Distributed or Replicated Cache Cluster Node 1 Remote Cache JVM Node 2 Remote Cache JVM Application Using Client API JVM Stand Alone Embedded Cache

Slide 12

Slide 12 text

@arafkarsh arafkarsh Spring Cache Example • Service definition with Cacheable Annotation • With Complex Object • With Custom Key Generator 12

Slide 13

Slide 13 text

@arafkarsh arafkarsh Cache Simple Example 13 import org.springframework.cache.annotation.Cacheable; import org.springframework.stereotype.Service; @Service public class MyCacheService { @Cacheable(value = ”healthCareCache", key = "#name") public String getGreeting(String name) { / // Simulating an expensive operation try { Thread.sleep(7000); } catch (InterruptedException e) { e.printStackTrace(); } return "Hello, " + name + "! How are you today?"; } } The @Cacheable annotation is part of the Spring Cache abstraction. This annotation indicates that the result of a method invocation should be cached so that subsequent invocations with the same arguments can return the result from the cache. 1. Before the method execution, Spring generates a cache key based on the method arguments and the specified cache name. 2. Spring checks if the cache contains a value associated with the generated cache key. 3. If a cached value is found, it is returned directly, and the method is not executed. 4. The method is executed if no cached value is found, and the result is stored in the cache with the generated key. 5. The result of the method is returned to the caller.

Slide 14

Slide 14 text

@arafkarsh arafkarsh Cache Annotations 14 @Service @CacheConfig(cacheNames = " healthCareCache ") public class PatientService { @Cacheable(key = "#id") public Patient findPatientById(String id) { // Code to Fetch Data } @CachePut(key = "#patient.id") public Patient updatePatient(Patient patient) { // Code to Update data // Cache is also updated } CacheEvict(key = "#id") public void deletePatient(String id) { // Code to Delete data // Cache is Evicted } } @CacheEvict The @CacheEvict annotation is used to remove one or more entries from the cache. You can specify the cache entry key to evict or use the allEntries attribute to remove all entries from the specified cache @CachePut The @CachePut annotation is used to update the cache with the result of the method execution. Unlike @Cacheable, which only executes the method if the result is not present in the cache, @CachePut always executes the method and then updates the cache with the returned value. This annotation is useful when you want to update the cache after modifying data.

Slide 15

Slide 15 text

@arafkarsh arafkarsh Cache Complex Example 15 public class Patient { private String firstName, lastName, dateOfBirth, gender, maritalStatus; private String phone, email, address; public Patient(String firstName, String lastName, String dateOfBirth, String gender, String maritalStatus, String phone, String email, String address) { this.firstName = firstName; this.lastName = lastName; this.dateOfBirth = dateOfBirth; this.gender = gender; this.maritalStatus = maritalStatus; this.phone = phone; this.email = email; this.address = address; } // Getters omitted for brevity }

Slide 16

Slide 16 text

@arafkarsh arafkarsh Cache Complex Example 16 import org.springframework.cache.annotation.Cacheable; import org.springframework.stereotype.Service; @Service public class PatientService { @Cacheable(value = "patient1Cache", key = "#patient.firstName") public Patient findPatientByFirstName(Patient patient) { // Simulate a time-consuming operation try { Thread.sleep(3000); } catch (InterruptedException e) { e.printStackTrace(); } return patient; } @Cacheable(value = "patient2Cache", key = "#firstName + '_' + #lastName") public Patient findPatientByFirstAndLastName(String firstName, String lastName) { try { Thread.sleep(3000); } catch (InterruptedException e) { e.printStackTrace(); } return new Patient(firstName, lastName, "01-01-1995", "Female", ”Single", ”98765-12345", "jane.doe@example.com", "123 Main St"); } }

Slide 17

Slide 17 text

@arafkarsh arafkarsh Cache Example: Custom Key Generator 17 import org.springframework.cache.interceptor.KeyGenerator; import org.springframework.web.context.request.RequestContextHolder; import org.springframework.web.context.request.ServletRequestAttributes; import javax.servlet.http.HttpServletRequest; import java.lang.reflect.Method; public class CustomKeyGenerator implements KeyGenerator { @Override public Object generate(Object target, Method method, Object... params) { HttpServletRequest request = ((ServletRequestAttributes) RequestContextHolder.currentRequestAttributes()).getRequest(); String customHeaderValue = request.getHeader("X-Custom-Header"); StringBuilder keyBuilder = new StringBuilder(); keyBuilder.append(method.getName()).append("-"); if (customHeaderValue != null) { keyBuilder.append(customHeaderValue).append("-"); } for (Object param : params) { keyBuilder.append(param.toString()).append("-"); } return keyBuilder.toString(); } }

Slide 18

Slide 18 text

@arafkarsh arafkarsh Cache Example: Key Generator Bean 18 import org.springframework.cache.annotation.EnableCaching; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; @Configuration @EnableCaching public class CacheConfig { @Bean("customKeyGenerator") public CustomKeyGenerator customKeyGenerator() { return new CustomKeyGenerator(); } }

Slide 19

Slide 19 text

@arafkarsh arafkarsh Cache Complex Example 19 import org.springframework.beans.factory.annotation.Autowired; import org.springframework.beans.factory.annotation.Qualifier; import org.springframework.cache.annotation.Cacheable; import org.springframework.cache.interceptor.KeyGenerator; import org.springframework.stereotype.Service; @Service public class PatientService { @Autowired @Qualifier("customKeyGenerator") private KeyGenerator customKeyGenerator; @Cacheable(value = "patientCache", keyGenerator = "customKeyGenerator") public Patient findPatientByFirstAndLastName(String firstName, String lastName) { try { Thread.sleep(3000); } catch (InterruptedException e) { e.printStackTrace(); } return new Patient(firstName, lastName, "01-01-1995", "Female", ”Single", ” 98765-12345", "jane.doe@example.com", "123 Main St"); } }

Slide 20

Slide 20 text

@arafkarsh arafkarsh EHCache Setup • POM File • Configuration File • Springboot Configuration 20

Slide 21

Slide 21 text

@arafkarsh arafkarsh EHCache Setup 21 org.springframework.boot spring-boot-starter-cache org.ehcache ehcache POM File Configuration – ehcache.xml

Slide 22

Slide 22 text

@arafkarsh arafkarsh EHCache Properties 22 • name: The unique name of the cache. • maxElementsInMemory: The maximum number of elements that can be stored in memory. Once this limit is reached, elements can be evicted or overflow to disk, depending on the configuration. The value is a positive integer. • eternal: A boolean value that indicates whether the elements in the cache should never expire. If set to true, the timeToIdleSeconds and timeToLiveSeconds attributes are ignored. The value is either true or false. • timeToIdleSeconds: The maximum number of seconds an element can be idle (not accessed) before it expires. A value of 0 means there's no limit on the idle time. The value is a non-negative integer. • timeToLiveSeconds: The maximum number of seconds an element can exist in the cache, regardless of idle time. A value of 0 means there's no limit on the element's lifespan. The value is a non-negative integer. • overflowToDisk: A boolean value that indicates whether elements can be moved from memory to disk when the maxElementsInMemory limit is reached. This attribute is deprecated in EHCache 2.10 and removed in EHCache 3.x. Instead, you should use the diskPersistent attribute or configure a disk store element. The value is either true or false.

Slide 23

Slide 23 text

@arafkarsh arafkarsh EHCache Setup for Distributed 23 org.ehcache.modules ehcache-clustered get-latest-version POM File

Slide 24

Slide 24 text

@arafkarsh arafkarsh EHCache Spring Boot Config 24 import org.springframework.cache.annotation.EnableCaching; import org.springframework.cache.ehcache.EhCacheCacheManager; import org.springframework.cache.ehcache.EhCacheManagerFactoryBean; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.core.io.ClassPathResource; @Configuration @EnableCaching public class CacheConfiguration { @Bean public EhCacheCacheManager cacheManager() { return new EhCacheCacheManager(ehCacheCacheManager().getObject()); } @Bean public EhCacheManagerFactoryBean ehCacheCacheManager() { EhCacheManagerFactoryBean factory = new EhCacheManagerFactoryBean(); factory.setConfigLocation(new ClassPathResource("ehcache.xml")); factory.setShared(true); return factory; } }

Slide 25

Slide 25 text

@arafkarsh arafkarsh Hazelcast Setup • POM File • Configuration File • Springboot Configuration 25

Slide 26

Slide 26 text

@arafkarsh arafkarsh Hazelcast Setup 26 org.springframework.boot spring-boot-starter-cache com.hazelcast hazelcast com.hazelcast hazelcast-spring POM File Configuration – hazelcast.xml

Slide 27

Slide 27 text

@arafkarsh arafkarsh Hazelcast Properties 27 1. cache name: The unique name of the cache. 2. eviction: • size: The maximum number of elements in the cache before eviction occurs. The value is a positive integer. • max-size-policy: The cache size policy. The possible values are ENTRY_COUNT, USED_HEAP_SIZE, and USED_HEAP_PERCENTAGE. • eviction-policy: The eviction policy for the cache. The possible values are LRU (Least Recently Used), LFU (Least Frequently Used), RANDOM, and NONE. 3. expiry-policy-factory: Configures the cache's expiration policy. • timed-expiry-policy-factory: A factory for creating a timed expiry policy. • expiry-policy-type: The type of expiry policy. Possible values are CREATED, ACCESSED, MODIFIED, and TOUCHED. CREATED expires based on the creation time, ACCESSED expires based on the last access time, MODIFIED expires based on the last modification time, and TOUCHED expires based on the last access or modification time. • duration-amount: The duration amount for the expiry policy. The value is a positive integer. • time-unit: The time unit for the duration amount. Possible values are NANOSECONDS, MICROSECONDS, MILLISECONDS, SECONDS, MINUTES, HOURS, and DAYS.

Slide 28

Slide 28 text

@arafkarsh arafkarsh Hazelcast Setup for Distributed 28 machine1:5701 machine2:5701 --> 120 Configuration – hazelcast.xml

Slide 29

Slide 29 text

@arafkarsh arafkarsh Hazelcast Spring Boot Config 29 import com.hazelcast.config.Config; import com.hazelcast.core.Hazelcast; import com.hazelcast.core.HazelcastInstance; import org.springframework.cache.annotation.EnableCaching; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.core.io.ClassPathResource; import org.springframework.cache.CacheManager; import org.springframework.cache.hazelcast.HazelcastCacheManager; @Configuration @EnableCaching public class CacheConfiguration { @Bean public CacheManager cacheManager() { return new HazelcastCacheManager(hazelcastInstance()); } @Bean public HazelcastInstance hazelcastInstance() { Config config = new Config(); config.setConfigurationFile( new ClassPathResource("hazelcast.xml").getFile()); return Hazelcast.newHazelcastInstance(config); } }

Slide 30

Slide 30 text

@arafkarsh arafkarsh Oracle Coherence Setup • POM File • Configuration File • Springboot Configuration 30

Slide 31

Slide 31 text

@arafkarsh arafkarsh Oracle Coherence Setup 31 org.springframework.boot spring-boot-starter-cache com.oracle.coherence.spring coherence-spring-boot-starter 3.3.2 POM File healthCareCache example-distributed example-distributed true Configuration – coherence.xml

Slide 32

Slide 32 text

@arafkarsh arafkarsh Coherence Properties 32 1.cache-config: The root element for the Coherence cache configuration. 2.caching-scheme-mapping: Contains the mapping of cache names to caching schemes. 1. cache-mapping: Defines the mapping between a cache name and a caching scheme. 1. cache-name: The unique name of the cache. 2. scheme-name: The name of the caching scheme that this cache should use. 3.caching-schemes: Contains the caching scheme definitions. 1. distributed-scheme: The distributed caching scheme. This scheme provides a distributed cache, partitioned across the cluster members. 1. scheme-name: The name of the distributed caching scheme. 2. backing-map-scheme: The backing map scheme that defines the storage strategy for the distributed cache. 1. local-scheme: A local backing map scheme that stores the cache data in the local member's memory. 3. autostart: A boolean value that indicates whether the cache should start automatically when the cache service starts. The value is either true or false.

Slide 33

Slide 33 text

@arafkarsh arafkarsh Oracle Coherence Spring Boot Config 33 import com.tangosol.net.CacheFactory; import com.tangosol.net.ConfigurableCacheFactory; import com.tangosol.net.NamedCache; import org.springframework.cache.CacheManager; import org.springframework.cache.annotation.EnableCaching; import org.springframework.cache.support.SimpleCacheManager; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.core.io.ClassPathResource; import java.util.Collections; @Configuration @EnableCaching public class CacheConfiguration { @Bean@Bean public CacheManager cacheManager() { NamedCache cache = getCache(”healthCareCache"); SimpleCacheManager cacheManager = new SimpleCacheManager(); cacheManager.setCaches(Collections.singletonList(cache)); return cacheManager; } @Bean public ConfigurableCacheFactory configurableCacheFactory() { return CacheFactory.getCacheFactoryBuilder() .getConfigurableCacheFactory(new ClassPathResource( "coherence.xml").getFile()); } private NamedCache getCache(String cacheName) { return configurableCacheFactory().ensureCache(cacheName, null); } }

Slide 34

Slide 34 text

@arafkarsh arafkarsh Redis Cache Setup • POM File • Configuration File • Springboot Configuration 34

Slide 35

Slide 35 text

@arafkarsh arafkarsh Redis Setup Standalone & Distributed 35 org.springframework.boot spring-boot-starter-data-redis org.springframework.boot spring-boot-starter-cache > POM File spring.redis.host=127.0.0.1 spring.redis.port=6379 Standalone Configuration – application.properties Distributed Configuration – application.properties spring.redis.cluster.nodes=node1:6379,node2:6380 cluster-enabled yes cluster-config-file nodes.conf cluster-node-timeout 5000 Update this for each Redis instance Nodes.conf is automatically created by Redis when running in cluster mode redis.conf

Slide 36

Slide 36 text

@arafkarsh arafkarsh Redis Spring Boot Config 36 import org.springframework.cache.CacheManager; import org.springframework.cache.annotation.EnableCaching; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.data.redis.cache.RedisCacheConfiguration; import org.springframework.data.redis.cache.RedisCacheManager; import org.springframework.data.redis.connection.RedisConnectionFactory; import org.springframework.data.redis.core.RedisTemplate; import java.time.Duration; @Configuration @EnableCaching public class CacheConfiguration { @Bean public CacheManager cacheManager(RedisConnectionFactory redisConnectionFactory) { RedisCacheConfiguration configuration = RedisCacheConfiguration.defaultCacheConfig() .entryTtl(Duration.ofSeconds(120)); // Set the time-to-live for the cache entries return RedisCacheManager.builder(redisConnectionFactory) . cacheDefaults(configuration) .build(); } @Bean public RedisTemplate redisTemplate(RedisConnectionFactory redisConnectionFactory) { RedisTemplate redisTemplate = new RedisTemplate<>(); redisTemplate.setConnectionFactory(redisConnectionFactory); return redisTemplate; } }

Slide 37

Slide 37 text

@arafkarsh arafkarsh 2 NoSQL Databases o Cap Theorem o Sharding / Partitioning o Geo Partitioning o Oracle Sharding and Geo Partitioning 37

Slide 38

Slide 38 text

@arafkarsh arafkarsh ACID Vs. BASE 38 # Property ACID BASE 1 Acronym Atomicity, Consistency, Isolation, Durability Basically Available, Soft state, Eventual consistency 2 Focus Strong consistency, data integrity, transaction reliability High availability, partition tolerance, high scalability 3 Applicability Traditional relational databases (RDBMS) Distributed NoSQL databases 4 Transactions Ensures all-or-nothing transactions Allows partial transactions, more flexible 5 Consistency Guarantees strong consistency Supports eventual consistency 6 Isolation Ensures transactions are isolated from each other Transactions may not be fully isolated 7 Durability Guarantees data is permanently stored once committed Data durability may be delayed, relying on eventual consistency 8 Latency Higher latency due to stricter consistency constraints Lower latency due to relaxed consistency constraints 9 Use Cases Financial systems, inventory management, etc. Social networks, recommendation systems, search engines, etc.

Slide 39

Slide 39 text

@arafkarsh arafkarsh CAP Theorem by Eric Allen Brewer 39 Pick Any 2!! Say NO to 2 Phase Commit ☺ Source: https://en.wikipedia.org/wiki/CAP_theorem | http://en.wikipedia.org/wiki/Eric_Brewer_(scientist) CAP 12 years later: How the “Rules have changed” “In a network subject to communication failures, it is impossible for any web service to implement an atomic read / write shared memory that guarantees a response to every request.” Partition Tolerance (Key in Cloud) The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. Consistency Every read receives the most recent write or an error. Availability Every request receives a (non-error) response – without guarantee that it contains the most recent write. Old Single Node RDBMS

Slide 40

Slide 40 text

@arafkarsh arafkarsh Databases that Support CA 40 Aster (Teradata Aster Database): A parallel, distributed, and columnar database is designed to perform advanced analytics and manage large- scale data. It provides high performance and availability but does not explicitly focus on partition tolerance. Greenplum: It is an open-source MPP database that is based on PostgreSQL. It is designed for handling large-scale analytical workloads and provides high performance and availability. Greenplum is also designed for fault tolerance and can recover from failures; however, it does not explicitly focus on partition tolerance. Vertica is a columnar MPP (Massively Parallel Processing) database designed for high-performance analytics and large-scale data management. Vertica offers high availability through data replication and automated failover, ensuring the system's resilience in case of node failures. However, Vertica does not explicitly focus on partition tolerance. Traditional RDBMS (Single Node Implementation) 1. DB2 2. MS SQL 3. MySQL 4. Oracle 5. PostgreSQL

Slide 41

Slide 41 text

@arafkarsh arafkarsh Databases that support both AP / CP 41 1. MongoDB 2. Cassandra 3. Amazon DynamoDB 4. Couchbase 5. Riak 6. ScyllaDB • Network partitions are considered inevitable in modern distributed systems, and most databases and systems now prioritize partition tolerance by default. • The challenge is to find the right balance between consistency and availability in the presence of partitions.

Slide 42

Slide 42 text

@arafkarsh arafkarsh MongoDB: Consistency / Partition Tolerance 42 import com.mongodb.MongoClientSettings; import com.mongodb.ReadConcern; import com.mongodb.WriteConcern; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoDatabase; public class MongoDBExample { public static void main(String[] args) { MongoClientSettings settings = MongoClientSettings.builder() .applyConnectionString(new ConnectionString("mongodb://localhost:27017")) .readConcern(ReadConcern.MAJORITY) .writeConcern(WriteConcern.MAJORITY) .build(); MongoClient mongoClient = MongoClients.create(settings); MongoDatabase exampleDb = mongoClient.getDatabase(”healthcare_db"); } }

Slide 43

Slide 43 text

@arafkarsh arafkarsh MongoDB: Availability / Partition Tolerance 43 import com.mongodb.MongoClientSettings; import com.mongodb.ReadConcern; import com.mongodb.WriteConcern; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoDatabase; public class MongoDBExample { public static void main(String[] args) { MongoClientSettings settings = MongoClientSettings.builder() . .applyConnectionString(new ConnectionString("mongodb://localhost:27017")) .readConcern(ReadConcern.LOCAL) .writeConcern(WriteConcern.W1) .build(); MongoClient mongoClient = MongoClients.create(settings); MongoDatabase exampleDb = mongoClient.getDatabase(”healthcare_db"); } }

Slide 44

Slide 44 text

@arafkarsh arafkarsh Cassandra: Consistency / Partition Tolerance 44 import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Session; import com.datastax.driver.core.ConsistencyLevel; import com.datastax.driver.core.SimpleStatement; Import com.datastax.driver.core.Statement; public class CassandraAPExample { public static void main(String[] args) { Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build(); Session session = cluster.connect(" healthcare_keyspace"); Statement writeStatement = new SimpleStatement("INSERT INTO diagnosis_t (id, value) VALUES (1, 'test')"); writeStatement.setConsistencyLevel(ConsistencyLevel.QUORUM); session.execute(writeStatement); Statement readStatement = new SimpleStatement("SELECT * FROM diagnosis_t WHERE id = 1"); readStatement.setConsistencyLevel(ConsistencyLevel.QUORUM); session.execute(readStatement); } }

Slide 45

Slide 45 text

@arafkarsh arafkarsh Cassandra: Availability / Partition Tolerance 45 import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Session; import com.datastax.driver.core.ConsistencyLevel; import com.datastax.driver.core.SimpleStatement; Import com.datastax.driver.core.Statement; public class CassandraAPExample { public static void main(String[] args) { Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build(); Session session = cluster.connect(" healthcare_keyspace"); Statement writeStatement = new SimpleStatement("INSERT INTO diagnosis_t (id, value) VALUES (1, 'test')"); writeStatement.setConsistencyLevel(ConsistencyLevel.ONE); session.execute(writeStatement); Statement readStatement = new SimpleStatement("SELECT * FROM diagnosis_t WHERE id = 1"); readStatement.setConsistencyLevel(ConsistencyLevel.ONE); session.execute(readStatement); } }

Slide 46

Slide 46 text

@arafkarsh arafkarsh AWS DynamoDB: Consistency / Partition Tolerance 46 import com.amazonaws.services.dynamodbv2.AmazonDynamoDB; import com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientBuilder; import com.amazonaws.services.dynamodbv2.document.DynamoDB; import com.amazonaws.services.dynamodbv2.document.Item; import com.amazonaws.services.dynamodbv2.document.spec.GetItemSpec; import com.amazonaws.services.dynamodbv2.document.spec.PutItemSpec; public class DynamoDBAPExample { public static void main(String[] args) { AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build(); DynamoDB dynamoDB = new DynamoDB(client); // Write an item (CP-like behavior) PutItemSpec putItemSpecCP = new PutItemSpec() .withItem(new Item().withPrimaryKey("id", 1) .withString("value", "test")); dynamoDB.getTable("diagnosis_t").putItem(putItemSpecCP); // Read with strongly consistent read (CP-like behavior) GetItemSpec getItemSpecStronglyConsistent = new GetItemSpec() .withPrimaryKey("id", 1) .withConsistentRead(true); Item itemCP = dynamoDB.getTable("diagnosis_t ") .getItem(getItemSpecStronglyConsistent); System.out.println("CP: " + itemCP); } }

Slide 47

Slide 47 text

@arafkarsh arafkarsh AWS DynamoDB: Availability / Partition Tolerance 47 import com.amazonaws.services.dynamodbv2.AmazonDynamoDB; import com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientBuilder; import com.amazonaws.services.dynamodbv2.document.DynamoDB; import com.amazonaws.services.dynamodbv2.document.Item; import com.amazonaws.services.dynamodbv2.document.spec.GetItemSpec; import com.amazonaws.services.dynamodbv2.document.spec.PutItemSpec; public class DynamoDBAPExample { public static void main(String[] args) { AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build(); DynamoDB dynamoDB = new DynamoDB(client); // Write an item (CP-like behavior) PutItemSpec putItemSpecCP = new PutItemSpec() .withItem(new Item().withPrimaryKey("id", 1) .withString("value", "test")); dynamoDB.getTable(" diagnosis_t").putItem(putItemSpecCP); // Read with eventually consistent read (AP-like behavior) GetItemSpec getItemSpecEventuallyConsistent = new GetItemSpec() .withPrimaryKey("id", 1) .withConsistentRead(false); Item itemAP = dynamoDB.getTable(" diagnosis_t") .getItem(getItemSpecEventuallyConsistent); System.out.println("AP: " + itemAP); } }

Slide 48

Slide 48 text

@arafkarsh arafkarsh NoSQL Databases 48 2

Slide 49

Slide 49 text

@arafkarsh arafkarsh NoSQL Databases 49 Database Type ACID Query Use Case Couchbase Doc Based, Key Value Open Source Yes N1QL Financial Services, Inventory, IoT Cassandra Wide Column Open Source No CQL Social Analytics Retail, Messaging Neo4J Graph Open Source Commercial Yes Cypher AI, Master Data Mgmt Fraud Protection Redis Key Value Open Source Yes Many languages Caching, Queuing Mongo DB Doc Based Open Source Commercial Yes JS IoT, Feal Time Analytics Inventory, Amazon Dynamo DB Key Value Doc based Vendor Yes DQL Gamming, Retail, Financial Services Source: https://searchdatamanagement.techtarget.com/infographic/NoSQL-database-comparison-to-help-you-choose-the-right-store.

Slide 50

Slide 50 text

@arafkarsh arafkarsh SQL Vs NoSQL 50 SQL NoSQL Database Type Relational Non-Relational Schema Pre-Defined Dynamic Schema Database Category Table Based 1. Documents 2. Key Value Stores 3. Graph Stores 4. Wide Column Stores Queries Complex Queries (Standard SQL for all Relational Databases) Need to apply Special Query language for each type of NoSQL DB. Hierarchical Storage Not a Good Fit Perfect Scalability Scales well for traditional Applications Scales well for Modern heavy data-oriented Application Query Language SQL – Standard Language across all the Databases Non-Standard Query Language as each of the NoSQL DB is different. ACID Support Yes For some of the Database (Ex. MongoDB) Data Size Good for traditional Applications Handles massive amount of Data for the Modern App requirements.

Slide 51

Slide 51 text

@arafkarsh arafkarsh SQL Vs NoSQL (MongoDB) 51 1. In MongoDB Transactional Properties are scoped at Doc Level. 2. One or More fields can be atomically written in a Single Operation. 3. With Updates to multiple sub documents including nested arrays. 4. Any Error results in the entire operation to Roll back. 5. This is at par with Data Integrity Guarantees provided Traditional Databases.

Slide 52

Slide 52 text

@arafkarsh arafkarsh Multi Table / Doc ACID Transactions 52 Examples – Systems of Record or Line of Business (LoB) Applications 1. Finance 1. Moving funds between Bank Accounts, 2. Payment Processing Systems 3. Trading Platforms 2. Supply Chain • Transferring ownership of Goods & Services through Supply Chains and Booking Systems – Ex. Adding Order and Reducing inventory. 3. Billing System 1. Adding a Call Detail Record and then updating Monthly Plan. Source: ACID Transactions in MongoDB

Slide 53

Slide 53 text

@arafkarsh arafkarsh Redis • Data Structures • Design Patterns 53 2020 2019 NoSQL Database Model 1 1 Redis Key-Value, Multi Model 2 2 Amazon DynamoDB Multi Model 3 3 Microsoft Cosmos Multi Model 4 4 Memcached Key-Value In-Memory Databases

Slide 54

Slide 54 text

@arafkarsh arafkarsh Why do you need In-Memory Databases 54 1 Users 1 Million + 2 Data Volume Terabytes to Petabytes 3 Locality Global 4 Performance Microsecond Latency 5 Request Rate Millions Per Second 6 Access Mobile, IoT, Devices 7 Economics Pay as you go 8 Developer Access Open API Source: AWS re:Invent 2020: https://www.youtube.com/watch?v=2WkJeofqIJg

Slide 55

Slide 55 text

@arafkarsh arafkarsh Tables / Docs (JSON) – Why Redis is different? 55 • Redis is a Multi data model Key Store • Commands operate on Keys • Data types of Keys can change overtime Source: https://www.youtube.com/watch?v=ELk_W9BBTDU

Slide 56

Slide 56 text

@arafkarsh arafkarsh Keys, Values & Data Types 56 movie:StarWars “Sold Out” Key Name Value String Hash List Set Sorted Set Basic Data Types Key Properties • Unique • Binary Safe (Case Sensitive) • Max Size = 512 MB Expiration / TTL • By Default – Keys are retained • Time in Seconds, Milli Second, Unix Epoch • Added / Removed from Key ➢ SET movie:StarWars ex 5000 (Expires in 5000 seconds) ➢ PEXPIRE movie:StarWars 5 (set for 5 milli seconds) https://redis.io/commands/set

Slide 57

Slide 57 text

@arafkarsh arafkarsh Redis – Remote Dictionary Server 57 Distributed In-Memory Data Store String Standard String data Hash { A: “John Doe”, B: “New York”, C:USA” } List [ A -> B -> C -> D. -> E ] Set { A, B, C, D, E } Sorted Set { A:10, B:12, C:14:, D:20, E:32 } Stream … msg1, msg2, msg3 Pub / Sub … msg1, msg2, msg3 https://redis.io/topics/data-types

Slide 58

Slide 58 text

@arafkarsh arafkarsh Data Type: Hash 58 movie:The-Force-Awakens Value J. J. Abrams L. Kasdan, J. J. Abrams, M. Arndt Dan Mindel ➢ HGET movie:The-Force-Awakens Director “J. J. Abrams” • Field & Value Pairs • Single Level • Add and Remove Fields • Set Operations • Intersect • Union https://redis.io/topics/data-types https://redis.io/commands#hash Key Name Director Writer Cinematography Field Use Cases • Session Cache • Rate Limiting

Slide 59

Slide 59 text

@arafkarsh arafkarsh Data Type: List 59 movies Key Name “Force Awakens, The” “Last Jedi, The” “Rise of Skywalker, The” ➢ LPOP movies “Force Awakens, The” ➢ LPOP movies “Last Jedi, The” ➢ RPOP movies “Rise of Skywalker, The” ➢ RPOP movies “Last Jedi, The” • Ordered List (FIFO or LIFO) • Duplicates Allowed • Elements added from Left or Right or By Position • Max 4 Billion elements per List Type of Lists • Queues • Stacks • Capped List https://redis.io/topics/data-types https://redis.io/commands#list Use Cases • Communication • Activity List

Slide 60

Slide 60 text

@arafkarsh arafkarsh Data Type: Set 60 movies Member / Element “Force Awakens, The” “Last Jedi, The” “Rise of Skywalker, The” ➢ SMEMBERS movies “Force Awakens, The” “Last Jedi, The” “Rise of Skywalker, The” • Un-Ordered List of Unique Elements • Set Operations • Difference • Intersect • Union https://redis.io/topics/data-types https://redis.io/commands#set Key Name Use Cases • Unique Visitors

Slide 61

Slide 61 text

@arafkarsh arafkarsh Data Type: Sorted Set 61 movies Value “Force Awakens, The” “Last Jedi, The” “Rise of Skywalker, The” ➢ ZRANGE movies 0 1 “Last Jedi, The” “Rise of Skywalker, The” • Ordered List of Unique Elements • Set Operations • Intersect • Union https://redis.io/topics/data-types https://redis.io/commands#set Key Name 3 1 2 Score Use Cases • Leaderboard • Priority Queues

Slide 62

Slide 62 text

@arafkarsh arafkarsh Redis: Transactions 62 • Transactions are • Atomic • Isolated • Redis commands are queue • All the Queued commands are executed sequentially as an Atomic unit ➢ MULTI ➢ SET movie:The-Force-Awakens:Review Good ➢ INCR movie:The-Force-Awakens:Rating ➢ EXEC

Slide 63

Slide 63 text

@arafkarsh arafkarsh Redis In-Memory Data Store Use cases 63 Machine Learning Message Queues Gaming Leaderboards Geospatial Session Store Media Streaming Real-time Analytics Caching

Slide 64

Slide 64 text

@arafkarsh arafkarsh Use Case: Sorted Set – Leader Board 64 • Collection of Sorted Distinct Entities • Set Operations and Range Queries based on Score value: John score: 610 value : Jane score: 987 value : Sarah score: 1597 value : Maya score: 144 value : Fred score: 233 value : Ann score: 377 Game Scores ➢ ZADD game:1 987 Jane 1597 Sarah 377 Maya 610 John 144 Ann 233 Fred ➢ ZREVRANGE game:1 0 3 WITHSCORES. (Get top 4 Scores) • Sarah 1597 • Jane 987 • John 610 • Ann 377 Source: AWS re:Invent 2020: https://www.youtube.com/watch?v=2WkJeofqIJg https://redis.io/commands/zadd

Slide 65

Slide 65 text

@arafkarsh arafkarsh Use Case: Geospatial 65 • Compute distance between members • Find all members within a radius Source: AWS re:Invent 2020: https://www.youtube.com/watch?v=2WkJeofqIJg ➢ GEOADD cities 87.6298 41.8781 Chicago ➢ GEOADD cities 122.3321 447.6062 Seattle ➢ ZRANGE cities0 -1 • “Chicago” • “Seattle” ➢ GEODIST cities Chicago Seattle mi • “1733.4089” ➢ GEORADIUS cities 122.4194 37..7749 1000 mi WITHDIST • “Seattle” • “679.4848” o m for meters o km for kilometres o mi for miles o ft for feet https://redis.io/commands/geodist

Slide 66

Slide 66 text

@arafkarsh arafkarsh Use Case: Streams 66 • Ordered collection of Data • Efficient for consuming from the tail • Multiple Consumers support similar to Kafka { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } { “order”: “xy2123adbcd” { “item”: “book1”, “qty”: 1 } } START END Consumer 1 Consumer 2 Consumer n Consumer G1 Consumer G2 Consumer Group G ➢ XADD orderStream * orderId1:item1:qty1 ➢ XADD orderStream * orderId2:item1:qty2 https://redis.io/commands/xadd * Autogenerates the Uniq ID Producer ➢ XREAD BLOCK 20 STREAMS orderStream $ • orderId2 • Item1 • qty2 Consumer

Slide 67

Slide 67 text

@arafkarsh arafkarsh MongoDB: Design Patterns 1. Prefer Embedding 2. Embrace Duplication 3. Know when Not to Embed 4. Relationships and Join 67

Slide 68

Slide 68 text

@arafkarsh arafkarsh MongoDB Docs – Prefer Embedding 68 Use Structure to use Data within a Document Include Bounded Arrays to have multiple records

Slide 69

Slide 69 text

@arafkarsh arafkarsh MongoDB Docs – Embrace Duplication 69 Field Info Duplicated from Customer Profile Address Duplicated from Customer Profile

Slide 70

Slide 70 text

@arafkarsh arafkarsh Know When Not to Embed 70 As Item is used outside of Order, You don’t need to embed the whole Object here. Instead give the Item Reference ID. (Not to Embed) Name is given to decouple it from Item (Product) Service. (Embrace Duplication)

Slide 71

Slide 71 text

@arafkarsh arafkarsh Relationships and Joins 71 Reviews are joined to Product Collection using Item UUID Bi-Directional Joins are also supported

Slide 72

Slide 72 text

@arafkarsh arafkarsh MongoDB – Tips & Best Practices 72 1. MongoDB Will Abort any Multi Document transaction that runs for more than 60 seconds. 2. No More than 1000 documents should be modified within a Transaction. 3. Developers need to try logic to retry the transaction in case transaction is aborted due to network error. 4. Transactions that affects Multiple Shards incur a greater performance Cost as operations are coordinated across multiple participating nodes over the network. 5. Performance will be impacted if a transaction runs against a collection that is subject to rebalancing.

Slide 73

Slide 73 text

@arafkarsh arafkarsh Amazon DynamoDB DynamoDB Concepts DynamoDB Design Patterns Performance

Slide 74

Slide 74 text

@arafkarsh arafkarsh Amazon DynamoDB Concept Customer ID Name Category State Order Order Customer Cart Payments Order Cart Catalogue Catalogue Table Product ID Name Value Description Image Item ID Quantity Value Currency User ID + Item ID Attributes 1. A single Table holds multiple Entities (Customer, Catalogue, Cart, Order etc.) aka Items. 2. Item contains a collection of Attributes. 3. Primary Key plays a key role in Performance, Scalability and avoiding Joins (in a typical RDBMS way). 4. Primary Key contains a Partition Key and an option Sort Key. 5. Item Data Model is JSON, and Attribute can be a field or a Custom Object. Items Primary Key

Slide 75

Slide 75 text

@arafkarsh arafkarsh DynamoDB – Under the Hood One Single table Multiple Entities with multiple documents (Records in RDBMS style) 1 Org Record 2 Employee Record 1 Org Record 2 Employee Record 1. DynamoDB Structure is JSON (Document Model) – However, it has no resemblance to MongoDB in terms DB implementation or Schema Design Patterns. 2. Multiple Entities are part of the Single Table and this helps to avoid expensive joins. For Ex. PK = ORG#Magna will retrieve all the 3 records. 1 Record from Org Entity and 2 Records from Employee Entity. 3. Partition Key helps in Sharding and Horizontal Scalability.

Slide 76

Slide 76 text

@arafkarsh arafkarsh Neo4J Graph Database 76

Slide 77

Slide 77 text

@arafkarsh arafkarsh Features 77 In a Graph Database, data is represented as nodes (also called vertices) and edges (also called relationships or connections). • Nodes represent entities or objects, while • Edges represent the relationships or connections between those entities. • Both nodes and edges can have properties (key-value pairs) that store additional information about the entities or their relationships. The main components of a Graph Database are: 1. Nodes: The fundamental units representing entities, such as people, products, or locations. 2. Edges: The connections between nodes, representing relationships, such as "friends with," "purchased," or "lives in." 3. Properties: Key-value pairs that store additional information about nodes or edges, such as names, ages, or timestamps.

Slide 78

Slide 78 text

@arafkarsh arafkarsh Advantages 78 1. Flexibility: Graph Databases can quickly adapt to changes in the data model and accommodate the addition or removal of nodes, edges, or properties without significant disruption. 2. Performance: Graph Databases are optimized for querying connected data, allowing for faster traversal of relationships compared to traditional relational databases. 3. Intuitive Representation: Graph Databases represent data in a way that closely mirrors real-world entities and their relationships, making it easier to understand and work with the data.

Slide 79

Slide 79 text

@arafkarsh arafkarsh Example: Healthcare App 79 Nodes: 1. Patients 2. Doctors 3. Hospitals 4. Diagnoses 5. Treatments 6. Medications 7. Insurances Edges: The edges would represent relationships between these entities, such as: 1. Patient 'visited' Doctor 2. Doctor 'works at a' Hospital 3. Patient 'diagnosed with' Diagnosis 4. Diagnosis 'treated with' Treatment 5. Treatment 'involves' Medication 6. Patient 'covered by’ Insurance Properties: Nodes and edges could have properties to store additional information about the entities or their relationships, such as: 1. Patient: name, age, gender, medical history 2. Doctor: name, specialty, experience, ratings 3. Hospital: name, location, facilities, ratings 4. Diagnosis: name, description, prevalence, risk factors 5. Treatment: name, type, duration, success rate 6. Medication: name, dosage, side effects, interactions 7. Insurance: company, coverage, premium, limitations 8. Visited: Date & Time, Doctors Name, Hospital

Slide 80

Slide 80 text

@arafkarsh arafkarsh How the data is represented in Graph DB 80 P1 D1 Node P1 Patient Name DOB Phone Node D1 Doctor Name Phone Specialty D2 P2 Edges Node D2 Doctor Name Phone Specialty p2.d2.v1, Date, Clinic L1 Node L1 Lab Name Phone Specialty Node P2 Patient Name DOB Phone M1 Node M1 Pharmacy Phone Type (Internal) Edges with Properties Nodes with Properties

Slide 81

Slide 81 text

@arafkarsh arafkarsh Sample Code 81 MATCH (p:Patient {patient_id: "P1"}), (d:Doctor {doctor_id: "D1"}) CREATE (p)-[:VISITED {date: "2023-05-01"}]->(d) MATCH (p:Patient {patient_id: "P1"}), (diag:Diagnosis {diagnosis_id: "Dg1"}) CREATE (p)-[:DIAGNOSED {date: "2023-05-02"}]->(diag) MATCH (p:Patient {patient_id: "P1"}), (l:Lab {lab_id: "L1"}) CREATE (p)-[:HAD_XRAY {date: "2023-05-03"}]->(l) CREATE (p:Patient {patient_id: "P1", name: "John Doe"}) CREATE (d:Doctor {doctor_id: "D1", name: "Dr. Smith"}) CREATE (diag:Diagnosis {diagnosis_id: "Dg1", name: "Flu"}) CREATE (l:Lab {lab_id: "L1", name: "X-Ray Lab"})

Slide 82

Slide 82 text

@arafkarsh arafkarsh Neo4J – Graph Data Science (GDS) Library 82 1. Graph traversal algorithms: 1. Depth-First Search (DFS) 2. Breadth-First Search (BFS) 2. Shortest path algorithms: 1. Dijkstra's algorithm 2. A* (A-Star) algorithm 3. All Pairs Shortest Path (APSP) 3. Centrality algorithms: 1. Degree Centrality: Computes a node's incoming and outgoing number of relationships. 2. Closeness Centrality: Measures how central a node is to its neighbors. 3. Betweenness Centrality: Measures the importance of a node based on the number of shortest paths passing through it. 4. PageRank: A popular centrality algorithm initially designed to rank web pages based on the idea that more important nodes will likely receive more connections from other nodes.

Slide 83

Slide 83 text

@arafkarsh arafkarsh Neo4J – Graph Data Science (GDS) Library 83 4. Community detection algorithms: 1. Label Propagation: A fast algorithm for detecting communities within a graph based on propagating labels to form clusters. 2. Louvain Modularity: An algorithm for detecting communities by optimizing a modularity score. 3. Weakly Connected Components: Identifies groups of nodes where each node is reachable from any other node within the same group, disregarding the direction of relationships. 5.Similarity algorithms: 1. Jaccard Similarity: Measures the similarity between two sets by comparing their intersection and union. 2. Cosine Similarity: Measures the similarity between two vectors based on the cosine of the angle between them. 3. Pearson Similarity: Measures the similarity between two vectors based on their Pearson correlation coefficient. 6.Pathfinding algorithms: 1. Minimum Weight Spanning Tree: Computes the minimum weight spanning tree for a connected graph, using algorithms like Kruskal's or Prim's.

Slide 84

Slide 84 text

@arafkarsh arafkarsh Data Mesh o Introduction to Data Mesh and Key Principles o Problems Data Mesh Solves o Real-World Use Cases for Data Mesh o Case Study: Banking, Retail o Building a Data Mesh 84

Slide 85

Slide 85 text

@arafkarsh arafkarsh Comparison Data Lake / Warehouse / Mart 85 Data Lake Warehouse Data Mart Storage for Raw Data Data lakes store raw, unprocessed data in its native format, including structured data, semi-structured data (like logs or XML), and unstructured data (such as emails and documents). Data warehouses store data that has been processed and structured into a defined schema. Also does not store raw data, similar to data warehouses. It stores processed and refined data specific to a particular business function. Scalability Typically built on scalable cloud platforms or Hadoop, data lakes can handle massive volumes of data Moderately scalable, traditionally limited by hardware when on-premises, but modern cloud-based solutions offer considerable scalability. Least scalable due to its focused and limited scope, typically designed to serve specific departmental needs. Performance Performance can vary. While it's excellent for big data processing and machine learning tasks, it might not perform as well for quick, ad-hoc query scenarios compared to structured systems. Optimized for high performance in query processing, especially for complex queries across large datasets. Designed for speed and efficiency in retrieval operations. Generally offers high performance for its limited scope and targeted queries, enabling faster response times for the specific business area it serves.

Slide 86

Slide 86 text

@arafkarsh arafkarsh Comparison Data Lake / Warehouse / Mart 86 Data Lake Warehouse Data Mart Flexibility Extremely flexible in terms of the types of data it can store and how data can be used. It allows for the exploration and manipulation of data in various formats. Less flexible as it requires data to fit into a predefined schema, which might limit the types of data that can be easily integrated and queried. Also has limited flexibility, tailored to specific business functions with data structured for particular uses. Purpose Ideal for data discovery, data science, and machine learning where access to large and diverse data sets is necessary. Designed for business intelligence, analytics, and reporting, where fast, reliable, and consistent data retrieval is crucial. Serves specific departmental needs by providing data that is relevant and quickly accessible to business users within a department. Data Integrity & Consistency Data integrity and consistency can be a challenge due to the variety and volume of raw and unprocessed data. High integrity and consistency. Data is processed, cleansed, and conformed to ensure reliability and accuracy, which is critical for decision-making processes. Similar to data warehouses, data marts ensure high data integrity and consistency within their focused scope, as the data often originates from a data warehouse.

Slide 87

Slide 87 text

@arafkarsh arafkarsh Understanding The Great Divide 87 Source: https://martinfowler.com/articles/data-mesh-principles.html Zhamak Dehghani

Slide 88

Slide 88 text

@arafkarsh arafkarsh Data Mesh in a Nutshell 88 Data mesh is a decentralized sociotechnical approach to share, access, and manage analytical data in complex and large-scale environments— within or across organizations. Source: Dehghani, Zhamak. Data Mesh (p. 46). O'Reilly Media. Zhamak Dehghani

Slide 89

Slide 89 text

@arafkarsh arafkarsh 4 Principles of Data Mesh 89 1. Domain-Oriented Decentralized Data Ownership and Architecture: Data is managed by domain-specific teams that treat their data as a product. These teams are responsible for their own data pipelines and outputs. 2. Data as a Product: Data is treated as a product with a focus on the consumers' needs. This includes clear documentation, SLAs, and a user-friendly interface for accessing the data. 3. Self-Serve Data Infrastructure as a Platform: This principle aims to empower domain teams by providing them with a self-serve data infrastructure, which helps them handle their data products with minimal central oversight. 4. Federated Computational Governance: Governance is applied across domains through a federated model, ensuring that data quality, security, and access controls are maintained without stifling innovation. Source: Dehghani, Zhamak. Data Mesh (p. 56). O'Reilly Media.

Slide 90

Slide 90 text

@arafkarsh arafkarsh Data Mesh 90 Source: https://www.datamesh-architecture.com/

Slide 91

Slide 91 text

@arafkarsh arafkarsh Problems Data Mesh Solves 91 1. Elimination of Silos: By empowering domain teams to manage their own data, Data Mesh breaks down silos and encourages a more collaborative approach to data management. 2. Scalability Issues: Traditional data platforms often struggle to scale effectively as data volume and complexity grow. Data Mesh's decentralized approach allows more scalable solutions by distributing the data workload. 3. Over-centralization of Data Teams: Centralized data teams can become bottlenecks. Data Mesh decentralizes this by making domain teams responsible for their data, thus distributing workload and responsibility. 4. Adaptability and Agility: Data Mesh allows organizations to be more adaptable and agile in their data strategy, as changes can be made more swiftly and efficiently at the domain level.

Slide 92

Slide 92 text

@arafkarsh arafkarsh Real-World Use Cases 92 1. Financial Services: In a banking scenario, different domains such as loans, credit cards, and customer service can independently manage their data, enabling faster innovation and personalized customer experiences while maintaining compliance through federated governance. 2. E-commerce: Large e-commerce platforms manage diverse data from inventory, sales, customer feedback, and logistics. Each domain can optimize its data management and analytics, improving service delivery and operational efficiency. 3. Healthcare: Different departments such as clinical data, patient records, and insurance processing can manage their data as discrete products, enhancing data privacy, compliance, and patient outcomes through more tailored data usage. 4. Manufacturing: Domains like production, supply chain, and maintenance in a manufacturing enterprise can leverage Data Mesh to optimize their operations independently, using real-time data streaming via Kafka for immediate responsiveness and decision-making.

Slide 93

Slide 93 text

@arafkarsh arafkarsh Banking Data Products 93 1. Customer Segmentation Data Product o Purpose: To segment customers based on various parameters like income, spending habits, life stages, and financial goals to offer personalized banking services. o Contents: Demographics, transaction history, account types, customer interactions, and feedback. o Usage: Marketing campaigns, personalized product offerings, customer retention strategies. 2. Risk Assessment Data Product o Purpose: To assess and predict the risk associated with lending to individuals or businesses. o Contents: Credit scores, repayment history, current financial obligations, economic conditions, and employment status. o Usage: Credit scoring, loan approvals, setting interest rates, provisioning for bad debts. 3. Fraud Detection Data Product o Purpose: To detect and prevent fraudulent transactions in real-time. o Contents: Transaction patterns, flagged transactions, account holder's historical data, IP addresses, and device information. o Usage: Real-time fraud monitoring, alerting systems, and improving security measures.

Slide 94

Slide 94 text

@arafkarsh arafkarsh Banking Data Products 94 4. Regulatory Compliance Data Product o Purpose: To ensure all banking operations comply with local and international regulatory requirements. o Contents: Transaction records, customer data, audit trails, compliance check results. o Usage: Reporting to regulatory bodies, internal audits, compliance monitoring. 5. Investment Insights Data Product o Purpose: To provide customers and bank advisors with insights into investment opportunities. o Contents: Market data, economic indicators, historical investment performance, news feeds, and predictive analytics. o Usage: Investment advisory services, customer dashboards, portfolio management.

Slide 95

Slide 95 text

@arafkarsh arafkarsh Retail Data Products 95 1. Customer Behavior Data Product o Purpose: To understand customer preferences, buying patterns, and engagement across channels. o Contents: Purchase history, online browsing logs, loyalty card data, customer feedback. o Usage: Personalized marketing, product recommendations, customer experience enhancement. 2. Inventory Optimization Data Product o Purpose: To manage stock levels efficiently across all retail outlets and warehouses. o Contents: Stock levels, sales velocity, supplier lead times, seasonal trends. o Usage: Stock replenishment, markdown management, warehouse space optimization. 3. Sales Performance Data Product o Purpose: To track and analyze sales performance across various dimensions such as geography, product line, and time period. o Contents: Sales data, promotional campaigns effectiveness, customer demographics, product returns data.

Slide 96

Slide 96 text

@arafkarsh arafkarsh Retail Data Products 96 4. Supplier Performance Data Product o Purpose: To evaluate and manage supplier relationships based on performance metrics. o Contents: Delivery times, quality metrics, cost analysis, compliance data. o Usage: Supplier negotiations, procurement strategy, risk management. 5. Market Trend Analysis Data Product o Purpose: To capture and analyze market trends and consumer sentiment. o Contents: Social media data, market research reports, competitor analysis, economic indicators. o Usage: New product development, competitive strategy, pricing strategies.

Slide 97

Slide 97 text

@arafkarsh arafkarsh How? 97 Source: https://www.datamesh-architecture.com/ From the Centralized Data Team and Data To Distributed Decentralized Model Source: https://www.datamesh-architecture.com/

Slide 98

Slide 98 text

@arafkarsh arafkarsh Data Mesh Architecture 98 Source: https://www.datamesh-architecture.com/ A data mesh architecture is a decentralized approach that enables domain teams to perform cross- domain data analysis on their own.

Slide 99

Slide 99 text

@arafkarsh arafkarsh Building Data Mesh: 1 of 4 99 1. Define Requirements and Assess Current Capabilities o Assess Current Data Usage and Needs: Analyze current data flows, storage needs, and processing requirements. Identify pain points in your existing infrastructure. o Forecast Future Needs: Estimate future data growth based on business goals. Consider not only the volume but also the complexity and diversity of data that will need to be managed. o Compliance and Security Needs: Ensure that your infrastructure will comply with applicable data protection regulations (like GDPR, HIPAA, PCI) and security standards. 2. Choose the Right Data Storage Solutions o Diverse Data Storage Options: Use a combination of storage solutions (SQL databases, NoSQL databases, data warehouses, and data lakes) to cater to different types of data and access patterns. o Elastic Scalability: Opt for cloud-based solutions such as Amazon S3, Google Cloud Storage, or Azure Blob Storage for elastic scalability and durability.

Slide 100

Slide 100 text

@arafkarsh arafkarsh Building Data Mesh: 2 of 4 100 3. Implement Data Processing Frameworks o Batch Processing: Implement batch processing systems for large-scale analytics and reporting. Apache Hadoop and Spark are popular choices for handling massive amounts of data with fault tolerance. o Stream Processing: For real-time data processing needs, use tools like Apache Kafka, Apache Flink, and Apache Storm. These tools can handle high throughput and low- latency processing. o Hybrid Processing: Consider hybrid models that combine batch and stream processing for more flexibility. 4. Ensure Data Integration and Orchestration o Data Integration Tools: Use robust ETL (Extract, Transform, Load) tools or more modern ELT approaches to integrate data from various sources. Tools like Apache Kafka Connect, Talend, Apache Nifi, or Stitch can automate these processes. o Workflow Orchestration: Use workflow orchestration tools like Apache Airflow or Dagster to manage dependencies and scheduling of data processing jobs across multiple platforms.

Slide 101

Slide 101 text

@arafkarsh arafkarsh Building Data Mesh: 3 of 4 101 5. Adopt a Microservices Architecture o Decoupled Services: Implement microservices to break down your data infrastructure into smaller, manageable, and independently scalable services. o Containerization: Use Docker containers to encapsulate microservices, making them portable and easier to manage. o Orchestration Platforms: Utilize Kubernetes or Docker Swarm for managing containerized services, ensuring they scale properly with demand. 6. Use Data Management and Monitoring Tools o Data Cataloging: Implement data catalogue tools to manage metadata and ensure data is findable and accessible. Tools like Apache Atlas or Collibra can be useful. o Monitoring and Logging: Use monitoring tools to track the performance and health of your data systems. Service Mesh, Prometheus, Grafana, and ELK (Elasticsearch, Logstash, Kibana) stacks are effective for monitoring and visualizing metrics.

Slide 102

Slide 102 text

@arafkarsh arafkarsh Building Data Mesh: 4 of 4 102 7. Ensure Scalability and Reliability o Load Balancing: Use load balancers to distribute workloads evenly across servers, preventing any single point of failure. (Kubernetes/Kafka/Flink) o Data Redundancy and Backup: Implement data replication and backup strategies to ensure data durability and recoverability. o Scalable Architecture Design: Design your infrastructure to scale out (adding more machines) or scale up (adding more power to existing machines) based on demand. (Kubernetes/Kafka/Flink) 8. Foster a Culture of Continuous Improvement o Regular Audits and Updates: Regularly review and upgrade your infrastructure to incorporate new technologies and improvements. o Training and Development: Keep your team updated with the latest data technologies and best practices through continuous training and development.

Slide 103

Slide 103 text

@arafkarsh arafkarsh Popular Data Mesh Tech Stacks 103 o Google Cloud BigQuery o AWS S3 and Athena o Azure Synapse Analytics o dbt and Snowflake o Databricks (How To Build a Data Product with Databricks) o MinIO and Trino o SAP o Kafka and RisingWave Source: https://www.datamesh-architecture.com/ Data Mesh User Stories Data mesh is primarily an organizational approach, and that's why you can't buy a data mesh from a vendor.

Slide 104

Slide 104 text

@arafkarsh arafkarsh Google Data Mesh Stack 104 BigQuery is the central component for storing analytical data. BigQuery is a columnar data store and can perform efficient JOIN operations with large data set. BigQuery supports both, batch ingestion and streaming ingestion. When the operational system architecture relies on Apache Kafka, then streaming through Kafka Connect Google BigQuery Sink Connector is recommended. Source: Google Cloud BigQuery

Slide 105

Slide 105 text

@arafkarsh arafkarsh Amazon Data Mesh Stack 105 AWS S3 is the central component for storing analytical data. S3 is a file based object store and data can be stored in many formats, such as CSV, JSON, Avro, or Parquet. S3 buckets are used for all stages: raw files, aggregated data, and even data products. Every domain team typically has their own AWS S3 buckets to store their own data products. Analytical queries are executed through AWS Athena that queries data stored in many locations, including files on S3, with standard SQL and performs cross-dataset join operations. Athena uses Presto, a distributed query engine. Source: AWS S3 and Athena

Slide 106

Slide 106 text

@arafkarsh arafkarsh Azure Data Mesh Stack 106 Source: Azure Synapse Analytics Microsoft offers Azure Synapse Analytics, along with both Data Lake Storage Gen2 and SQL database, as the central components for implementing a data mesh architecture.

Slide 107

Slide 107 text

@arafkarsh arafkarsh DBT Snowflake Data Mesh Stack 107 Source: dbt and Snowflake dbt is a framework to transform, clean, and aggregate data within your data warehouse. Transformations are written as plain SQL statements and result in models that are SQL views, materialized views, or tables, without the need to define their structure using DDL upfront. dbt embraces tests to verify data when running any transformation, both for sources and results Snowflake stores data in tables that are logically organized in databases and schemas.

Slide 108

Slide 108 text

@arafkarsh arafkarsh SAP Data Mesh Stack 108 Source: SAP SAP Datasphere comes with an exceptional integration into SAP applications, allowing to re-use the rich business semantic and data entity models for building data products. SAP Datasphere integrates out of the box with SAP S/4HANA tables and supports replication as well as federation. The integration is based on the VDM (virtual data model) which forms the basis for data access in SAP S/4HANA. SAP HANA Cloud and SAP HANA Cloud Data Lake can be fully leveraged for data stored within SAP Datasphere.

Slide 109

Slide 109 text

@arafkarsh arafkarsh Kafka Streaming Data Mesh Stack 109 Source: Kafka and RisingWave Kafka already has its place in many "classical" implementations of data mesh, namely as an ingestion layer for streaming data into data products. This tech stack extends the scope of Kafka far beyond merely serving as an ingestion layer. Here, data products are not just ingested from Kafka but data products live on Kafka, for bi- directional interactions between the operational systems and data products. Open-Source Stack "Classical" data mesh implementations firmly put their data products on the analytical plane, either in data warehouses (such as Snowflake or BigQuery), data lakes (S3, MinIO) or data lakehouses (Databricks).

Slide 110

Slide 110 text

@arafkarsh arafkarsh Challenges of Implementing Data Mesh 110 o Cultural Shift: Adopting Data Mesh requires significant changes in organizational culture and mindset, particularly the shift towards viewing data as a product. o Technical Heterogeneity: Implementing a self-serve data infrastructure that can accommodate diverse technologies and systems across domains can be challenging. o Governance Complexity: Balancing autonomy with oversight requires sophisticated governance mechanisms that can be complex to implement and maintain.

Slide 111

Slide 111 text

@arafkarsh arafkarsh Benefits of Data Mesh 111 o Scalability: By decentralizing data ownership and management, Data Mesh can scale more effectively as organizations grow. o Agility: Domains can quickly adapt and respond to changes and needs within their specific areas, leading to faster innovation. o Enhanced Collaboration: Data Mesh fosters a collaborative environment by encouraging domains to share their data products across the organization, enhancing cross-functional projects and innovation. o Improved Data Quality and Accessibility: With domain experts managing their own data, the overall quality and relevance of data improve, making it more accessible and useful to end users.

Slide 112

Slide 112 text

@arafkarsh arafkarsh Data Mesh Summary 112 o Data Lakes are best for flexible, scalable storage of raw data and are ideal for exploratory work and big data applications. o Data Warehouses excel in performance, data integrity, and consistency, making them suitable for structured business intelligence tasks. o Data Marts provide targeted performance and data consistency, optimized for department-specific analytic needs. Each system has its strengths and ideal use cases, and organizations often benefit from employing a combination of these structures to meet different needs across various aspects of their operations Data Mesh is an o innovative and strategic framework for o managing and accessing data across o large and complex organizations. o It shifts from a centralized model of data management to o a decentralized one, o treating data as a product and emphasizing o domain-specific ownership and accountability.. Domain Oriented Decentralized Data Data as a Product Self Serve Data Infrastructure Federated Computational Governance

Slide 113

Slide 113 text

@arafkarsh arafkarsh Scalability: Sharding / Partitions • Scale Cube • eBay Case Study • Sharding and Partitions 113 3

Slide 114

Slide 114 text

@arafkarsh arafkarsh Scalability • Scale Cube • eBay Case Study 114

Slide 115

Slide 115 text

@arafkarsh arafkarsh App Scalability based on micro services architecture Source: The NewStack. Based on the Art of Scalability by By Martin Abbot & Michael Fisher 115

Slide 116

Slide 116 text

@arafkarsh arafkarsh Scale Cube and Micro Services 116 1. Functional Decomposition 2. Avoid locks by Database Sharding 3. Cloning Services

Slide 117

Slide 117 text

@arafkarsh arafkarsh Scalability Best Practices : Lessons from Best Practices Highlights #1 Partition By Function • Decouple the Unrelated Functionalities. • Selling functionality is served by one set of applications, bidding by another, search by yet another. • 16,000 App Servers in 220 different pools • 1000 logical databases, 400 physical hosts #2 Split Horizontally • Break the workload into manageable units. • eBay’s interactions are stateless by design • All App Servers are treated equal and none retains any transactional state • Data Partitioning based on specific requirements #3 Avoid Distributed Transactions • 2 Phase Commit is a pessimistic approach comes with a big COST • CAP Theorem (Consistency, Availability, Partition Tolerance). Apply any two at any point in time. • @ eBay No Distributed Transactions of any kind and NO 2 Phase Commit. #4 Decouple Functions Asynchronously • If Component A calls component B synchronously, then they are tightly coupled. For such systems to scale A you need to scale B also. • If Asynchronous A can move forward irrespective of the state of B • SEDA (Staged Event Driven Architecture) #5 Move Processing to Asynchronous Flow • Move as much processing towards Asynchronous side • Anything that can wait should wait #6 Virtualize at All Levels • Virtualize everything. eBay created their on O/R layer for abstraction #7 Cache Appropriately • Cache Slow changing, read-mostly data, meta data, configuration and static data. 117 Source: http://www.infoq.com/articles/ebay-scalability-best-practices

Slide 118

Slide 118 text

@arafkarsh arafkarsh Sharding & Partitions • Horizontal Sharding • Vertical Sharding • Partitioning (Vertical) • Geo Partitioning 118

Slide 119

Slide 119 text

@arafkarsh arafkarsh Sharding / Partitioning 119 Method Scalability Table Sharding Horizontal Rows Same Schema with Uniq Rows Sharding Vertical Columns Different Schema Partition Vertical Rows Same Schema with Uniq Rows 1. Optimize the Database 2. Separate Rows or Columns into multiple smaller tables 3. Each table has either Same Schema with Unique Rows 4. Or has a Schema that is subset of the Original Customer ID Customer Name DOB City 1 ABC Bengaluru 2 DEF Tokyo 3 JHI Kochi 4 KLM Pune Original Table Customer ID Customer Name DOB City 1 ABC Bengaluru 2 DEF Tokyo Customer ID Customer Name DOB City 3 JHI Kochi 4 KLM Pune Horizontal Sharding - 1 Horizontal Sharding - 2 Customer ID Customer Name DOB 1 ABC 2 DEF 3 JHI 4 KLM Customer ID City 1 Bengaluru 2 Tokyo 3 Kochi 4 Pune Vertical Sharding - 1 Vertical Sharding - 2

Slide 120

Slide 120 text

@arafkarsh arafkarsh Sharding Scenarios 120 1. Horizontal Scaling: Single Server is unable to handle the load even after partitioning the data sets. 2. Data can be partitioned in such a way that specific server(s) can serve the search query based on the partition. For Ex. In an e-Commerce Application Searching the data based on 1. Product Type 2. Product Brand 3. Sellers Region (for Local Shipping) 4. Orders based on Year / Months

Slide 121

Slide 121 text

@arafkarsh arafkarsh Geo Partitioning 121 • Geo-partitioning is the ability to control the location of data at the row level. • CockroachDB lets you control which tables are replicated to which nodes. But with geo-partitioning, you can control which nodes house data with row-level granularity. • This allows you to keep customer data close to the user, which reduces the distance it needs to travel, thereby reducing latency and improving user experience. Source: https://www.cockroachlabs.com/blog/geo-partition-data-reduce-latency/

Slide 122

Slide 122 text

@arafkarsh arafkarsh Oracle Database – Geo Partitioning 122 Source: https://www.oracle.com/a/tech/docs/sharding-wp-12c.pdf

Slide 123

Slide 123 text

@arafkarsh arafkarsh Oracle Sharding and Geo 123 CREATE SHARDED TABLE customers ( cust_id NUMBER NOT NULL , name VARCHAR2(50) , address VARCHAR2(250) , geo VARCHAR2(20) , class VARCHAR2(3) , signup_date DATE , CONSTRAINT cust_pk PRIMARY KEY(geo, cust_id) ) PARTITIONSET BY LIST (geo) PARTITION BY CONSISTENT HASH (cust_id) PARTITIONS AUTO ( PARTITIONSET AMERICA VALUES (‘AMERICA’) TABLESPACE SET tbs1, PARTITIONSET ASIA VALUES (‘ASIA’) TABLESPACE SET tbs2 ); Primary Shard Standby Shards Read / Write Tx / Second Read Only Tx / Second 25 25 1.18 Million 1.62 Million 50 50 2.11 Million 3.26 Million 75 75 3.57 Million 5.05 Million 100 100 4.38 Million 6.82 Million Linear Scalability Source: https://www.oracle.com/a/tech/docs/sharding-wp-12c.pdf

Slide 124

Slide 124 text

@arafkarsh arafkarsh Oracle Sharding Compared with Cassandra and MongoDB 124

Slide 125

Slide 125 text

@arafkarsh arafkarsh MongoDB: Cluster 1. Replication 2. Automatic Failover 3. Sharding 125

Slide 126

Slide 126 text

@arafkarsh arafkarsh MongoDB Replication 126 Application (Client App Driver) Replica Set1 (mongos) RS 2 (mongos) RS 3 (mongos) Secondary Servers Primary Server Replication Replication Heartbeat Source: MongoDB Replication https://docs.mongodb.com/manual/replication/ ✓ Provides redundancy High Availability. ✓ It provides Fault Tolerance as multiple copies of data on different database servers ensures that the loss of a single database server will not affect the Application. 1. Replicate the primary's oplog and 2. Apply the operations to their data sets such that the secondaries' data sets reflect the primary's data set. 3. Secondary apply the operations to their data sets asynchronously What Secondary does? What Primary does? 1. Receives all write operations mongodb:// mongodb0.example.com:27017, mongodb1.example.com:27017, mongodb2.example.com:27017/? replicaSet=myRepl Use Secure Connection mongodb://myDBReader:D1fficultP%40ssw0rd @mongodb0.example.com:27017 Replica Set Connection Configuration

Slide 127

Slide 127 text

@arafkarsh arafkarsh MongoDB Replication: Automatic Failover 127 Source: MongoDB Replication https://docs.mongodb.com/manual/replication/ ✓ If the Primary is NOT reachable more than the configured electionTimeoutMillis (default 10 seconds) then ✓ One of the Secondary will become the Primary after an election process. ✓ Most updated Secondary will become the next Primary. ✓ Election should not take more than 12 seconds to elect a Primary. Replica Set1 (mongos) RS 2 (mongos) RS 3 (mongos) Secondary Servers Primary Server Heartbeat Election for new Primary Replica Set1 (mongos) Primary (mongos) RS 3 (mongos) Secondary Servers Primary Server Heartbeat Election for new Primary Replication ✓ The write Operations will be blocked until the new Primary is selected. ✓ The Secondary Replica Set can serve the Read Operations while the election is in progress provided its configured for that. ✓ MongoDB 4.2+ compatible drivers enable retryable writes by default ✓ MongoDB 4.0 and 3.6-compatible drivers must explicitly enable retryable writes by including retryWrites=true in the connection string.

Slide 128

Slide 128 text

@arafkarsh arafkarsh MongoDB Replication: Arbiter 128 Application (Client App Driver) Replica Set1 (mongos) RS 2 (mongos) Arbiter (mongos) Secondary Servers Primary Server Replication ✓ An Arbiter can be used to save the cost of adding an additional Secondary Server. ✓ Arbiter will handle only the election process to select a Primary. Source: MongoDB Replication https://docs.mongodb.com/manual/replication/

Slide 129

Slide 129 text

@arafkarsh arafkarsh MongoDB Replication: Secondary Reads 129 Replica Set1 (mongos) RS 2 (mongos) RS 3 (mongos) Secondary Servers Primary Server Replication Replication Heartbeat Source: MongoDB Replication https://docs.mongodb.com/manual/core/read-preference/ ✓ Asynchronous replication to secondaries means that reads from secondaries may return data that does not reflect the state of the data on the primary. ✓ Multi-document transactions that contain read operations must use read preference primary. All operations in a given transaction must route to the same member. Write to Primary and Read from Secondary Application (Client App Driver) Read from the Secondary Write mongo ‘mongodb://mongodb0,mongodb1,mongodb2/?replicaSet=rsOmega&readPreference=secondary’ $ >

Slide 130

Slide 130 text

@arafkarsh arafkarsh MongoDB – Deploy Replica Set 130 mongod --replSet “rsOmega” --bind_ip localhost, $ > replication: replSetName: "rsOmega" net: bindIp: localhost, Config File mongod --config $ > Use Config file to set the Replica Config to each Mongo Instance Use Command Line to set Replica details to each Mongo Instance 1 Source: MongoDB Replication https://docs.mongodb.com/manual/tutorial/deploy-replica-set/

Slide 131

Slide 131 text

@arafkarsh arafkarsh MongoDB – Deploy Replica Set 131 mongo $ > Initiate the Replica Set Connect to Mongo DB 2 > rs.initiate( { _id : "rsOmega", members: [ { _id: 0, host: "mongodb0.host.com:27017" }, { _id: 1, host: "mongodb1.host.com :27017" }, { _id: 2, host: "mongodb2.host.com :27017" } ] }) 3 Run rs.initiate() on just one and only one mongod instance for the replica set. Source: MongoDB Replication https://docs.mongodb.com/manual/tutorial/deploy-replica-set/

Slide 132

Slide 132 text

@arafkarsh arafkarsh MongoDB – Deploy Replica Set 132 mongo ‘mongodb://mongodb0,mongodb1,mongodb2/?replicaSet=rsOmega’ $ > > rs.conf() Show Config Show the Replica Config 4 Source: MongoDB Replication https://docs.mongodb.com/manual/tutorial/deploy-replica-set/ > rs.status() 5 Ensure that the replica set has a primary mongo $ > 6 Connect to the Replica Set

Slide 133

Slide 133 text

@arafkarsh arafkarsh MongoDB Sharding 133 Application (Client App Driver) Config Server (mongos) Config (mongos) Config (mongos) Secondary Servers Primary Server Router (mongos) Router (mongos) Replica Set1 (mongos) RS 2 (mongos) RS 3 (mongos) Secondary Servers Primary Server Shard 1 Replica Set1 (mongos) RS 2 (mongos) RS 3 (mongos) Secondary Servers Primary Server Shard 2 Replica Set1 (mongos) RS 2 (mongos) RS 3 (mongos) Secondary Servers Primary Server Shard 3

Slide 134

Slide 134 text

@arafkarsh arafkarsh Scalability Best Practices : Lessons from Best Practices Highlights #1 Partition By Function • Decouple the Unrelated Functionalities. • Selling functionality is served by one set of applications, bidding by another, search by yet another. • 16,000 App Servers in 220 different pools • 1000 logical databases, 400 physical hosts #2 Split Horizontally • Break the workload into manageable units. • eBay’s interactions are stateless by design • All App Servers are treated equal and none retains any transactional state • Data Partitioning based on specific requirements #3 Avoid Distributed Transactions • 2 Phase Commit is a pessimistic approach comes with a big COST • CAP Theorem (Consistency, Availability, Partition Tolerance). Apply any two at any point in time. • @ eBay No Distributed Transactions of any kind and NO 2 Phase Commit. #4 Decouple Functions Asynchronously • If Component A calls component B synchronously, then they are tightly coupled. For such systems to scale A you need to scale B also. • If Asynchronous A can move forward irrespective of the state of B • SEDA (Staged Event Driven Architecture) #5 Move Processing to Asynchronous Flow • Move as much processing towards Asynchronous side • Anything that can wait should wait #6 Virtualize at All Levels • Virtualize everything. eBay created their on O/R layer for abstraction #7 Cache Appropriately • Cache Slow changing, read-mostly data, meta data, configuration and static data. Source: http://www.infoq.com/articles/ebay-scalability-best-practices

Slide 135

Slide 135 text

@arafkarsh arafkarsh 4 Multi-Tenancy o Multi-Tenancy o Compliance o Data Security 135

Slide 136

Slide 136 text

@arafkarsh arafkarsh Multi-Tenancy • Separate Database • Shared Database Separate Schema • Shared Database Shared Schema 136

Slide 137

Slide 137 text

@arafkarsh arafkarsh Type of Multi-Tenancy 137 Tenant B Tenant A App Tenant A Tenant B 1. Separate Database Tenant B Tenant A App Shared DB Tenant A Schema Tenant B Schema 2. Shared Database, Separate Schema Tenant B Tenant A App Shared DB 3. Shared Database, Shared Entity Entity • Tenant A • Tenant B

Slide 138

Slide 138 text

@arafkarsh arafkarsh Hibernate: 1. Separate DB 138 import org.hibernate.SessionFactory; import org.hibernate.boot.registry.StandardServiceRegistryBuilder; import org.hibernate.cfg.Configuration; public class HibernateUtil { public static SessionFactory getSessionFactory(String tenantId) { Configuration configuration = new Configuration(); configuration.configure("hibernate.cfg.xml"); configuration.setProperty("hibernate.connection.url", "jdbc:postgresql://localhost:5432/" + tenantId); configuration.setProperty("hibernate.default_schema", tenantId); StandardServiceRegistryBuilder builder = new StandardServiceRegistryBuilder() .applySettings(configuration.getProperties()); return configuration.buildSessionFactory(builder.build()); } }

Slide 139

Slide 139 text

@arafkarsh arafkarsh Hibernate: 2. Shared DB Separate Schema 139 import org.hibernate.boot.model.naming.Identifier; Import org.hibernate.boot.model.naming.PhysicalNamingStrategyStandardImpl; import org.hibernate.engine.jdbc.env.spi.JdbcEnvironment; import org.hibernate.SessionFactory; import org.hibernate.boot.registry.StandardServiceRegistryBuilder; import org.hibernate.cfg.Configuration; public class TenantAwareNamingStrategy extends PhysicalNamingStrategyStandardImpl { private String tenantId; public TenantAwareNamingStrategy(String tenantId) { this.tenantId = tenantId; } @Override public Identifier toPhysicalTableName(Identifier name, JdbcEnvironment context) { String tableName = tenantId + "_" + name.getText(); return new Identifier(tableName, name.isQuoted()); } } public class HibernateUtil { public static SessionFactory getSessionFactory(String tenantId) { Configuration configuration = new Configuration(); configuration.configure("hibernate.cfg.xml"); configuration.setPhysicalNamingStrategy(new TenantAwareNamingStrategy(tenantId)); StandardServiceRegistryBuilder builder = new StandardServiceRegistryBuilder() .applySettings(configuration.getProperties()); return configuration.buildSessionFactory(builder.build()); } }

Slide 140

Slide 140 text

@arafkarsh arafkarsh Hibernate: 3. Shared DB Shared Entity 140 import javax.persistence.Column; import javax.persistence.Entity; import javax.persistence.Table; import org.hibernate.annotations.Filter; import org.hibernate.annotations.FilterDef; import org.hibernate.annotations.ParamDef; I import org.hibernate.SessionFactory; import org.hibernate.boot.registry.StandardServiceRegistryBuilder; Import org.hibernate.cfg.Configuration; @Entity @Table(name = "employees") @FilterDef(name = "tenantFilter", parameters = {@ParamDef(name = "tenantId", type = "string")}) @Filter(name = "tenantFilter", condition = "tenant_id = :tenantId") public class Employee { @Column(name = "tenant_id", nullable = false) private String tenantId; // // other columns …….. getters and setters } public class HibernateUtil { public static SessionFactory getSessionFactory() { Configuration configuration = new Configuration(); configuration.configure("hibernate.cfg.xml"); StandardServiceRegistryBuilder builder = new StandardServiceRegistryBuilder() .applySettings(configuration.getProperties()); return configuration.buildSessionFactory(builder.build()); } }

Slide 141

Slide 141 text

@arafkarsh arafkarsh Multi-Tenancy in MongoDB, Redis, Cassandra 141 Feature MongoDB Redis Cassandra Separate databases / instances per tenant Separate databases per tenant Separate instances per tenant Separate keyspaces per tenant Strong data isolation, individual tenant backup and restoration. Complete data isolation, more resources and management overhead. Strong data isolation, simplifies tenant-specific backup and restoration. Shared database / instance with separate collections / tables / namespaces per tenant Shared database with separate collections per tenant Shared instance with separate namespaces per tenant Shared keyspace with separate tables per tenant Balances data isolation and resource utilization. Some level of data isolation, better resource utilization. Balances data isolation and resource utilization. Shared database / instance and shared collections / tables Shared database and shared collections Not applicable Shared keyspace and shared tables Optimizes resource utilization, requires careful implementation. Optimizes resource utilization, requires careful implementation to avoid data leaks.

Slide 142

Slide 142 text

@arafkarsh arafkarsh Compliance • Best Practices • Case study: Health Care App • Case study: eCommerce App 142

Slide 143

Slide 143 text

@arafkarsh arafkarsh Best Practices: Health Care App 143 1. Understand the regulations: Familiarize yourself with HIPAA rules, including the Privacy Rule, Security Rule, and Breach Notification Rule. These rules outline the standards and requirements for protecting sensitive patient data, known as Protected Health Information (PHI). 2. Implement Access Controls: Ensure only authorized users can access PHI by implementing robust authentication mechanisms like multi-factor authentication (MFA), role-based access controls (RBAC), and proper password policies. 3. Encrypt Data: Use encryption for data at rest and in transit. Implement encryption technologies like SSL/TLS for data in transit and AES-256 for data at rest. Store encryption keys securely and separately from the data they protect. 4. Regular Audits and Monitoring: Regularly audit and monitor your systems for security vulnerabilities and potential breaches. Implement logging mechanisms to track access to PHI and use intrusion detection systems to monitor for unauthorized access. 5. Data Backups and Disaster Recovery: Implement a robust data backup and disaster recovery plan to ensure the availability and integrity of PHI in case of data loss or system failures.

Slide 144

Slide 144 text

@arafkarsh arafkarsh Best Practices: Health Care App 144 6. Regular Risk Assessments: Conduct regular risk assessments to identify potential risks to PHI's confidentiality, integrity, and availability. Develop a risk management plan to address these risks and ensure continuous improvement of your security posture. 7. Implement a Privacy Policy: Develop and maintain a privacy policy that clearly outlines how your organization handles and protects PHI. This policy should be easily accessible to users and should be updated regularly to reflect changes in your organization’s practices or regulations. 8. Employee Training: Train employees on HIPAA regulations, your organization’s privacy policy, and security best practices. Regularly update and reinforce this training to ensure continued compliance. 9. Business Associate Agreements (BAAs): Ensure that any third-party vendors, contractors, or partners who have access to PHI sign a Business Associate Agreement (BAA) that outlines their responsibilities for maintaining the privacy and security of PHI. 10. Incident Response Plan: Develop an incident response plan to handle potential data breaches or security incidents. This plan should include procedures for identifying, containing, and mitigating breaches and notifying affected individuals and relevant authorities.

Slide 145

Slide 145 text

@arafkarsh arafkarsh Best Practices: eCommerce App 145 1. Build and Maintain a Secure Network: 1. Install and maintain a firewall configuration to protect cardholder data. 2. Change vendor-supplied defaults for system passwords and other security parameters. 2. Protect Cardholder Data: 1. Encrypt transmission of cardholder data across open, public networks using SSL/TLS or other encryption methods. 2. Avoid storing cardholder data whenever possible. If you must store data, use strong encryption and access controls, and securely delete data when no longer needed. 3. Maintain a Vulnerability Management Program: 1. Regularly update and patch all systems, including operating systems, applications, and security software, to protect against known vulnerabilities. 2. Use and regularly update anti-virus software or programs. 4. Implement Strong Access Control Measures: 1. Restrict access to cardholder data by business need-to-know by implementing role-based access controls (RBAC). 2. Implement robust authentication methods, such as multi-factor authentication (MFA), for all users with access to cardholder data. 3. If applicable, restrict physical access to cardholder data by implementing physical security measures like secure storage, access controls, and surveillance.

Slide 146

Slide 146 text

@arafkarsh arafkarsh Best Practices: eCommerce App 146 5. Regularly Monitor and Test Networks: 1. Track and monitor all access to network resources and cardholder data using logging mechanisms and monitoring tools. 2. Regularly test security systems and processes, including vulnerability scans, penetration tests, and file integrity monitoring. 6. Maintain an Information Security Policy: 1. Develop, maintain, and disseminate a comprehensive information security policy that addresses all PCI-DSS requirements and is reviewed and updated at least annually. 7. Use Tokenization or Third-Party Payment Processors: 1. Consider using tokenization or outsourcing payment processing to a PCI-DSS-compliant third-party provider. This can reduce the scope of compliance and protect cardholder data by minimizing the exposure of sensitive information within your systems. 8. Educate and Train Employees: 1. Train employees on PCI-DSS requirements, your company's security policies, and best practices for handling cardholder data securely. Regularly reinforce and update this training. 9. Regularly Assess and Update Security Measures: 1. Conduct regular risk assessments and security audits to identify potential risks and vulnerabilities. Update your security measures accordingly to maintain compliance and ensure the continued protection of cardholder data.

Slide 147

Slide 147 text

@arafkarsh arafkarsh Data Security • Oracle • PostgreSQL • MySQL 147

Slide 148

Slide 148 text

@arafkarsh arafkarsh Oracle Data Security 148 • Row-level security: Oracle's Virtual Private Database (VPD) feature enables row-level security by adding a dynamic WHERE clause to SQL statements. You can use this feature to restrict access to specific rows based on user roles or attributes. • Column-level security: Oracle provides column-level security through the use of column masking with Data Redaction. This feature allows you to mask sensitive data in specific columns for unauthorized users, ensuring only authorized users can view the data. • Encryption: Oracle's Transparent Data Encryption (TDE) feature automatically encrypts data at rest without requiring changes to the application code. TDE encrypts data within the database files and automatically decrypts it when accessed by authorized users.

Slide 149

Slide 149 text

@arafkarsh arafkarsh PostgreSQL Data Security 149 • Row-level security: PostgreSQL supports row-level security using Row Security Policies. These policies allow you to define access rules for specific rows based on user roles or attributes. • Column-level security: PostgreSQL provides column-level security using column privileges. You can grant or revoke specific privileges (e.g., SELECT, INSERT, UPDATE) on individual columns to control access to sensitive data. • Encryption: PostgreSQL does not have built-in transparent data encryption like Oracle. However, you can use third-party solutions, such as pgcrypto, to encrypt data within the database. You can use file-system level encryption or full-disk encryption for data at rest.

Slide 150

Slide 150 text

@arafkarsh arafkarsh MySQL Data Security 150 • Row-level security: MySQL does not have built-in row-level security features like Oracle and PostgreSQL. However, you can implement row-level security by adding appropriate WHERE clauses to SQL statements in your application code, restricting access to specific rows based on user roles or attributes. • Column-level security: MySQL supports column-level security through column privileges. Like PostgreSQL, you can grant or revoke specific privileges on individual columns to control access to sensitive data. • Encryption: MySQL Enterprise Edition provides Transparent Data Encryption (TDE) to encrypt data at rest automatically. For the Community Edition, you can use file-system level encryption or full-disk encryption to protect data at rest. Data in transit can be encrypted using SSL/TLS.

Slide 151

Slide 151 text

@arafkarsh arafkarsh 151 100s Microservices 1,000s Releases / Day 10,000s Virtual Machines 100K+ User actions / Second 81 M Customers Globally 1 B Time series Metrics 10 B Hours of video streaming every quarter Source: NetFlix: : https://www.youtube.com/watch?v=UTKIT6STSVM 10s OPs Engineers 0 NOC 0 Data Centers So what do NetFlix think about DevOps? No DevOps Don’t do lot of Process / Procedures Freedom for Developers & be Accountable Trust people you Hire No Controls / Silos / Walls / Fences Ownership – You Build it, You Run it.

Slide 152

Slide 152 text

@arafkarsh arafkarsh 152 50M Paid Subscribers 100M Active Users 60 Countries Cross Functional Team Full, End to End ownership of features Autonomous 1000+ Microservices Source: https://microcph.dk/media/1024/conference-microcph-2017.pdf 1000+ Tech Employees 120+ Teams

Slide 153

Slide 153 text

@arafkarsh arafkarsh 153 Design Patterns are solutions to general problems that software developers faced during software development. Design Patterns

Slide 154

Slide 154 text

@arafkarsh arafkarsh 154 Thank you DREAM EMPOWER AUTOMATE MOTIVATE India: +91.999.545.8627 https://arafkarsh.medium.com/ https://speakerdeck.com/arafkarsh https://www.linkedin.com/in/arafkarsh/ https://www.youtube.com/user/arafkarsh/playlists http://www.slideshare.net/arafkarsh http://www.arafkarsh.com/ @arafkarsh arafkarsh LinkedIn arafkarsh.com Medium.com Speakerdeck.com

Slide 155

Slide 155 text

@arafkarsh arafkarsh 155 Slides: https://speakerdeck.com/arafkarsh Blogs https://arafkarsh.medium.com/ Web: https://arafkarsh.com/ Source: https://github.com/arafkarsh

Slide 156

Slide 156 text

@arafkarsh arafkarsh 156 Slides: https://speakerdeck.com/arafkarsh

Slide 157

Slide 157 text

@arafkarsh arafkarsh References 157 1. July 15, 2015 – Agile is Dead : GoTo 2015 By Dave Thomas 2. Apr 7, 2016 - Agile Project Management with Kanban | Eric Brechner | Talks at Google 3. Sep 27, 2017 - Scrum vs Kanban - Two Agile Teams Go Head-to-Head 4. Feb 17, 2019 - Lean vs Agile vs Design Thinking 5. Dec 17, 2020 - Scrum vs Kanban | Differences & Similarities Between Scrum & Kanban 6. Feb 24, 2021 - Agile Methodology Tutorial for Beginners | Jira Tutorial | Agile Methodology Explained. Agile Methodologies

Slide 158

Slide 158 text

@arafkarsh arafkarsh References 158 1. Vmware: What is Cloud Architecture? 2. Redhat: What is Cloud Architecture? 3. Cloud Computing Architecture 4. Cloud Adoption Essentials: 5. Google: Hybrid and Multi Cloud 6. IBM: Hybrid Cloud Architecture Intro 7. IBM: Hybrid Cloud Architecture: Part 1 8. IBM: Hybrid Cloud Architecture: Part 2 9. Cloud Computing Basics: IaaS, PaaS, SaaS 1. IBM: IaaS Explained 2. IBM: PaaS Explained 3. IBM: SaaS Explained 4. IBM: FaaS Explained 5. IBM: What is Hypervisor? Cloud Architecture

Slide 159

Slide 159 text

@arafkarsh arafkarsh References 159 Microservices 1. Microservices Definition by Martin Fowler 2. When to use Microservices By Martin Fowler 3. GoTo: Sep 3, 2020: When to use Microservices By Martin Fowler 4. GoTo: Feb 26, 2020: Monolith Decomposition Pattern 5. Thought Works: Microservices in a Nutshell 6. Microservices Prerequisites 7. What do you mean by Event Driven? 8. Understanding Event Driven Design Patterns for Microservices

Slide 160

Slide 160 text

@arafkarsh arafkarsh References – Microservices – Videos 160 1. Martin Fowler – Micro Services : https://www.youtube.com/watch?v=2yko4TbC8cI&feature=youtu.be&t=15m53s 2. GOTO 2016 – Microservices at NetFlix Scale: Principles, Tradeoffs & Lessons Learned. By R Meshenberg 3. Mastering Chaos – A NetFlix Guide to Microservices. By Josh Evans 4. GOTO 2015 – Challenges Implementing Micro Services By Fred George 5. GOTO 2016 – From Monolith to Microservices at Zalando. By Rodrigue Scaefer 6. GOTO 2015 – Microservices @ Spotify. By Kevin Goldsmith 7. Modelling Microservices @ Spotify : https://www.youtube.com/watch?v=7XDA044tl8k 8. GOTO 2015 – DDD & Microservices: At last, Some Boundaries By Eric Evans 9. GOTO 2016 – What I wish I had known before Scaling Uber to 1000 Services. By Matt Ranney 10. DDD Europe – Tackling Complexity in the Heart of Software By Eric Evans, April 11, 2016 11. AWS re:Invent 2016 – From Monolithic to Microservices: Evolving Architecture Patterns. By Emerson L, Gilt D. Chiles 12. AWS 2017 – An overview of designing Microservices based Applications on AWS. By Peter Dalbhanjan 13. GOTO Jun, 2017 – Effective Microservices in a Data Centric World. By Randy Shoup. 14. GOTO July, 2017 – The Seven (more) Deadly Sins of Microservices. By Daniel Bryant 15. Sept, 2017 – Airbnb, From Monolith to Microservices: How to scale your Architecture. By Melanie Cubula 16. GOTO Sept, 2017 – Rethinking Microservices with Stateful Streams. By Ben Stopford. 17. GOTO 2017 – Microservices without Servers. By Glynn Bird.

Slide 161

Slide 161 text

@arafkarsh arafkarsh References 161 Domain Driven Design 1. Oct 27, 2012 What I have learned about DDD Since the book. By Eric Evans 2. Mar 19, 2013 Domain Driven Design By Eric Evans 3. Jun 02, 2015 Applied DDD in Java EE 7 and Open Source World 4. Aug 23, 2016 Domain Driven Design the Good Parts By Jimmy Bogard 5. Sep 22, 2016 GOTO 2015 – DDD & REST Domain Driven API’s for the Web. By Oliver Gierke 6. Jan 24, 2017 Spring Developer – Developing Micro Services with Aggregates. By Chris Richardson 7. May 17. 2017 DEVOXX – The Art of Discovering Bounded Contexts. By Nick Tune 8. Dec 21, 2019 What is DDD - Eric Evans - DDD Europe 2019. By Eric Evans 9. Oct 2, 2020 - Bounded Contexts - Eric Evans - DDD Europe 2020. By. Eric Evans 10. Oct 2, 2020 - DDD By Example - Paul Rayner - DDD Europe 2020. By Paul Rayner

Slide 162

Slide 162 text

@arafkarsh arafkarsh References 162 Event Sourcing and CQRS 1. IBM: Event Driven Architecture – Mar 21, 2021 2. Martin Fowler: Event Driven Architecture – GOTO 2017 3. Greg Young: A Decade of DDD, Event Sourcing & CQRS – April 11, 2016 4. Nov 13, 2014 GOTO 2014 – Event Sourcing. By Greg Young 5. Mar 22, 2016 Building Micro Services with Event Sourcing and CQRS 6. Apr 15, 2016 YOW! Nights – Event Sourcing. By Martin Fowler 7. May 08, 2017 When Micro Services Meet Event Sourcing. By Vinicius Gomes

Slide 163

Slide 163 text

@arafkarsh arafkarsh References 163 Kafka 1. Understanding Kafka 2. Understanding RabbitMQ 3. IBM: Apache Kafka – Sept 18, 2020 4. Confluent: Apache Kafka Fundamentals – April 25, 2020 5. Confluent: How Kafka Works – Aug 25, 2020 6. Confluent: How to integrate Kafka into your environment – Aug 25, 2020 7. Kafka Streams – Sept 4, 2021 8. Kafka: Processing Streaming Data with KSQL – Jul 16, 2018 9. Kafka: Processing Streaming Data with KSQL – Nov 28, 2019

Slide 164

Slide 164 text

@arafkarsh arafkarsh References 164 Databases: Big Data / Cloud Databases 1. Google: How to Choose the right database? 2. AWS: Choosing the right Database 3. IBM: NoSQL Vs. SQL 4. A Guide to NoSQL Databases 5. How does NoSQL Databases Work? 6. What is Better? SQL or NoSQL? 7. What is DBaaS? 8. NoSQL Concepts 9. Key Value Databases 10. Document Databases 11. Jun 29, 2012 – Google I/O 2012 - SQL vs NoSQL: Battle of the Backends 12. Feb 19, 2013 - Introduction to NoSQL • Martin Fowler • GOTO 2012 13. Jul 25, 2018 - SQL vs NoSQL or MySQL vs MongoDB 14. Oct 30, 2020 - Column vs Row Oriented Databases Explained 15. Dec 9, 2020 - How do NoSQL databases work? Simply Explained! 1. Graph Databases 2. Column Databases 3. Row Vs. Column Oriented Databases 4. Database Indexing Explained 5. MongoDB Indexing 6. AWS: DynamoDB Global Indexing 7. AWS: DynamoDB Local Indexing 8. Google Cloud Spanner 9. AWS: DynamoDB Design Patterns 10. Cloud Provider Database Comparisons 11. CockroachDB: When to use a Cloud DB?

Slide 165

Slide 165 text

@arafkarsh arafkarsh References 165 Docker / Kubernetes / Istio 1. IBM: Virtual Machines and Containers 2. IBM: What is a Hypervisor? 3. IBM: Docker Vs. Kubernetes 4. IBM: Containerization Explained 5. IBM: Kubernetes Explained 6. IBM: Kubernetes Ingress in 5 Minutes 7. Microsoft: How Service Mesh works in Kubernetes 8. IBM: Istio Service Mesh Explained 9. IBM: Kubernetes and OpenShift 10. IBM: Kubernetes Operators 11. 10 Consideration for Kubernetes Deployments Istio – Metrics 1. Istio – Metrics 2. Monitoring Istio Mesh with Grafana 3. Visualize your Istio Service Mesh 4. Security and Monitoring with Istio 5. Observing Services using Prometheus, Grafana, Kiali 6. Istio Cookbook: Kiali Recipe 7. Kubernetes: Open Telemetry 8. Open Telemetry 9. How Prometheus works 10. IBM: Observability vs. Monitoring

Slide 166

Slide 166 text

@arafkarsh arafkarsh References 166 1. Feb 6, 2020 – An introduction to TDD 2. Aug 14, 2019 – Component Software Testing 3. May 30, 2020 – What is Component Testing? 4. Apr 23, 2013 – Component Test By Martin Fowler 5. Jan 12, 2011 – Contract Testing By Martin Fowler 6. Jan 16, 2018 – Integration Testing By Martin Fowler 7. Testing Strategies in Microservices Architecture 8. Practical Test Pyramid By Ham Vocke Testing – TDD / BDD

Slide 167

Slide 167 text

@arafkarsh arafkarsh 167 1. Simoorg : LinkedIn’s own failure inducer framework. It was designed to be easy to extend and most of the important components are plug‐ gable. 2. Pumba : A chaos testing and network emulation tool for Docker. 3. Chaos Lemur : Self-hostable application to randomly destroy virtual machines in a BOSH- managed environment, as an aid to resilience testing of high-availability systems. 4. Chaos Lambda : Randomly terminate AWS ASG instances during business hours. 5. Blockade : Docker-based utility for testing network failures and partitions in distributed applications. 6. Chaos-http-proxy : Introduces failures into HTTP requests via a proxy server. 7. Monkey-ops : Monkey-Ops is a simple service implemented in Go, which is deployed into an OpenShift V3.X and generates some chaos within it. Monkey-Ops seeks some OpenShift components like Pods or Deployment Configs and randomly terminates them. 8. Chaos Dingo : Chaos Dingo currently supports performing operations on Azure VMs and VMSS deployed to an Azure Resource Manager-based resource group. 9. Tugbot : Testing in Production (TiP) framework for Docker. Testing tools

Slide 168

Slide 168 text

@arafkarsh arafkarsh References 168 CI / CD 1. What is Continuous Integration? 2. What is Continuous Delivery? 3. CI / CD Pipeline 4. What is CI / CD Pipeline? 5. CI / CD Explained 6. CI / CD Pipeline using Java Example Part 1 7. CI / CD Pipeline using Ansible Part 2 8. Declarative Pipeline vs Scripted Pipeline 9. Complete Jenkins Pipeline Tutorial 10. Common Pipeline Mistakes 11. CI / CD for a Docker Application

Slide 169

Slide 169 text

@arafkarsh arafkarsh References 169 DevOps 1. IBM: What is DevOps? 2. IBM: Cloud Native DevOps Explained 3. IBM: Application Transformation 4. IBM: Virtualization Explained 5. What is DevOps? Easy Way 6. DevOps?! How to become a DevOps Engineer??? 7. Amazon: https://www.youtube.com/watch?v=mBU3AJ3j1rg 8. NetFlix: https://www.youtube.com/watch?v=UTKIT6STSVM 9. DevOps and SRE: https://www.youtube.com/watch?v=uTEL8Ff1Zvk 10. SLI, SLO, SLA : https://www.youtube.com/watch?v=tEylFyxbDLE 11. DevOps and SRE : Risks and Budgets : https://www.youtube.com/watch?v=y2ILKr8kCJU 12. SRE @ Google: https://www.youtube.com/watch?v=d2wn_E1jxn4

Slide 170

Slide 170 text

@arafkarsh arafkarsh References 170 1. Lewis, James, and Martin Fowler. “Microservices: A Definition of This New Architectural Term”, March 25, 2014. 2. Miller, Matt. “Innovate or Die: The Rise of Microservices”. e Wall Street Journal, October 5, 2015. 3. Newman, Sam. Building Microservices. O’Reilly Media, 2015. 4. Alagarasan, Vijay. “Seven Microservices Anti-patterns”, August 24, 2015. 5. Cockcroft, Adrian. “State of the Art in Microservices”, December 4, 2014. 6. Fowler, Martin. “Microservice Prerequisites”, August 28, 2014. 7. Fowler, Martin. “Microservice Tradeoffs”, July 1, 2015. 8. Humble, Jez. “Four Principles of Low-Risk Software Release”, February 16, 2012. 9. Zuul Edge Server, Ketan Gote, May 22, 2017 10. Ribbon, Hysterix using Spring Feign, Ketan Gote, May 22, 2017 11. Eureka Server with Spring Cloud, Ketan Gote, May 22, 2017 12. Apache Kafka, A Distributed Streaming Platform, Ketan Gote, May 20, 2017 13. Functional Reactive Programming, Araf Karsh Hamid, August 7, 2016 14. Enterprise Software Architectures, Araf Karsh Hamid, July 30, 2016 15. Docker and Linux Containers, Araf Karsh Hamid, April 28, 2015

Slide 171

Slide 171 text

@arafkarsh arafkarsh References 171 16. MSDN – Microsoft https://msdn.microsoft.com/en-us/library/dn568103.aspx 17. Martin Fowler : CQRS – http://martinfowler.com/bliki/CQRS.html 18. Udi Dahan : CQRS – http://www.udidahan.com/2009/12/09/clarified-cqrs/ 19. Greg Young : CQRS - https://www.youtube.com/watch?v=JHGkaShoyNs 20. Bertrand Meyer – CQS - http://en.wikipedia.org/wiki/Bertrand_Meyer 21. CQS : http://en.wikipedia.org/wiki/Command–query_separation 22. CAP Theorem : http://en.wikipedia.org/wiki/CAP_theorem 23. CAP Theorem : http://www.julianbrowne.com/article/viewer/brewers-cap-theorem 24. CAP 12 years how the rules have changed 25. EBay Scalability Best Practices : http://www.infoq.com/articles/ebay-scalability-best-practices 26. Pat Helland (Amazon) : Life beyond distributed transactions 27. Stanford University: Rx https://www.youtube.com/watch?v=y9xudo3C1Cw 28. Princeton University: SAGAS (1987) Hector Garcia Molina / Kenneth Salem 29. Rx Observable : https://dzone.com/articles/using-rx-java-observable