Suuchi - FifthElephant, Bengaluru 2017

Suuchi toolkit to build distributed systems

Sriram Ramachandrasekaran Principal Engineer, Indix https://github.com/brewkode

950M+ Products 50K+ Brands 2B+ Offers 7.5K+ Categories

Crawl Parse Dedup Classify Extract Match Index Data Pipeline @
Indix

Desirable Properties • Handle Scale - order of TBs •
Fault Tolerant • Ease of operations - less moving parts

Traditionally... • Tiered architecture • Scale individual tiers • Until...

Traditionally... • Tiered architecture • Scale individual tiers ◦ Web
Tier ◦ Service Tier • Until...

Essentially, we are looking to Scale data systems

BigTable, 2006 Dynamo, 2007 Cassandra, 2008 Voldemort, 2009 rise of
KV Stores distributed, replicated, fault-tolerant, sorted*

Service Service Service Distributed Data Store

Service Service Service Distributed Data Store Latency

Distributed Service

Distributed Service Data locality kills latency Increases Application Complexity

Just having a distributed store isn’t enough! We need something
more...

boils down to... Distributed Data Store + CoProcessors (Bigtable /
HBase) …run arbitrary code “next” to each shard

Distributed Data Store + CoProcessors (Bigtable / HBase) - Business
logic upgrade is painful - CoProcessors are not services, more an afterthought - Failure semantics are not well established - More applications means multiple coproc or single bloated coproc - Noisy neighbours / Impedance due to a shared datastore

Applications need to OWN Scaling

In-house Vs Off-the-shelf In-house Off-the-shelf Features Subset Superset Moving parts
Fully Controllable Community Controlled Ownership Implicit Acquired / Cultural Upfront cost High Low Expertise Hired / Retained / Nurtured Community

पांग ப Communication key=”foo” key=”bar” key=”baz” Request Routing Sync /
Async Replication Replication Data Sharding Cluster Membership

Introducing Suuchi DIY kit for building distributed systems github.com/ashwanthkumar/suuchi

Suuchi Provides support for ... - underlying communication channel -
routing queries to appropriate member - detecting your cluster members - replicating your data based on your strategy - local state via embedded KV store per node (optionally) github.com/ashwanthkumar/suuchi

Communication + HandleOrForward + Scatter Gather uses http/2 with streaming

Sharding / Routing + Consistent Hash Ring - Your own
sharding technique? node 2 node 1 node 3 node 4 Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web

Membership static dynamic fault tolerance in case of node/process failure
scaling up/down needs downtime of the system

Replication sync async provides very high availability for write systems
at the cost of eventual consistency every request is successful only if all the replicas succeeded

Storage + KeyValue + RocksDB - Your own abstraction? embedded
KV store from FB for server workloads

Getting started • gRPC Service using Protobuf2 • Generate stubs
& implement them • Connect using Suuchi “Server” abstraction

Server Abstraction • Pluggable membership mechanism • Pluggable routing strategy
• Pluggable replication method

Suuchi @ Indix • HTML Archive ◦ Handles 1000+ tps
- write heavy system ◦ Stores 120 TB of url & timestamp indexed HTML pages • Stats Aggregation System ◦ Approximate real-time aggregates ◦ Timeline & windowed queries • Real time scheduler for our Crawlers ◦ Prioritising which next batch of urls to crawl ◦ Helps crawl 20+ million urls per day

Thank you

Suuchi - FifthElephant, Bengaluru 2017

Suuchi - FifthElephant, Bengaluru 2017

Sriram

More Decks by Sriram

Other Decks in Technology

Featured

Transcript

Suuchi toolkit to build distributed systems

Sriram Ramachandrasekaran Principal Engineer, Indix https://github.com/brewkode

950M+ Products 50K+ Brands 2B+ Offers 7.5K+ Categories

Crawl Parse Dedup Classify Extract Match Index Data Pipeline @

Crawl Parse Dedup Classify Extract Match Index Data Pipeline @

Desirable Properties • Handle Scale - order of TBs •

Traditionally... • Tiered architecture • Scale individual tiers • Until...

Traditionally... • Tiered architecture • Scale individual tiers ◦ Web

Traditionally... • Tiered architecture • Scale individual tiers ◦ Web

Essentially, we are looking to Scale data systems

BigTable, 2006 Dynamo, 2007 Cassandra, 2008 Voldemort, 2009 rise of

Service Service Service Distributed Data Store

Service Service Service Distributed Data Store Latency

Distributed Service

Distributed Service Data locality kills latency Increases Application Complexity

Just having a distributed store isn’t enough! We need something

boils down to... Distributed Data Store + CoProcessors (Bigtable /

Distributed Data Store + CoProcessors (Bigtable / HBase) - Business

Applications need to OWN Scaling

In-house Vs Off-the-shelf In-house Off-the-shelf Features Subset Superset Moving parts

पांग ப Communication key=”foo” key=”bar” key=”baz” Request Routing Sync /

Introducing Suuchi DIY kit for building distributed systems github.com/ashwanthkumar/suuchi

Suuchi Provides support for ... - underlying communication channel -

Communication + HandleOrForward + Scatter Gather uses http/2 with streaming

Sharding / Routing + Consistent Hash Ring - Your own

Membership static dynamic fault tolerance in case of node/process failure

Replication sync async provides very high availability for write systems

Storage + KeyValue + RocksDB - Your own abstraction? embedded

Getting started • gRPC Service using Protobuf2 • Generate stubs

Server Abstraction • Pluggable membership mechanism • Pluggable routing strategy

Suuchi @ Indix • HTML Archive ◦ Handles 1000+ tps

Thank you