Why we built a distributed system - DSConf, Pune 2018
In the first edition of DSConf, we spoke about our reasons for why we built an in-house distributed system and how Suuchi - toolkit to build such systems evolved.
Distributed Data Store + CoProcessors (Bigtable / HBase) - Business logic upgrade is painful - CoProcessors are not services, more an afterthought - Failure semantics are not well established - More applications means multiple coproc or single bloated coproc - Noisy neighbours / Impedance due to a shared datastore
In-house Vs Off-the-shelf In-house Off-the-shelf Features Subset Superset Moving parts Fully Controllable Community Controlled Ownership Implicit Acquired / Cultural Upfront cost High Low Expertise Hired / Retained / Nurtured Community
पांग ப Communication key=”foo” key=”bar” key=”baz” Request Routing Sync / Async Replication Replication Data Sharding Cluster Membership Primitives in a Distributed System
Suuchi Provides support for ... - underlying communication channel - routing queries to appropriate member - detecting your cluster members - replicating your data based on your strategy - local state via embedded KV store per node (optionally) github.com/ashwanthkumar/suuchi
Sharding / Routing + Consistent Hash Ring - Your own sharding technique? node 2 node 1 node 3 node 4 Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web
Sharding / Routing + Consistent Hash Ring - Your own sharding technique? node 2 node 1 node 3 node 4 Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web
Sharding / Routing + Consistent Hash Ring - Your own sharding technique? node 2 node 1 node 3 node 4 Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web
Sharding / Routing + Consistent Hash Ring - Your own sharding technique? Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web node 2 node 3 node 4
Replication Provides high availability for write heavy systems at the cost of consistency sync async* every request is successful only if all the replicas succeeded
Suuchi @ Indix ● HTML Archive ○ Handles 1000+ tps - write heavy system ○ Stores 120 TB of url & timestamp indexed HTML pages ● Stats (as Monoids) Aggregation System ○ Approximate real-time aggregates ○ Timeline & windowed queries ● Real time scheduler for our Crawlers ○ Prioritising which next batch of urls to crawl ○ Helps crawl 20+ million urls per day