Scaling AfricasTalking - DevCraft nairobi

+ SCALING THE BACKEND The Africa’s Talking Experience

+ Who we are

+ In numbers…. ▪ 10+ Products across SMS, USSD, Voice
and Airtime ▪ 7 Markets, 35 employees ▪ 16 Telcos, 50+ connections ▪ 1000+ Active Developer Accounts ▪ 5 Million+ Daily Client API Calls ▪ 20 Million+ Daily Telco API Calls

+ In the beginning… ▪ 2012 Summer ▪ Build the
damn APIs now!!!! ▪ 2-Way SMS APIs ▪ ~0 active clients ▪ ~0 API Calls Daily ▪ No product-market fit validation ▪ 2 Developers and 1 intern ▪ No venture capital cash

+ The product

+ Keep is simple ▪ Single Server in the cloud
(we might need another one…) ▪ Rackspace since they are in the UK ▪ 1GB Instance ▪ Zend Application (who doesn’t know PHP?) ▪ POST to /version1/messaging controller for outgoing messages ▪ GET from /version1/messaging controller for incoming messages ▪ /gateway/safaricom/incoming-sms action for incoming messages ▪ /gateway/safaricom/dlr action for delivery reports ▪ /index controller for home, documentation and dashboards ▪ Mesh of wires in the office (to Yu and Safaricom)

+ One instance to rule them all…

+ Alas, product market fit ▪ Profitable by end of
2013 ▪ 400+ active accounts ▪ 500,000+ Daily API Calls ▪ 4 Telcos ▪ 3 in-house Developers (2 backend, 1 UI/UX)

+ Meanwhile, good problems to have and a first stab..
▪ Memory pressure from Apache/MySQL ▪ Increased to 4GB/4 VCPUs ▪ Separate Web Server, DB and Website ▪ Separate concerns ▪ Client Requests ▪ Telco Requests (Delivery reports, Incoming SMS) ▪ Abstract out telco-specific logic into separate instances

+ Lets create some separation…

+ Apache/Zend growing pains ▪ 40 MB RAM per request
▪ 100 max number of concurrent calls based on 4GB Rule ▪ Only option is to scale up given “threading support” in PHP ▪ Long-running Requests ▪ Introduced enqueue parameter for clients ▪ Not enforced, just highly encouraged. Largest consumers were internal ▪ Long running Response Callbacks ▪ Client servers could be slow or unavailable ▪ Enqueue all telco requests and run cron jobs to clean up

+ MySQL growing pains ▪ Memory intensive connections (slow writes)
▪ Ever-growing DB tables ▪ Clean up old record using cron jobs ▪ Profiler to figure out smart indexing and querying ▪ Scaling up RAM ▪ No analytics ▪ Lock contention ▪ Billing & queue status updates ▪ Reduced throughput ▪ No TTL for records ▪ Cron jobs to clean up old/unnecessary records ▪ Not effective for caching

+ PHP Growing Pains ▪ Poor threading support ▪ Moved
queue processors to python (twisted framework) ▪ Increase number of workers on Apache ▪ Testing burden with increased complexity ▪ No compiler ▪ Duck typing is NOT your friend any more ▪ Enforced standards amongst the 2 developers ☺ ▪ No shared state without involving a database ▪ Limited application state

+ The end is nigh ▪ By mid-2014, everything was
grinding to a halt ▪ SMS Table had over 300 Million entries ▪ Cleanup jobs took a week ▪ Updates would take minutes ▪ Dashboard queries were hanging ▪ Deluge of client and telco requests ▪ Lots of downtime and dropped requests ▪ All servers were using swap space, all the time ▪ Fund raising was not going so well….

+ Hard Reset ▪ Raise money for vertical scale or
rewrite the entire application and scale horizontally ▪ Re-evaluate every technical decision we had made up to that point ▪ Programming language ▪ Storage layer ▪ Web server ▪ Queuing ▪ Analytics ▪ Monitoring

+ Here comes the bride…

+ Scala as the glue ▪ First class concurrency (Actor
model) ▪ Type safety ▪ OOP with functional goodness ▪ Testable to your heart’s content ▪ Java interoperability ▪ Growing developer community ▪ First class concurrency (Actor model)!!!

+ Akka Actor Framework ▪ Framework for concurrent, fault-tolerant scalable
applications ▪ Actors communicate by passing sync or async messages to each other ▪ Each actor has a mailbox for received messages ▪ Each actor maintains isolated state Credits: https://cwiki.apache.org/confluence/display/FLINK/Akka+and+Actors

+ Actor Systems ▪ Actor system contains a thread pool
(could be 400+) ▪ Evenly distributed load across multiple processors ▪ Each actor processes messages sequentially ▪ No error prone locking and state management ▪ Even thought each actor is essentially single-threaded, a system of actors is highly concurrent and scalable ▪ Important: ensure no blocking calls in the critical path ▪ Small pool of actors for core business logic ▪ Army of workers for blocking network/io calls

+ True concurrency

+ Bulk SMS with Actors

+ BulkSmsService code…

+ Persistence Layer

+ Redis for transient data ▪ Open source, in-memory data
structure store ▪ Fast read/write operations ▪ Supports data structures such as strings, hashes, lists, sets ▪ Supports fast atomic operations ▪ Incr/decr ▪ Push/pop ▪ Uses: ▪ User balance operations ▪ Queuing ▪ Unconfirmed requests

+ Queues with actors and Redis

+ Cassandra for transaction logs ▪ Manage massive amounts of
data fast, without losing sleep ▪ Scalability ▪ High availability ▪ High performance ▪ Automatic sharding through partition keys ▪ Automatic row lifetime management ▪ Uses ▪ SMS/USSD/Voice/Airtime transaction logs ▪ Delivery reports ▪ Billing transaction logs ▪ Analytics views

+ Bulk SMS Cassandra Table Design ▪ Pick good partition
keys to distribute data with respect to access patterns ▪ Lookups should ideally only access one node ▪ Reduce lookups when writing in favor of inserts ▪ Write heavy loads ▪ Bulk SMS access: ▪ By userId ▪ Ordered by insertion time (latest first) ▪ Lookup by date ▪ Lookup by recipient and senderId

+ Bulk SMS Cassandra Schema

+ Analytics with Redis and Cassandra ▪ Analytics-first design ▪
Manage growth ▪ Answer client requests ▪ Talk to investors ▪ Event-based Real-time analytics framework across all products ▪ Actors generate events (sendBulkSms, failedUssdHop) ▪ Events converted into views, bucketed by hour and day in Redis ▪ Views continuously stored in Cassandra ▪ API layer generated JSON views for display to clients

+ Real-time Analytics

+ Still walking the path ▪ Monitoring ▪ ~100 hosts
across 6 countries ▪ Hybrid solution to monitor at server, network and application layers ▪ Instant notifications over SMS/Email when things break ▪ Integration with client apps is wanting ▪ Configuration management ▪ Load-balanced servers ▪ Sandbox/production environment ▪ Currently evaluating solutions ▪ Security ▪ Enough said ▪ New products with different characteristics ▪ Voice/Video: real-time, network heavy, local hosting ▪ Payments: Acid transactions, robust error handling

+ Lessons learnt ▪ Organic is best ▪ Only scale
if you HAVE TO (product-market fit) ▪ Use solutions that match your budget ▪ Control as much of the stack as you can ▪ Buys you speed and flexibility as you grow ▪ Build on the shoulders of (free) giants ▪ Open source where possible ▪ Small, agile teams rule ▪ Find good talent and throw hard problems at it ▪ Apply the same thinking into growing the business ▪ Nothing beats a tech background on the hot seat

+ Questions?

Scaling AfricasTalking - DevCraft nairobi

Scaling AfricasTalking - DevCraft nairobi

Ian Juma

More Decks by Ian Juma

Other Decks in Programming

Featured

Transcript

+ SCALING THE BACKEND The Africa’s Talking Experience

+ Who we are

+ In numbers…. ▪ 10+ Products across SMS, USSD, Voice

+ In the beginning… ▪ 2012 Summer ▪ Build the

+ The product

+ Keep is simple ▪ Single Server in the cloud

+ One instance to rule them all…

+ Alas, product market fit ▪ Profitable by end of

+ Meanwhile, good problems to have and a first stab..

+ Lets create some separation…

+ Apache/Zend growing pains ▪ 40 MB RAM per request

+ MySQL growing pains ▪ Memory intensive connections (slow writes)

+ PHP Growing Pains ▪ Poor threading support ▪ Moved

+ The end is nigh ▪ By mid-2014, everything was

+ Hard Reset ▪ Raise money for vertical scale or

+ Here comes the bride…

+ Scala as the glue ▪ First class concurrency (Actor

+ Akka Actor Framework ▪ Framework for concurrent, fault-tolerant scalable

+ Actor Systems ▪ Actor system contains a thread pool

+ True concurrency

+ Bulk SMS with Actors

+ BulkSmsService code…

+ Persistence Layer

+ Redis for transient data ▪ Open source, in-memory data

+ Queues with actors and Redis

+ Cassandra for transaction logs ▪ Manage massive amounts of

+ Bulk SMS Cassandra Table Design ▪ Pick good partition

+ Bulk SMS Cassandra Schema

+ Analytics with Redis and Cassandra ▪ Analytics-first design ▪

+ Real-time Analytics

+ Still walking the path ▪ Monitoring ▪ ~100 hosts

+ Lessons learnt ▪ Organic is best ▪ Only scale

+ Questions?