Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling AfricasTalking - DevCraft nairobi

Ian Juma
September 23, 2016

Scaling AfricasTalking - DevCraft nairobi

Sam Gikandi's AfricasTalking talk at DevCraft Nairobi on scaling AfricasTalking API's

Ian Juma

September 23, 2016
Tweet

More Decks by Ian Juma

Other Decks in Programming

Transcript

  1. + In numbers…. ▪ 10+ Products across SMS, USSD, Voice

    and Airtime ▪ 7 Markets, 35 employees ▪ 16 Telcos, 50+ connections ▪ 1000+ Active Developer Accounts ▪ 5 Million+ Daily Client API Calls ▪ 20 Million+ Daily Telco API Calls
  2. + In the beginning… ▪ 2012 Summer ▪ Build the

    damn APIs now!!!! ▪ 2-Way SMS APIs ▪ ~0 active clients ▪ ~0 API Calls Daily ▪ No product-market fit validation ▪ 2 Developers and 1 intern ▪ No venture capital cash
  3. + Keep is simple ▪ Single Server in the cloud

    (we might need another one…) ▪ Rackspace since they are in the UK ▪ 1GB Instance ▪ Zend Application (who doesn’t know PHP?) ▪ POST to /version1/messaging controller for outgoing messages ▪ GET from /version1/messaging controller for incoming messages ▪ /gateway/safaricom/incoming-sms action for incoming messages ▪ /gateway/safaricom/dlr action for delivery reports ▪ /index controller for home, documentation and dashboards ▪ Mesh of wires in the office (to Yu and Safaricom)
  4. + Alas, product market fit ▪ Profitable by end of

    2013 ▪ 400+ active accounts ▪ 500,000+ Daily API Calls ▪ 4 Telcos ▪ 3 in-house Developers (2 backend, 1 UI/UX)
  5. + Meanwhile, good problems to have and a first stab..

    ▪ Memory pressure from Apache/MySQL ▪ Increased to 4GB/4 VCPUs ▪ Separate Web Server, DB and Website ▪ Separate concerns ▪ Client Requests ▪ Telco Requests (Delivery reports, Incoming SMS) ▪ Abstract out telco-specific logic into separate instances
  6. + Apache/Zend growing pains ▪ 40 MB RAM per request

    ▪ 100 max number of concurrent calls based on 4GB Rule ▪ Only option is to scale up given “threading support” in PHP ▪ Long-running Requests ▪ Introduced enqueue parameter for clients ▪ Not enforced, just highly encouraged. Largest consumers were internal ▪ Long running Response Callbacks ▪ Client servers could be slow or unavailable ▪ Enqueue all telco requests and run cron jobs to clean up
  7. + MySQL growing pains ▪ Memory intensive connections (slow writes)

    ▪ Ever-growing DB tables ▪ Clean up old record using cron jobs ▪ Profiler to figure out smart indexing and querying ▪ Scaling up RAM ▪ No analytics ▪ Lock contention ▪ Billing & queue status updates ▪ Reduced throughput ▪ No TTL for records ▪ Cron jobs to clean up old/unnecessary records ▪ Not effective for caching
  8. + PHP Growing Pains ▪ Poor threading support ▪ Moved

    queue processors to python (twisted framework) ▪ Increase number of workers on Apache ▪ Testing burden with increased complexity ▪ No compiler ▪ Duck typing is NOT your friend any more ▪ Enforced standards amongst the 2 developers ☺ ▪ No shared state without involving a database ▪ Limited application state
  9. + The end is nigh ▪ By mid-2014, everything was

    grinding to a halt ▪ SMS Table had over 300 Million entries ▪ Cleanup jobs took a week ▪ Updates would take minutes ▪ Dashboard queries were hanging ▪ Deluge of client and telco requests ▪ Lots of downtime and dropped requests ▪ All servers were using swap space, all the time ▪ Fund raising was not going so well….
  10. + Hard Reset ▪ Raise money for vertical scale or

    rewrite the entire application and scale horizontally ▪ Re-evaluate every technical decision we had made up to that point ▪ Programming language ▪ Storage layer ▪ Web server ▪ Queuing ▪ Analytics ▪ Monitoring
  11. + Scala as the glue ▪ First class concurrency (Actor

    model) ▪ Type safety ▪ OOP with functional goodness ▪ Testable to your heart’s content ▪ Java interoperability ▪ Growing developer community ▪ First class concurrency (Actor model)!!!
  12. + Akka Actor Framework ▪ Framework for concurrent, fault-tolerant scalable

    applications ▪ Actors communicate by passing sync or async messages to each other ▪ Each actor has a mailbox for received messages ▪ Each actor maintains isolated state Credits: https://cwiki.apache.org/confluence/display/FLINK/Akka+and+Actors
  13. + Actor Systems ▪ Actor system contains a thread pool

    (could be 400+) ▪ Evenly distributed load across multiple processors ▪ Each actor processes messages sequentially ▪ No error prone locking and state management ▪ Even thought each actor is essentially single-threaded, a system of actors is highly concurrent and scalable ▪ Important: ensure no blocking calls in the critical path ▪ Small pool of actors for core business logic ▪ Army of workers for blocking network/io calls
  14. + Redis for transient data ▪ Open source, in-memory data

    structure store ▪ Fast read/write operations ▪ Supports data structures such as strings, hashes, lists, sets ▪ Supports fast atomic operations ▪ Incr/decr ▪ Push/pop ▪ Uses: ▪ User balance operations ▪ Queuing ▪ Unconfirmed requests
  15. + Cassandra for transaction logs ▪ Manage massive amounts of

    data fast, without losing sleep ▪ Scalability ▪ High availability ▪ High performance ▪ Automatic sharding through partition keys ▪ Automatic row lifetime management ▪ Uses ▪ SMS/USSD/Voice/Airtime transaction logs ▪ Delivery reports ▪ Billing transaction logs ▪ Analytics views
  16. + Bulk SMS Cassandra Table Design ▪ Pick good partition

    keys to distribute data with respect to access patterns ▪ Lookups should ideally only access one node ▪ Reduce lookups when writing in favor of inserts ▪ Write heavy loads ▪ Bulk SMS access: ▪ By userId ▪ Ordered by insertion time (latest first) ▪ Lookup by date ▪ Lookup by recipient and senderId
  17. + Analytics with Redis and Cassandra ▪ Analytics-first design ▪

    Manage growth ▪ Answer client requests ▪ Talk to investors ▪ Event-based Real-time analytics framework across all products ▪ Actors generate events (sendBulkSms, failedUssdHop) ▪ Events converted into views, bucketed by hour and day in Redis ▪ Views continuously stored in Cassandra ▪ API layer generated JSON views for display to clients
  18. + Still walking the path ▪ Monitoring ▪ ~100 hosts

    across 6 countries ▪ Hybrid solution to monitor at server, network and application layers ▪ Instant notifications over SMS/Email when things break ▪ Integration with client apps is wanting ▪ Configuration management ▪ Load-balanced servers ▪ Sandbox/production environment ▪ Currently evaluating solutions ▪ Security ▪ Enough said ▪ New products with different characteristics ▪ Voice/Video: real-time, network heavy, local hosting ▪ Payments: Acid transactions, robust error handling
  19. + Lessons learnt ▪ Organic is best ▪ Only scale

    if you HAVE TO (product-market fit) ▪ Use solutions that match your budget ▪ Control as much of the stack as you can ▪ Buys you speed and flexibility as you grow ▪ Build on the shoulders of (free) giants ▪ Open source where possible ▪ Small, agile teams rule ▪ Find good talent and throw hard problems at it ▪ Apply the same thinking into growing the business ▪ Nothing beats a tech background on the hot seat