Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling AfricasTalking - DevCraft nairobi

Avatar for Ian Juma Ian Juma
September 23, 2016

Scaling AfricasTalking - DevCraft nairobi

Sam Gikandi's AfricasTalking talk at DevCraft Nairobi on scaling AfricasTalking API's

Avatar for Ian Juma

Ian Juma

September 23, 2016
Tweet

More Decks by Ian Juma

Other Decks in Programming

Transcript

  1. + In numbers…. ▪ 10+ Products across SMS, USSD, Voice

    and Airtime ▪ 7 Markets, 35 employees ▪ 16 Telcos, 50+ connections ▪ 1000+ Active Developer Accounts ▪ 5 Million+ Daily Client API Calls ▪ 20 Million+ Daily Telco API Calls
  2. + In the beginning… ▪ 2012 Summer ▪ Build the

    damn APIs now!!!! ▪ 2-Way SMS APIs ▪ ~0 active clients ▪ ~0 API Calls Daily ▪ No product-market fit validation ▪ 2 Developers and 1 intern ▪ No venture capital cash
  3. + Keep is simple ▪ Single Server in the cloud

    (we might need another one…) ▪ Rackspace since they are in the UK ▪ 1GB Instance ▪ Zend Application (who doesn’t know PHP?) ▪ POST to /version1/messaging controller for outgoing messages ▪ GET from /version1/messaging controller for incoming messages ▪ /gateway/safaricom/incoming-sms action for incoming messages ▪ /gateway/safaricom/dlr action for delivery reports ▪ /index controller for home, documentation and dashboards ▪ Mesh of wires in the office (to Yu and Safaricom)
  4. + Alas, product market fit ▪ Profitable by end of

    2013 ▪ 400+ active accounts ▪ 500,000+ Daily API Calls ▪ 4 Telcos ▪ 3 in-house Developers (2 backend, 1 UI/UX)
  5. + Meanwhile, good problems to have and a first stab..

    ▪ Memory pressure from Apache/MySQL ▪ Increased to 4GB/4 VCPUs ▪ Separate Web Server, DB and Website ▪ Separate concerns ▪ Client Requests ▪ Telco Requests (Delivery reports, Incoming SMS) ▪ Abstract out telco-specific logic into separate instances
  6. + Apache/Zend growing pains ▪ 40 MB RAM per request

    ▪ 100 max number of concurrent calls based on 4GB Rule ▪ Only option is to scale up given “threading support” in PHP ▪ Long-running Requests ▪ Introduced enqueue parameter for clients ▪ Not enforced, just highly encouraged. Largest consumers were internal ▪ Long running Response Callbacks ▪ Client servers could be slow or unavailable ▪ Enqueue all telco requests and run cron jobs to clean up
  7. + MySQL growing pains ▪ Memory intensive connections (slow writes)

    ▪ Ever-growing DB tables ▪ Clean up old record using cron jobs ▪ Profiler to figure out smart indexing and querying ▪ Scaling up RAM ▪ No analytics ▪ Lock contention ▪ Billing & queue status updates ▪ Reduced throughput ▪ No TTL for records ▪ Cron jobs to clean up old/unnecessary records ▪ Not effective for caching
  8. + PHP Growing Pains ▪ Poor threading support ▪ Moved

    queue processors to python (twisted framework) ▪ Increase number of workers on Apache ▪ Testing burden with increased complexity ▪ No compiler ▪ Duck typing is NOT your friend any more ▪ Enforced standards amongst the 2 developers ☺ ▪ No shared state without involving a database ▪ Limited application state
  9. + The end is nigh ▪ By mid-2014, everything was

    grinding to a halt ▪ SMS Table had over 300 Million entries ▪ Cleanup jobs took a week ▪ Updates would take minutes ▪ Dashboard queries were hanging ▪ Deluge of client and telco requests ▪ Lots of downtime and dropped requests ▪ All servers were using swap space, all the time ▪ Fund raising was not going so well….
  10. + Hard Reset ▪ Raise money for vertical scale or

    rewrite the entire application and scale horizontally ▪ Re-evaluate every technical decision we had made up to that point ▪ Programming language ▪ Storage layer ▪ Web server ▪ Queuing ▪ Analytics ▪ Monitoring
  11. + Scala as the glue ▪ First class concurrency (Actor

    model) ▪ Type safety ▪ OOP with functional goodness ▪ Testable to your heart’s content ▪ Java interoperability ▪ Growing developer community ▪ First class concurrency (Actor model)!!!
  12. + Akka Actor Framework ▪ Framework for concurrent, fault-tolerant scalable

    applications ▪ Actors communicate by passing sync or async messages to each other ▪ Each actor has a mailbox for received messages ▪ Each actor maintains isolated state Credits: https://cwiki.apache.org/confluence/display/FLINK/Akka+and+Actors
  13. + Actor Systems ▪ Actor system contains a thread pool

    (could be 400+) ▪ Evenly distributed load across multiple processors ▪ Each actor processes messages sequentially ▪ No error prone locking and state management ▪ Even thought each actor is essentially single-threaded, a system of actors is highly concurrent and scalable ▪ Important: ensure no blocking calls in the critical path ▪ Small pool of actors for core business logic ▪ Army of workers for blocking network/io calls
  14. + Redis for transient data ▪ Open source, in-memory data

    structure store ▪ Fast read/write operations ▪ Supports data structures such as strings, hashes, lists, sets ▪ Supports fast atomic operations ▪ Incr/decr ▪ Push/pop ▪ Uses: ▪ User balance operations ▪ Queuing ▪ Unconfirmed requests
  15. + Cassandra for transaction logs ▪ Manage massive amounts of

    data fast, without losing sleep ▪ Scalability ▪ High availability ▪ High performance ▪ Automatic sharding through partition keys ▪ Automatic row lifetime management ▪ Uses ▪ SMS/USSD/Voice/Airtime transaction logs ▪ Delivery reports ▪ Billing transaction logs ▪ Analytics views
  16. + Bulk SMS Cassandra Table Design ▪ Pick good partition

    keys to distribute data with respect to access patterns ▪ Lookups should ideally only access one node ▪ Reduce lookups when writing in favor of inserts ▪ Write heavy loads ▪ Bulk SMS access: ▪ By userId ▪ Ordered by insertion time (latest first) ▪ Lookup by date ▪ Lookup by recipient and senderId
  17. + Analytics with Redis and Cassandra ▪ Analytics-first design ▪

    Manage growth ▪ Answer client requests ▪ Talk to investors ▪ Event-based Real-time analytics framework across all products ▪ Actors generate events (sendBulkSms, failedUssdHop) ▪ Events converted into views, bucketed by hour and day in Redis ▪ Views continuously stored in Cassandra ▪ API layer generated JSON views for display to clients
  18. + Still walking the path ▪ Monitoring ▪ ~100 hosts

    across 6 countries ▪ Hybrid solution to monitor at server, network and application layers ▪ Instant notifications over SMS/Email when things break ▪ Integration with client apps is wanting ▪ Configuration management ▪ Load-balanced servers ▪ Sandbox/production environment ▪ Currently evaluating solutions ▪ Security ▪ Enough said ▪ New products with different characteristics ▪ Voice/Video: real-time, network heavy, local hosting ▪ Payments: Acid transactions, robust error handling
  19. + Lessons learnt ▪ Organic is best ▪ Only scale

    if you HAVE TO (product-market fit) ▪ Use solutions that match your budget ▪ Control as much of the stack as you can ▪ Buys you speed and flexibility as you grow ▪ Build on the shoulders of (free) giants ▪ Open source where possible ▪ Small, agile teams rule ▪ Find good talent and throw hard problems at it ▪ Apply the same thinking into growing the business ▪ Nothing beats a tech background on the hot seat