Processing 15 Billion events a day without breaking the bank - ReversimX ILTechTalks

22d9eb22713520cfb9df28f6b1ce7f83?s=47 AppsFlyer
January 10, 2017

Processing 15 Billion events a day without breaking the bank - ReversimX ILTechTalks

Talk from ILTechTalks ReversimX meetup.
Adi Belan talking about AppsFlyer concerns with efficiency and scale of the real-time back-end services.
video (in hebrew) is here:
https://youtu.be/TPQKBwDky60

22d9eb22713520cfb9df28f6b1ce7f83?s=128

AppsFlyer

January 10, 2017
Tweet

Transcript

  1. Adi Belan - R&D Team Lead Processing 15 Billion events

    per day - In Real Time! (While not breaking the bank)
  2. What is AppsFlyer? Mobile Attribution Measurement and Analytics

  3. None
  4. Tech Stack - Tools of the Trade

  5. Micro Services Architecture Real Time Attr.

  6. What it really feels like?

  7. So, 15 Billion Events you say? • SDK installed in

    10 Billion mobile devices • Every time App is launched += 1 event • Every time User engages with Ad += 1 event
  8. So, Why Real Time? • Deferred Deeplinking - Opening App

    in correct context • Ad Networks optimize bidding in real time
  9. System Structure

  10. • 7 Different ELBs • ~200 - 400 EC2 Instances

    running • 5-8 million http requests per minute • 20+ Billion Kafka messages per day • Multiple AeroSpike & Couchbase clusters with Billions of keys Some Numbers
  11. Strong. Light. Cheap - Choose 2 • Not All Traffic

    has same business value • Choose resilience vs. speed vs. cost according to business value.
  12. Saving Mission Critical! • 1st App Open is Install event

    = mission critical • Dual RabbitMQ clusters = 32 r3.2xlarge ~= 500 $/day
  13. Active-Active RabbitMQ •

  14. Saving Money where possible • Use Spots - ⅕ of

    the $$$ • Auto Scale - according to Applicative Metrics to get best utilization • Replication Factor - where you can afford to lose ◦ Kafka Cluster per business value ◦ DB replication factor
  15. AWS Spots - The Stock Market • Bidding Mechanism •

    Different instance types and AZ • Be ready to replace with On-Demand
  16. Auto-Scale • Kafka Consumer Lag (in Seconds!!) • Keep SLA

    - don’t worry about Load / CPU • Scale spots before On-Demand
  17. Auto-Scale - Example Scale up - spots Scale down -

    spots Scale up - on demand
  18. Replication Factor • 80% of conversions come from clicks in

    the last 12 hours In Memory RF 2 ~ 24 Hours SSD RF 1 30 Days Attribution Service XDR DB Writes DB Reads
  19. We’re hiring!