Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building high performance push notification server in Go

Building high performance push notification server in Go

Tatsuhiko Kubo

August 04, 2017
Tweet

More Decks by Tatsuhiko Kubo

Other Decks in Technology

Transcript

  1. • How to send push notification to iOS or Android

    device • Past and present push infrastructure @ • Why did we develop push notification server? • Gaurun ~General push notificatoin server in Go~ • Features • Artchitecture and internals Agenda
  2. • Notification is not pushed to smartphone directly • Notification

    payload is sent to push notification service such as • APNs, GCM/FCM, Amazon SNS, etc… • Only APNs and GCM/FCM are targets in this talk Push notification to iOS or Android device
  3. • APNs • Via HTTPS • TLS certificate and key

    or JWT is required • GCM/FCM • Via HTTPS • Server key is required Communicating with APNs and GCM/FCM
  4. • APNs Binary Provider API (Legacy) • Binary protocol on

    TLS • APNs Provider API • HTTP/2 • Payload is JSON APNs (Apple Push Notification Service)
  5. • GCM (Google Cloud Messaging) • FCM (Firebase Cloud Messaging)

    • Google says in https://developers.google.com/cloud-messaging/ GCM / FCM Firebase Cloud Messaging (FCM) is the new version of GCM. It inherits the reliable and scalable GCM infrastructure, plus new features! See the FAQ to learn more. If you are integrating messaging in a new app, start with FCM. GCM users are strongly recommended to upgrade to FCM, in order to benefit from new FCM features today and in the future.
  6. • High network latency • APNs and GCM/FCM endpoint is

    far • It takes between tens and hundreds millseconds to push • Connection handling • Keep-alive as possible • Frequent connect / close is bad Communicating with APNs and GCM/FCM
  7. • Push notification asynchronously kicked in-app events • comment, purchase,

    like, etc… • Push notification to many customers within 1~2 hours on some campaign and event • Target number is over tens of millions Push infrastructure requirements @ High concurrency and low latency are required!
  8. • Push notification kicked in-app events such as comment, purchase,

    like, etc… • All logics were implemented in Mercari API • Mercari API is written in PHP (mod_php) • Push notification to many customers in large- scale campaign and event • PHP / Ruby script & Amazon SNS Past push infrastructure @
  9. • Slow API response • Push was not kicked asynchronously

    in-app events • High network Latency • PHP processes frequently connected/closed APNs and GCM • Low throughput • It took a very long time to push notification to many users (more than a few hours) Problem of past push infrastructure @
  10. • Push runs synchronously! • API response had high network

    latency when in-app event is kicked • Response was returned to client after pushing notification to APNs or GCM 3 years ago
  11. • Push runs asynchronously! • Job queue and worker were

    introduced • Q4M and php-parallel-prefork • Latency in API response was significantly reduced when in-app event is kicked • Throughput was significantly improved 2 years ago
  12. • Job queue: Q4M • Message queue for MySQL •

    https://q4m.github.io/ • Job worker: php-parallel-prefork • Simple prefork server framework by PHP • https://github.com/travail/php-Parallel-Prefork Job queue and worker @
  13. • Throughput is not enough • Preforking worker system based

    PHP is not fast and scalable for push notification • PHP is not good at concurrent processing • push notification processing requires high concurrency for achiving low network latency • APNs and GCM/FCM endpoints are far Problem still here
  14. • Low latency • Push notification asynchronously kicked in-app events

    • comment, purchase, like, etc… • Push notification to many customers within 1~2 hours in some campaign and event • Target number is over tens of millions Push infrastructure requirements @ High concurrency and low latency are required!
  15. • Push infrastructure @ is built by • nginx: HTTP

    load balancer • Gaurun: HTTP/2 proxy for APNs and GCM/FCM Now
  16. • Push notification server for APNs and GCM/FCM written in

    Go • https://github.com/mercari/gaurun • JSON based API via HTTP • Queueing & Pushing notifications to APNS and GCM/FCM asynchronously • Monitoring Gaurun
  17. Send push notification to iPhone by Gaurun $ gaurun -c

    /etc/gaurun/gaurun.toml -p 1056 & $ curl \ -X POST \ -H "Content-Type: application/json" \ “http://127.0.0.1:1056/push” \ -d '{"notifications": [ {“token":["token-string"],"platform": 1,"message":"Hello, iOS"} ] }'
  18. • POST /push • Proxy push-notification requests to APNs and

    GCM/FCM • Response to client immediately and push notification asynchronously • GET /stat/app • Return operational stats by JSON • e.g. channel-usage, push-success/error number • GET /stat/go • Return Go stats by JSON • e.g. number of goroutine, memory usage in Go runtime • PUT /config/pushers • configure push-throughput dynamically Gaurun HTTP API
  19. Why is Gaurun written in Go • High performance HTTP

    server • Go provides net/http package. • High concurrency • Go can handle too many goroutines simultaneously
  20. Go provides net/http package. We can get enough performance by

    only this for introducing proxy server in Go. package main import ( “fmt” “net/http” ) func handler(w http.ResponseWriter, r *http.Request) { w.Header().Set(“Content-Type”, “text/plain”) fmt.Fprintf(w, “Hello, World!\n”) } func main() { http.HandleFunc(“/“, handler) http.ListenAndServe(“:8080”, nil) }
  21. $ ab \ -k \ -c 100 \ -n 100000

    \ "http://127.0.0.1:8080/" 2>&1 | \ grep “Requests per second:” Requests per second: 56127.40 [#/sec] (mean) Simple benchmark on my MacBook Pro
  22. Gaurun internals • Gaurun has 3 components • HTTP API

    server • Proxy for APNs and GCM/FCM • Message queue and multiple workers • Based goroutine and channel
  23. Job queue and workers by channel and goroutine ・channel is

    available as in-memory queue // channel based queue QueueNotification chan RequestGaurunNotification ・Start workers and initialize queue func StartPushWorkers(workerNum, queueNum int64) { QueueNotification = make(chan RequestGaurunNotification, queueNum) for i := int64(0); i < workerNum; i++ { go pushNotificationWorker() } }
  24. Each worker has pusher pool • worker and pusher are

    goroutine • worker • Dequeue push-job from channel • Run pusher function by goroutine • pusher • Push notification to APNs and GCM/FCM
  25. Pusher management by atomic package atomic.AddInt64(&pusherCount, 1) if atomic.LoadInt64(pusherCount) <

    pusherMax { go PusherFunc() } else { … func PusherFunc() { err := push() // error handling atomic.AddInt64(pusherCount, -1) } each worker know only number of active pusher. worker pusher
  26. Connection handling for APNs and GCM/FCM • Gaurun uses http.Client

    in net/http package • http.Client reuses connection transparently • Behavior is configurable by http.Transport • MaxIdleConns • MaxIdleConnsPerHost • IdleConnTimeout • Gaurun provides parameters for configuring them
  27. HTTP proxy server has various timeouts • connection timeout •

    read request timeout • write response timeout • keepalive timeout • Proxy is client also • proxy connection timeout • proxy write request timeout • proxy read response timeout • proxy keepalive timeout • etc…
  28. Timeouts in net and net/http • net.Dial • Timeout •

    http.Transport • TLSHandshakeTimeout, IdleConnTimeout, ResponseHeaderTimeout, ExpectContinueTimeout • http.Client • TImeout • http.Server • ReadTimeout, ReadHeaderTimeout, WriteTimeout, IdleTimeout • Gopher should read this article • https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/
  29. Monitoring Gaurun status ・GET /stat/app { "queue_max": 12000000, "queue_usage": 515497,

    "pusher_max": 1152, "pusher_count": 640, "ios": { "push_success": 2465897, "push_error": 5704 }, "android": { "push_success": 1416295, "push_error": 4118 } }
  30. Monitoring Go status ・GET /stat/go $ curl -s http://127.0.0.1:1056/stat/go |

    jq ‘.goroutine_num’ 2326 $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.heap_objects’ 27428 $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.gc_num’ 44695 $ …
  31. • Gaurun's behavior is configurable by TOML • Parameter tuning

    is required for high performant • The default configuration is very conservative and not high performant High Performance Gaurun
  32. Configuration in TOML [core] port = “1056” workers = 8

    queues = 4192 pusher_max = 32 [android] apikey = “…” enabled = true keepalive_conns = 32 [ios] pem_cert_path = “/path/to/cert.pem” pem_key_path = “/path/to/key.pem” enabled = true sandbox = false topic = “…” keepalive_conns = 32
  33. • core.workers • number of goroutine dequeues push notification from

    channel-based queue • core.queues • size of channel based queue for push notification • core.pusher_max • number of goroutine per worker pushes notification to APNs and CGM/FCM Parameter tuning
  34. • (ios|android).timeout • timeout for pushing notification to APNs or

    GCM/FCM • (ios|android).keepalive_conns • number of idling connection to APNs or GCM/FCM • (ios|android).keepalive_timeout • time for continuing keep-alive connection to APNs or GCM/ FCM • (ios|android).retry_max • maximum retry count for pushing notification to APNs or GCM/FCM Parameter tuning
  35. • Increase simulatanous number of push notification • core.workers x

    core.pusher_max • Increase core.queues • If channel is full, number of goroutine grows and Gaurun slows down. • Increase (ios|android).keepalive_conns Performance tuning But too large number is not good!
  36. • POST /push can accepts multiple push notifications in single

    request • Limited by core.notification_max • default value is 100 • Example request payload -> Bulk enqueue { "notifications" : [ { "token" : ["xxx"], "platform" : 1, "message" : "Hello, iOS!" }, { "token" : ["yyy"], "platform" : 2, "message" : "Hello, Android!" } ] }
  37. • Device token is sometimes invalidated • Let’s remove invalidated

    token in database periodically • If number of invalidated device token is reduced, the time it takes to push notification will be shortened • We can know whether device token is invalidated by response from APNs and GCM/FCM Device token screening
  38. Daily device token screening @ S3 Batch Gaurun Output JSON

    log Upload Download MySQL Issue DELETE Parse JSON log
  39. • Push notification has high network latency • High concurrency

    is required • Go is good choice for push notification server. Because, • Go provides useful net/http package • Go can handle too many goroutines simultaneously Conclusion
  40. • Gaurun • https://github.com/mercari/gaurun • nginxとGoでつくるメルカリのプッシュ通知システム • http://tech.mercari.com/entry/2015/08/11/172206 • ハイパフォーマンスGaurun

    〜メルカリの⼤大規模プッシュ配信を⽀支えるミドルウェ ア〜 • http://tech.mercari.com/entry/2016/11/08/170343 References