Building high performance push notification server in Go

Building high performance push notification server in Go

5d74d743eabd2bf7d4d2f68b9d3c727d?s=128

Tatsuhiko Kubo

August 04, 2017
Tweet

Transcript

  1. Tatsuhiko Kubo@cubicdaiya buildercon tokyo 2017/08/04 Building high performance push notification

    server in Go
  2. @cubicdaiya / Tatsuhiko Kubo Principal Engineer, SRE @ Mercari, Inc.

  3. None
  4. • How to send push notification to iOS or Android

    device • Past and present push infrastructure @ • Why did we develop push notification server? • Gaurun ~General push notificatoin server in Go~ • Features • Artchitecture and internals Agenda
  5. Push notification to iPhone

  6. • Notification is not pushed to smartphone directly • Notification

    payload is sent to push notification service such as • APNs, GCM/FCM, Amazon SNS, etc… • Only APNs and GCM/FCM are targets in this talk Push notification to iOS or Android device
  7. • APNs • Via HTTPS • TLS certificate and key

    or JWT is required • GCM/FCM • Via HTTPS • Server key is required Communicating with APNs and GCM/FCM
  8. • APNs Binary Provider API (Legacy) • Binary protocol on

    TLS • APNs Provider API • HTTP/2 • Payload is JSON APNs (Apple Push Notification Service)
  9. • GCM (Google Cloud Messaging) • FCM (Firebase Cloud Messaging)

    • Google says in https://developers.google.com/cloud-messaging/ GCM / FCM Firebase Cloud Messaging (FCM) is the new version of GCM. It inherits the reliable and scalable GCM infrastructure, plus new features! See the FAQ to learn more. If you are integrating messaging in a new app, start with FCM. GCM users are strongly recommended to upgrade to FCM, in order to benefit from new FCM features today and in the future.
  10. • High network latency • APNs and GCM/FCM endpoint is

    far • It takes between tens and hundreds millseconds to push • Connection handling • Keep-alive as possible • Frequent connect / close is bad Communicating with APNs and GCM/FCM
  11. Push infrastructure @

  12. Overview APNs HTTP/2 GCM HTTP/2 HTTP/1.1 HTTP/1.1 HTTP/1.1 HTTP/1.1 batch

    app app nginx Gaurun
  13. • Push notification asynchronously kicked in-app events • comment, purchase,

    like, etc… • Push notification to many customers within 1~2 hours on some campaign and event • Target number is over tens of millions Push infrastructure requirements @ High concurrency and low latency are required!
  14. • Push notification kicked in-app events such as comment, purchase,

    like, etc… • All logics were implemented in Mercari API • Mercari API is written in PHP (mod_php) • Push notification to many customers in large- scale campaign and event • PHP / Ruby script & Amazon SNS Past push infrastructure @
  15. • Slow API response • Push was not kicked asynchronously

    in-app events • High network Latency • PHP processes frequently connected/closed APNs and GCM • Low throughput • It took a very long time to push notification to many users (more than a few hours) Problem of past push infrastructure @
  16. 3 years ago

  17. • Push runs synchronously! • API response had high network

    latency when in-app event is kicked • Response was returned to client after pushing notification to APNs or GCM 3 years ago
  18. 2 years ago

  19. • Push runs asynchronously! • Job queue and worker were

    introduced • Q4M and php-parallel-prefork • Latency in API response was significantly reduced when in-app event is kicked • Throughput was significantly improved 2 years ago
  20. • Job queue: Q4M • Message queue for MySQL •

    https://q4m.github.io/ • Job worker: php-parallel-prefork • Simple prefork server framework by PHP • https://github.com/travail/php-Parallel-Prefork Job queue and worker @
  21. • Throughput is not enough • Preforking worker system based

    PHP is not fast and scalable for push notification • PHP is not good at concurrent processing • push notification processing requires high concurrency for achiving low network latency • APNs and GCM/FCM endpoints are far Problem still here
  22. • Low latency • Push notification asynchronously kicked in-app events

    • comment, purchase, like, etc… • Push notification to many customers within 1~2 hours in some campaign and event • Target number is over tens of millions Push infrastructure requirements @ High concurrency and low latency are required!
  23. Now APNs HTTP/2 GCM HTTP/2 HTTP/1.1 HTTP/1.1 HTTP/1.1 HTTP/1.1 batch

    app app nginx Gaurun
  24. • Push infrastructure @ is built by • nginx: HTTP

    load balancer • Gaurun: HTTP/2 proxy for APNs and GCM/FCM Now
  25. Gaurun

  26. • Push notification server for APNs and GCM/FCM written in

    Go • https://github.com/mercari/gaurun • JSON based API via HTTP • Queueing & Pushing notifications to APNS and GCM/FCM asynchronously • Monitoring Gaurun
  27. Send push notification to iPhone by Gaurun $ gaurun -c

    /etc/gaurun/gaurun.toml -p 1056 & $ curl \ -X POST \ -H "Content-Type: application/json" \ “http://127.0.0.1:1056/push” \ -d '{"notifications": [ {“token":["token-string"],"platform": 1,"message":"Hello, iOS"} ] }'
  28. • POST /push • Proxy push-notification requests to APNs and

    GCM/FCM • Response to client immediately and push notification asynchronously • GET /stat/app • Return operational stats by JSON • e.g. channel-usage, push-success/error number • GET /stat/go • Return Go stats by JSON • e.g. number of goroutine, memory usage in Go runtime • PUT /config/pushers • configure push-throughput dynamically Gaurun HTTP API
  29. Why is Gaurun written in Go • High performance HTTP

    server • Go provides net/http package. • High concurrency • Go can handle too many goroutines simultaneously
  30. Go provides net/http package. We can get enough performance by

    only this for introducing proxy server in Go. package main import ( “fmt” “net/http” ) func handler(w http.ResponseWriter, r *http.Request) { w.Header().Set(“Content-Type”, “text/plain”) fmt.Fprintf(w, “Hello, World!\n”) } func main() { http.HandleFunc(“/“, handler) http.ListenAndServe(“:8080”, nil) }
  31. $ ab \ -k \ -c 100 \ -n 100000

    \ "http://127.0.0.1:8080/" 2>&1 | \ grep “Requests per second:” Requests per second: 56127.40 [#/sec] (mean) Simple benchmark on my MacBook Pro
  32. Gaurun Internals

  33. Gaurun internals • Gaurun has 3 components • HTTP API

    server • Proxy for APNs and GCM/FCM • Message queue and multiple workers • Based goroutine and channel
  34. Push process in Gaurun

  35. Job queue and workers by channel and goroutine ・channel is

    available as in-memory queue // channel based queue QueueNotification chan RequestGaurunNotification ・Start workers and initialize queue func StartPushWorkers(workerNum, queueNum int64) { QueueNotification = make(chan RequestGaurunNotification, queueNum) for i := int64(0); i < workerNum; i++ { go pushNotificationWorker() } }
  36. Each worker has pusher pool • worker and pusher are

    goroutine • worker • Dequeue push-job from channel • Run pusher function by goroutine • pusher • Push notification to APNs and GCM/FCM
  37. Pusher management by atomic package atomic.AddInt64(&pusherCount, 1) if atomic.LoadInt64(pusherCount) <

    pusherMax { go PusherFunc() } else { … func PusherFunc() { err := push() // error handling atomic.AddInt64(pusherCount, -1) } each worker know only number of active pusher. worker pusher
  38. Push process in Gaurun

  39. Connection handling for APNs and GCM/FCM • Gaurun uses http.Client

    in net/http package • http.Client reuses connection transparently • Behavior is configurable by http.Transport • MaxIdleConns • MaxIdleConnsPerHost • IdleConnTimeout • Gaurun provides parameters for configuring them
  40. HTTP proxy server has various timeouts • connection timeout •

    read request timeout • write response timeout • keepalive timeout • Proxy is client also • proxy connection timeout • proxy write request timeout • proxy read response timeout • proxy keepalive timeout • etc…
  41. Timeouts in net and net/http • net.Dial • Timeout •

    http.Transport • TLSHandshakeTimeout, IdleConnTimeout, ResponseHeaderTimeout, ExpectContinueTimeout • http.Client • TImeout • http.Server • ReadTimeout, ReadHeaderTimeout, WriteTimeout, IdleTimeout • Gopher should read this article • https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/
  42. Monitoring Gaurun

  43. Monitoring Gaurun status ・GET /stat/app { "queue_max": 12000000, "queue_usage": 515497,

    "pusher_max": 1152, "pusher_count": 640, "ios": { "push_success": 2465897, "push_error": 5704 }, "android": { "push_success": 1416295, "push_error": 4118 } }
  44. Monitoring Go status ・GET /stat/go $ curl -s http://127.0.0.1:1056/stat/go |

    jq ‘.goroutine_num’ 2326 $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.heap_objects’ 27428 $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.gc_num’ 44695 $ …
  45. ・number of goroutine ・memory allocation

  46. push success rate

  47. Threshold alerting with Mackerel

  48. High Performance Gaurun

  49. • Gaurun's behavior is configurable by TOML • Parameter tuning

    is required for high performant • The default configuration is very conservative and not high performant High Performance Gaurun
  50. Configuration in TOML [core] port = “1056” workers = 8

    queues = 4192 pusher_max = 32 [android] apikey = “…” enabled = true keepalive_conns = 32 [ios] pem_cert_path = “/path/to/cert.pem” pem_key_path = “/path/to/key.pem” enabled = true sandbox = false topic = “…” keepalive_conns = 32
  51. • core.workers • number of goroutine dequeues push notification from

    channel-based queue • core.queues • size of channel based queue for push notification • core.pusher_max • number of goroutine per worker pushes notification to APNs and CGM/FCM Parameter tuning
  52. • (ios|android).timeout • timeout for pushing notification to APNs or

    GCM/FCM • (ios|android).keepalive_conns • number of idling connection to APNs or GCM/FCM • (ios|android).keepalive_timeout • time for continuing keep-alive connection to APNs or GCM/ FCM • (ios|android).retry_max • maximum retry count for pushing notification to APNs or GCM/FCM Parameter tuning
  53. • Increase simulatanous number of push notification • core.workers x

    core.pusher_max • Increase core.queues • If channel is full, number of goroutine grows and Gaurun slows down. • Increase (ios|android).keepalive_conns Performance tuning But too large number is not good!
  54. Configure push-throughput dynamically curl \ -XPUT \ “http://127.0.0.1:1056/config/pushers?max=32" configure core.pusher_max

    via HTTP dynamically Does not give too large number!
  55. • POST /push can accepts multiple push notifications in single

    request • Limited by core.notification_max • default value is 100 • Example request payload -> Bulk enqueue { "notifications" : [ { "token" : ["xxx"], "platform" : 1, "message" : "Hello, iOS!" }, { "token" : ["yyy"], "platform" : 2, "message" : "Hello, Android!" } ] }
  56. • Device token is sometimes invalidated • Let’s remove invalidated

    token in database periodically • If number of invalidated device token is reduced, the time it takes to push notification will be shortened • We can know whether device token is invalidated by response from APNs and GCM/FCM Device token screening
  57. Daily device token screening @ S3 Batch Gaurun Output JSON

    log Upload Download MySQL Issue DELETE Parse JSON log
  58. • Push notification has high network latency • High concurrency

    is required • Go is good choice for push notification server. Because, • Go provides useful net/http package • Go can handle too many goroutines simultaneously Conclusion
  59. • Gaurun • https://github.com/mercari/gaurun • nginxとGoでつくるメルカリのプッシュ通知システム • http://tech.mercari.com/entry/2015/08/11/172206 • ハイパフォーマンスGaurun

    〜メルカリの⼤大規模プッシュ配信を⽀支えるミドルウェ ア〜 • http://tech.mercari.com/entry/2016/11/08/170343 References