Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building high performance push notification server in Go

Building high performance push notification server in Go

Tatsuhiko Kubo

August 04, 2017
Tweet

More Decks by Tatsuhiko Kubo

Other Decks in Technology

Transcript

  1. Tatsuhiko Kubo@cubicdaiya
    buildercon tokyo 2017/08/04
    Building high performance
    push notification server in Go

    View Slide

  2. @cubicdaiya / Tatsuhiko Kubo
    Principal Engineer, SRE @ Mercari, Inc.

    View Slide

  3. View Slide

  4. • How to send push notification to iOS or Android
    device
    • Past and present push infrastructure @
    • Why did we develop push notification server?
    • Gaurun ~General push notificatoin server in Go~
    • Features
    • Artchitecture and internals
    Agenda

    View Slide

  5. Push notification to iPhone

    View Slide

  6. • Notification is not pushed to smartphone directly
    • Notification payload is sent to push notification
    service such as
    • APNs, GCM/FCM, Amazon SNS, etc…
    • Only APNs and GCM/FCM are targets in this talk
    Push notification to iOS or Android device

    View Slide

  7. • APNs
    • Via HTTPS
    • TLS certificate and key or JWT is required
    • GCM/FCM
    • Via HTTPS
    • Server key is required
    Communicating with APNs and GCM/FCM

    View Slide

  8. • APNs Binary Provider API (Legacy)
    • Binary protocol on TLS
    • APNs Provider API
    • HTTP/2
    • Payload is JSON
    APNs (Apple Push Notification Service)

    View Slide

  9. • GCM (Google Cloud Messaging)
    • FCM (Firebase Cloud Messaging)
    • Google says in https://developers.google.com/cloud-messaging/
    GCM / FCM
    Firebase Cloud Messaging (FCM) is the new version of GCM.
    It inherits the reliable and scalable GCM infrastructure, plus
    new features! See the FAQ to learn more. If you are integrating
    messaging in a new app, start with FCM. GCM users are
    strongly recommended to upgrade to FCM, in order to benefit
    from new FCM features today and in the future.

    View Slide

  10. • High network latency
    • APNs and GCM/FCM endpoint is far
    • It takes between tens and hundreds millseconds
    to push
    • Connection handling
    • Keep-alive as possible
    • Frequent connect / close is bad
    Communicating with APNs and GCM/FCM

    View Slide

  11. Push infrastructure @

    View Slide

  12. Overview
    APNs
    HTTP/2
    GCM
    HTTP/2
    HTTP/1.1
    HTTP/1.1
    HTTP/1.1
    HTTP/1.1
    batch
    app
    app
    nginx
    Gaurun

    View Slide

  13. • Push notification asynchronously kicked in-app
    events
    • comment, purchase, like, etc…
    • Push notification to many customers within 1~2
    hours on some campaign and event
    • Target number is over tens of millions
    Push infrastructure requirements @
    High concurrency and low latency are required!

    View Slide

  14. • Push notification kicked in-app events such as
    comment, purchase, like, etc…
    • All logics were implemented in Mercari API
    • Mercari API is written in PHP (mod_php)
    • Push notification to many customers in large-
    scale campaign and event
    • PHP / Ruby script & Amazon SNS
    Past push infrastructure @

    View Slide

  15. • Slow API response
    • Push was not kicked asynchronously in-app events
    • High network Latency
    • PHP processes frequently connected/closed APNs
    and GCM
    • Low throughput
    • It took a very long time to push notification to many
    users (more than a few hours)
    Problem of past push infrastructure @

    View Slide

  16. 3 years ago

    View Slide

  17. • Push runs synchronously!
    • API response had high network latency
    when in-app event is kicked
    • Response was returned to client after
    pushing notification to APNs or GCM
    3 years ago

    View Slide

  18. 2 years ago

    View Slide

  19. • Push runs asynchronously!
    • Job queue and worker were introduced
    • Q4M and php-parallel-prefork
    • Latency in API response was significantly
    reduced when in-app event is kicked
    • Throughput was significantly improved
    2 years ago

    View Slide

  20. • Job queue: Q4M
    • Message queue for MySQL
    • https://q4m.github.io/
    • Job worker: php-parallel-prefork
    • Simple prefork server framework by PHP
    • https://github.com/travail/php-Parallel-Prefork
    Job queue and worker @

    View Slide

  21. • Throughput is not enough
    • Preforking worker system based PHP is not
    fast and scalable for push notification
    • PHP is not good at concurrent processing
    • push notification processing requires high
    concurrency for achiving low network latency
    • APNs and GCM/FCM endpoints are far
    Problem still here

    View Slide

  22. • Low latency
    • Push notification asynchronously kicked in-app events
    • comment, purchase, like, etc…
    • Push notification to many customers within 1~2
    hours in some campaign and event
    • Target number is over tens of millions
    Push infrastructure requirements @
    High concurrency and low latency are required!

    View Slide

  23. Now
    APNs
    HTTP/2
    GCM
    HTTP/2
    HTTP/1.1
    HTTP/1.1
    HTTP/1.1
    HTTP/1.1
    batch
    app
    app
    nginx Gaurun

    View Slide

  24. • Push infrastructure @ is built by
    • nginx: HTTP load balancer
    • Gaurun: HTTP/2 proxy for APNs and GCM/FCM
    Now

    View Slide

  25. Gaurun

    View Slide

  26. • Push notification server for APNs and GCM/FCM
    written in Go
    • https://github.com/mercari/gaurun
    • JSON based API via HTTP
    • Queueing & Pushing notifications to APNS
    and GCM/FCM asynchronously
    • Monitoring
    Gaurun

    View Slide

  27. Send push notification to iPhone by Gaurun
    $ gaurun -c /etc/gaurun/gaurun.toml -p 1056 &
    $ curl \
    -X POST \
    -H "Content-Type: application/json" \
    “http://127.0.0.1:1056/push” \
    -d '{"notifications": [ {“token":["token-string"],"platform":
    1,"message":"Hello, iOS"} ] }'

    View Slide

  28. • POST /push
    • Proxy push-notification requests to APNs and GCM/FCM
    • Response to client immediately and push notification asynchronously
    • GET /stat/app
    • Return operational stats by JSON
    • e.g. channel-usage, push-success/error number
    • GET /stat/go
    • Return Go stats by JSON
    • e.g. number of goroutine, memory usage in Go runtime
    • PUT /config/pushers
    • configure push-throughput dynamically
    Gaurun HTTP API

    View Slide

  29. Why is Gaurun written in Go
    • High performance HTTP server
    • Go provides net/http package.
    • High concurrency
    • Go can handle too many goroutines
    simultaneously

    View Slide

  30. Go provides net/http package. We can get enough performance
    by only this for introducing proxy server in Go.
    package main
    import (
    “fmt”
    “net/http”
    )
    func handler(w http.ResponseWriter, r *http.Request) {
    w.Header().Set(“Content-Type”, “text/plain”)
    fmt.Fprintf(w, “Hello, World!\n”)
    }
    func main() {
    http.HandleFunc(“/“, handler)
    http.ListenAndServe(“:8080”, nil)
    }

    View Slide

  31. $ ab \
    -k \
    -c 100 \
    -n 100000 \
    "http://127.0.0.1:8080/" 2>&1 | \
    grep “Requests per second:”
    Requests per second: 56127.40 [#/sec] (mean)
    Simple benchmark on my MacBook Pro

    View Slide

  32. Gaurun Internals

    View Slide

  33. Gaurun internals
    • Gaurun has 3 components
    • HTTP API server
    • Proxy for APNs and GCM/FCM
    • Message queue and multiple workers
    • Based goroutine and channel

    View Slide

  34. Push process in Gaurun

    View Slide

  35. Job queue and workers by channel and goroutine
    ・channel is available as in-memory queue
    // channel based queue
    QueueNotification chan RequestGaurunNotification
    ・Start workers and initialize queue
    func StartPushWorkers(workerNum, queueNum int64) {
    QueueNotification = make(chan RequestGaurunNotification, queueNum)
    for i := int64(0); i < workerNum; i++ {
    go pushNotificationWorker()
    }
    }

    View Slide

  36. Each worker has pusher pool
    • worker and pusher are goroutine
    • worker
    • Dequeue push-job from channel
    • Run pusher function by goroutine
    • pusher
    • Push notification to APNs and GCM/FCM

    View Slide

  37. Pusher management by atomic package
    atomic.AddInt64(&pusherCount, 1)
    if atomic.LoadInt64(pusherCount) < pusherMax {
    go PusherFunc()
    } else {

    func PusherFunc() {
    err := push()
    // error handling
    atomic.AddInt64(pusherCount, -1)
    }
    each worker know only number of active pusher.
    worker
    pusher

    View Slide

  38. Push process in Gaurun

    View Slide

  39. Connection handling for APNs and GCM/FCM
    • Gaurun uses http.Client in net/http package
    • http.Client reuses connection transparently
    • Behavior is configurable by http.Transport
    • MaxIdleConns
    • MaxIdleConnsPerHost
    • IdleConnTimeout
    • Gaurun provides parameters for configuring them

    View Slide

  40. HTTP proxy server has various timeouts
    • connection timeout
    • read request timeout
    • write response timeout
    • keepalive timeout
    • Proxy is client also
    • proxy connection timeout
    • proxy write request timeout
    • proxy read response timeout
    • proxy keepalive timeout
    • etc…

    View Slide

  41. Timeouts in net and net/http
    • net.Dial
    • Timeout
    • http.Transport
    • TLSHandshakeTimeout, IdleConnTimeout, ResponseHeaderTimeout,
    ExpectContinueTimeout
    • http.Client
    • TImeout
    • http.Server
    • ReadTimeout, ReadHeaderTimeout, WriteTimeout, IdleTimeout
    • Gopher should read this article
    • https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/

    View Slide

  42. Monitoring Gaurun

    View Slide

  43. Monitoring Gaurun status
    ・GET /stat/app
    {
    "queue_max": 12000000,
    "queue_usage": 515497,
    "pusher_max": 1152,
    "pusher_count": 640,
    "ios": {
    "push_success": 2465897,
    "push_error": 5704
    },
    "android": {
    "push_success": 1416295,
    "push_error": 4118
    }
    }

    View Slide

  44. Monitoring Go status
    ・GET /stat/go
    $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.goroutine_num’
    2326
    $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.heap_objects’
    27428
    $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.gc_num’
    44695
    $ …

    View Slide

  45. ・number of goroutine
    ・memory allocation

    View Slide

  46. push success rate

    View Slide

  47. Threshold alerting with Mackerel

    View Slide

  48. High Performance Gaurun

    View Slide

  49. • Gaurun's behavior is configurable by TOML
    • Parameter tuning is required for high
    performant
    • The default configuration is very
    conservative and not high performant
    High Performance Gaurun

    View Slide

  50. Configuration in TOML
    [core]
    port = “1056”
    workers = 8
    queues = 4192
    pusher_max = 32
    [android]
    apikey = “…”
    enabled = true
    keepalive_conns = 32
    [ios]
    pem_cert_path = “/path/to/cert.pem”
    pem_key_path = “/path/to/key.pem”
    enabled = true
    sandbox = false
    topic = “…”
    keepalive_conns = 32

    View Slide

  51. • core.workers
    • number of goroutine dequeues push notification
    from channel-based queue
    • core.queues
    • size of channel based queue for push notification
    • core.pusher_max
    • number of goroutine per worker pushes notification
    to APNs and CGM/FCM
    Parameter tuning

    View Slide

  52. • (ios|android).timeout
    • timeout for pushing notification to APNs or GCM/FCM
    • (ios|android).keepalive_conns
    • number of idling connection to APNs or GCM/FCM
    • (ios|android).keepalive_timeout
    • time for continuing keep-alive connection to APNs or GCM/
    FCM
    • (ios|android).retry_max
    • maximum retry count for pushing notification to APNs or GCM/FCM
    Parameter tuning

    View Slide

  53. • Increase simulatanous number of push
    notification
    • core.workers x core.pusher_max
    • Increase core.queues
    • If channel is full, number of goroutine
    grows and Gaurun slows down.
    • Increase (ios|android).keepalive_conns
    Performance tuning
    But too large number is not good!

    View Slide

  54. Configure push-throughput dynamically
    curl \
    -XPUT \
    “http://127.0.0.1:1056/config/pushers?max=32"
    configure core.pusher_max
    via HTTP dynamically
    Does not give too large number!

    View Slide

  55. • POST /push can accepts multiple push
    notifications in single request
    • Limited by core.notification_max
    • default value is 100
    • Example request payload ->
    Bulk enqueue
    {
    "notifications" : [
    {
    "token" : ["xxx"],
    "platform" : 1,
    "message" : "Hello, iOS!"
    },
    {
    "token" : ["yyy"],
    "platform" : 2,
    "message" : "Hello, Android!"
    }
    ]
    }

    View Slide

  56. • Device token is sometimes invalidated
    • Let’s remove invalidated token in database
    periodically
    • If number of invalidated device token is reduced,
    the time it takes to push notification will be
    shortened
    • We can know whether device token is invalidated
    by response from APNs and GCM/FCM
    Device token screening

    View Slide

  57. Daily device token screening @
    S3 Batch
    Gaurun
    Output JSON log Upload Download
    MySQL
    Issue DELETE
    Parse JSON log

    View Slide

  58. • Push notification has high network latency
    • High concurrency is required
    • Go is good choice for push notification
    server. Because,
    • Go provides useful net/http package
    • Go can handle too many goroutines
    simultaneously
    Conclusion

    View Slide

  59. • Gaurun
    • https://github.com/mercari/gaurun
    • nginxとGoでつくるメルカリのプッシュ通知システム
    • http://tech.mercari.com/entry/2015/08/11/172206
    • ハイパフォーマンスGaurun
    〜メルカリの⼤大規模プッシュ配信を⽀支えるミドルウェ
    ア〜
    • http://tech.mercari.com/entry/2016/11/08/170343
    References

    View Slide