Building high performance push notification server in Go

Tatsuhiko Kubo@cubicdaiya buildercon tokyo 2017/08/04 Building high performance push notiﬁcation
server in Go

@cubicdaiya / Tatsuhiko Kubo Principal Engineer, SRE @ Mercari, Inc.

• How to send push notification to iOS or Android
device • Past and present push infrastructure @ • Why did we develop push notification server? • Gaurun ~General push notificatoin server in Go~ • Features • Artchitecture and internals Agenda

Push notiﬁcation to iPhone

• Notification is not pushed to smartphone directly • Notification
payload is sent to push notification service such as • APNs, GCM/FCM, Amazon SNS, etc… • Only APNs and GCM/FCM are targets in this talk Push notification to iOS or Android device

• APNs • Via HTTPS • TLS certiﬁcate and key
or JWT is required • GCM/FCM • Via HTTPS • Server key is required Communicating with APNs and GCM/FCM

• APNs Binary Provider API (Legacy) • Binary protocol on
TLS • APNs Provider API • HTTP/2 • Payload is JSON APNs (Apple Push Notiﬁcation Service)

• GCM (Google Cloud Messaging) • FCM (Firebase Cloud Messaging)
• Google says in https://developers.google.com/cloud-messaging/ GCM / FCM Firebase Cloud Messaging (FCM) is the new version of GCM. It inherits the reliable and scalable GCM infrastructure, plus new features! See the FAQ to learn more. If you are integrating messaging in a new app, start with FCM. GCM users are strongly recommended to upgrade to FCM, in order to beneﬁt from new FCM features today and in the future.

• High network latency • APNs and GCM/FCM endpoint is
far • It takes between tens and hundreds millseconds to push • Connection handling • Keep-alive as possible • Frequent connect / close is bad Communicating with APNs and GCM/FCM

Push infrastructure @

Overview APNs HTTP/2 GCM HTTP/2 HTTP/1.1 HTTP/1.1 HTTP/1.1 HTTP/1.1 batch
app app nginx Gaurun

• Push notiﬁcation asynchronously kicked in-app events • comment, purchase,
like, etc… • Push notiﬁcation to many customers within 1~2 hours on some campaign and event • Target number is over tens of millions Push infrastructure requirements @ High concurrency and low latency are required！

• Push notiﬁcation kicked in-app events such as comment, purchase,
like, etc… • All logics were implemented in Mercari API • Mercari API is written in PHP (mod_php) • Push notiﬁcation to many customers in large- scale campaign and event • PHP / Ruby script & Amazon SNS Past push infrastructure @

• Slow API response • Push was not kicked asynchronously
in-app events • High network Latency • PHP processes frequently connected/closed APNs and GCM • Low throughput • It took a very long time to push notiﬁcation to many users (more than a few hours) Problem of past push infrastructure @

3 years ago

• Push runs synchronously! • API response had high network
latency when in-app event is kicked • Response was returned to client after pushing notiﬁcation to APNs or GCM 3 years ago

2 years ago

• Push runs asynchronously! • Job queue and worker were
introduced • Q4M and php-parallel-prefork • Latency in API response was signiﬁcantly reduced when in-app event is kicked • Throughput was signiﬁcantly improved 2 years ago

• Job queue: Q4M • Message queue for MySQL •
https://q4m.github.io/ • Job worker: php-parallel-prefork • Simple prefork server framework by PHP • https://github.com/travail/php-Parallel-Prefork Job queue and worker @

• Throughput is not enough • Preforking worker system based
PHP is not fast and scalable for push notiﬁcation • PHP is not good at concurrent processing • push notiﬁcation processing requires high concurrency for achiving low network latency • APNs and GCM/FCM endpoints are far Problem still here

• Low latency • Push notiﬁcation asynchronously kicked in-app events
• comment, purchase, like, etc… • Push notiﬁcation to many customers within 1~2 hours in some campaign and event • Target number is over tens of millions Push infrastructure requirements @ High concurrency and low latency are required！

Now APNs HTTP/2 GCM HTTP/2 HTTP/1.1 HTTP/1.1 HTTP/1.1 HTTP/1.1 batch
app app nginx Gaurun

• Push infrastructure @ is built by • nginx: HTTP
load balancer • Gaurun: HTTP/2 proxy for APNs and GCM/FCM Now

Gaurun

• Push notiﬁcation server for APNs and GCM/FCM written in
Go • https://github.com/mercari/gaurun • JSON based API via HTTP • Queueing & Pushing notiﬁcations to APNS and GCM/FCM asynchronously • Monitoring Gaurun

Send push notiﬁcation to iPhone by Gaurun $ gaurun -c
/etc/gaurun/gaurun.toml -p 1056 & $ curl \ -X POST \ -H "Content-Type: application/json" \ “http://127.0.0.1:1056/push” \ -d '{"notiﬁcations": [ {“token":["token-string"],"platform": 1,"message":"Hello, iOS"} ] }'

• POST /push • Proxy push-notification requests to APNs and
GCM/FCM • Response to client immediately and push notification asynchronously • GET /stat/app • Return operational stats by JSON • e.g. channel-usage, push-success/error number • GET /stat/go • Return Go stats by JSON • e.g. number of goroutine, memory usage in Go runtime • PUT /config/pushers • configure push-throughput dynamically Gaurun HTTP API

Why is Gaurun written in Go • High performance HTTP
server • Go provides net/http package. • High concurrency • Go can handle too many goroutines simultaneously

Go provides net/http package. We can get enough performance by
only this for introducing proxy server in Go. package main import ( “fmt” “net/http” ) func handler(w http.ResponseWriter, r *http.Request) { w.Header().Set(“Content-Type”, “text/plain”) fmt.Fprintf(w, “Hello, World!\n”) } func main() { http.HandleFunc(“/“, handler) http.ListenAndServe(“:8080”, nil) }

$ ab \ -k \ -c 100 \ -n 100000
\ "http://127.0.0.1:8080/" 2>&1 | \ grep “Requests per second:” Requests per second: 56127.40 [#/sec] (mean) Simple benchmark on my MacBook Pro

Gaurun Internals

Gaurun internals • Gaurun has 3 components • HTTP API
server • Proxy for APNs and GCM/FCM • Message queue and multiple workers • Based goroutine and channel

Push process in Gaurun

Job queue and workers by channel and goroutine ・channel is
available as in-memory queue // channel based queue QueueNotification chan RequestGaurunNotification ・Start workers and initialize queue func StartPushWorkers(workerNum, queueNum int64) { QueueNotification = make(chan RequestGaurunNotification, queueNum) for i := int64(0); i < workerNum; i++ { go pushNotificationWorker() } }

Each worker has pusher pool • worker and pusher are
goroutine • worker • Dequeue push-job from channel • Run pusher function by goroutine • pusher • Push notiﬁcation to APNs and GCM/FCM

Pusher management by atomic package atomic.AddInt64(&pusherCount, 1) if atomic.LoadInt64(pusherCount) <
pusherMax { go PusherFunc() } else { … func PusherFunc() { err := push() // error handling atomic.AddInt64(pusherCount, -1) } each worker know only number of active pusher. worker pusher

Push process in Gaurun

Connection handling for APNs and GCM/FCM • Gaurun uses http.Client
in net/http package • http.Client reuses connection transparently • Behavior is conﬁgurable by http.Transport • MaxIdleConns • MaxIdleConnsPerHost • IdleConnTimeout • Gaurun provides parameters for conﬁguring them

HTTP proxy server has various timeouts • connection timeout •
read request timeout • write response timeout • keepalive timeout • Proxy is client also • proxy connection timeout • proxy write request timeout • proxy read response timeout • proxy keepalive timeout • etc…

Timeouts in net and net/http • net.Dial • Timeout •
http.Transport • TLSHandshakeTimeout, IdleConnTimeout, ResponseHeaderTimeout, ExpectContinueTimeout • http.Client • TImeout • http.Server • ReadTimeout, ReadHeaderTimeout, WriteTimeout, IdleTimeout • Gopher should read this article • https://blog.cloudﬂare.com/the-complete-guide-to-golang-net-http-timeouts/

Monitoring Gaurun

Monitoring Gaurun status ・GET /stat/app { "queue_max": 12000000, "queue_usage": 515497,
"pusher_max": 1152, "pusher_count": 640, "ios": { "push_success": 2465897, "push_error": 5704 }, "android": { "push_success": 1416295, "push_error": 4118 } }

Monitoring Go status ・GET /stat/go $ curl -s http://127.0.0.1:1056/stat/go |
jq ‘.goroutine_num’ 2326 $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.heap_objects’ 27428 $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.gc_num’ 44695 $ …

・number of goroutine ・memory allocation

push success rate

Threshold alerting with Mackerel

High Performance Gaurun

• Gaurun's behavior is conﬁgurable by TOML • Parameter tuning
is required for high performant • The default conﬁguration is very conservative and not high performant High Performance Gaurun

Conﬁguration in TOML [core] port = “1056” workers = 8
queues = 4192 pusher_max = 32 [android] apikey = “…” enabled = true keepalive_conns = 32 [ios] pem_cert_path = “/path/to/cert.pem” pem_key_path = “/path/to/key.pem” enabled = true sandbox = false topic = “…” keepalive_conns = 32

• core.workers • number of goroutine dequeues push notification from
channel-based queue • core.queues • size of channel based queue for push notification • core.pusher_max • number of goroutine per worker pushes notification to APNs and CGM/FCM Parameter tuning

• (ios|android).timeout • timeout for pushing notiﬁcation to APNs or
GCM/FCM • (ios|android).keepalive_conns • number of idling connection to APNs or GCM/FCM • (ios|android).keepalive_timeout • time for continuing keep-alive connection to APNs or GCM/ FCM • (ios|android).retry_max • maximum retry count for pushing notiﬁcation to APNs or GCM/FCM Parameter tuning

• Increase simulatanous number of push notiﬁcation • core.workers x
core.pusher_max • Increase core.queues • If channel is full, number of goroutine grows and Gaurun slows down. • Increase (ios|android).keepalive_conns Performance tuning But too large number is not good!

Configure push-throughput dynamically curl \ -XPUT \ “http://127.0.0.1:1056/config/pushers?max=32" configure core.pusher_max
via HTTP dynamically Does not give too large number!

• POST /push can accepts multiple push notifications in single
request • Limited by core.notification_max • default value is 100 • Example request payload -> Bulk enqueue { "notifications" : [ { "token" : ["xxx"], "platform" : 1, "message" : "Hello, iOS!" }, { "token" : ["yyy"], "platform" : 2, "message" : "Hello, Android!" } ] }

• Device token is sometimes invalidated • Let’s remove invalidated
token in database periodically • If number of invalidated device token is reduced, the time it takes to push notiﬁcation will be shortened • We can know whether device token is invalidated by response from APNs and GCM/FCM Device token screening

Daily device token screening @ S3 Batch Gaurun Output JSON
log Upload Download MySQL Issue DELETE Parse JSON log

• Push notiﬁcation has high network latency • High concurrency
is required • Go is good choice for push notiﬁcation server. Because, • Go provides useful net/http package • Go can handle too many goroutines simultaneously Conclusion

• Gaurun • https://github.com/mercari/gaurun • nginxとGoでつくるメルカリのプッシュ通知システム • http://tech.mercari.com/entry/2015/08/11/172206 • ハイパフォーマンスGaurun
〜メルカリの⼤大規模プッシュ配信を⽀支えるミドルウェア〜 • http://tech.mercari.com/entry/2016/11/08/170343 References

Building high performance push notification ser...

Building high performance push notification server in Go

More Decks by Tatsuhiko Kubo

Other Decks in Technology

Featured

Transcript