Slide 1

Slide 1 text

Tatsuhiko Kubo@cubicdaiya buildercon tokyo 2017/08/04 Building high performance push notification server in Go

Slide 2

Slide 2 text

@cubicdaiya / Tatsuhiko Kubo Principal Engineer, SRE @ Mercari, Inc.

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

• How to send push notification to iOS or Android device • Past and present push infrastructure @ • Why did we develop push notification server? • Gaurun ~General push notificatoin server in Go~ • Features • Artchitecture and internals Agenda

Slide 5

Slide 5 text

Push notification to iPhone

Slide 6

Slide 6 text

• Notification is not pushed to smartphone directly • Notification payload is sent to push notification service such as • APNs, GCM/FCM, Amazon SNS, etc… • Only APNs and GCM/FCM are targets in this talk Push notification to iOS or Android device

Slide 7

Slide 7 text

• APNs • Via HTTPS • TLS certificate and key or JWT is required • GCM/FCM • Via HTTPS • Server key is required Communicating with APNs and GCM/FCM

Slide 8

Slide 8 text

• APNs Binary Provider API (Legacy) • Binary protocol on TLS • APNs Provider API • HTTP/2 • Payload is JSON APNs (Apple Push Notification Service)

Slide 9

Slide 9 text

• GCM (Google Cloud Messaging) • FCM (Firebase Cloud Messaging) • Google says in https://developers.google.com/cloud-messaging/ GCM / FCM Firebase Cloud Messaging (FCM) is the new version of GCM. It inherits the reliable and scalable GCM infrastructure, plus new features! See the FAQ to learn more. If you are integrating messaging in a new app, start with FCM. GCM users are strongly recommended to upgrade to FCM, in order to benefit from new FCM features today and in the future.

Slide 10

Slide 10 text

• High network latency • APNs and GCM/FCM endpoint is far • It takes between tens and hundreds millseconds to push • Connection handling • Keep-alive as possible • Frequent connect / close is bad Communicating with APNs and GCM/FCM

Slide 11

Slide 11 text

Push infrastructure @

Slide 12

Slide 12 text

Overview APNs HTTP/2 GCM HTTP/2 HTTP/1.1 HTTP/1.1 HTTP/1.1 HTTP/1.1 batch app app nginx Gaurun

Slide 13

Slide 13 text

• Push notification asynchronously kicked in-app events • comment, purchase, like, etc… • Push notification to many customers within 1~2 hours on some campaign and event • Target number is over tens of millions Push infrastructure requirements @ High concurrency and low latency are required!

Slide 14

Slide 14 text

• Push notification kicked in-app events such as comment, purchase, like, etc… • All logics were implemented in Mercari API • Mercari API is written in PHP (mod_php) • Push notification to many customers in large- scale campaign and event • PHP / Ruby script & Amazon SNS Past push infrastructure @

Slide 15

Slide 15 text

• Slow API response • Push was not kicked asynchronously in-app events • High network Latency • PHP processes frequently connected/closed APNs and GCM • Low throughput • It took a very long time to push notification to many users (more than a few hours) Problem of past push infrastructure @

Slide 16

Slide 16 text

3 years ago

Slide 17

Slide 17 text

• Push runs synchronously! • API response had high network latency when in-app event is kicked • Response was returned to client after pushing notification to APNs or GCM 3 years ago

Slide 18

Slide 18 text

2 years ago

Slide 19

Slide 19 text

• Push runs asynchronously! • Job queue and worker were introduced • Q4M and php-parallel-prefork • Latency in API response was significantly reduced when in-app event is kicked • Throughput was significantly improved 2 years ago

Slide 20

Slide 20 text

• Job queue: Q4M • Message queue for MySQL • https://q4m.github.io/ • Job worker: php-parallel-prefork • Simple prefork server framework by PHP • https://github.com/travail/php-Parallel-Prefork Job queue and worker @

Slide 21

Slide 21 text

• Throughput is not enough • Preforking worker system based PHP is not fast and scalable for push notification • PHP is not good at concurrent processing • push notification processing requires high concurrency for achiving low network latency • APNs and GCM/FCM endpoints are far Problem still here

Slide 22

Slide 22 text

• Low latency • Push notification asynchronously kicked in-app events • comment, purchase, like, etc… • Push notification to many customers within 1~2 hours in some campaign and event • Target number is over tens of millions Push infrastructure requirements @ High concurrency and low latency are required!

Slide 23

Slide 23 text

Now APNs HTTP/2 GCM HTTP/2 HTTP/1.1 HTTP/1.1 HTTP/1.1 HTTP/1.1 batch app app nginx Gaurun

Slide 24

Slide 24 text

• Push infrastructure @ is built by • nginx: HTTP load balancer • Gaurun: HTTP/2 proxy for APNs and GCM/FCM Now

Slide 25

Slide 25 text

Gaurun

Slide 26

Slide 26 text

• Push notification server for APNs and GCM/FCM written in Go • https://github.com/mercari/gaurun • JSON based API via HTTP • Queueing & Pushing notifications to APNS and GCM/FCM asynchronously • Monitoring Gaurun

Slide 27

Slide 27 text

Send push notification to iPhone by Gaurun $ gaurun -c /etc/gaurun/gaurun.toml -p 1056 & $ curl \ -X POST \ -H "Content-Type: application/json" \ “http://127.0.0.1:1056/push” \ -d '{"notifications": [ {“token":["token-string"],"platform": 1,"message":"Hello, iOS"} ] }'

Slide 28

Slide 28 text

• POST /push • Proxy push-notification requests to APNs and GCM/FCM • Response to client immediately and push notification asynchronously • GET /stat/app • Return operational stats by JSON • e.g. channel-usage, push-success/error number • GET /stat/go • Return Go stats by JSON • e.g. number of goroutine, memory usage in Go runtime • PUT /config/pushers • configure push-throughput dynamically Gaurun HTTP API

Slide 29

Slide 29 text

Why is Gaurun written in Go • High performance HTTP server • Go provides net/http package. • High concurrency • Go can handle too many goroutines simultaneously

Slide 30

Slide 30 text

Go provides net/http package. We can get enough performance by only this for introducing proxy server in Go. package main import ( “fmt” “net/http” ) func handler(w http.ResponseWriter, r *http.Request) { w.Header().Set(“Content-Type”, “text/plain”) fmt.Fprintf(w, “Hello, World!\n”) } func main() { http.HandleFunc(“/“, handler) http.ListenAndServe(“:8080”, nil) }

Slide 31

Slide 31 text

$ ab \ -k \ -c 100 \ -n 100000 \ "http://127.0.0.1:8080/" 2>&1 | \ grep “Requests per second:” Requests per second: 56127.40 [#/sec] (mean) Simple benchmark on my MacBook Pro

Slide 32

Slide 32 text

Gaurun Internals

Slide 33

Slide 33 text

Gaurun internals • Gaurun has 3 components • HTTP API server • Proxy for APNs and GCM/FCM • Message queue and multiple workers • Based goroutine and channel

Slide 34

Slide 34 text

Push process in Gaurun

Slide 35

Slide 35 text

Job queue and workers by channel and goroutine ・channel is available as in-memory queue // channel based queue QueueNotification chan RequestGaurunNotification ・Start workers and initialize queue func StartPushWorkers(workerNum, queueNum int64) { QueueNotification = make(chan RequestGaurunNotification, queueNum) for i := int64(0); i < workerNum; i++ { go pushNotificationWorker() } }

Slide 36

Slide 36 text

Each worker has pusher pool • worker and pusher are goroutine • worker • Dequeue push-job from channel • Run pusher function by goroutine • pusher • Push notification to APNs and GCM/FCM

Slide 37

Slide 37 text

Pusher management by atomic package atomic.AddInt64(&pusherCount, 1) if atomic.LoadInt64(pusherCount) < pusherMax { go PusherFunc() } else { … func PusherFunc() { err := push() // error handling atomic.AddInt64(pusherCount, -1) } each worker know only number of active pusher. worker pusher

Slide 38

Slide 38 text

Push process in Gaurun

Slide 39

Slide 39 text

Connection handling for APNs and GCM/FCM • Gaurun uses http.Client in net/http package • http.Client reuses connection transparently • Behavior is configurable by http.Transport • MaxIdleConns • MaxIdleConnsPerHost • IdleConnTimeout • Gaurun provides parameters for configuring them

Slide 40

Slide 40 text

HTTP proxy server has various timeouts • connection timeout • read request timeout • write response timeout • keepalive timeout • Proxy is client also • proxy connection timeout • proxy write request timeout • proxy read response timeout • proxy keepalive timeout • etc…

Slide 41

Slide 41 text

Timeouts in net and net/http • net.Dial • Timeout • http.Transport • TLSHandshakeTimeout, IdleConnTimeout, ResponseHeaderTimeout, ExpectContinueTimeout • http.Client • TImeout • http.Server • ReadTimeout, ReadHeaderTimeout, WriteTimeout, IdleTimeout • Gopher should read this article • https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/

Slide 42

Slide 42 text

Monitoring Gaurun

Slide 43

Slide 43 text

Monitoring Gaurun status ・GET /stat/app { "queue_max": 12000000, "queue_usage": 515497, "pusher_max": 1152, "pusher_count": 640, "ios": { "push_success": 2465897, "push_error": 5704 }, "android": { "push_success": 1416295, "push_error": 4118 } }

Slide 44

Slide 44 text

Monitoring Go status ・GET /stat/go $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.goroutine_num’ 2326 $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.heap_objects’ 27428 $ curl -s http://127.0.0.1:1056/stat/go | jq ‘.gc_num’ 44695 $ …

Slide 45

Slide 45 text

・number of goroutine ・memory allocation

Slide 46

Slide 46 text

push success rate

Slide 47

Slide 47 text

Threshold alerting with Mackerel

Slide 48

Slide 48 text

High Performance Gaurun

Slide 49

Slide 49 text

• Gaurun's behavior is configurable by TOML • Parameter tuning is required for high performant • The default configuration is very conservative and not high performant High Performance Gaurun

Slide 50

Slide 50 text

Configuration in TOML [core] port = “1056” workers = 8 queues = 4192 pusher_max = 32 [android] apikey = “…” enabled = true keepalive_conns = 32 [ios] pem_cert_path = “/path/to/cert.pem” pem_key_path = “/path/to/key.pem” enabled = true sandbox = false topic = “…” keepalive_conns = 32

Slide 51

Slide 51 text

• core.workers • number of goroutine dequeues push notification from channel-based queue • core.queues • size of channel based queue for push notification • core.pusher_max • number of goroutine per worker pushes notification to APNs and CGM/FCM Parameter tuning

Slide 52

Slide 52 text

• (ios|android).timeout • timeout for pushing notification to APNs or GCM/FCM • (ios|android).keepalive_conns • number of idling connection to APNs or GCM/FCM • (ios|android).keepalive_timeout • time for continuing keep-alive connection to APNs or GCM/ FCM • (ios|android).retry_max • maximum retry count for pushing notification to APNs or GCM/FCM Parameter tuning

Slide 53

Slide 53 text

• Increase simulatanous number of push notification • core.workers x core.pusher_max • Increase core.queues • If channel is full, number of goroutine grows and Gaurun slows down. • Increase (ios|android).keepalive_conns Performance tuning But too large number is not good!

Slide 54

Slide 54 text

Configure push-throughput dynamically curl \ -XPUT \ “http://127.0.0.1:1056/config/pushers?max=32" configure core.pusher_max via HTTP dynamically Does not give too large number!

Slide 55

Slide 55 text

• POST /push can accepts multiple push notifications in single request • Limited by core.notification_max • default value is 100 • Example request payload -> Bulk enqueue { "notifications" : [ { "token" : ["xxx"], "platform" : 1, "message" : "Hello, iOS!" }, { "token" : ["yyy"], "platform" : 2, "message" : "Hello, Android!" } ] }

Slide 56

Slide 56 text

• Device token is sometimes invalidated • Let’s remove invalidated token in database periodically • If number of invalidated device token is reduced, the time it takes to push notification will be shortened • We can know whether device token is invalidated by response from APNs and GCM/FCM Device token screening

Slide 57

Slide 57 text

Daily device token screening @ S3 Batch Gaurun Output JSON log Upload Download MySQL Issue DELETE Parse JSON log

Slide 58

Slide 58 text

• Push notification has high network latency • High concurrency is required • Go is good choice for push notification server. Because, • Go provides useful net/http package • Go can handle too many goroutines simultaneously Conclusion

Slide 59

Slide 59 text

• Gaurun • https://github.com/mercari/gaurun • nginxとGoでつくるメルカリのプッシュ通知システム • http://tech.mercari.com/entry/2015/08/11/172206 • ハイパフォーマンスGaurun 〜メルカリの⼤大規模プッシュ配信を⽀支えるミドルウェ ア〜 • http://tech.mercari.com/entry/2016/11/08/170343 References