Slide 1

Slide 1 text

Dr. Christian Geuer-Pollmann @chgeuer http://blog.geuer-pollmann.de/ Lessons learned: Hosting large-scale backends like the “Eurovision Song Contest” on Microsoft Azure

Slide 2

Slide 2 text

Architecture Overview Operations Security Load Testing Performance Connectivity Agenda

Slide 3

Slide 3 text

Kampf der Orchester (SRF) Eurovision Song Contest 2015 (EBU / ORF) Quizduell im Ersten (Das Erste / NDR) Spiel für Dein Land (Das Erste / SRF / ORF)

Slide 4

Slide 4 text

Projekte

Slide 5

Slide 5 text

im Ersten

Slide 6

Slide 6 text

Technology Partner

Slide 7

Slide 7 text

https://www.youtube.com/watch?v=UR-sXNWp9H4

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Nächster Sendetermin Sa, 12. 12. 2015 | 20:15 Uhr

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

• Support 2+ mio concurrent connections • Sub-second in-app notifications • Voting and fast aggregation • Web Sockets for bi-directional communications • Build on Azure "Cloud Services" ("PaaS v1") ASP .NET, SignalR Solution Overview

Slide 12

Slide 12 text

General Architecture

Slide 13

Slide 13 text

Patterns

Slide 14

Slide 14 text

• KISS!!! • Cloud Services - Affinity, Network, CPU, Memory only. • Reduce moving pieces. If you can eliminate 3rd party services, do so. • Asynchronous to potentially blocking / failing components. • Retry operations towards data store shouldn't block critical path. Paranoia – Trust no one https://github.com/chgeuer/RedisCloudService

Slide 15

Slide 15 text

No external dependencies

Slide 16

Slide 16 text

Multi-paradigm fallbacks • Realtime updates via WebSockets, • Fallback to CDN. Paranoia – Don‘t trust your own solution

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

• Quorum in PaaS v1 Cloud Services is "difficult“; On paper, Compute v1 has 2 FDs only • New Compute v2 (ARM, Service Fabric) provides 3FD Unfortunately v2 not avail end CY14 Quorum and Fault Domains

Slide 19

Slide 19 text

• Don't let all web roles hammer the backend. Reduce traffic to central DB • Aggregate in role • Constant load on backend • Shared-Access Signatures for Profile Pictures Reduce Load on Backend http://blog.smarx.com/posts/architecting-scalable-counters-with-windows-azure

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

• Establishing TCP connections is expensive  strain on TCP/IP stack • Closed TCP connections are expensive (TIMED_WAIT2) • UX: Minimize realtime delay and latency • WebSockets have no poll interval • Authenticating each request HTTP Polling versus WebSockets

Slide 22

Slide 22 text

Don‘t use plain http polling Votes per POST Status per GET

Slide 23

Slide 23 text

Automate everything!

Slide 24

Slide 24 text

Automate everything!

Slide 25

Slide 25 text

Network Tweaking • HTTP • http.sys max connections • Concurrent requests per CPU • Request queue limit • TCP • TIME_WAIT2 • max. TCP retransmissions • Windows OS https://github.com/chgeuer/Quizzer/blob/master/Quizzer.Web/SetupScripts/install2.ps1

Slide 26

Slide 26 text

Network Tweaking – Receive Side Scaling • “Receive side scaling (RSS) is a network driver technology that enables the efficient distribution of network receive processing across multiple CPUs in multiprocessor systems.” [1] • https://github.com/chgeuer/Quizzer/blob/ master/Quizzer.Web/SetupScripts/install2.ps 1#L190-L216 [1] https://msdn.microsoft.com/en-us/library/windows/hardware/ff556942(v=vs.85).aspx

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

• Egress data volume for client is high. Questions and Answers can have image attachments. • Individually encrypted questions, zipped JSON in CDN • Change distribution time, path and costs • Goal: Separate bulk data and realtime traffic • 500k people * 100kB == 50GB. • 500k people * question ID + key only == 1MB. Traffic Volume Optimization https://github.com/chgeuer/SelectiveFieldConfidentiality

Slide 29

Slide 29 text

• There‘s no sizing info, as patterns vary heavily • Load Test is the (only) answer! SignalR (and other) Performance Guidance https://github.com/SignalR/SignalR/wiki/Performance

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

• There is never enough time for testing. • High # of concurrent users • Long-lived connections • Each public IP can establish a theorerical maximum of 64k connections to http://target:80/ • Custom protocol on top of SignalR Developed an own load test framework (“bot net” ) Load Testing Challenges https://github.com/chgeuer/AzureDistributedRunner

Slide 32

Slide 32 text

https://github.com/chgeuer/AzureDistributedRunner Load Test Setup

Slide 33

Slide 33 text

Spin 60 individual nodes (unique src IPs)

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Security Rule #1 – Know your threat model • Quizduell Gewinnspiel • QD “Hall of Fame” • QD Double-voting • ESC votes per SMS

Slide 36

Slide 36 text

Caution: Don‘t generalize that specific decision! • Used TLS for registration and login only • TLS is burden on CPU, we did custom authN (HMAC only). • Different APIs might have different security requirements & protocols (possible due to closed system nature) Performance vs. Security

Slide 37

Slide 37 text

Security Reviews you didn‘t ask for… Your client implementation is never private

Slide 38

Slide 38 text

3 Tage vor „Go Live“  http://quizduellforum.de/index.php?topic=478.0

Slide 39

Slide 39 text

Live API provides status, and links to bulk data in CDN A manifest // http://qd-prod.appsfactory.de/api/info { "AgbChange": "2015-10-23T18:55:00", "DsChange": "2015-10-23T18:55:00", "TbChange": "2015-10-23T18:55:00", "Live": false, "IdxVersion": 2, "RankingBlobUrl": "https://az692393.vo.msecnd.net/rankings/top50.zip", "RankingTimestamp": 1446231308743, "Capped": false, "PlayAlong": false, "Apps": [ { "OS": "iOS", "Version": "1.6", "Force": false }, { "OS": "Android", "Version": "1.2.7", "Force": false }, { "OS": "Windows", "Version": "1.8.0.0", "Force": false }, { "OS": "WindowsTablet", "Version": "1.8.0.0", "Force": false } ], "Duells": [ "https://az692393.vo.msecnd.net/duells/1179.zip", "https://az692393.vo.msecnd.net/duells/1253.zip", "https://az692393.vo.msecnd.net/duells/2274.zip", "https://az692393.vo.msecnd.net/duells/2275.zip" ] }

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Thanks for the voluntary analysis „Für alle weiteren Zugriffe auf die Web-API müssen wir jedoch ein sog. User-Token mitliefern, damit der Server uns überhaupt antwortet. Dieses User-Token erhalten wir erst nach Authentifizierung über Googles OAuth 2-Dienst mit unserem Google-Konto.“ http://quizduellforum.de/index.php?topic=478.0

Slide 42

Slide 42 text

„Nach Herunterladen [...] entdecken wir [...] einen Katalog aller Fragen, allerdings sind die Fragen verschlüsselt. [...] Der Schlüssel wird mit Beginn jeder Fragerunde [...] an die Spieler ausgeliefert. Bis dahin bleiben die Fragen unter Verschluss, denn das eingesetzte symmetrische AES-Verschlüsselungsverfahren ist unknackbar. [...] Cheaten ist also nicht drin, es sei denn man wertet die Kenntnis über die zweite und dritte Frage schon zu Beginn einer Runde, während andere erst die erste Frage sehen, als einen solchen Betrug.“ Thanks for the voluntary analysis (2) http://quizduellforum.de/index.php?topic=478.0

Slide 43

Slide 43 text

„Durch die geschickten Vorab-Downloads der Fragen und Team-Fotos müssen während eines Live-Duells nur noch kryptographische Schlüssel und einige Metadaten ausgetauscht werden. Dies ist sicher eine deutliche Reduktion des übertragenen Datenvolumens. Weiterhin kommen sog. Websockets zum Einsatz, welche gegenüber der alten App viele Performance-Vorteile bei der Live- Synchronisation des Spielgeschehens bieten.“ Thanks for the voluntary analysis (3) http://quizduellforum.de/index.php?topic=478.0

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

Pre-heating up your app Heating up Production

Slide 46

Slide 46 text

Autoscaling

Slide 47

Slide 47 text

How we handled auto-scaling We did not auto-scale!

Slide 48

Slide 48 text

• Instrument your infrastructure. Know what the load is on your nodes. • Using Microsoft standard logging (Performance Counters) helps. • Monitor everything (VMs, CDN) • Realtime-logging for startup tasks: chgeuer/UnorthodoxAzureLogging Logging and Instrumentation

Slide 49

Slide 49 text

Schedule 

Slide 50

Slide 50 text

Dr. Christian Geuer-Pollmann @chgeuer http://blog.geuer-pollmann.de/