Slide 1

Slide 1 text

Inside Cloudbleed Marek Majkowski @majek04

Slide 2

Slide 2 text

2 Cloudflare locations

Slide 3

Slide 3 text

Reverse proxy 3 Eyeball Reverse proxy Origin server • Caching • Security • DDoS protection • Optimizations

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

Friday, 23 Feb 6 “Our edge servers were running past the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data. And some of that data had been cached by search engines.”

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

8 When? Uninitialized memory during: Affected: Worst problem Heartbleed 1 Apr 2014 TLS Heartbeat Requests All OpenSSL servers globally SSL keys leaked Cloudbleed 17 Feb 2017 Malformed HTML passing Cloudflare All Cloudflare customers Cached by search engines

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

10 16:11 Friday, Feb 17 PT (T-00:21)

Slide 11

Slide 11 text

11

Slide 12

Slide 12 text

12

Slide 13

Slide 13 text

16:32 Friday, Feb 17 PT (T+00:00) 13 • Received bug details from Tavis Ormandy

Slide 14

Slide 14 text

16:40 Friday, February 17 PT (T+00:08) • Team assembled in San Francisco • Initial impact assessed 14

Slide 15

Slide 15 text

17:19 Friday, Feb 17 PT (T+00:47) • "Email Obfuscation" feature disabled globally • Project Zero confirm they no longer see leaking data 15

Slide 16

Slide 16 text

17:22 Friday, Feb 17 PT (T+00:50) 16

Slide 17

Slide 17 text

...the data was still leaking... 17

Slide 18

Slide 18 text

20:24 Friday, Feb 17 PT (T+03:52) • Automatic HTTPS Rewrites disabled worldwide 18

Slide 19

Slide 19 text

23:22 Friday, Feb 17 PT (T+06:50) • Implemented and deployed kill switch for Server-Side Excludes 19

Slide 20

Slide 20 text

13:59 Monday, 20 Feb PT (T+3d) • SAFE_CHAR fix deployed globally 20 2017/02/19 13:47:34 [crit] 27558#0: *2 email filter tried to access char past EOF while sending response to client, client: 127.0.0.1, server: localhost, request: "GET / malformed-test.html HTTP/1.1”

Slide 21

Slide 21 text

10:03 Tuesday, Feb 21 PT (T+4d) • Automatic HTTPS Rewrites, Server-Side Excludes and Email Obfuscation re-enabled worldwide 21

Slide 22

Slide 22 text

15:00 Thursday, Feb 21 PT (T+6d) 22

Slide 23

Slide 23 text

The technical cause 23

Slide 24

Slide 24 text

Unclosed HTML attribute at end of page • Domain had to have one of • Email Obfuscation • Server-Side Excludes + other feature • Automatic HTTPS Redirects + other feature • Page had to end with something like 24

Slide 25

Slide 25 text

25

Slide 26

Slide 26 text

Buffer overrun 26 /* generated C code */ if ( ++p == pe ) goto _test_eof; current position end of buffer

Slide 27

Slide 27 text

27 < s c r i p t t y p e = " ☐ - - - - - - - - - - - - - - - p pe

Slide 28

Slide 28 text

28 < s c r i p t t y p e = " ☐ - - - - - - - - - - - - - - - p pe

Slide 29

Slide 29 text

29 < s c r i p t t y p e = " ☐ - - - - - - - - - - - - - - - p pe

Slide 30

Slide 30 text

30 script_consume_attr := \ ((unquoted_attr_char)* :>> (space|'/'|'>')) @{ fhold; fgoto script_tag_parse; } $lerr{ fgoto script_consume_attr; }; fhold --> p-- fgoto --> p++ Ok! Error! Missing fhold.... Ragel parser generator

Slide 31

Slide 31 text

31 The bug was dormant for months...

Slide 32

Slide 32 text

32 ...until we started replacing the old parser.

Slide 33

Slide 33 text

33 email obfuscation old parser new parser 33 other features server-side exclude

Slide 34

Slide 34 text

34 email obfuscation old parser new parser 34 other features https rewrites server-side exclude

Slide 35

Slide 35 text

35 email obfuscation old parser new parser 35 https rewrites server-side excludes other features

Slide 36

Slide 36 text

36 email obfuscation old parser new parser 36 https rewrites email obfuscation other features server-side excludes

Slide 37

Slide 37 text

Irony: we caused this because we were migrating away from the buggy parser. 37

Slide 38

Slide 38 text

Trigering conditions • Buffer smaller than 4KB • End with malformed script or img tag • Enabled features using both new and old parser 38

Slide 39

Slide 39 text

39 Replacing old buggy code might expose unknown bugs.

Slide 40

Slide 40 text

Good, bad, ugly 40

Slide 41

Slide 41 text

41 Good • Bug mitigated in 47 minutes

Slide 42

Slide 42 text

Good • We were using Ragel to generate code • (incorrectly though) 42

Slide 43

Slide 43 text

Bad: leaking sensitive data • Customer: • HTTP headers, including cookies; POST data (passwords, potentially credit card numbers, SSNs); URI parameters; JSON blobs for API calls; API authentication secrets; OAuth keys • Private Cloudflare: • Keys, authentication secrets 43

Slide 44

Slide 44 text

Bad: leaking sensitive data 44

Slide 45

Slide 45 text

45 Bad: leaking sensitive data

Slide 46

Slide 46 text

Bad: leaking sensitive data 46

Slide 47

Slide 47 text

Ugly: cached by search engines 47 • Google, Bing, Yahoo, Yandex, Baidu, DuckDuckGo...

Slide 48

Slide 48 text

48 Ugly: cached by search engines • "CF-Host-Origin-IP:" "authorization:" • Total: 770 urls, 161 domains

Slide 49

Slide 49 text

49 Remove sensitive data from search engine caches before going public.

Slide 50

Slide 50 text

Going public 50

Slide 51

Slide 51 text

Project zero timelines • 90 days • 7 days - critical vulnerabilities under active exploitation 51

Slide 52

Slide 52 text

52

Slide 53

Slide 53 text

53 15:00 Thursday, Feb 23 PT

Slide 54

Slide 54 text

Friday, Feb 24 54

Slide 55

Slide 55 text

55 Fully transparent, technical post-mortem is great.

Slide 56

Slide 56 text

For 11 days we didn't know how much was leaked 56

Slide 57

Slide 57 text

57

Slide 58

Slide 58 text

58 Estimate the impact early on.

Slide 59

Slide 59 text

59 Wed, 1 Mar 2017

Slide 60

Slide 60 text

We are confident the bug wasn't deliberately exploited. 60

Slide 61

Slide 61 text

61 Purged 80,000 pages from search engine caches.

Slide 62

Slide 62 text

Impact statistics • SAFE_CHAR logs allowed us to estimate the impact. • September 22, 2016 -> February 13, 2017 605,307 • February 13, 2017 -> February 18, 2017 637,034 62

Slide 63

Slide 63 text

Each leaked page contained • (based on search engine caches) • 67.54 Internal Cloudflare HTTP Headers • 0.44 Cookies • 0.04 Authorization Headers / Tokens • No passwords, credit cards, SSNs found in the wild 63

Slide 64

Slide 64 text

Estimated customer impact 64 Requests per Month Estimated Leaks ------------------ ----------------- 200B – 300B 22,356 – 33,534 100B – 200B 11,427 – 22,356 50B – 100B 5,962 – 11,427 10B – 50B 1,118 – 5,926 1B – 10B 112 – 1,118 500M – 1B 56 – 112 250M – 500M 25 – 56 100M – 250M 11 – 25 50M – 100M 6 – 11 10M – 50M 1 – 6 <10M < 1 


Slide 65

Slide 65 text

The really ugly truth 65

Slide 66

Slide 66 text

It's been going on for months • September 22, 2016: • Automatic HTTP Rewrites enabled new parser • January 30, 2017 • Server-Side Excludes migrated to new parser • February 13, 2017 • Email Obfuscation partially migrated to new parser • February 18, 2017 • Google reports problem to Cloudflare and leak is stopped 66 ⟯180 sites ⟯6500 sites

Slide 67

Slide 67 text

It's been going on for months 67 Cloudbleed reported

Slide 68

Slide 68 text

68 Monitor your crashes.

Slide 69

Slide 69 text

From cloudbleed to zero crashes 69

Slide 70

Slide 70 text

70

Slide 71

Slide 71 text

All crashes will be investigated 71 From: SRE To: nginx-dev On 2017-02-22 between 3:00am and 3:30am UTC we notice 17 core dumps in SJC, ORD, IAD, and one in DOG.

Slide 72

Slide 72 text

Most of the core dumps were obscure bugs in our code 72

Slide 73

Slide 73 text

73

Slide 74

Slide 74 text

But some we couldn't explain! 74

Slide 75

Slide 75 text

75

Slide 76

Slide 76 text

Mystery crash • Not: “I don’t understand why the program did that” • Rather: “I believe it is impossible for the program to reach this state, if executed correctly” • At a low level, computers are deterministic! 76

Slide 77

Slide 77 text

Mystery crashes • On average, ~1 mystery core dump a day • Scattered over all servers, all datacenters • Per server, 1 in 10 years • Can’t reproduce • Hard to try any potential fix 77

Slide 78

Slide 78 text

Mystery crashes • Cosmic rays? • Memory error? (we use ECC) • Faulty CPU? (mostly Intel) • Core dumps get corrupted somewhere? • OS bug generating core dumps? • OS virtual memory bug? TLB? 78

Slide 79

Slide 79 text

Broadwell trail 79 From: SRE To: nginx-dev During further investigation, Josh suggested to check if the generations of hardware could be relevant. Surprisingly all 18 nginx SIGSEGV crashes happened on Intel Broadwell servers. Given that broadwell are on a 1/3rd of our fleet, we suspect the crashes might be related to hardware.

Slide 80

Slide 80 text

80

Slide 81

Slide 81 text

81

Slide 82

Slide 82 text

82 BDF-76

Slide 83

Slide 83 text

Microcode update • Firmware that controls the lowest-level operation of the processor
 • Can be updated by the BIOS (from system vendor) or the OS
 • Microcode updates can change the behaviour of the processor to some extent, e.g. to fix errata 83

Slide 84

Slide 84 text

84 Microcode update

Slide 85

Slide 85 text

85 Keep your microcode updated

Slide 86

Slide 86 text

Thank you 86 marek@cloudflare.com @majek04 Fully transparent post-mortem helps Estimate incident impact early on

Slide 87

Slide 87 text

87