Inside Cloudbleed

D4e1d473a995ef37b3e03e9e6006c3e3?s=47 majek04
December 07, 2017

Inside Cloudbleed

D4e1d473a995ef37b3e03e9e6006c3e3?s=128

majek04

December 07, 2017
Tweet

Transcript

  1. Inside Cloudbleed Marek Majkowski @majek04

  2. 2 Cloudflare locations

  3. Reverse proxy 3 Eyeball Reverse proxy Origin server • Caching

    • Security • DDoS protection • Optimizations
  4. 4

  5. 5

  6. Friday, 23 Feb 6 “Our edge servers were running past

    the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data. And some of that data had been cached by search engines.”
  7. 7

  8. 8 When? Uninitialized memory during: Affected: Worst problem Heartbleed 1

    Apr 2014 TLS Heartbeat Requests All OpenSSL servers globally SSL keys leaked Cloudbleed 17 Feb 2017 Malformed HTML passing Cloudflare All Cloudflare customers Cached by search engines
  9. 9

  10. 10 16:11 Friday, Feb 17 PT (T-00:21)

  11. 11

  12. 12

  13. 16:32 Friday, Feb 17 PT (T+00:00) 13 • Received bug

    details from Tavis Ormandy
  14. 16:40 Friday, February 17 PT (T+00:08) • Team assembled in

    San Francisco • Initial impact assessed 14
  15. 17:19 Friday, Feb 17 PT (T+00:47) • "Email Obfuscation" feature

    disabled globally • Project Zero confirm they no longer see leaking data 15
  16. 17:22 Friday, Feb 17 PT (T+00:50) 16

  17. ...the data was still leaking... 17

  18. 20:24 Friday, Feb 17 PT (T+03:52) • Automatic HTTPS Rewrites

    disabled worldwide 18
  19. 23:22 Friday, Feb 17 PT (T+06:50) • Implemented and deployed

    kill switch for Server-Side Excludes 19
  20. 13:59 Monday, 20 Feb PT (T+3d) • SAFE_CHAR fix deployed

    globally 20 2017/02/19 13:47:34 [crit] 27558#0: *2 email filter tried to access char past EOF while sending response to client, client: 127.0.0.1, server: localhost, request: "GET / malformed-test.html HTTP/1.1”
  21. 10:03 Tuesday, Feb 21 PT (T+4d) • Automatic HTTPS Rewrites,

    Server-Side Excludes and Email Obfuscation re-enabled worldwide 21
  22. 15:00 Thursday, Feb 21 PT (T+6d) 22

  23. The technical cause 23

  24. Unclosed HTML attribute at end of page • Domain had

    to have one of • Email Obfuscation • Server-Side Excludes + other feature • Automatic HTTPS Redirects + other feature • Page had to end with something like 24 <script type=" <img height="50px" width="200px" src="
  25. 25

  26. Buffer overrun 26 /* generated C code */ if (

    ++p == pe ) goto _test_eof; current position end of buffer
  27. 27 < s c r i p t t y

    p e = " ☐ - - - - - - - - - - - - - - - p pe
  28. 28 < s c r i p t t y

    p e = " ☐ - - - - - - - - - - - - - - - p pe
  29. 29 < s c r i p t t y

    p e = " ☐ - - - - - - - - - - - - - - - p pe
  30. 30 script_consume_attr := \ ((unquoted_attr_char)* :>> (space|'/'|'>')) @{ fhold; fgoto

    script_tag_parse; } $lerr{ fgoto script_consume_attr; }; fhold --> p-- fgoto --> p++ Ok! Error! Missing fhold.... Ragel parser generator
  31. 31 The bug was dormant for months...

  32. 32 ...until we started replacing the old parser.

  33. 33 email obfuscation old parser new parser 33 other features

    server-side exclude
  34. 34 email obfuscation old parser new parser 34 other features

    https rewrites server-side exclude
  35. 35 email obfuscation old parser new parser 35 https rewrites

    server-side excludes other features
  36. 36 email obfuscation old parser new parser 36 https rewrites

    email obfuscation other features server-side excludes
  37. Irony: we caused this because we were migrating away from

    the buggy parser. 37
  38. Trigering conditions • Buffer smaller than 4KB • End with

    malformed script or img tag • Enabled features using both new and old parser 38
  39. 39 Replacing old buggy code might expose unknown bugs.

  40. Good, bad, ugly 40

  41. 41 Good • Bug mitigated in 47 minutes

  42. Good • We were using Ragel to generate code •

    (incorrectly though) 42
  43. Bad: leaking sensitive data • Customer: • HTTP headers, including

    cookies; POST data (passwords, potentially credit card numbers, SSNs); URI parameters; JSON blobs for API calls; API authentication secrets; OAuth keys • Private Cloudflare: • Keys, authentication secrets 43
  44. Bad: leaking sensitive data 44

  45. 45 Bad: leaking sensitive data

  46. Bad: leaking sensitive data 46

  47. Ugly: cached by search engines 47 • Google, Bing, Yahoo,

    Yandex, Baidu, DuckDuckGo...
  48. 48 Ugly: cached by search engines • "CF-Host-Origin-IP:" "authorization:" •

    Total: 770 urls, 161 domains
  49. 49 Remove sensitive data from search engine caches before going

    public.
  50. Going public 50

  51. Project zero timelines • 90 days • 7 days -

    critical vulnerabilities under active exploitation 51
  52. 52

  53. 53 15:00 Thursday, Feb 23 PT

  54. Friday, Feb 24 54

  55. 55 Fully transparent, technical post-mortem is great.

  56. For 11 days we didn't know how much was leaked

    56
  57. 57

  58. 58 Estimate the impact early on.

  59. 59 Wed, 1 Mar 2017

  60. We are confident the bug wasn't deliberately exploited. 60

  61. 61 Purged 80,000 pages from search engine caches.

  62. Impact statistics • SAFE_CHAR logs allowed us to estimate the

    impact. • September 22, 2016 -> February 13, 2017 605,307 • February 13, 2017 -> February 18, 2017 637,034 62
  63. Each leaked page contained • (based on search engine caches)

    • 67.54 Internal Cloudflare HTTP Headers • 0.44 Cookies • 0.04 Authorization Headers / Tokens • No passwords, credit cards, SSNs found in the wild 63
  64. Estimated customer impact 64 Requests per Month Estimated Leaks ------------------

    ----------------- 200B – 300B 22,356 – 33,534 100B – 200B 11,427 – 22,356 50B – 100B 5,962 – 11,427 10B – 50B 1,118 – 5,926 1B – 10B 112 – 1,118 500M – 1B 56 – 112 250M – 500M 25 – 56 100M – 250M 11 – 25 50M – 100M 6 – 11 10M – 50M 1 – 6 <10M < 1 

  65. The really ugly truth 65

  66. It's been going on for months • September 22, 2016:

    • Automatic HTTP Rewrites enabled new parser • January 30, 2017 • Server-Side Excludes migrated to new parser • February 13, 2017 • Email Obfuscation partially migrated to new parser • February 18, 2017 • Google reports problem to Cloudflare and leak is stopped 66 ⟯180 sites ⟯6500 sites
  67. It's been going on for months 67 Cloudbleed reported

  68. 68 Monitor your crashes.

  69. From cloudbleed to zero crashes 69

  70. 70

  71. All crashes will be investigated 71 From: SRE To: nginx-dev

    On 2017-02-22 between 3:00am and 3:30am UTC we notice 17 core dumps in SJC, ORD, IAD, and one in DOG.
  72. Most of the core dumps were obscure bugs in our

    code 72
  73. 73

  74. But some we couldn't explain! 74

  75. 75

  76. Mystery crash • Not: “I don’t understand why the program

    did that” • Rather: “I believe it is impossible for the program to reach this state, if executed correctly” • At a low level, computers are deterministic! 76
  77. Mystery crashes • On average, ~1 mystery core dump a

    day • Scattered over all servers, all datacenters • Per server, 1 in 10 years • Can’t reproduce • Hard to try any potential fix 77
  78. Mystery crashes • Cosmic rays? • Memory error? (we use

    ECC) • Faulty CPU? (mostly Intel) • Core dumps get corrupted somewhere? • OS bug generating core dumps? • OS virtual memory bug? TLB? 78
  79. Broadwell trail 79 From: SRE To: nginx-dev During further investigation,

    Josh suggested to check if the generations of hardware could be relevant. Surprisingly all 18 nginx SIGSEGV crashes happened on Intel Broadwell servers. Given that broadwell are on a 1/3rd of our fleet, we suspect the crashes might be related to hardware.
  80. 80

  81. 81

  82. 82 BDF-76

  83. Microcode update • Firmware that controls the lowest-level operation of

    the processor
 • Can be updated by the BIOS (from system vendor) or the OS
 • Microcode updates can change the behaviour of the processor to some extent, e.g. to fix errata 83
  84. 84 Microcode update

  85. 85 Keep your microcode updated

  86. Thank you 86 marek@cloudflare.com @majek04 Fully transparent post-mortem helps Estimate

    incident impact early on
  87. 87