$30 off During Our Annual Pro Sale. View Details »

Inside Cloudbleed

majek04
December 07, 2017

Inside Cloudbleed

majek04

December 07, 2017
Tweet

More Decks by majek04

Other Decks in Technology

Transcript

  1. Inside Cloudbleed
    Marek Majkowski @majek04

    View Slide

  2. 2
    Cloudflare locations

    View Slide

  3. Reverse proxy
    3
    Eyeball Reverse proxy Origin server
    • Caching
    • Security
    • DDoS protection
    • Optimizations

    View Slide

  4. 4

    View Slide

  5. 5

    View Slide

  6. Friday, 23 Feb
    6
    “Our edge servers were running past the end of a buffer and
    returning memory that contained private information such
    as HTTP cookies, authentication tokens, HTTP POST bodies,
    and other sensitive data.
    And some of that data had been cached by search engines.”

    View Slide

  7. 7

    View Slide

  8. 8
    When?
    Uninitialized
    memory
    during:
    Affected: Worst problem
    Heartbleed
    1 Apr 2014
    TLS Heartbeat
    Requests
    All OpenSSL
    servers
    globally
    SSL keys
    leaked
    Cloudbleed
    17 Feb 2017
    Malformed
    HTML passing
    Cloudflare
    All Cloudflare
    customers
    Cached by
    search
    engines

    View Slide

  9. 9

    View Slide

  10. 10
    16:11 Friday, Feb 17 PT (T-00:21)

    View Slide

  11. 11

    View Slide

  12. 12

    View Slide

  13. 16:32 Friday, Feb 17 PT (T+00:00)
    13
    • Received bug details from Tavis Ormandy

    View Slide

  14. 16:40 Friday, February 17 PT (T+00:08)
    • Team assembled in San Francisco
    • Initial impact assessed
    14

    View Slide

  15. 17:19 Friday, Feb 17 PT (T+00:47)
    • "Email Obfuscation" feature disabled globally
    • Project Zero confirm they no longer see leaking data
    15

    View Slide

  16. 17:22 Friday, Feb 17 PT (T+00:50)
    16

    View Slide

  17. ...the data was still leaking...
    17

    View Slide

  18. 20:24 Friday, Feb 17 PT (T+03:52)
    • Automatic HTTPS Rewrites disabled worldwide
    18

    View Slide

  19. 23:22 Friday, Feb 17 PT (T+06:50)
    • Implemented and deployed kill switch for Server-Side
    Excludes
    19

    View Slide

  20. 13:59 Monday, 20 Feb PT (T+3d)
    • SAFE_CHAR fix deployed globally
    20
    2017/02/19 13:47:34 [crit] 27558#0: *2 email filter tried
    to access char past EOF while sending response to client,
    client: 127.0.0.1, server: localhost, request: "GET /
    malformed-test.html HTTP/1.1”

    View Slide

  21. 10:03 Tuesday, Feb 21 PT (T+4d)
    • Automatic HTTPS Rewrites, Server-Side Excludes and
    Email Obfuscation re-enabled worldwide
    21

    View Slide

  22. 15:00 Thursday, Feb 21 PT (T+6d)
    22

    View Slide

  23. The technical cause
    23

    View Slide

  24. Unclosed HTML attribute at end of page
    • Domain had to have one of
    • Email Obfuscation
    • Server-Side Excludes + other feature
    • Automatic HTTPS Redirects + other feature
    • Page had to end with something like
    24

    View Slide

  25. 25

    View Slide

  26. Buffer overrun
    26
    /* generated C code */
    if ( ++p == pe )
    goto _test_eof;
    current position end of buffer

    View Slide

  27. 27
    < s c r i p t t y p e = " ☐
    - - - - - - - - - - - - - - -
    p
    pe

    View Slide

  28. 28
    < s c r i p t t y p e = " ☐
    - - - - - - - - - - - - - - -
    p
    pe

    View Slide

  29. 29
    < s c r i p t t y p e = " ☐
    - - - - - - - - - - - - - - -
    p
    pe

    View Slide

  30. 30
    script_consume_attr := \
    ((unquoted_attr_char)* :>> (space|'/'|'>'))
    @{ fhold; fgoto script_tag_parse; }
    $lerr{ fgoto script_consume_attr; };
    fhold --> p--
    fgoto --> p++
    Ok!
    Error!
    Missing fhold....
    Ragel parser generator

    View Slide

  31. 31
    The bug was dormant for months...

    View Slide

  32. 32
    ...until we started replacing
    the old parser.

    View Slide

  33. 33
    email obfuscation
    old parser
    new parser
    33
    other features
    server-side exclude

    View Slide

  34. 34
    email obfuscation
    old parser
    new parser
    34
    other features
    https rewrites
    server-side exclude

    View Slide

  35. 35
    email obfuscation
    old parser
    new parser
    35
    https rewrites
    server-side excludes other features

    View Slide

  36. 36
    email obfuscation
    old parser
    new parser
    36
    https rewrites
    email obfuscation
    other features
    server-side excludes

    View Slide

  37. Irony: we caused this because we were
    migrating away from the buggy parser.
    37

    View Slide

  38. Trigering conditions
    • Buffer smaller than 4KB
    • End with malformed script or img tag
    • Enabled features using both new and old parser
    38

    View Slide

  39. 39
    Replacing old buggy code
    might expose unknown bugs.

    View Slide

  40. Good, bad, ugly
    40

    View Slide

  41. 41
    Good
    • Bug mitigated in 47 minutes

    View Slide

  42. Good
    • We were using Ragel to generate code
    • (incorrectly though)
    42

    View Slide

  43. Bad: leaking sensitive data
    • Customer:
    • HTTP headers, including cookies; POST data (passwords, potentially credit
    card numbers, SSNs); URI parameters; JSON blobs for API calls; API
    authentication secrets; OAuth keys
    • Private Cloudflare:
    • Keys, authentication secrets
    43

    View Slide

  44. Bad: leaking sensitive data
    44

    View Slide

  45. 45
    Bad: leaking sensitive data

    View Slide

  46. Bad: leaking sensitive data
    46

    View Slide

  47. Ugly: cached by search engines
    47
    • Google, Bing, Yahoo, Yandex, Baidu, DuckDuckGo...

    View Slide

  48. 48
    Ugly: cached by search engines
    • "CF-Host-Origin-IP:" "authorization:"
    • Total: 770 urls, 161 domains

    View Slide

  49. 49
    Remove sensitive data from
    search engine caches before
    going public.

    View Slide

  50. Going public
    50

    View Slide

  51. Project zero timelines
    • 90 days
    • 7 days - critical vulnerabilities under active exploitation
    51

    View Slide

  52. 52

    View Slide

  53. 53
    15:00 Thursday, Feb 23 PT

    View Slide

  54. Friday, Feb 24
    54

    View Slide

  55. 55
    Fully transparent, technical
    post-mortem is great.

    View Slide

  56. For 11 days we didn't know
    how much was leaked
    56

    View Slide

  57. 57

    View Slide

  58. 58
    Estimate the impact early on.

    View Slide

  59. 59
    Wed, 1 Mar 2017

    View Slide

  60. We are confident the bug
    wasn't deliberately exploited.
    60

    View Slide

  61. 61
    Purged 80,000 pages from
    search engine caches.

    View Slide

  62. Impact statistics
    • SAFE_CHAR logs allowed us to estimate the impact.
    • September 22, 2016 -> February 13, 2017 605,307
    • February 13, 2017 -> February 18, 2017 637,034
    62

    View Slide

  63. Each leaked page contained
    • (based on search engine caches)
    • 67.54 Internal Cloudflare HTTP Headers
    • 0.44 Cookies
    • 0.04 Authorization Headers / Tokens
    • No passwords, credit cards, SSNs found in the wild
    63

    View Slide

  64. Estimated customer impact
    64
    Requests per Month Estimated Leaks
    ------------------ -----------------
    200B – 300B 22,356 – 33,534
    100B – 200B 11,427 – 22,356
    50B – 100B 5,962 – 11,427
    10B – 50B 1,118 – 5,926
    1B – 10B 112 – 1,118
    500M – 1B 56 – 112
    250M – 500M 25 – 56
    100M – 250M 11 – 25
    50M – 100M 6 – 11
    10M – 50M 1 – 6
    <10M < 1 


    View Slide

  65. The really ugly truth
    65

    View Slide

  66. It's been going on for months
    • September 22, 2016:
    • Automatic HTTP Rewrites enabled new parser
    • January 30, 2017
    • Server-Side Excludes migrated to new parser
    • February 13, 2017
    • Email Obfuscation partially migrated to new parser
    • February 18, 2017
    • Google reports problem to Cloudflare and leak is stopped
    66
    ⟯180 sites
    ⟯6500 sites

    View Slide

  67. It's been going on for months
    67
    Cloudbleed
    reported

    View Slide

  68. 68
    Monitor your crashes.

    View Slide

  69. From cloudbleed
    to zero crashes
    69

    View Slide

  70. 70

    View Slide

  71. All crashes will be investigated
    71
    From: SRE
    To: nginx-dev
    On 2017-02-22 between 3:00am and 3:30am UTC we
    notice 17 core dumps in SJC, ORD, IAD, and one
    in DOG.

    View Slide

  72. Most of the core dumps were
    obscure bugs in our code
    72

    View Slide

  73. 73

    View Slide

  74. But some we couldn't explain!
    74

    View Slide

  75. 75

    View Slide

  76. Mystery crash
    • Not: “I don’t understand why the program did that”
    • Rather: “I believe it is impossible for the program to
    reach this state, if executed correctly”
    • At a low level, computers are deterministic!
    76

    View Slide

  77. Mystery crashes
    • On average, ~1 mystery core dump a day
    • Scattered over all servers, all datacenters
    • Per server, 1 in 10 years
    • Can’t reproduce
    • Hard to try any potential fix
    77

    View Slide

  78. Mystery crashes
    • Cosmic rays?
    • Memory error? (we use ECC)
    • Faulty CPU? (mostly Intel)
    • Core dumps get corrupted somewhere?
    • OS bug generating core dumps?
    • OS virtual memory bug? TLB?
    78

    View Slide

  79. Broadwell trail
    79
    From: SRE
    To: nginx-dev
    During further investigation, Josh suggested to
    check if the generations of hardware could be
    relevant. Surprisingly all 18 nginx SIGSEGV
    crashes happened on Intel Broadwell servers.
    Given that broadwell are on a 1/3rd of our
    fleet, we suspect the crashes might be related
    to hardware.

    View Slide

  80. 80

    View Slide

  81. 81

    View Slide

  82. 82
    BDF-76

    View Slide

  83. Microcode update
    • Firmware that controls the lowest-level operation of
    the processor

    • Can be updated by the BIOS (from system vendor) or
    the OS

    • Microcode updates can change the behaviour of the
    processor to some extent, e.g. to fix errata
    83

    View Slide

  84. 84
    Microcode
    update

    View Slide

  85. 85
    Keep your microcode updated

    View Slide

  86. Thank you
    86
    marek@cloudflare.com @majek04
    Fully transparent
    post-mortem helps
    Estimate incident
    impact early on

    View Slide

  87. 87

    View Slide