Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Life After EC2

Life After EC2

A journey from slow recovery to realized potential.

Elasticsearch Inc

October 10, 2013
Tweet

More Decks by Elasticsearch Inc

Other Decks in Technology

Transcript

  1. Life After EC2
    A Love Story
    @drewr
    Friday, October 11, 13

    View full-size slide

  2. Friday, October 11, 13

    View full-size slide

  3. EC2
    40 (data) nodes
    1 index
    500 shards
    12.5T (primaries)
    1 replica
    1.6B docs (Jul 2013)
    Friday, October 11, 13

    View full-size slide

  4. Carpathia
    8 (data) nodes
    1 index
    128 shards
    1 replica
    14 x 600G SSD
    32 cores, 64G RAM
    Friday, October 11, 13

    View full-size slide

  5. We are upgrading our new search cluster from
    0.90.1 to 0.90.3 The shard sizes are ~100GB on
    average, and it is taking an obscenely long time to
    recover shards on the nodes we have restarted. The
    restart took place roughly 45 minutes ago, and not a
    single shard has fully recovered yet. The load on the
    machines is minimal as is disk IO and network IO.
    We've bumped the node_concurrent_recoveries to
    6. But how long should this take?
    #1004 Tim Pease, 8 Aug 2013
    Friday, October 11, 13

    View full-size slide

  6. Jeez! It has been five hours now and only 5 of the
    128 shards have recovered. At this rate it will take a
    full week to get the cluster into a green state.
    ...
    Friday, October 11, 13

    View full-size slide

  7. First things first
    Friday, October 11, 13
    Any anomalies in the dashboards?
    GitHub has *excellent* monitoring...

    View full-size slide

  8. GET /_nodes/hot_threads
    Friday, October 11, 13
    Really nice for inspecting where ES might be bound

    View full-size slide

  9. dd if=/dev/zero of=/tmp/file...
    scp /tmp/file host2:/tmp
    Friday, October 11, 13
    Check the network... Hm, no way 10gigE is that slow
    No rush, let’s sleep on it

    View full-size slide

  10. dd if=/dev/zero of=/tmp/file...
    scp /tmp/file host2:/tmp
    ....66M/s
    Friday, October 11, 13
    Check the network... Hm, no way 10gigE is that slow
    No rush, let’s sleep on it

    View full-size slide

  11. curl -s http://git.io/KlTPxw | sh
    Friday, October 11, 13
    OK, I think I have enough evidence here...

    View full-size slide

  12. curl -s http://git.io/KlTPxw | sh
    --- /tmp/1 2013-08-08 21:34:59.352499371 -0700
    +++ /tmp/2 2013-08-08 21:35:29.404911659 -0700
    @@ -66,13 +66,13 @@
    -code-search-1 46 r 216782024539 172.16.12.13 codesearch-storage7
    +code-search-1 46 r 217412218715 172.16.12.13 codesearch-storage7
    Friday, October 11, 13
    OK, I think I have enough evidence here...

    View full-size slide

  13. curl -s http://git.io/KlTPxw | sh
    --- /tmp/1 2013-08-08 21:34:59.352499371 -0700
    +++ /tmp/2 2013-08-08 21:35:29.404911659 -0700
    @@ -66,13 +66,13 @@
    -code-search-1 46 r 216782024539 172.16.12.13 codesearch-storage7
    +code-search-1 46 r 217412218715 172.16.12.13 codesearch-storage7
    ...20M/s
    Friday, October 11, 13
    OK, I think I have enough evidence here...

    View full-size slide

  14. Allocation
    Friday, October 11, 13

    View full-size slide

  15. Friday, October 11, 13
    Per node!
    Why didn’t this help?
    Probably not blocked on deciding where shards go

    View full-size slide

  16. P P
    Friday, October 11, 13
    Per node!
    Why didn’t this help?
    Probably not blocked on deciding where shards go

    View full-size slide

  17. P P
    R R
    Friday, October 11, 13
    Per node!
    Why didn’t this help?
    Probably not blocked on deciding where shards go

    View full-size slide

  18. P P
    R R
    cluster
    .routing
    .allocation
    .concurrent_recoveries
    Friday, October 11, 13
    Per node!
    Why didn’t this help?
    Probably not blocked on deciding where shards go

    View full-size slide

  19. Recovery
    Friday, October 11, 13

    View full-size slide

  20. P P
    Friday, October 11, 13

    View full-size slide

  21. P R
    P
    Friday, October 11, 13

    View full-size slide

  22. P R R
    P
    Friday, October 11, 13

    View full-size slide

  23. P R
    Friday, October 11, 13
    Chunks (default 512k) read & write by max_bytes * ns
    Setting which controls that...
    Anyone know the default?
    Incidentally...

    View full-size slide

  24. P R
    Friday, October 11, 13
    Chunks (default 512k) read & write by max_bytes * ns
    Setting which controls that...
    Anyone know the default?
    Incidentally...

    View full-size slide

  25. P R
    indices.recovery.max_bytes_per_sec
    Friday, October 11, 13
    Chunks (default 512k) read & write by max_bytes * ns
    Setting which controls that...
    Anyone know the default?
    Incidentally...

    View full-size slide

  26. P R
    indices.recovery.max_bytes_per_sec
    20M/s
    Friday, October 11, 13
    Chunks (default 512k) read & write by max_bytes * ns
    Setting which controls that...
    Anyone know the default?
    Incidentally...

    View full-size slide

  27. org.apache.lucene.store.RateLimiter$
    SimpleRateLimiter.pause(RateLimiter.java:112)
    Friday, October 11, 13
    hot_threads was right :)

    View full-size slide

  28. curl -XPUT localhost:9202/_cluster/settings -d'
    {
    "transient": {
    "indices.recovery.concurrent_streams": 12,
    "indices.recovery.max_bytes_per_sec": "500mb"
    }
    }
    '
    Friday, October 11, 13
    Let’s see if we can move the needle
    Also bump up concurrent_streams to handle interleaving

    View full-size slide

  29. Friday, October 11, 13
    OK! Progress...

    View full-size slide

  30. curl -XPUT localhost:9202/_cluster/settings -d'
    {
    "transient": {
    "indices.recovery.concurrent_streams": 24,
    "indices.recovery.max_bytes_per_sec": "2gb"
    }
    }
    '
    Friday, October 11, 13
    Turn it up to eleven

    View full-size slide

  31. Friday, October 11, 13
    Only one thread active, writes very erratic
    “Nodes basically bored”
    Nothing else throttled in ES; what’s it doing?

    View full-size slide

  32. GET /_nodes/hot_threads
    Friday, October 11, 13

    View full-size slide

  33. GET /_nodes/hot_threads
    sun.nio.ch.IOUtil.read()
    Friday, October 11, 13

    View full-size slide

  34. Friday, October 11, 13
    Where did we see that before?
    The file copy from our lame network test!
    We weren’t testing just the network!

    View full-size slide

  35. 66M/s
    Friday, October 11, 13
    Where did we see that before?
    The file copy from our lame network test!
    We weren’t testing just the network!

    View full-size slide

  36. n1 n2
    Friday, October 11, 13

    View full-size slide

  37. Disk Disk
    n1 n2
    Friday, October 11, 13

    View full-size slide

  38. Disk Disk
    n1 n2
    Network
    Friday, October 11, 13

    View full-size slide

  39. Disk Disk
    Kernel Kernel
    n1 n2
    Network
    Friday, October 11, 13

    View full-size slide

  40. Disk Disk
    Kernel Kernel
    n1 n2
    Network
    e
    t
    h
    0
    e
    t
    h
    0
    Friday, October 11, 13

    View full-size slide

  41. Disk Disk
    Kernel Kernel
    n1 n2
    Network
    e
    t
    h
    0
    e
    t
    h
    0
    scp
    Friday, October 11, 13

    View full-size slide

  42. Disk Disk
    Kernel Kernel
    n1 n2
    Network
    e
    t
    h
    0
    e
    t
    h
    0
    iperf
    scp
    Friday, October 11, 13

    View full-size slide

  43. Friday, October 11, 13

    View full-size slide

  44. C
    F
    Q
    Friday, October 11, 13
    Reorders access by sector ID
    Designed to most efficiently use rotational media
    and for multi-user systems, unlike db server
    * Why is this useless here? (SSD (plus RAID!))

    View full-size slide

  45. C
    F
    Q
    ompletely
    Friday, October 11, 13
    Reorders access by sector ID
    Designed to most efficiently use rotational media
    and for multi-user systems, unlike db server
    * Why is this useless here? (SSD (plus RAID!))

    View full-size slide

  46. C
    F
    Q
    ompletely
    air
    Friday, October 11, 13
    Reorders access by sector ID
    Designed to most efficiently use rotational media
    and for multi-user systems, unlike db server
    * Why is this useless here? (SSD (plus RAID!))

    View full-size slide

  47. C
    F
    Q
    ompletely
    air
    ueuing
    Friday, October 11, 13
    Reorders access by sector ID
    Designed to most efficiently use rotational media
    and for multi-user systems, unlike db server
    * Why is this useless here? (SSD (plus RAID!))

    View full-size slide

  48. N
    Friday, October 11, 13
    Removes all reordering, gets the kernel out of the IO game
    Also deadline, which reorders based on time, didn’t make a
    difference

    View full-size slide

  49. Noop
    Friday, October 11, 13
    Removes all reordering, gets the kernel out of the IO game
    Also deadline, which reorders based on time, didn’t make a
    difference

    View full-size slide

  50. echo noop | sudo tee /sys/block/sdb/queue/scheduler'
    Friday, October 11, 13

    View full-size slide

  51. Friday, October 11, 13
    Turned on one node

    View full-size slide

  52. Friday, October 11, 13
    Turned on one node

    View full-size slide

  53. Friday, October 11, 13
    Trickling the setting through the nodes...

    View full-size slide

  54. Conclusions
    Friday, October 11, 13

    View full-size slide

  55. Defaults
    Friday, October 11, 13
    ES awesome defaults, but tuned for ec2
    Improving this with more extensive documentation
    ...big part of having a company behind ES

    View full-size slide

  56. Friday, October 11, 13
    with raid or ssd: noop, otherwise experiment
    indices.* <- still node-level here!

    View full-size slide

  57. scheduler
    Friday, October 11, 13
    with raid or ssd: noop, otherwise experiment
    indices.* <- still node-level here!

    View full-size slide

  58. indices.recovery.max_bytes_per_sec
    scheduler
    Friday, October 11, 13
    with raid or ssd: noop, otherwise experiment
    indices.* <- still node-level here!

    View full-size slide

  59. indices.recovery.max_bytes_per_sec
    indices.recovery.concurrent_streams
    scheduler
    Friday, October 11, 13
    with raid or ssd: noop, otherwise experiment
    indices.* <- still node-level here!

    View full-size slide

  60. Monitoring
    Friday, October 11, 13
    Doesn’t have to be perfect. Do it tonight.
    You cannot make engineering decisions without it.
    Translates “hrm, this is taking forever” to *action*
    We’re working on helping you here.

    View full-size slide

  61. Thanks
    Tim Pease
    Grant Rodgers
    Mark Imbriaco
    Friday, October 11, 13

    View full-size slide