Life After EC2

Life After EC2

A journey from slow recovery to realized potential.

098332e9d988080a9057816f84d668f7?s=128

Elasticsearch Inc

October 10, 2013
Tweet

Transcript

  1. Life After EC2 A Love Story @drewr Friday, October 11,

    13
  2. Friday, October 11, 13

  3. EC2 40 (data) nodes 1 index 500 shards 12.5T (primaries)

    1 replica 1.6B docs (Jul 2013) Friday, October 11, 13
  4. Carpathia 8 (data) nodes 1 index 128 shards 1 replica

    14 x 600G SSD 32 cores, 64G RAM Friday, October 11, 13
  5. We are upgrading our new search cluster from 0.90.1 to

    0.90.3 The shard sizes are ~100GB on average, and it is taking an obscenely long time to recover shards on the nodes we have restarted. The restart took place roughly 45 minutes ago, and not a single shard has fully recovered yet. The load on the machines is minimal as is disk IO and network IO. We've bumped the node_concurrent_recoveries to 6. But how long should this take? #1004 Tim Pease, 8 Aug 2013 Friday, October 11, 13
  6. Jeez! It has been five hours now and only 5

    of the 128 shards have recovered. At this rate it will take a full week to get the cluster into a green state. ... Friday, October 11, 13
  7. First things first Friday, October 11, 13 Any anomalies in

    the dashboards? GitHub has *excellent* monitoring...
  8. GET /_nodes/hot_threads Friday, October 11, 13 Really nice for inspecting

    where ES might be bound
  9. dd if=/dev/zero of=/tmp/file... scp /tmp/file host2:/tmp Friday, October 11, 13

    Check the network... Hm, no way 10gigE is that slow No rush, let’s sleep on it
  10. dd if=/dev/zero of=/tmp/file... scp /tmp/file host2:/tmp ....66M/s Friday, October 11,

    13 Check the network... Hm, no way 10gigE is that slow No rush, let’s sleep on it
  11. curl -s http://git.io/KlTPxw | sh Friday, October 11, 13 OK,

    I think I have enough evidence here...
  12. curl -s http://git.io/KlTPxw | sh --- /tmp/1 2013-08-08 21:34:59.352499371 -0700

    +++ /tmp/2 2013-08-08 21:35:29.404911659 -0700 @@ -66,13 +66,13 @@ -code-search-1 46 r 216782024539 172.16.12.13 codesearch-storage7 +code-search-1 46 r 217412218715 172.16.12.13 codesearch-storage7 Friday, October 11, 13 OK, I think I have enough evidence here...
  13. curl -s http://git.io/KlTPxw | sh --- /tmp/1 2013-08-08 21:34:59.352499371 -0700

    +++ /tmp/2 2013-08-08 21:35:29.404911659 -0700 @@ -66,13 +66,13 @@ -code-search-1 46 r 216782024539 172.16.12.13 codesearch-storage7 +code-search-1 46 r 217412218715 172.16.12.13 codesearch-storage7 ...20M/s Friday, October 11, 13 OK, I think I have enough evidence here...
  14. Allocation Friday, October 11, 13

  15. Friday, October 11, 13 Per node! Why didn’t this help?

    Probably not blocked on deciding where shards go
  16. P P Friday, October 11, 13 Per node! Why didn’t

    this help? Probably not blocked on deciding where shards go
  17. P P R R Friday, October 11, 13 Per node!

    Why didn’t this help? Probably not blocked on deciding where shards go
  18. P P R R cluster .routing .allocation .concurrent_recoveries Friday, October

    11, 13 Per node! Why didn’t this help? Probably not blocked on deciding where shards go
  19. Recovery Friday, October 11, 13

  20. P P Friday, October 11, 13

  21. P R P Friday, October 11, 13

  22. P R R P Friday, October 11, 13

  23. P R Friday, October 11, 13 Chunks (default 512k) read

    & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...
  24. P R Friday, October 11, 13 Chunks (default 512k) read

    & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...
  25. P R indices.recovery.max_bytes_per_sec Friday, October 11, 13 Chunks (default 512k)

    read & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...
  26. P R indices.recovery.max_bytes_per_sec 20M/s Friday, October 11, 13 Chunks (default

    512k) read & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...
  27. org.apache.lucene.store.RateLimiter$ SimpleRateLimiter.pause(RateLimiter.java:112) Friday, October 11, 13 hot_threads was right :)

  28. curl -XPUT localhost:9202/_cluster/settings -d' { "transient": { "indices.recovery.concurrent_streams": 12, "indices.recovery.max_bytes_per_sec":

    "500mb" } } ' Friday, October 11, 13 Let’s see if we can move the needle Also bump up concurrent_streams to handle interleaving
  29. Friday, October 11, 13 OK! Progress...

  30. curl -XPUT localhost:9202/_cluster/settings -d' { "transient": { "indices.recovery.concurrent_streams": 24, "indices.recovery.max_bytes_per_sec":

    "2gb" } } ' Friday, October 11, 13 Turn it up to eleven
  31. Friday, October 11, 13 Only one thread active, writes very

    erratic “Nodes basically bored” Nothing else throttled in ES; what’s it doing?
  32. GET /_nodes/hot_threads Friday, October 11, 13

  33. GET /_nodes/hot_threads sun.nio.ch.IOUtil.read() Friday, October 11, 13

  34. Friday, October 11, 13 Where did we see that before?

    The file copy from our lame network test! We weren’t testing just the network!
  35. 66M/s Friday, October 11, 13 Where did we see that

    before? The file copy from our lame network test! We weren’t testing just the network!
  36. n1 n2 Friday, October 11, 13

  37. Disk Disk n1 n2 Friday, October 11, 13

  38. Disk Disk n1 n2 Network Friday, October 11, 13

  39. Disk Disk Kernel Kernel n1 n2 Network Friday, October 11,

    13
  40. Disk Disk Kernel Kernel n1 n2 Network e t h

    0 e t h 0 Friday, October 11, 13
  41. Disk Disk Kernel Kernel n1 n2 Network e t h

    0 e t h 0 scp Friday, October 11, 13
  42. Disk Disk Kernel Kernel n1 n2 Network e t h

    0 e t h 0 iperf scp Friday, October 11, 13
  43. Friday, October 11, 13

  44. C F Q Friday, October 11, 13 Reorders access by

    sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))
  45. C F Q ompletely Friday, October 11, 13 Reorders access

    by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))
  46. C F Q ompletely air Friday, October 11, 13 Reorders

    access by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))
  47. C F Q ompletely air ueuing Friday, October 11, 13

    Reorders access by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))
  48. N Friday, October 11, 13 Removes all reordering, gets the

    kernel out of the IO game Also deadline, which reorders based on time, didn’t make a difference
  49. Noop Friday, October 11, 13 Removes all reordering, gets the

    kernel out of the IO game Also deadline, which reorders based on time, didn’t make a difference
  50. echo noop | sudo tee /sys/block/sdb/queue/scheduler' Friday, October 11, 13

  51. Friday, October 11, 13 Turned on one node

  52. Friday, October 11, 13 Turned on one node

  53. Friday, October 11, 13 Trickling the setting through the nodes...

  54. Conclusions Friday, October 11, 13

  55. Defaults Friday, October 11, 13 ES awesome defaults, but tuned

    for ec2 Improving this with more extensive documentation ...big part of having a company behind ES
  56. Friday, October 11, 13 with raid or ssd: noop, otherwise

    experiment indices.* <- still node-level here!
  57. scheduler Friday, October 11, 13 with raid or ssd: noop,

    otherwise experiment indices.* <- still node-level here!
  58. indices.recovery.max_bytes_per_sec scheduler Friday, October 11, 13 with raid or ssd:

    noop, otherwise experiment indices.* <- still node-level here!
  59. indices.recovery.max_bytes_per_sec indices.recovery.concurrent_streams scheduler Friday, October 11, 13 with raid or

    ssd: noop, otherwise experiment indices.* <- still node-level here!
  60. Monitoring Friday, October 11, 13 Doesn’t have to be perfect.

    Do it tonight. You cannot make engineering decisions without it. Translates “hrm, this is taking forever” to *action* We’re working on helping you here.
  61. Thanks Tim Pease Grant Rodgers Mark Imbriaco Friday, October 11,

    13