Slide 1

Slide 1 text

Life After EC2 A Love Story @drewr Friday, October 11, 13

Slide 2

Slide 2 text

Friday, October 11, 13

Slide 3

Slide 3 text

EC2 40 (data) nodes 1 index 500 shards 12.5T (primaries) 1 replica 1.6B docs (Jul 2013) Friday, October 11, 13

Slide 4

Slide 4 text

Carpathia 8 (data) nodes 1 index 128 shards 1 replica 14 x 600G SSD 32 cores, 64G RAM Friday, October 11, 13

Slide 5

Slide 5 text

We are upgrading our new search cluster from 0.90.1 to 0.90.3 The shard sizes are ~100GB on average, and it is taking an obscenely long time to recover shards on the nodes we have restarted. The restart took place roughly 45 minutes ago, and not a single shard has fully recovered yet. The load on the machines is minimal as is disk IO and network IO. We've bumped the node_concurrent_recoveries to 6. But how long should this take? #1004 Tim Pease, 8 Aug 2013 Friday, October 11, 13

Slide 6

Slide 6 text

Jeez! It has been five hours now and only 5 of the 128 shards have recovered. At this rate it will take a full week to get the cluster into a green state. ... Friday, October 11, 13

Slide 7

Slide 7 text

First things first Friday, October 11, 13 Any anomalies in the dashboards? GitHub has *excellent* monitoring...

Slide 8

Slide 8 text

GET /_nodes/hot_threads Friday, October 11, 13 Really nice for inspecting where ES might be bound

Slide 9

Slide 9 text

dd if=/dev/zero of=/tmp/file... scp /tmp/file host2:/tmp Friday, October 11, 13 Check the network... Hm, no way 10gigE is that slow No rush, let’s sleep on it

Slide 10

Slide 10 text

dd if=/dev/zero of=/tmp/file... scp /tmp/file host2:/tmp ....66M/s Friday, October 11, 13 Check the network... Hm, no way 10gigE is that slow No rush, let’s sleep on it

Slide 11

Slide 11 text

curl -s http://git.io/KlTPxw | sh Friday, October 11, 13 OK, I think I have enough evidence here...

Slide 12

Slide 12 text

curl -s http://git.io/KlTPxw | sh --- /tmp/1 2013-08-08 21:34:59.352499371 -0700 +++ /tmp/2 2013-08-08 21:35:29.404911659 -0700 @@ -66,13 +66,13 @@ -code-search-1 46 r 216782024539 172.16.12.13 codesearch-storage7 +code-search-1 46 r 217412218715 172.16.12.13 codesearch-storage7 Friday, October 11, 13 OK, I think I have enough evidence here...

Slide 13

Slide 13 text

curl -s http://git.io/KlTPxw | sh --- /tmp/1 2013-08-08 21:34:59.352499371 -0700 +++ /tmp/2 2013-08-08 21:35:29.404911659 -0700 @@ -66,13 +66,13 @@ -code-search-1 46 r 216782024539 172.16.12.13 codesearch-storage7 +code-search-1 46 r 217412218715 172.16.12.13 codesearch-storage7 ...20M/s Friday, October 11, 13 OK, I think I have enough evidence here...

Slide 14

Slide 14 text

Allocation Friday, October 11, 13

Slide 15

Slide 15 text

Friday, October 11, 13 Per node! Why didn’t this help? Probably not blocked on deciding where shards go

Slide 16

Slide 16 text

P P Friday, October 11, 13 Per node! Why didn’t this help? Probably not blocked on deciding where shards go

Slide 17

Slide 17 text

P P R R Friday, October 11, 13 Per node! Why didn’t this help? Probably not blocked on deciding where shards go

Slide 18

Slide 18 text

P P R R cluster .routing .allocation .concurrent_recoveries Friday, October 11, 13 Per node! Why didn’t this help? Probably not blocked on deciding where shards go

Slide 19

Slide 19 text

Recovery Friday, October 11, 13

Slide 20

Slide 20 text

P P Friday, October 11, 13

Slide 21

Slide 21 text

P R P Friday, October 11, 13

Slide 22

Slide 22 text

P R R P Friday, October 11, 13

Slide 23

Slide 23 text

P R Friday, October 11, 13 Chunks (default 512k) read & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...

Slide 24

Slide 24 text

P R Friday, October 11, 13 Chunks (default 512k) read & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...

Slide 25

Slide 25 text

P R indices.recovery.max_bytes_per_sec Friday, October 11, 13 Chunks (default 512k) read & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...

Slide 26

Slide 26 text

P R indices.recovery.max_bytes_per_sec 20M/s Friday, October 11, 13 Chunks (default 512k) read & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...

Slide 27

Slide 27 text

org.apache.lucene.store.RateLimiter$ SimpleRateLimiter.pause(RateLimiter.java:112) Friday, October 11, 13 hot_threads was right :)

Slide 28

Slide 28 text

curl -XPUT localhost:9202/_cluster/settings -d' { "transient": { "indices.recovery.concurrent_streams": 12, "indices.recovery.max_bytes_per_sec": "500mb" } } ' Friday, October 11, 13 Let’s see if we can move the needle Also bump up concurrent_streams to handle interleaving

Slide 29

Slide 29 text

Friday, October 11, 13 OK! Progress...

Slide 30

Slide 30 text

curl -XPUT localhost:9202/_cluster/settings -d' { "transient": { "indices.recovery.concurrent_streams": 24, "indices.recovery.max_bytes_per_sec": "2gb" } } ' Friday, October 11, 13 Turn it up to eleven

Slide 31

Slide 31 text

Friday, October 11, 13 Only one thread active, writes very erratic “Nodes basically bored” Nothing else throttled in ES; what’s it doing?

Slide 32

Slide 32 text

GET /_nodes/hot_threads Friday, October 11, 13

Slide 33

Slide 33 text

GET /_nodes/hot_threads sun.nio.ch.IOUtil.read() Friday, October 11, 13

Slide 34

Slide 34 text

Friday, October 11, 13 Where did we see that before? The file copy from our lame network test! We weren’t testing just the network!

Slide 35

Slide 35 text

66M/s Friday, October 11, 13 Where did we see that before? The file copy from our lame network test! We weren’t testing just the network!

Slide 36

Slide 36 text

n1 n2 Friday, October 11, 13

Slide 37

Slide 37 text

Disk Disk n1 n2 Friday, October 11, 13

Slide 38

Slide 38 text

Disk Disk n1 n2 Network Friday, October 11, 13

Slide 39

Slide 39 text

Disk Disk Kernel Kernel n1 n2 Network Friday, October 11, 13

Slide 40

Slide 40 text

Disk Disk Kernel Kernel n1 n2 Network e t h 0 e t h 0 Friday, October 11, 13

Slide 41

Slide 41 text

Disk Disk Kernel Kernel n1 n2 Network e t h 0 e t h 0 scp Friday, October 11, 13

Slide 42

Slide 42 text

Disk Disk Kernel Kernel n1 n2 Network e t h 0 e t h 0 iperf scp Friday, October 11, 13

Slide 43

Slide 43 text

Friday, October 11, 13

Slide 44

Slide 44 text

C F Q Friday, October 11, 13 Reorders access by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))

Slide 45

Slide 45 text

C F Q ompletely Friday, October 11, 13 Reorders access by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))

Slide 46

Slide 46 text

C F Q ompletely air Friday, October 11, 13 Reorders access by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))

Slide 47

Slide 47 text

C F Q ompletely air ueuing Friday, October 11, 13 Reorders access by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))

Slide 48

Slide 48 text

N Friday, October 11, 13 Removes all reordering, gets the kernel out of the IO game Also deadline, which reorders based on time, didn’t make a difference

Slide 49

Slide 49 text

Noop Friday, October 11, 13 Removes all reordering, gets the kernel out of the IO game Also deadline, which reorders based on time, didn’t make a difference

Slide 50

Slide 50 text

echo noop | sudo tee /sys/block/sdb/queue/scheduler' Friday, October 11, 13

Slide 51

Slide 51 text

Friday, October 11, 13 Turned on one node

Slide 52

Slide 52 text

Friday, October 11, 13 Turned on one node

Slide 53

Slide 53 text

Friday, October 11, 13 Trickling the setting through the nodes...

Slide 54

Slide 54 text

Conclusions Friday, October 11, 13

Slide 55

Slide 55 text

Defaults Friday, October 11, 13 ES awesome defaults, but tuned for ec2 Improving this with more extensive documentation ...big part of having a company behind ES

Slide 56

Slide 56 text

Friday, October 11, 13 with raid or ssd: noop, otherwise experiment indices.* <- still node-level here!

Slide 57

Slide 57 text

scheduler Friday, October 11, 13 with raid or ssd: noop, otherwise experiment indices.* <- still node-level here!

Slide 58

Slide 58 text

indices.recovery.max_bytes_per_sec scheduler Friday, October 11, 13 with raid or ssd: noop, otherwise experiment indices.* <- still node-level here!

Slide 59

Slide 59 text

indices.recovery.max_bytes_per_sec indices.recovery.concurrent_streams scheduler Friday, October 11, 13 with raid or ssd: noop, otherwise experiment indices.* <- still node-level here!

Slide 60

Slide 60 text

Monitoring Friday, October 11, 13 Doesn’t have to be perfect. Do it tonight. You cannot make engineering decisions without it. Translates “hrm, this is taking forever” to *action* We’re working on helping you here.

Slide 61

Slide 61 text

Thanks Tim Pease Grant Rodgers Mark Imbriaco Friday, October 11, 13