Life After EC2

Life After EC2 A Love Story @drewr Friday, October 11,
13

Friday, October 11, 13

EC2 40 (data) nodes 1 index 500 shards 12.5T (primaries)
1 replica 1.6B docs (Jul 2013) Friday, October 11, 13

Carpathia 8 (data) nodes 1 index 128 shards 1 replica
14 x 600G SSD 32 cores, 64G RAM Friday, October 11, 13

We are upgrading our new search cluster from 0.90.1 to
0.90.3 The shard sizes are ~100GB on average, and it is taking an obscenely long time to recover shards on the nodes we have restarted. The restart took place roughly 45 minutes ago, and not a single shard has fully recovered yet. The load on the machines is minimal as is disk IO and network IO. We've bumped the node_concurrent_recoveries to 6. But how long should this take? #1004 Tim Pease, 8 Aug 2013 Friday, October 11, 13

Jeez! It has been five hours now and only 5
of the 128 shards have recovered. At this rate it will take a full week to get the cluster into a green state. ... Friday, October 11, 13

First things ﬁrst Friday, October 11, 13 Any anomalies in
the dashboards? GitHub has *excellent* monitoring...

GET /_nodes/hot_threads Friday, October 11, 13 Really nice for inspecting
where ES might be bound

dd if=/dev/zero of=/tmp/file... scp /tmp/file host2:/tmp Friday, October 11, 13
Check the network... Hm, no way 10gigE is that slow No rush, let’s sleep on it

dd if=/dev/zero of=/tmp/file... scp /tmp/file host2:/tmp ....66M/s Friday, October 11,
13 Check the network... Hm, no way 10gigE is that slow No rush, let’s sleep on it

curl -s http://git.io/KlTPxw | sh Friday, October 11, 13 OK,
I think I have enough evidence here...

curl -s http://git.io/KlTPxw | sh --- /tmp/1 2013-08-08 21:34:59.352499371 -0700
+++ /tmp/2 2013-08-08 21:35:29.404911659 -0700 @@ -66,13 +66,13 @@ -code-search-1 46 r 216782024539 172.16.12.13 codesearch-storage7 +code-search-1 46 r 217412218715 172.16.12.13 codesearch-storage7 Friday, October 11, 13 OK, I think I have enough evidence here...

curl -s http://git.io/KlTPxw | sh --- /tmp/1 2013-08-08 21:34:59.352499371 -0700
+++ /tmp/2 2013-08-08 21:35:29.404911659 -0700 @@ -66,13 +66,13 @@ -code-search-1 46 r 216782024539 172.16.12.13 codesearch-storage7 +code-search-1 46 r 217412218715 172.16.12.13 codesearch-storage7 ...20M/s Friday, October 11, 13 OK, I think I have enough evidence here...

Allocation Friday, October 11, 13

Friday, October 11, 13 Per node! Why didn’t this help?
Probably not blocked on deciding where shards go

P P Friday, October 11, 13 Per node! Why didn’t
this help? Probably not blocked on deciding where shards go

P P R R Friday, October 11, 13 Per node!
Why didn’t this help? Probably not blocked on deciding where shards go

P P R R cluster .routing .allocation .concurrent_recoveries Friday, October
11, 13 Per node! Why didn’t this help? Probably not blocked on deciding where shards go

Recovery Friday, October 11, 13

P P Friday, October 11, 13

P R P Friday, October 11, 13

P R R P Friday, October 11, 13

P R Friday, October 11, 13 Chunks (default 512k) read
& write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...

P R indices.recovery.max_bytes_per_sec Friday, October 11, 13 Chunks (default 512k)
read & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...

P R indices.recovery.max_bytes_per_sec 20M/s Friday, October 11, 13 Chunks (default
512k) read & write by max_bytes * ns Setting which controls that... Anyone know the default? Incidentally...

org.apache.lucene.store.RateLimiter$ SimpleRateLimiter.pause(RateLimiter.java:112) Friday, October 11, 13 hot_threads was right :)

curl -XPUT localhost:9202/_cluster/settings -d' { "transient": { "indices.recovery.concurrent_streams": 12, "indices.recovery.max_bytes_per_sec":
"500mb" } } ' Friday, October 11, 13 Let’s see if we can move the needle Also bump up concurrent_streams to handle interleaving

Friday, October 11, 13 OK! Progress...

curl -XPUT localhost:9202/_cluster/settings -d' { "transient": { "indices.recovery.concurrent_streams": 24, "indices.recovery.max_bytes_per_sec":
"2gb" } } ' Friday, October 11, 13 Turn it up to eleven

Friday, October 11, 13 Only one thread active, writes very
erratic “Nodes basically bored” Nothing else throttled in ES; what’s it doing?

GET /_nodes/hot_threads Friday, October 11, 13

GET /_nodes/hot_threads sun.nio.ch.IOUtil.read() Friday, October 11, 13

Friday, October 11, 13 Where did we see that before?
The ﬁle copy from our lame network test! We weren’t testing just the network!

66M/s Friday, October 11, 13 Where did we see that
before? The ﬁle copy from our lame network test! We weren’t testing just the network!

n1 n2 Friday, October 11, 13

Disk Disk n1 n2 Friday, October 11, 13

Disk Disk n1 n2 Network Friday, October 11, 13

Disk Disk Kernel Kernel n1 n2 Network Friday, October 11,
13

Disk Disk Kernel Kernel n1 n2 Network e t h
0 e t h 0 Friday, October 11, 13

0 e t h 0 scp Friday, October 11, 13

0 e t h 0 iperf scp Friday, October 11, 13

Friday, October 11, 13

C F Q Friday, October 11, 13 Reorders access by
sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))

C F Q ompletely Friday, October 11, 13 Reorders access
by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))

C F Q ompletely air Friday, October 11, 13 Reorders
access by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))

C F Q ompletely air ueuing Friday, October 11, 13
Reorders access by sector ID Designed to most efficiently use rotational media and for multi-user systems, unlike db server * Why is this useless here? (SSD (plus RAID!))

N Friday, October 11, 13 Removes all reordering, gets the
kernel out of the IO game Also deadline, which reorders based on time, didn’t make a difference

Noop Friday, October 11, 13 Removes all reordering, gets the
kernel out of the IO game Also deadline, which reorders based on time, didn’t make a difference

echo noop | sudo tee /sys/block/sdb/queue/scheduler' Friday, October 11, 13

Friday, October 11, 13 Turned on one node

Friday, October 11, 13 Trickling the setting through the nodes...

Conclusions Friday, October 11, 13

Defaults Friday, October 11, 13 ES awesome defaults, but tuned
for ec2 Improving this with more extensive documentation ...big part of having a company behind ES

Friday, October 11, 13 with raid or ssd: noop, otherwise
experiment indices.* <- still node-level here!

scheduler Friday, October 11, 13 with raid or ssd: noop,
otherwise experiment indices.* <- still node-level here!

indices.recovery.max_bytes_per_sec scheduler Friday, October 11, 13 with raid or ssd:
noop, otherwise experiment indices.* <- still node-level here!

indices.recovery.max_bytes_per_sec indices.recovery.concurrent_streams scheduler Friday, October 11, 13 with raid or
ssd: noop, otherwise experiment indices.* <- still node-level here!

Monitoring Friday, October 11, 13 Doesn’t have to be perfect.
Do it tonight. You cannot make engineering decisions without it. Translates “hrm, this is taking forever” to *action* We’re working on helping you here.

Thanks Tim Pease Grant Rodgers Mark Imbriaco Friday, October 11,
13

Life After EC2

Life After EC2

More Decks by Elasticsearch Inc

Other Decks in Technology

Featured

Transcript