Slide 1

Slide 1 text

GCPͷωοτϫʔΫͰϋϚͬͨ࿩ DAICHI HIRATA @daichild גࣜձࣾαΠόʔΤʔδΣϯτ ΞυςΫຊ෦ CAϦϫʔυ 2016/6/27 ୈ16ճelasticsearchษڧձ

Slide 2

Slide 2 text

ࣗݾ঺հ DAICHI HIRATA ▸ @daichild
 daichirata ▸ גࣜձࣾαΠόʔΤʔδΣϯτ
 ΞυςΫຊ෦
 CAϦϫʔυ ▸ Golang, Ruby ▸ ✂Secateurs (ES IndexTemplate DSL in Ruby) ▸ ྲྀ೿: hhkb2 2౛ྲྀ

Slide 3

Slide 3 text

ෆ҆ఆͳΫϥελ ▸ ͍͍ͩͨ2࣌ؒҐͷִؒͰϚελʔϊʔυͱͷpingʹࣦഊ ▸ OS: CentOS 7.2 ▸ Elasticsearch: 2.3.1 [INFO ][discovery.gce ] [elasticsearch-1] master_left [{elasticsearch-2} {4TPArCtHQMKgWaLod3ZMjA}{10.2.101.5}{10.2.101.5:9300}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout] [WARN ][discovery.gce ] [elasticsearch-1] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{elasticsearch-3} {JtcxuuucRXiClrl6q7qL8A}{10.2.101.5}{10.2.101.5:9300},{elasticsearch-1}{RQvtZKAJTfGmbmWETYY0fw} {10.2.101.4}{elasticsearch-1.c.cyberagent-013.internal/10.2.101.4:9300},} [INFO ][cluster.service ] [elasticsearch-1] removed {{elasticsearch-2} {4TPArCtHQMKgWaLod3ZMjA}{10.2.101.5}{10.2.101.5:9300},}, reason: zen-disco-master_failed ({elasticsearch-2}{4TPArCtHQMKgWaLod3ZMjA}{10.2.101.5}{10.2.101.5:9300})

Slide 4

Slide 4 text

ෆ҆ఆͳΫϥελ [DEBUG][action.admin.cluster.health] [elasticsearch-1] connection exception while trying to forward request with action name [cluster:monitor/health] to master node [{elasticsearch-2} {4TPArCtHQMKgWaLod3ZMjA}{10.2.101.5}{10.2.101.5:9300}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-2][10.2.101.5:9300] [cluster:monitor/health] disconnected] [INFO][discovery.gce ] [elasticsearch-1] master_left [{elasticsearch-2}{Xa2Cq98mQie1WcaXFfHraQ} {10.2.101.5}{10.2.101.5:9300}], reason [transport disconnected] [WARN][discovery.gce ] [elasticsearch-1] master left (reason = transport disconnected), current nodes: {{elasticsearch-1}{fjLqVUoxRB6RRNCecJSAaw}{10.2.101.4}{10.2.101.4:9300},} [INFO][cluster.service] [elasticsearch-1] removed {{elasticsearch-2}{Xa2Cq98mQie1WcaXFfHraQ} {10.2.101.5}{10.2.101.5:9300},}, reason: zen-disco-master_failed ({elasticsearch-2} {Xa2Cq98mQie1WcaXFfHraQ}{10.2.101.16}{10.2.101.16:9300})

Slide 5

Slide 5 text

ରԠͦͷ1 ͱΓ͋͑ͣPINGͷλΠϜΞ΢τΛ৳͹ͯ͠ΈΔ

Slide 6

Slide 6 text

ZEN DISCOVERY ▸ discovery.zen.fd.ping_timeout: 60s ▸ discovery.zen.fd.ping_retries: 6

Slide 7

Slide 7 text

ZEN DISCOVERY ▸ discovery.zen.fd.ping_timeout: 60s ▸ discovery.zen.fd.ping_retries: 6 ▸ มԽͳ͠

Slide 8

Slide 8 text

ରԠͦͷ2 TRANSPORT MODULE (NETTY) ͷϩάΛग़ྗͯ͠ΈΔ

Slide 9

Slide 9 text

TRANSPORT MODULE ▸ TransportपΓͷϩάΛTRACEϨϕϧ·Ͱग़ྗ ▸ curl -XPUT localhost:9200/_cluster/settings -d '
 {
 "transient" : {
 "logger.transport" : "TRACE",
 "logger.org.elasticsearch.transport" : "TRACE"
 }
 }'

Slide 10

Slide 10 text

TRANSPORT MODULE [2016-04-27 16:07:43,207][TRACE][transport.netty ] [elasticsearch-1] close connection exception caught on transport layer [[id: 0xa2b52d5c, /10.2.101.4:40290 => /10.2.101.5:9300]], disconnecting from relevant node java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Slide 11

Slide 11 text

TRANSPORT MODULE ▸ ϩάΛݟͯΈΔͱɺͲ͏΋ωοτϫʔΫϨΠϠͰͦ΋ͦ΋઀ଓ(ϦτϥΠ)Ͱ͖ͯ ͍ͳ͍༷ͳϩά͕ग़ྗ͞Ε͍ͯΔ ▸ ͱΓ͋͑ͣpingͰϊʔυؒͷωοτϫʔΫͷૄ௨؂ࢹ

Slide 12

Slide 12 text

TRANSPORT MODULE ▸ ϩάΛݟͯΈΔͱɺͲ͏΋ωοτϫʔΫϨΠϠͰͦ΋ͦ΋઀ଓ(ϦτϥΠ)Ͱ͖ͯ ͍ͳ͍༷ͳϩά͕ग़ྗ͞Ε͍ͯΔ ▸ ͱΓ͋͑ͣpingͰϊʔυؒͷωοτϫʔΫͷૄ௨؂ࢹ ▸ ಛʹ໰୊ͳ͠

Slide 13

Slide 13 text

ରԠͦͷ3 NETSTATͰTCPίωΫγϣϯͷঢ়ଶΛ֬ೝͯ͠ΈΔ

Slide 14

Slide 14 text

NETSTAT $ netstat --tcp -t -o -n | grep 9300 | sort -k5 tcp6 0 0 10.2.101.4:9300 10.2.101.5:37638 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37637 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37636 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37635 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37634 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37633 ESTABLISHED keepalive (5221.58/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37632 ESTABLISHED keepalive (5172.43/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37631 ESTABLISHED keepalive (5172.43/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37630 ESTABLISHED keepalive (5188.81/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37629 ESTABLISHED keepalive (5188.82/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37628 ESTABLISHED keepalive (5221.58/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37627 ESTABLISHED keepalive (4205.77/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37626 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42254 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42253 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42252 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42251 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42250 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42249 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42248 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42247 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42246 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42245 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42244 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42243 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42242 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) ϊʔυA͔ΒB΁ͷ઀ଓ ϊʔυB͔ΒA΁ͷ઀ଓ

Slide 15

Slide 15 text

NETSTAT ▸ Elasticsearch͸ϊʔυؒͰޓ͍ʹ13ຊͷίωΫγϣϯΛ࡞੒͍ͯ͠Δ ▸ Ͳ͏΍ΒҰ෦ͷίωΫγϣϯͰTCP Keepaliveͷprobe packetͷ΍ΓऔΓʹࣦഊ ͍ͯ͠Δ ▸ ͦΕ͕ݪҼͰϊʔυؒͷίωΫγϣϯ͕Ϋϩʔζ͞ΕΔ

Slide 16

Slide 16 text

TCP KEEPALIVE ▸ ແ௨৴࣌ɺҰఆִ࣌ؒؒͰprobeύέοτΛૹड৴͢Δ͜ͱʹΑΓɺTCP઀ଓ͕ ΞΫςΟϒͰ͋Δ͜ͱΛ͓ޓ͍ʹ௨஌ɺ֬ೝ͢ΔͨΊͷػೳ ▸ ElasticsearchσϑΥϧτઃఆ͸༗ޮ ▸ net.ipv4.tcp_keepalive_time=7200 (2࣌ؒ)
 net.ipv4.tcp_keepalive_intvl=75
 net.ipv4.tcp_keepalive_probes=9 ▸ ͦ΋ͦ΋ɺTCP Keepaliveͷprobe packet͸ແ௨৴ͩͬͨ৔߹ʹͷΈૹ৴͞ΕΔ ͸ͣ ▸ Ұ෦ͷίωΫγϣϯͷΈ΍ΓऔΓʹࣦഊ͍ͯ͠ΔݪҼ͕ෆ໌

Slide 17

Slide 17 text

TCP KEEPALIVE ▸ ݪҼΛಛఆ͢ΔͨΊɺTCP KeepaliveͷઃఆΛมߋ ▸ 2࣌ؒ଴ͭͷ͸ਏ͍ͷͰɺͱΓ͋͑ͣ60ඵͰprobeύέοτΛૹ৴͢Δ ▸ $ sysctl -w net.ipv4.tcp_keepalive_time=60

Slide 18

Slide 18 text

TCP KEEPALIVE ▸ ElasticsearchΫϥελ͕҆ఆͯ͠ಈ࡞͢ΔΑ͏ʹͳͬͨʂ ▸ GCPͷωοτϫʔΫ࢓༷తʹ੾அ͞Ε͍ͯͦ͏ͳ༧ײ...

Slide 19

Slide 19 text

TCP KEEPALIVE ▸ ElasticsearchΫϥελ͕҆ఆͯ͠ಈ࡞͢ΔΑ͏ʹͳͬͨʂ ▸ GCPͷωοτϫʔΫ࢓༷తʹ੾அ͞Ε͍ͯͦ͏ͳ༧ײ... ▸ υΩϡϝϯτʹී௨ʹॻ͍ͯ·ͨ͠

Slide 20

Slide 20 text

GCP NETWORKS ▸ Πϯελϯεؒͷ௨৴Ͱ͋ͬͯ΋ɺL2 Ͱ͸ͳ͘ඞͣήʔτ΢ΣΠΛܦ༝͢Δ L3Ͱ௨৴͢Δ ▸ ֤Πϯελϯεʹରͯ͠ڐՄ͢Δ INBOUNDτϥϑΟοΫΛϑΝΠϠʔ ΢ΥʔϧͰ؅ཧ ▸ ͜ͷϑΝΠϠʔ΢Υʔϧ͕inactiveͳ TCPίωΫγϣϯΛ10෼Ͱ੾அ͢Δ

Slide 21

Slide 21 text

GCP NETWORKS ▸ ίωΫγϣϯΛҡ͍࣋ͨ͠৔߹͸Լهઃఆ͕ਪ঑ ▸ sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5

Slide 22

Slide 22 text

·ͱΊ ▸ GCPͷϑΝΠϠʔ΢Υʔϧ͸inactiveίωΫγϣϯΛ10෼Ͱ੾அ͢ΔͷͰɺ ElasticsearchͷΫϥελΛߏங͢Δ৔߹͸net.ipv4.tcp_keepalive_timeͷઃఆΛ ม͑Δඞཁ͕͋Δ ▸ Elasticsearch͸ϊʔυؒͰ13ຊίωΫγϣϯΛ࡞੒ͨ͠ޙίωΫγϣϯΛϓʔϧ ͍ͯͯ͠ɺ࢖ΘΕͳ͍ίωΫγϣϯ͕ز͔ͭଘࡏ͢Δʁ ▸ ͦͷίωΫγϣϯ͕ϑΝΠΞʔ΢Υʔϧʹ੾அ͞ΕɺTCP KeepaliveͰݕ஌͞Ε ΔλΠϛϯάͰϊʔυؒͷ઀ଓ͕੾ΕͨͱElasticsearch͕ݕ஌ͯ͠ɺΫϥελ͔ Β੾அ͞ΕΔɻ

Slide 23

Slide 23 text

͓ΘΓ ༗೉͏͍͟͝·ͨ͠ɻ