GCPのネットワークでハマった話

 GCPのネットワークでハマった話

第16回elasticsearch勉強会 https://elasticsearch.doorkeeper.jp/events/46539

4c1731ac4187db79a9b16d621aa94bdc?s=128

Daichi Hirata

June 27, 2016
Tweet

Transcript

  1. GCPͷωοτϫʔΫͰϋϚͬͨ࿩ DAICHI HIRATA @daichild גࣜձࣾαΠόʔΤʔδΣϯτ ΞυςΫຊ෦ CAϦϫʔυ 2016/6/27 ୈ16ճelasticsearchษڧձ

  2. ࣗݾ঺հ DAICHI HIRATA ▸ @daichild
 daichirata ▸ גࣜձࣾαΠόʔΤʔδΣϯτ
 ΞυςΫຊ෦
 CAϦϫʔυ

    ▸ Golang, Ruby ▸ ✂Secateurs (ES IndexTemplate DSL in Ruby) ▸ ྲྀ೿: hhkb2 2౛ྲྀ
  3. ෆ҆ఆͳΫϥελ ▸ ͍͍ͩͨ2࣌ؒҐͷִؒͰϚελʔϊʔυͱͷpingʹࣦഊ ▸ OS: CentOS 7.2 ▸ Elasticsearch: 2.3.1

    [INFO ][discovery.gce ] [elasticsearch-1] master_left [{elasticsearch-2} {4TPArCtHQMKgWaLod3ZMjA}{10.2.101.5}{10.2.101.5:9300}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout] [WARN ][discovery.gce ] [elasticsearch-1] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{elasticsearch-3} {JtcxuuucRXiClrl6q7qL8A}{10.2.101.5}{10.2.101.5:9300},{elasticsearch-1}{RQvtZKAJTfGmbmWETYY0fw} {10.2.101.4}{elasticsearch-1.c.cyberagent-013.internal/10.2.101.4:9300},} [INFO ][cluster.service ] [elasticsearch-1] removed {{elasticsearch-2} {4TPArCtHQMKgWaLod3ZMjA}{10.2.101.5}{10.2.101.5:9300},}, reason: zen-disco-master_failed ({elasticsearch-2}{4TPArCtHQMKgWaLod3ZMjA}{10.2.101.5}{10.2.101.5:9300})
  4. ෆ҆ఆͳΫϥελ [DEBUG][action.admin.cluster.health] [elasticsearch-1] connection exception while trying to forward request

    with action name [cluster:monitor/health] to master node [{elasticsearch-2} {4TPArCtHQMKgWaLod3ZMjA}{10.2.101.5}{10.2.101.5:9300}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-2][10.2.101.5:9300] [cluster:monitor/health] disconnected] [INFO][discovery.gce ] [elasticsearch-1] master_left [{elasticsearch-2}{Xa2Cq98mQie1WcaXFfHraQ} {10.2.101.5}{10.2.101.5:9300}], reason [transport disconnected] [WARN][discovery.gce ] [elasticsearch-1] master left (reason = transport disconnected), current nodes: {{elasticsearch-1}{fjLqVUoxRB6RRNCecJSAaw}{10.2.101.4}{10.2.101.4:9300},} [INFO][cluster.service] [elasticsearch-1] removed {{elasticsearch-2}{Xa2Cq98mQie1WcaXFfHraQ} {10.2.101.5}{10.2.101.5:9300},}, reason: zen-disco-master_failed ({elasticsearch-2} {Xa2Cq98mQie1WcaXFfHraQ}{10.2.101.16}{10.2.101.16:9300})
  5. ରԠͦͷ1 ͱΓ͋͑ͣPINGͷλΠϜΞ΢τΛ৳͹ͯ͠ΈΔ

  6. ZEN DISCOVERY ▸ discovery.zen.fd.ping_timeout: 60s ▸ discovery.zen.fd.ping_retries: 6

  7. ZEN DISCOVERY ▸ discovery.zen.fd.ping_timeout: 60s ▸ discovery.zen.fd.ping_retries: 6 ▸ มԽͳ͠

  8. ରԠͦͷ2 TRANSPORT MODULE (NETTY) ͷϩάΛग़ྗͯ͠ΈΔ

  9. TRANSPORT MODULE ▸ TransportपΓͷϩάΛTRACEϨϕϧ·Ͱग़ྗ ▸ curl -XPUT localhost:9200/_cluster/settings -d '


    {
 "transient" : {
 "logger.transport" : "TRACE",
 "logger.org.elasticsearch.transport" : "TRACE"
 }
 }'
  10. TRANSPORT MODULE [2016-04-27 16:07:43,207][TRACE][transport.netty ] [elasticsearch-1] close connection exception caught

    on transport layer [[id: 0xa2b52d5c, /10.2.101.4:40290 => /10.2.101.5:9300]], disconnecting from relevant node java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
  11. TRANSPORT MODULE ▸ ϩάΛݟͯΈΔͱɺͲ͏΋ωοτϫʔΫϨΠϠͰͦ΋ͦ΋઀ଓ(ϦτϥΠ)Ͱ͖ͯ ͍ͳ͍༷ͳϩά͕ग़ྗ͞Ε͍ͯΔ ▸ ͱΓ͋͑ͣpingͰϊʔυؒͷωοτϫʔΫͷૄ௨؂ࢹ

  12. TRANSPORT MODULE ▸ ϩάΛݟͯΈΔͱɺͲ͏΋ωοτϫʔΫϨΠϠͰͦ΋ͦ΋઀ଓ(ϦτϥΠ)Ͱ͖ͯ ͍ͳ͍༷ͳϩά͕ग़ྗ͞Ε͍ͯΔ ▸ ͱΓ͋͑ͣpingͰϊʔυؒͷωοτϫʔΫͷૄ௨؂ࢹ ▸ ಛʹ໰୊ͳ͠

  13. ରԠͦͷ3 NETSTATͰTCPίωΫγϣϯͷঢ়ଶΛ֬ೝͯ͠ΈΔ

  14. NETSTAT $ netstat --tcp -t -o -n | grep 9300

    | sort -k5 tcp6 0 0 10.2.101.4:9300 10.2.101.5:37638 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37637 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37636 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37635 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37634 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37633 ESTABLISHED keepalive (5221.58/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37632 ESTABLISHED keepalive (5172.43/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37631 ESTABLISHED keepalive (5172.43/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37630 ESTABLISHED keepalive (5188.81/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37629 ESTABLISHED keepalive (5188.82/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37628 ESTABLISHED keepalive (5221.58/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37627 ESTABLISHED keepalive (4205.77/0/0) tcp6 0 0 10.2.101.4:9300 10.2.101.5:37626 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42254 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42253 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42252 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42251 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42250 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42249 10.2.101.5:9300 ESTABLISHED keepalive (4107.47/0/1) tcp6 0 0 10.2.101.4:42248 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42247 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42246 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42245 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42244 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42243 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) tcp6 0 0 10.2.101.4:42242 10.2.101.5:9300 ESTABLISHED keepalive (5319.89/0/0) ϊʔυA͔ΒB΁ͷ઀ଓ ϊʔυB͔ΒA΁ͷ઀ଓ
  15. NETSTAT ▸ Elasticsearch͸ϊʔυؒͰޓ͍ʹ13ຊͷίωΫγϣϯΛ࡞੒͍ͯ͠Δ ▸ Ͳ͏΍ΒҰ෦ͷίωΫγϣϯͰTCP Keepaliveͷprobe packetͷ΍ΓऔΓʹࣦഊ ͍ͯ͠Δ ▸ ͦΕ͕ݪҼͰϊʔυؒͷίωΫγϣϯ͕Ϋϩʔζ͞ΕΔ

  16. TCP KEEPALIVE ▸ ແ௨৴࣌ɺҰఆִ࣌ؒؒͰprobeύέοτΛૹड৴͢Δ͜ͱʹΑΓɺTCP઀ଓ͕ ΞΫςΟϒͰ͋Δ͜ͱΛ͓ޓ͍ʹ௨஌ɺ֬ೝ͢ΔͨΊͷػೳ ▸ ElasticsearchσϑΥϧτઃఆ͸༗ޮ ▸ net.ipv4.tcp_keepalive_time=7200 (2࣌ؒ)


    net.ipv4.tcp_keepalive_intvl=75
 net.ipv4.tcp_keepalive_probes=9 ▸ ͦ΋ͦ΋ɺTCP Keepaliveͷprobe packet͸ແ௨৴ͩͬͨ৔߹ʹͷΈૹ৴͞ΕΔ ͸ͣ ▸ Ұ෦ͷίωΫγϣϯͷΈ΍ΓऔΓʹࣦഊ͍ͯ͠ΔݪҼ͕ෆ໌
  17. TCP KEEPALIVE ▸ ݪҼΛಛఆ͢ΔͨΊɺTCP KeepaliveͷઃఆΛมߋ ▸ 2࣌ؒ଴ͭͷ͸ਏ͍ͷͰɺͱΓ͋͑ͣ60ඵͰprobeύέοτΛૹ৴͢Δ ▸ $ sysctl

    -w net.ipv4.tcp_keepalive_time=60
  18. TCP KEEPALIVE ▸ ElasticsearchΫϥελ͕҆ఆͯ͠ಈ࡞͢ΔΑ͏ʹͳͬͨʂ ▸ GCPͷωοτϫʔΫ࢓༷తʹ੾அ͞Ε͍ͯͦ͏ͳ༧ײ...

  19. TCP KEEPALIVE ▸ ElasticsearchΫϥελ͕҆ఆͯ͠ಈ࡞͢ΔΑ͏ʹͳͬͨʂ ▸ GCPͷωοτϫʔΫ࢓༷తʹ੾அ͞Ε͍ͯͦ͏ͳ༧ײ... ▸ υΩϡϝϯτʹී௨ʹॻ͍ͯ·ͨ͠

  20. GCP NETWORKS ▸ Πϯελϯεؒͷ௨৴Ͱ͋ͬͯ΋ɺL2 Ͱ͸ͳ͘ඞͣήʔτ΢ΣΠΛܦ༝͢Δ L3Ͱ௨৴͢Δ ▸ ֤Πϯελϯεʹରͯ͠ڐՄ͢Δ INBOUNDτϥϑΟοΫΛϑΝΠϠʔ ΢ΥʔϧͰ؅ཧ

    ▸ ͜ͷϑΝΠϠʔ΢Υʔϧ͕inactiveͳ TCPίωΫγϣϯΛ10෼Ͱ੾அ͢Δ
  21. GCP NETWORKS ▸ ίωΫγϣϯΛҡ͍࣋ͨ͠৔߹͸Լهઃఆ͕ਪ঑ ▸ sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60

    net.ipv4.tcp_keepalive_probes=5
  22. ·ͱΊ ▸ GCPͷϑΝΠϠʔ΢Υʔϧ͸inactiveίωΫγϣϯΛ10෼Ͱ੾அ͢ΔͷͰɺ ElasticsearchͷΫϥελΛߏங͢Δ৔߹͸net.ipv4.tcp_keepalive_timeͷઃఆΛ ม͑Δඞཁ͕͋Δ ▸ Elasticsearch͸ϊʔυؒͰ13ຊίωΫγϣϯΛ࡞੒ͨ͠ޙίωΫγϣϯΛϓʔϧ ͍ͯͯ͠ɺ࢖ΘΕͳ͍ίωΫγϣϯ͕ز͔ͭଘࡏ͢Δʁ ▸ ͦͷίωΫγϣϯ͕ϑΝΠΞʔ΢Υʔϧʹ੾அ͞ΕɺTCP

    KeepaliveͰݕ஌͞Ε ΔλΠϛϯάͰϊʔυؒͷ઀ଓ͕੾ΕͨͱElasticsearch͕ݕ஌ͯ͠ɺΫϥελ͔ Β੾அ͞ΕΔɻ
  23. ͓ΘΓ ༗೉͏͍͟͝·ͨ͠ɻ