Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes Chaos Engineering: Lessons Learned in Networking

Kubernetes Chaos Engineering: Lessons Learned in Networking

When you deploy an application in Kubernetes, your code ends up running on one or more worker nodes. A node may be a physical machine or VM such as AWS EC2 or Google Compute Engine and having several of them means you can run and scale your application across instances efficiently. When there is an incoming request, the cluster routes the traffic to one of the nodes using a network proxy. But what happens when network proxy crashes? Does the cluster still work? Can Kubernetes recover from the failure?
In this talk, you'll learn how the traffic is distributed within a Kubernetes cluster and what happens when the network proxy is misbehaving.

Daniele Polencic

March 28, 2019
Tweet

More Decks by Daniele Polencic

Other Decks in Technology

Transcript

  1. Node 2 Pod 1 APP APP Pod 2 APP Pod

    3 APP Pod 4 Node 1 Node 3
  2. Node 2 APP Pod 1 APP Pod 2 APP Pod

    3 APP Pod 4 Node 1 Node 3
  3. Node 2 APP Pod 1 APP Pod 2 APP Pod

    3 APP Pod 4 Node 1 Node 3
  4. Node 2 APP Pod 1 APP Pod 5 APP Pod

    3 APP Pod 4 Node 1 Node 3
  5. Node 2 APP Pod 1 APP Pod 2 Load balancer

    Incoming traffic APP Pod 3 APP Pod 4 Node 1 Node 3
  6. Node 2 APP Pod 1 APP Pod 2 Load balancer

    Incoming traffic APP Pod 3 APP Pod 4 Node 1 Node 3
  7. Node 2 APP Pod 1 APP Pod 2 Load balancer

    Incoming traffic APP Pod 3 APP Pod 4 APP Pod 5 Node 1 Node 3
  8. Node 2 APP Pod 1 APP Pod 2 Load balancer

    Incoming traffic Node 1 Node 3
  9. Node 2 APP Pod 1 APP Pod 2 Load balancer

    Incoming traffic Node 1 Node 3
  10. Node 2 APP Pod 1 APP Pod 2 Load balancer

    Incoming traffic Node 1 Node 3
  11. Node 2 APP Pod 1 APP Pod 2 Load balancer

    Incoming traffic Node 1 Node 3 ?
  12. Pod name Status Node Endpoint Pod 1 RUNNING worker1 10.0.1.1

    Pod 2 RUNNING worker1 10.0.1.2 Pod 3 RUNNING worker2 10.0.2.1 Pod 4 RUNNING worker2 10.0.2.2
  13. Pod name Status Node Endpoint Pod 1 RUNNING worker1 10.0.1.1

    Pod 2 RUNNING worker1 10.0.1.2 Pod 3 RUNNING worker2 10.0.2.1 Pod 4 RUNNING worker2 10.0.2.2
  14. Pod name Status Node Pod 1 RUNNING worker1 Pod 2

    RUNNING worker2 Pod 3 RUNNING worker3
  15. Pod name Status Node Pod 1 RUNNING worker1 Pod 2

    RUNNING worker2 Pod 3 RUNNING worker3
  16. Pod name Status Node Endpoint Pod 1 RUNNING worker1 10.0.1.1

    Pod 2 RUNNING worker1 10.0.1.2 Service name IP Endpoints Service 1 172.17.0.1 10.0.1.1:3000, 10.0.1.2:3000, 10.0.2.1:3000 Service 2 172.17.0.2 10.0.2.2:8080 Pod 3 RUNNING worker2 10.0.2.1 Pod 4 RUNNING worker2 10.0.2.2
  17. ? RED RED Pod name IP Node Pod 1 10.0.1.1

    worker2 Pod 2 10.0.2.1 worker3 Service name Endpoints Service 1 Pod1, Pod2
  18. RED RED Pod name IP Node Pod 1 10.0.1.1 worker2

    Pod 2 10.0.2.1 worker3 Service name Endpoints Service 1 Pod1, Pod2
  19. !

  20. RED

  21. RED

  22. RED

  23. monitor the app ~$ while sleep 1 do date +%X

    curl -sS http://<balancer_ip>/ done
  24. 14:39:41 Hello world! 14:39:42 Hello world! 14:39:43 Hello world! 14:39:44

    Hello world! 14:39:45 Hello world! 14:39:46 Hello world!
  25. 14:39:43 Hello world! 14:39:44 Hello world! 14:39:45 Hello world! 14:39:46

    Hello world! 14:39:47 Hello world! # nothing...
  26. 14:39:45 Hello world! 14:39:46 Hello world! 14:39:47 Hello world! #

    nothing... # nothing... # nothing... 14:40:14 Hello world!
  27. 14:39:46 Hello world! 14:39:47 Hello world! # nothing... # nothing...

    # nothing... 14:40:14 Hello world! 14:40:15 Hello world!
  28. RED

  29. monitor the app ~$ while sleep 1 do date +%X

    curl -sS http://<node_ip>/ done
  30. 14:39:41 Hello world! 14:39:42 Hello world! 14:39:43 Hello world! 14:39:44

    Hello world! 14:39:45 Hello world! 14:39:46 Hello world!
  31. 14:39:42 Hello world! 14:39:43 Hello world! 14:39:44 Hello world! 14:39:45

    Hello world! 14:39:46 Hello world! # nothing...
  32. 14:39:43 Hello world! 14:39:44 Hello world! 14:39:45 Hello world! 14:39:46

    Hello world! # nothing... curl: (28) Connection timed out after 10003 milliseconds
  33. 14:39:44 Hello world! 14:39:45 Hello world! 14:39:46 Hello world! #

    nothing... curl: (28) Connection timed out after 10003 milliseconds curl: (28) Connection timed out after 10004 milliseconds
  34. 14:39:45 Hello world! 14:39:46 Hello world! # nothing... curl: (28)

    Connection timed out after 10003 milliseconds curl: (28) Connection timed out after 10004 milliseconds 14:40:15 Hello world!
  35. 14:39:46 Hello world! # nothing... curl: (28) Connection timed out

    after 10003 milliseconds curl: (28) Connection timed out after 10004 milliseconds 14:40:15 Hello world! 14:40:16 Hello world!
  36. 1. curl times out at 10s 2. lb must be

    timing out > 10s 3. something fixed the routing table
  37. 1. curl times out 10s 2. lb must be timing

    out > 10s 3. something fixed the routing table 4. why 30 seconds?
  38. ah!

  39. RED Pod name Status Node Endpoint Pod 1 RUNNING worker1

    10.0.1.1 Pod 2 RUNNING worker1 10.0.1.2 Pod 3 RUNNING worker2 10.0.2.1 Pod 4 RUNNING worker2 10.0.2.2 Routing table
  40. RED Pod name Status Node Endpoint Pod 1 RUNNING worker1

    10.0.1.1 Pod 2 RUNNING worker1 10.0.1.2 Pod 3 RUNNING worker2 10.0.2.1 Pod 4 RUNNING worker2 10.0.2.2 Routing table
  41. RED Pod name Status Node Endpoint Pod 1 RUNNING worker1

    10.0.1.1 Pod 2 RUNNING worker1 10.0.1.2 Pod 3 RUNNING worker2 10.0.2.1 Pod 4 RUNNING worker2 10.0.2.2 Routing table
  42. RED Pod name Status Node Endpoint Pod 1 RUNNING worker1

    10.0.1.1 Pod 2 RUNNING worker1 10.0.1.2 Pod 3 RUNNING worker2 10.0.2.1 Pod 4 RUNNING worker2 10.0.2.2 Routing table !
  43. RED Pod name Status Node Endpoint Pod 1 RUNNING worker1

    10.0.1.1 Pod 2 RUNNING worker1 10.0.1.2 Pod 3 RUNNING worker2 10.0.2.1 Pod 4 RUNNING worker2 10.0.2.2 Routing table !
  44. RED Pod name Status Node Endpoint Pod 1 RUNNING worker1

    10.0.1.1 Pod 2 RUNNING worker1 10.0.1.2 Pod 3 RUNNING worker2 10.0.2.1 Pod 4 RUNNING worker2 10.0.2.2 Routing table
  45. ? Pod name IP Node Pod 1 10.0.1.1 worker2 Pod

    2 10.0.2.1 worker3 Service name Endpoints Service 1 Pod1, Pod2 TO: FROM: Anywhere Service (172.17.0.1)
  46. ? Pod name IP Node Pod 1 10.0.1.1 worker2 Pod

    2 10.0.2.1 worker3 Service name Endpoints Service 1 Pod1, Pod2 TO: Pod1 (10.0.1.1) FROM: Anywhere Service (172.17.0.1)
  47. 192.168.0.1 Destination Next hop 10.0.0.0/24 10.0.1.0/24 10.0.2.0/24 192.168.0.2 192.168.0.3 192.168.0.4

    10.0.3.0/24 192.168.0.5 192.168.0.2 192.168.0.3 ? 10.0.1.1 TO: Pod1 (10.0.1.1) FROM: Anywhere Service (172.17.0.1)