猫でもわかる Pod Preemption #k8sjp / Kubernetes Meetup Tokyo 10th

332f89cc697355902a817506b6995f2b?s=47 y_taka_23
March 08, 2018

猫でもわかる Pod Preemption #k8sjp / Kubernetes Meetup Tokyo 10th

Kubernetes Meetup Tokyo #10 で使用したスライドです。v1.8.0 から導入されたアルファ機能 Priority と Preemption について、そのメリットと仕組み、さらに運用上気をつけるべき点を簡単に解説しました。

イベント概要:https://k8sjp.connpass.com/event/76816/
ブログ記事:http://ccvanishing.hateblo.jp/entry/2018/05/17/180426

332f89cc697355902a817506b6995f2b?s=128

y_taka_23

March 08, 2018
Tweet

Transcript

  1. 猫でもわかる Pod Preemption チェシャ猫 (@y_taka_23) Kubernetes Meetup #10 (2018/03/08) #k8sjp

  2. 目次 • 導入の動機 ◦ Preemption とは何か? 何が得られるのか? • アルゴリズム概観 ◦

    内部的にはどうやって動作しているのか? • Preempt 判定の限界 ◦ 複数条件が組み合わさったとき何が起こるのか? #k8sjp
  3. Pod B: Req. = 5 Node 2: Capacity = 5/30

    Pod A: Req. = 10 Node 1: Capacity = 10/30 Pod Y: Req. = 20 Pod X: Req. = 15 Pod Z: Req. = 10 Scheduler’s main loop Pod Queue
  4. Pod B: Req. = 5 Node 2: Capacity = 5/30

    Pod X: Req. = 15 Pod A: Req. = 10 Node 1: Capacity = 25/30 Pod Z: Req. = 10 Pod Y: Req. = 20 Scheduler’s main loop Pod Queue
  5. Pod Y: Req. = 20 Pod B: Req. = 5

    Node 2: Capacity = 25/30 Pod X: Req. = 15 Pod A: Req. = 10 Node 1: Capacity = 25/30 Pod Z: Req. = 10 Scheduler’s main loop Pod Queue
  6. Pod Y: Req. = 20 Pod B: Req. = 5

    Node 2: Capacity = 25/30 Pod X: Req. = 15 Pod A: Req. = 10 Node 1: Capacity = 25/30 Pod Z: Req. = 10 Scheduler’s main loop Pod Queue
  7. Priority を付加する #k8sjp

  8. Pod Y: Req. = 20, Pri. = 20 Pod B:

    Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod X: Req. = 15, Pri. = 10 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 25/30 Pod Z: Req. = 10, Pri. = 20 Scheduler’s main loop Pod Queue
  9. Pod Y: Req. = 20, Pri. = 20 Pod B:

    Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod X: Req. = 15, Pri. = 10 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 25/30 Pod Z: Req. = 10, Pri. = 20 Scheduler’s main loop Pod Queue
  10. Pod Y: Req. = 20, Pri. = 20 Pod B:

    Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 10/30 Pod Z: Req. = 10, Pri. = 20 Scheduler’s main loop Pod Queue
  11. Pod Y: Req. = 20, Pri. = 20 Pod B:

    Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod Z: Req. = 10, Pri. = 20 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 20/30 Scheduler’s main loop Pod Queue
  12. Priority / Preemption の意義 • 重要な Pod がブロックされない ◦ 従来は積極的に

    Pod を追い出す方法がなかった ◦ 特に Node がスケールできない環境では嬉しい ◦ スケールできても Node の立ち上がりは遅い • コストが予測可能になる ◦ Priority なしでは無限のリソースが必要 ◦ 急激なスパイクにも対応できる #k8sjp
  13. Pod B: Req. = 10, Pri. = 10 Pod C:

    Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 30/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor
  14. Node の選択は 2 段階 #k8sjp

  15. Step 1. Node のフィルタリング • Preempt の可能性がある Node を選ぶ ◦

    配置したい Pod より優先度の低い Pod を 一旦すべて追い出したと仮定 ◦ その状態で目的の Pod が配置できるか? • リソース量以外も考慮 ◦ NodeSelector や Taint などの指定 ◦ Node Affinity / Inter-Pod Affinity / Anti-Affinity の指定 #k8sjp
  16. Pod B: Req. = 10, Pri. = 10 Pod C:

    Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 30/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor
  17. Node 1: Capacity = 0/30 Pod X: Req. = 10,

    Pri. = 20 Preemptor
  18. Pod E: Req. = 15, Pri. = 20 Pod D:

    Req. = 10, Pri. = 30 Node 2: Capacity = 25/30 Pod X: Req. = 10, Pri. = 20 Preemptor
  19. Pod G: Req. = 10, Pri. = 30 Node 3:

    Capacity = 10/30 Pod X: Req. = 10, Pri. = 20 Preemptor
  20. Pod B: Req. = 10, Pri. = 10 Pod C:

    Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 30/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor
  21. Step 2. Node の順位付け • 最低限追い出すべき Pod を求める ◦ 可能な範囲で

    Pod を戻してみる • 減点が少ない Node を選ぶ ◦ Pod Disruption Budget (PDB) 違反となる個数 ◦ PDB 違反となる Pod の Priority の最大値 ◦ 追い出される Pod の Priority の合計値 ◦ 追い出される Pod の個数 #k8sjp
  22. Pod B: Req. = 10, Pri. = 10 Pod C:

    Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 15/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 10/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor
  23. Pod B: Req. = 10, Pri. = 10 Pod C:

    Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 20/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Preemptor
  24. このアルゴリズムは完璧? #k8sjp

  25. Affinity + Preemption • Pod の Affinity ◦ 指定した他の Pod

    が存在することを要求 ◦ 例:Redis は Web アプリと同じ Node に割り当て • 高優先度 → 低優先度の Affinity ◦ 低優先度だと「一旦すべて追い出す」の対象に ◦ フィルタリングの段階で候補から外れてしまう ◦ 低優先度 → 高優先度なら大丈夫 #k8sjp
  26. Anti-Affinity + Preemption • Pod の Anti-Affinity ◦ 指定した他の Pod

    が存在しないことを要求 ◦ 例:DB クラスタを AZ ごとに分散させる • Node を跨いだ Anti-Affinity ◦ 「一旦すべて追い出す」のは該当 Node のみ ◦ 同じ AZ 内、他の Node 上にある Pod が邪魔 #k8sjp
  27. まとめ • 重要な Pod に優先してリソース確保 ◦ コストが制御可能で、際限なく増大しない • Node のフィルタリング

    + 順位付け ◦ まず全部追い出してみて、それから改めて精査 • Affinity との組み合わせは要注意 ◦ 本来 preempt できるはずでも発動しないケース #k8sjp
  28. Preempt the Preemption! Presented by チェシャ猫 (@y_taka_23) #k8sjp