Upgrade to Pro — share decks privately, control downloads, hide ads and more …

猫でもわかる Pod Preemption #k8sjp / Kubernetes Meetup Tokyo 10th

猫でもわかる Pod Preemption #k8sjp / Kubernetes Meetup Tokyo 10th

Kubernetes Meetup Tokyo #10 で使用したスライドです。v1.8.0 から導入されたアルファ機能 Priority と Preemption について、そのメリットと仕組み、さらに運用上気をつけるべき点を簡単に解説しました。

イベント概要:https://k8sjp.connpass.com/event/76816/
ブログ記事:http://ccvanishing.hateblo.jp/entry/2018/05/17/180426

y_taka_23

March 08, 2018
Tweet

More Decks by y_taka_23

Other Decks in Technology

Transcript

  1. 猫でもわかる
    Pod Preemption
    チェシャ猫 (@y_taka_23)
    Kubernetes Meetup #10 (2018/03/08)
    #k8sjp

    View full-size slide

  2. 目次
    ● 導入の動機
    ○ Preemption とは何か? 何が得られるのか?
    ● アルゴリズム概観
    ○ 内部的にはどうやって動作しているのか?
    ● Preempt 判定の限界
    ○ 複数条件が組み合わさったとき何が起こるのか?
    #k8sjp

    View full-size slide

  3. Pod B: Req. = 5
    Node 2: Capacity = 5/30
    Pod A: Req. = 10
    Node 1: Capacity = 10/30
    Pod Y: Req. = 20
    Pod X: Req. = 15
    Pod Z: Req. = 10
    Scheduler’s main loop
    Pod Queue

    View full-size slide

  4. Pod B: Req. = 5
    Node 2: Capacity = 5/30
    Pod X: Req. = 15
    Pod A: Req. = 10
    Node 1: Capacity = 25/30
    Pod Z: Req. = 10
    Pod Y: Req. = 20
    Scheduler’s main loop
    Pod Queue

    View full-size slide

  5. Pod Y: Req. = 20
    Pod B: Req. = 5
    Node 2: Capacity = 25/30
    Pod X: Req. = 15
    Pod A: Req. = 10
    Node 1: Capacity = 25/30
    Pod Z: Req. = 10
    Scheduler’s main loop
    Pod Queue

    View full-size slide

  6. Pod Y: Req. = 20
    Pod B: Req. = 5
    Node 2: Capacity = 25/30
    Pod X: Req. = 15
    Pod A: Req. = 10
    Node 1: Capacity = 25/30
    Pod Z: Req. = 10
    Scheduler’s main loop
    Pod Queue

    View full-size slide

  7. Priority を付加する
    #k8sjp

    View full-size slide

  8. Pod Y: Req. = 20, Pri. = 20
    Pod B: Req. = 5, Pri. = 30
    Node 2: Capacity = 25/30
    Pod X: Req. = 15, Pri. = 10
    Pod A: Req. = 10, Pri. = 20
    Node 1: Capacity = 25/30
    Pod Z: Req. = 10, Pri. = 20
    Scheduler’s main loop
    Pod Queue

    View full-size slide

  9. Pod Y: Req. = 20, Pri. = 20
    Pod B: Req. = 5, Pri. = 30
    Node 2: Capacity = 25/30
    Pod X: Req. = 15, Pri. = 10
    Pod A: Req. = 10, Pri. = 20
    Node 1: Capacity = 25/30
    Pod Z: Req. = 10, Pri. = 20
    Scheduler’s main loop
    Pod Queue

    View full-size slide

  10. Pod Y: Req. = 20, Pri. = 20
    Pod B: Req. = 5, Pri. = 30
    Node 2: Capacity = 25/30
    Pod A: Req. = 10, Pri. = 20
    Node 1: Capacity = 10/30
    Pod Z: Req. = 10, Pri. = 20
    Scheduler’s main loop
    Pod Queue

    View full-size slide

  11. Pod Y: Req. = 20, Pri. = 20
    Pod B: Req. = 5, Pri. = 30
    Node 2: Capacity = 25/30
    Pod Z: Req. = 10, Pri. = 20
    Pod A: Req. = 10, Pri. = 20
    Node 1: Capacity = 20/30
    Scheduler’s main loop
    Pod Queue

    View full-size slide

  12. Priority / Preemption の意義
    ● 重要な Pod がブロックされない
    ○ 従来は積極的に Pod を追い出す方法がなかった
    ○ 特に Node がスケールできない環境では嬉しい
    ○ スケールできても Node の立ち上がりは遅い
    ● コストが予測可能になる
    ○ Priority なしでは無限のリソースが必要
    ○ 急激なスパイクにも対応できる
    #k8sjp

    View full-size slide

  13. Pod B: Req. = 10, Pri. = 10
    Pod C: Req. = 5, Pri. = 5
    Pod A: Req. = 15, Pri. = 10
    Node 1: Capacity = 30/30
    Pod H: Req. = 20, Pri. = 10
    Pod G: Req. = 10, Pri. = 30
    Node 3: Capacity = 30/30
    Pod E: Req. = 15, Pri. = 20
    Pod F: Req. = 5, Pri. = 10
    Pod D: Req. = 10, Pri. = 30
    Node 2: Capacity = 30/30
    Pod X: Req. = 10, Pri. = 20
    Preemptor

    View full-size slide

  14. Node の選択は 2 段階
    #k8sjp

    View full-size slide

  15. Step 1. Node のフィルタリング
    ● Preempt の可能性がある Node を選ぶ
    ○ 配置したい Pod より優先度の低い Pod を
    一旦すべて追い出したと仮定
    ○ その状態で目的の Pod が配置できるか?
    ● リソース量以外も考慮
    ○ NodeSelector や Taint などの指定
    ○ Node Affinity / Inter-Pod Affinity / Anti-Affinity の指定
    #k8sjp

    View full-size slide

  16. Pod B: Req. = 10, Pri. = 10
    Pod C: Req. = 5, Pri. = 5
    Pod A: Req. = 15, Pri. = 10
    Node 1: Capacity = 30/30
    Pod H: Req. = 20, Pri. = 10
    Pod G: Req. = 10, Pri. = 30
    Node 3: Capacity = 30/30
    Pod E: Req. = 15, Pri. = 20
    Pod F: Req. = 5, Pri. = 10
    Pod D: Req. = 10, Pri. = 30
    Node 2: Capacity = 30/30
    Pod X: Req. = 10, Pri. = 20
    Preemptor

    View full-size slide

  17. Node 1: Capacity = 0/30
    Pod X: Req. = 10, Pri. = 20
    Preemptor

    View full-size slide

  18. Pod E: Req. = 15, Pri. = 20
    Pod D: Req. = 10, Pri. = 30
    Node 2: Capacity = 25/30
    Pod X: Req. = 10, Pri. = 20
    Preemptor

    View full-size slide

  19. Pod G: Req. = 10, Pri. = 30
    Node 3: Capacity = 10/30
    Pod X: Req. = 10, Pri. = 20
    Preemptor

    View full-size slide

  20. Pod B: Req. = 10, Pri. = 10
    Pod C: Req. = 5, Pri. = 5
    Pod A: Req. = 15, Pri. = 10
    Node 1: Capacity = 30/30
    Pod H: Req. = 20, Pri. = 10
    Pod G: Req. = 10, Pri. = 30
    Node 3: Capacity = 30/30
    Pod E: Req. = 15, Pri. = 20
    Pod F: Req. = 5, Pri. = 10
    Pod D: Req. = 10, Pri. = 30
    Node 2: Capacity = 30/30
    Pod X: Req. = 10, Pri. = 20
    Preemptor

    View full-size slide

  21. Step 2. Node の順位付け
    ● 最低限追い出すべき Pod を求める
    ○ 可能な範囲で Pod を戻してみる
    ● 減点が少ない Node を選ぶ
    ○ Pod Disruption Budget (PDB) 違反となる個数
    ○ PDB 違反となる Pod の Priority の最大値
    ○ 追い出される Pod の Priority の合計値
    ○ 追い出される Pod の個数
    #k8sjp

    View full-size slide

  22. Pod B: Req. = 10, Pri. = 10
    Pod C: Req. = 5, Pri. = 5
    Pod A: Req. = 15, Pri. = 10
    Node 1: Capacity = 15/30
    Pod H: Req. = 20, Pri. = 10
    Pod G: Req. = 10, Pri. = 30
    Node 3: Capacity = 10/30
    Pod E: Req. = 15, Pri. = 20
    Pod F: Req. = 5, Pri. = 10
    Pod D: Req. = 10, Pri. = 30
    Node 2: Capacity = 30/30
    Pod X: Req. = 10, Pri. = 20
    Preemptor

    View full-size slide

  23. Pod B: Req. = 10, Pri. = 10
    Pod C: Req. = 5, Pri. = 5
    Pod A: Req. = 15, Pri. = 10
    Node 1: Capacity = 30/30
    Pod X: Req. = 10, Pri. = 20
    Pod G: Req. = 10, Pri. = 30
    Node 3: Capacity = 20/30
    Pod E: Req. = 15, Pri. = 20
    Pod F: Req. = 5, Pri. = 10
    Pod D: Req. = 10, Pri. = 30
    Node 2: Capacity = 30/30
    Preemptor

    View full-size slide

  24. このアルゴリズムは完璧?
    #k8sjp

    View full-size slide

  25. Affinity + Preemption
    ● Pod の Affinity
    ○ 指定した他の Pod が存在することを要求
    ○ 例:Redis は Web アプリと同じ Node に割り当て
    ● 高優先度 → 低優先度の Affinity
    ○ 低優先度だと「一旦すべて追い出す」の対象に
    ○ フィルタリングの段階で候補から外れてしまう
    ○ 低優先度 → 高優先度なら大丈夫
    #k8sjp

    View full-size slide

  26. Anti-Affinity + Preemption
    ● Pod の Anti-Affinity
    ○ 指定した他の Pod が存在しないことを要求
    ○ 例:DB クラスタを AZ ごとに分散させる
    ● Node を跨いだ Anti-Affinity
    ○ 「一旦すべて追い出す」のは該当 Node のみ
    ○ 同じ AZ 内、他の Node 上にある Pod が邪魔
    #k8sjp

    View full-size slide

  27. まとめ
    ● 重要な Pod に優先してリソース確保
    ○ コストが制御可能で、際限なく増大しない
    ● Node のフィルタリング + 順位付け
    ○ まず全部追い出してみて、それから改めて精査
    ● Affinity との組み合わせは要注意
    ○ 本来 preempt できるはずでも発動しないケース
    #k8sjp

    View full-size slide

  28. Preempt the Preemption!
    Presented by チェシャ猫 (@y_taka_23)
    #k8sjp

    View full-size slide