y_taka_23
March 08, 2018
11k

# 猫でもわかる Pod Preemption #k8sjp / Kubernetes Meetup Tokyo 10th

Kubernetes Meetup Tokyo #10 で使用したスライドです。v1.8.0 から導入されたアルファ機能 Priority と Preemption について、そのメリットと仕組み、さらに運用上気をつけるべき点を簡単に解説しました。

March 08, 2018

## Transcript

2. ### 目次 • 導入の動機 ◦ Preemption とは何か？ 何が得られるのか？ • アルゴリズム概観 ◦

内部的にはどうやって動作しているのか？ • Preempt 判定の限界 ◦ 複数条件が組み合わさったとき何が起こるのか？ #k8sjp
3. ### Pod B: Req. = 5 Node 2: Capacity = 5/30

Pod A: Req. = 10 Node 1: Capacity = 10/30 Pod Y: Req. = 20 Pod X: Req. = 15 Pod Z: Req. = 10 Scheduler’s main loop Pod Queue
4. ### Pod B: Req. = 5 Node 2: Capacity = 5/30

Pod X: Req. = 15 Pod A: Req. = 10 Node 1: Capacity = 25/30 Pod Z: Req. = 10 Pod Y: Req. = 20 Scheduler’s main loop Pod Queue
5. ### Pod Y: Req. = 20 Pod B: Req. = 5

Node 2: Capacity = 25/30 Pod X: Req. = 15 Pod A: Req. = 10 Node 1: Capacity = 25/30 Pod Z: Req. = 10 Scheduler’s main loop Pod Queue
6. ### Pod Y: Req. = 20 Pod B: Req. = 5

Node 2: Capacity = 25/30 Pod X: Req. = 15 Pod A: Req. = 10 Node 1: Capacity = 25/30 Pod Z: Req. = 10 Scheduler’s main loop Pod Queue

8. ### Pod Y: Req. = 20, Pri. = 20 Pod B:

Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod X: Req. = 15, Pri. = 10 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 25/30 Pod Z: Req. = 10, Pri. = 20 Scheduler’s main loop Pod Queue
9. ### Pod Y: Req. = 20, Pri. = 20 Pod B:

Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod X: Req. = 15, Pri. = 10 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 25/30 Pod Z: Req. = 10, Pri. = 20 Scheduler’s main loop Pod Queue
10. ### Pod Y: Req. = 20, Pri. = 20 Pod B:

Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 10/30 Pod Z: Req. = 10, Pri. = 20 Scheduler’s main loop Pod Queue
11. ### Pod Y: Req. = 20, Pri. = 20 Pod B:

Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod Z: Req. = 10, Pri. = 20 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 20/30 Scheduler’s main loop Pod Queue
12. ### Priority / Preemption の意義 • 重要な Pod がブロックされない ◦ 従来は積極的に

Pod を追い出す方法がなかった ◦ 特に Node がスケールできない環境では嬉しい ◦ スケールできても Node の立ち上がりは遅い • コストが予測可能になる ◦ Priority なしでは無限のリソースが必要 ◦ 急激なスパイクにも対応できる #k8sjp
13. ### Pod B: Req. = 10, Pri. = 10 Pod C:

Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 30/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor

15. ### Step 1. Node のフィルタリング • Preempt の可能性がある Node を選ぶ ◦

配置したい Pod より優先度の低い Pod を 一旦すべて追い出したと仮定 ◦ その状態で目的の Pod が配置できるか？ • リソース量以外も考慮 ◦ NodeSelector や Taint などの指定 ◦ Node Affinity / Inter-Pod Affinity / Anti-Affinity の指定 #k8sjp
16. ### Pod B: Req. = 10, Pri. = 10 Pod C:

Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 30/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor
17. ### Node 1: Capacity = 0/30 Pod X: Req. = 10,

Pri. = 20 Preemptor
18. ### Pod E: Req. = 15, Pri. = 20 Pod D:

Req. = 10, Pri. = 30 Node 2: Capacity = 25/30 Pod X: Req. = 10, Pri. = 20 Preemptor
19. ### Pod G: Req. = 10, Pri. = 30 Node 3:

Capacity = 10/30 Pod X: Req. = 10, Pri. = 20 Preemptor
20. ### Pod B: Req. = 10, Pri. = 10 Pod C:

Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 30/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor
21. ### Step 2. Node の順位付け • 最低限追い出すべき Pod を求める ◦ 可能な範囲で

Pod を戻してみる • 減点が少ない Node を選ぶ ◦ Pod Disruption Budget (PDB) 違反となる個数 ◦ PDB 違反となる Pod の Priority の最大値 ◦ 追い出される Pod の Priority の合計値 ◦ 追い出される Pod の個数 #k8sjp
22. ### Pod B: Req. = 10, Pri. = 10 Pod C:

Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 15/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 10/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor
23. ### Pod B: Req. = 10, Pri. = 10 Pod C:

Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 20/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Preemptor

25. ### Affinity + Preemption • Pod の Affinity ◦ 指定した他の Pod

が存在することを要求 ◦ 例：Redis は Web アプリと同じ Node に割り当て • 高優先度 → 低優先度の Affinity ◦ 低優先度だと「一旦すべて追い出す」の対象に ◦ フィルタリングの段階で候補から外れてしまう ◦ 低優先度 → 高優先度なら大丈夫 #k8sjp
26. ### Anti-Affinity + Preemption • Pod の Anti-Affinity ◦ 指定した他の Pod

が存在しないことを要求 ◦ 例：DB クラスタを AZ ごとに分散させる • Node を跨いだ Anti-Affinity ◦ 「一旦すべて追い出す」のは該当 Node のみ ◦ 同じ AZ 内、他の Node 上にある Pod が邪魔 #k8sjp
27. ### まとめ • 重要な Pod に優先してリソース確保 ◦ コストが制御可能で、際限なく増大しない • Node のフィルタリング

+ 順位付け ◦ まず全部追い出してみて、それから改めて精査 • Affinity との組み合わせは要注意 ◦ 本来 preempt できるはずでも発動しないケース #k8sjp