y_taka_23
March 08, 2018
11k

# 猫でもわかる Pod Preemption #k8sjp / Kubernetes Meetup Tokyo 10th

Kubernetes Meetup Tokyo #10 で使用したスライドです。v1.8.0 から導入されたアルファ機能 Priority と Preemption について、そのメリットと仕組み、さらに運用上気をつけるべき点を簡単に解説しました。

March 08, 2018

## Transcript

1. 猫でもわかる
Pod Preemption
チェシャ猫 (@y_taka_23)
Kubernetes Meetup #10 (2018/03/08)
#k8sjp

2. 目次
● 導入の動機
○ Preemption とは何か？ 何が得られるのか？
● アルゴリズム概観
○ 内部的にはどうやって動作しているのか？
● Preempt 判定の限界
○ 複数条件が組み合わさったとき何が起こるのか？
#k8sjp

3. Pod B: Req. = 5
Node 2: Capacity = 5/30
Pod A: Req. = 10
Node 1: Capacity = 10/30
Pod Y: Req. = 20
Pod X: Req. = 15
Pod Z: Req. = 10
Scheduler’s main loop
Pod Queue

4. Pod B: Req. = 5
Node 2: Capacity = 5/30
Pod X: Req. = 15
Pod A: Req. = 10
Node 1: Capacity = 25/30
Pod Z: Req. = 10
Pod Y: Req. = 20
Scheduler’s main loop
Pod Queue

5. Pod Y: Req. = 20
Pod B: Req. = 5
Node 2: Capacity = 25/30
Pod X: Req. = 15
Pod A: Req. = 10
Node 1: Capacity = 25/30
Pod Z: Req. = 10
Scheduler’s main loop
Pod Queue

6. Pod Y: Req. = 20
Pod B: Req. = 5
Node 2: Capacity = 25/30
Pod X: Req. = 15
Pod A: Req. = 10
Node 1: Capacity = 25/30
Pod Z: Req. = 10
Scheduler’s main loop
Pod Queue

7. Priority を付加する
#k8sjp

8. Pod Y: Req. = 20, Pri. = 20
Pod B: Req. = 5, Pri. = 30
Node 2: Capacity = 25/30
Pod X: Req. = 15, Pri. = 10
Pod A: Req. = 10, Pri. = 20
Node 1: Capacity = 25/30
Pod Z: Req. = 10, Pri. = 20
Scheduler’s main loop
Pod Queue

9. Pod Y: Req. = 20, Pri. = 20
Pod B: Req. = 5, Pri. = 30
Node 2: Capacity = 25/30
Pod X: Req. = 15, Pri. = 10
Pod A: Req. = 10, Pri. = 20
Node 1: Capacity = 25/30
Pod Z: Req. = 10, Pri. = 20
Scheduler’s main loop
Pod Queue

10. Pod Y: Req. = 20, Pri. = 20
Pod B: Req. = 5, Pri. = 30
Node 2: Capacity = 25/30
Pod A: Req. = 10, Pri. = 20
Node 1: Capacity = 10/30
Pod Z: Req. = 10, Pri. = 20
Scheduler’s main loop
Pod Queue

11. Pod Y: Req. = 20, Pri. = 20
Pod B: Req. = 5, Pri. = 30
Node 2: Capacity = 25/30
Pod Z: Req. = 10, Pri. = 20
Pod A: Req. = 10, Pri. = 20
Node 1: Capacity = 20/30
Scheduler’s main loop
Pod Queue

12. Priority / Preemption の意義
● 重要な Pod がブロックされない
○ 従来は積極的に Pod を追い出す方法がなかった
○ 特に Node がスケールできない環境では嬉しい
○ スケールできても Node の立ち上がりは遅い
● コストが予測可能になる
○ Priority なしでは無限のリソースが必要
○ 急激なスパイクにも対応できる
#k8sjp

13. Pod B: Req. = 10, Pri. = 10
Pod C: Req. = 5, Pri. = 5
Pod A: Req. = 15, Pri. = 10
Node 1: Capacity = 30/30
Pod H: Req. = 20, Pri. = 10
Pod G: Req. = 10, Pri. = 30
Node 3: Capacity = 30/30
Pod E: Req. = 15, Pri. = 20
Pod F: Req. = 5, Pri. = 10
Pod D: Req. = 10, Pri. = 30
Node 2: Capacity = 30/30
Pod X: Req. = 10, Pri. = 20
Preemptor

14. Node の選択は 2 段階
#k8sjp

15. Step 1. Node のフィルタリング
● Preempt の可能性がある Node を選ぶ
○ 配置したい Pod より優先度の低い Pod を
一旦すべて追い出したと仮定
○ その状態で目的の Pod が配置できるか？
● リソース量以外も考慮
○ NodeSelector や Taint などの指定
○ Node Affinity / Inter-Pod Affinity / Anti-Affinity の指定
#k8sjp

16. Pod B: Req. = 10, Pri. = 10
Pod C: Req. = 5, Pri. = 5
Pod A: Req. = 15, Pri. = 10
Node 1: Capacity = 30/30
Pod H: Req. = 20, Pri. = 10
Pod G: Req. = 10, Pri. = 30
Node 3: Capacity = 30/30
Pod E: Req. = 15, Pri. = 20
Pod F: Req. = 5, Pri. = 10
Pod D: Req. = 10, Pri. = 30
Node 2: Capacity = 30/30
Pod X: Req. = 10, Pri. = 20
Preemptor

17. Node 1: Capacity = 0/30
Pod X: Req. = 10, Pri. = 20
Preemptor

18. Pod E: Req. = 15, Pri. = 20
Pod D: Req. = 10, Pri. = 30
Node 2: Capacity = 25/30
Pod X: Req. = 10, Pri. = 20
Preemptor

19. Pod G: Req. = 10, Pri. = 30
Node 3: Capacity = 10/30
Pod X: Req. = 10, Pri. = 20
Preemptor

20. Pod B: Req. = 10, Pri. = 10
Pod C: Req. = 5, Pri. = 5
Pod A: Req. = 15, Pri. = 10
Node 1: Capacity = 30/30
Pod H: Req. = 20, Pri. = 10
Pod G: Req. = 10, Pri. = 30
Node 3: Capacity = 30/30
Pod E: Req. = 15, Pri. = 20
Pod F: Req. = 5, Pri. = 10
Pod D: Req. = 10, Pri. = 30
Node 2: Capacity = 30/30
Pod X: Req. = 10, Pri. = 20
Preemptor

21. Step 2. Node の順位付け
● 最低限追い出すべき Pod を求める
○ 可能な範囲で Pod を戻してみる
● 減点が少ない Node を選ぶ
○ Pod Disruption Budget (PDB) 違反となる個数
○ PDB 違反となる Pod の Priority の最大値
○ 追い出される Pod の Priority の合計値
○ 追い出される Pod の個数
#k8sjp

22. Pod B: Req. = 10, Pri. = 10
Pod C: Req. = 5, Pri. = 5
Pod A: Req. = 15, Pri. = 10
Node 1: Capacity = 15/30
Pod H: Req. = 20, Pri. = 10
Pod G: Req. = 10, Pri. = 30
Node 3: Capacity = 10/30
Pod E: Req. = 15, Pri. = 20
Pod F: Req. = 5, Pri. = 10
Pod D: Req. = 10, Pri. = 30
Node 2: Capacity = 30/30
Pod X: Req. = 10, Pri. = 20
Preemptor

23. Pod B: Req. = 10, Pri. = 10
Pod C: Req. = 5, Pri. = 5
Pod A: Req. = 15, Pri. = 10
Node 1: Capacity = 30/30
Pod X: Req. = 10, Pri. = 20
Pod G: Req. = 10, Pri. = 30
Node 3: Capacity = 20/30
Pod E: Req. = 15, Pri. = 20
Pod F: Req. = 5, Pri. = 10
Pod D: Req. = 10, Pri. = 30
Node 2: Capacity = 30/30
Preemptor

24. このアルゴリズムは完璧？
#k8sjp

25. Affinity + Preemption
● Pod の Affinity
○ 指定した他の Pod が存在することを要求
○ 例：Redis は Web アプリと同じ Node に割り当て
● 高優先度 → 低優先度の Affinity
○ 低優先度だと「一旦すべて追い出す」の対象に
○ フィルタリングの段階で候補から外れてしまう
○ 低優先度 → 高優先度なら大丈夫
#k8sjp

26. Anti-Affinity + Preemption
● Pod の Anti-Affinity
○ 指定した他の Pod が存在しないことを要求
○ 例：DB クラスタを AZ ごとに分散させる
● Node を跨いだ Anti-Affinity
○ 「一旦すべて追い出す」のは該当 Node のみ
○ 同じ AZ 内、他の Node 上にある Pod が邪魔
#k8sjp

27. まとめ
● 重要な Pod に優先してリソース確保
○ コストが制御可能で、際限なく増大しない
● Node のフィルタリング + 順位付け
○ まず全部追い出してみて、それから改めて精査
● Affinity との組み合わせは要注意
○ 本来 preempt できるはずでも発動しないケース
#k8sjp

28. Preempt the Preemption!
Presented by チェシャ猫 (@y_taka_23)
#k8sjp