$30 off During Our Annual Pro Sale. View Details »

SIG-Node 2020-05-12, experiences of advanced resource management in Kubernetes

SIG-Node 2020-05-12, experiences of advanced resource management in Kubernetes

Screencast used in demo:
https://asciinema.org/a/327044

Alexander D. Kanevskiy

May 12, 2020
Tweet

More Decks by Alexander D. Kanevskiy

Other Decks in Programming

Transcript

  1. Alexander Kanevskiy, Krisztian Litkey, Ismo Puustinen
    2020-05-12

    View Slide

  2. 2
    Agenda
    § Why?
    § Demo
    § What do we know about
    – Hardware in general
    – CPUs
    – Memory
    – UX

    View Slide

  3. The stock Kubernetes + stock containerd, with some addons
    3

    View Slide

  4. View Slide

  5. 5

    View Slide

  6. Node 1
    Node 0
    System devices topology
    Socket 0
    Core 0 Core 1 Core 6 Core 7
    Core 2 Core 3 Core 8 Core 9
    Core 4 Core 5 Core 10 Core 11
    PCIe UPI Socket 1
    Core 0 Core 1 Core 6 Core 7
    Core 2 Core 3 Core 8 Core 9
    Core 4 Core 5 Core 10 Core 11
    UPI PCIe
    Memory Controller Memory Controller

    View Slide

  7. Node 5
    Node 4
    System topology in real world
    Node 0 Node 2
    Node 1 Node 3
    Package 0
    Core 0 Core 1 Core 5 Core 6
    Memory
    Controller
    Core 2 Core 7
    Memory
    Controller
    Core 3 Core 4 Core 8 Core 9
    PCIe
    UPI Package 1
    Core 0 Core 1 Core 5 Core 6
    Core 2 Core 7
    Core 3 Core 4 Core 8 Core 9
    UPI
    PCIe
    UPI UPI
    Memory
    Controller
    Memory
    Controller
    PCIe
    PCIe UPI UPI
    DMI DMI
    Chipset
    QAT x16 QAT x16
    QAT x16
    I/O Hub
    4x10G NIC

    View Slide

  8. 8
    CPU
    Things to keep in mind for CPUs
    § CPU cores vs. threads
    § CPU cores frequencies: base, turbo, throttling
    § CPU usage: Shared, Exclusive, Isolated
    § Additional CPU resources: Cache, Memory Bandwidth
    § Interconnect is an expensive resource
    § “C” in “NUMA” stands for “CPU”
    – Vendors have BIOS configurable settings to redefine what means NUMA (SNC, NPS,…)
    § Workload migration cost: low to very low
    § CPUs for Kubelet might be not the same meaning inside VM based runtimes

    View Slide

  9. Group …
    9
    System
    root
    Socket
    0
    Die 0
    CPU 0 CPU 1
    Die 1
    CPU 2 CPU 3
    Socket
    1
    Die 0
    CPU 4 CPU 5
    Die 1
    CPU 6 CPU 7
    For each leaf node
    § Groups of
    CPU+Memory
    § Dynamic pools
    – Shared
    – Exclusive
    – Isolated
    – “Throttled”
    – …
    Parent nodes
    § Sum of subtree
    resources
    Topology-Aware CPU policy
    physical_package_id
    die_id
    core_id

    View Slide

  10. 10
    Memory
    Things to keep in mind for Memory
    § Memory types
    – DRAM
    – “Persistent”, in ”volatile, system RAM mode” (PMEM)
    – High Bandwidth (HBM)
    § Kernel’s “Normal” vs. “Movable”
    § NUMA
    – Distances
    – Have CPU
    – Have ”normal” memory
    § Workload migration costs: medium to HIGH

    View Slide

  11. NUMA
    C
    11
    NUMA
    A
    NUMA
    B
    System
    root
    Socket 0
    Die 0
    IO and
    Memory
    CPUs
    Core
    Thread Thread
    Core...
    DRAM PMEM HBM
    IO and
    Memory ...
    Die ...
    Socket ...
    Each Node
    § CPU
    – CPU-less NUMA
    nodes are linked to
    nodes with CPUs
    § All memory types
    – DRAM
    – PMEM
    – HBM
    § Placement Cost
    calculated based on
    – Requested memory
    type(s) and amount of
    available memory
    – Later: BW, WSS
    MemTier Topology-Aware policy

    View Slide

  12. 12
    Linux OS Memory Tiering
    Principle of operation
    § Memory pages promoted from
    PMEM to DRAM when capacity
    available
    § Cold pages in DRAM Demoted
    to PMEM
    Page Promotions
    Page Demotions
    DDR

    View Slide

  13. 13
    User friendly resources controls
    § Good UX that spawns public cloud, VMs, bare metal is hard
    – Especially for non-trivial resources (block I/O, caches, …)
    § Placing workloads might lead to situations where it can’t be
    done
    – Reject?
    – Rebalance?
    § Rebalancing of the running workloads can be also hard
    – Assigned devices
    – Memory migrations
    – Priorities
    The story of jar, rocks, pebbles and sand…

    View Slide

  14. 14
    What next?
    UX is the key in our opinion
    § User should expect “it just works great” by default
    § Advanced users should be able to utilize good patterns on resource groups
    – Affinity/anti-affinity pattern
    – Device pipelines
    § Solutions that we do now should be aligned with where hardware is evolving to
    § Maybe not in Kubelet…?

    View Slide

  15. 15

    View Slide

  16. Description of the functionality
    CRI Resource Manager
    https://github.com/intel/cri-resource-manager

    is a Container Runtime Interface proxy

    sits between CRI Clients and the CRI Runtime

    applies (hardware) resource policies to containers
    CPU, Memory, Cache, Memory Bandwidth, Block I/O, …

    policies are applied by

    modifying proxied container requests, or

    generating container update requests, or

    triggering extra policy-specific actions during request
    processing

    View Slide

  17. 17
    Legal notices and disclaimers
    § Intel technologies’ features and benefits depend on system configuration and
    may require enabled hardware, software or service activation.
    § Performance varies depending on system configuration.
    § No computer system can be absolutely secure.
    § Check with your system manufacturer or retailer or learn more at
    www.intel.com.
    § Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or
    other countries.
    § *Other names and brands may be claimed as the property of others.
    § © Intel Corporation

    View Slide

  18. View Slide