Upgrade to Pro — share decks privately, control downloads, hide ads and more …

IBM Dev Day: Running AI Workloads on Nomad

IBM Dev Day: Running AI Workloads on Nomad

In this presentation, I dive into how to think about AI workloads and why HashiCorp Nomad is the right foundation for an AI platform.

This version of the talk was given at IBM Dev Day: AI Demystified in January 2026.

Avatar for Kerim Satirli

Kerim Satirli PRO

January 29, 2026

More Decks by Kerim Satirli

Other Decks in Programming

Transcript

  1. Running AI Workloads on Nomad IBM Dev Day: AI Demystified

    January 29, 2026 Kerim Satirli Senior Developer Advocate II
  2. Infrastructure Provisioning build images with Packer and Ansible provision infrastructure

    with Terraform orchestrate workloads with Nomad secure access and data with Vault
  3. Infrastructure Landscape Local / Edge Cloud NVIDIA Jetson (100 TOPS)

    NVIDIA Jetson (100 TOPS) Windows Server (1300 TOPS) other connected clients (not GPU accelerated) IBM Cloud Compute AWS EC2 Azure VMs other supporting infrastructure (file and data storage, etc.)
  4. defines basic job properties • datacenter and region • type

    of job • update strategy defines atomic units of work • driver selection • task environment • resource requirements defines how to co-locate tasks • network config • volume config • service discovery Job Specification Overview Task Job Group
  5. Job Specification Overview Affinity and Constraints defines basic job properties

    • datacenter and region • type of job • update strategy defines atomic units of work • driver selection • task environment • resource requirements defines how to co-locate tasks • network config • volume config • service discovery Task Job Group
  6. Basic Job Spec job "raw_exec" { datacenters = ["dc1"] type

    = "batch" group "user" { task "whoami" { driver = "raw_exec" user = "nomad" config { command = "/usr/bin/whoami" args = [] } resources { cpu = 100 memory = 100 } } } } raw-exec.nomad.hcl
  7. Basic Job Spec job "raw_exec" { datacenters = ["dc1"] type

    = "batch" group "user" { task "whoami" { driver = "raw_exec" user = "nomad" config { command = "/usr/bin/whoami" args = [] } resources { cpu = 100 memory = 100 } } } } raw-exec.nomad.hcl
  8. Basic Job Spec job "raw_exec" { datacenters = ["dc1"] type

    = "batch" group "user" { task "whoami" { driver = "raw_exec" user = "nomad" config { command = "/usr/bin/whoami" args = [] } resources { cpu = 100 memory = 100 } } } } raw-exec.nomad.hcl
  9. OS-constrained Job Spec job "run_on_windows" { datacenters = ["dc1"] group

    "windows" { constraint { attribute = "${attr.kernel.name}" operator = "=" value = "windows" } service { name = "windows-iis" tags = ["windows","iis"] port = "www" check { name = "alive" type = "tcp" interval = "10s" timeout = "2s" } } run-on-windows.nomad.hcl
  10. Inject Secure Data into Job Spec job "certs_from_vault" { datacenters

    = ["dc1"] group "group" { task "sleepy" { driver = "exec" vault { policies = ["nomad-client"] change_mode = "signal" change_signal = "SIGUSR1" } template { destination = "${NOMAD_SECRETS_DIR}/certificate.crt" change_mode = "restart" data = <<EOH {{ with pkiCert "pki/issue/svcs-dev" "common_name=svcs.dev" "ttl=24h" "ip_sans=127.0.0.1" }} {{- .Data.certificate -}} {{ end }} EOH } certs-from-vault.nomad.hcl
  11. Inject Secure Data into Job Spec job "certs_from_vault" { datacenters

    = ["dc1"] group "group" { task "sleepy" { driver = "exec" vault { policies = ["nomad-client"] change_mode = "signal" change_signal = "SIGUSR1" } template { destination = "${NOMAD_SECRETS_DIR}/certificate.crt" change_mode = "restart" data = <<EOH {{ with pkiCert "pki/issue/svcs-dev" "common_name=svcs.dev" "ttl=24h" "ip_sans=127.0.0.1" }} {{- .Data.certificate -}} {{ end }} EOH } certs-from-vault.nomad.hcl
  12. GPU Workloads with Nomad plugin "nomad-device-nvidia" { config { enabled

    = true # find GPU IDs by running `nvidia-smi -L` ignored_gpu_ids = [ "GPU-4f707ad8-2c9a-3b1f-9a58-1c2e9f0b7c3d", ] fingerprint_period = "1m" } } nomad-gpu-config.hcl
  13. GPU Workloads with Nomad job "gpu_workload" { datacenters = ["dc1"]

    type = "batch" group "smi" { task "smi" { driver = "docker" config { # see https://hub.docker.com/r/nvidia/cuda/tags image = "nvidia/cuda:13.1.1-base-ubuntu24.04" command = "nvidia-smi" } resources { device "nvidia/gpu" { count = 1 } } } } } nvidia-smi.nomad.hcl
  14. GPU Workloads with Nomad job "gpu_workload" { datacenters = ["dc1"]

    type = "batch" group "smi" { task "smi" { driver = "docker" config { # see https://hub.docker.com/r/nvidia/cuda/tags image = "nvidia/cuda:13.1.1-base-ubuntu24.04" command = "nvidia-smi" } resources { device "nvidia/gpu" { count = 1 } } } } } nvidia-smi.nomad.hcl
  15. GPU Workloads with Nomad job "gpu_workload" { # <other config

    hidden> group "smi" { task "smi" { # <other config hidden> resources { device "nvidia/gpu" { count = 1 } } } } nvidia-smi.nomad.hcl
  16. GPU Workloads with Nomad job "gpu_workload" { # <other config

    hidden> group "smi" { task "smi" { # <other config hidden> resources { device "nvidia/gpu" { count = 1 affinity { attribute = "${device.model}" # covers both "H100 PCIe" and "H100 SXM" type of GPUs operator = "regexp" value = "H100" weight = 50 } } } } } nvidia-smi.nomad.hcl
  17. GPU Workloads with Nomad 5GB 5GB 5GB 5GB 5GB 5GB

    5GB 5GB 1 compute 1 compute 1 compute 1 compute 1 compute 1 compute 1 compute
  18. 5GB 5GB 5GB 5GB 5GB 5GB 5GB 1 compute 1

    compute 1 compute 1 compute 1 compute 1 compute GPU Workloads with Nomad 5GB 1 compute
  19. 5GB 5GB 5GB 5GB 1 compute 1 compute 1 compute

    GPU Workloads with Nomad 5GB 5GB 5GB 5GB 1 compute 1 compute 1 compute 1 compute
  20. GPU Workloads with Nomad job "gpu_workload" { datacenters = ["dc1"]

    node_pool = "gpu_instances" # <other config hidden> group "smi" { task "smi" { # <other config hidden> resources { device "nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb" { count = 1 } } } } } nvidia-smi.nomad.hcl
  21. GPU Workloads with Nomad job "gpu_workload" { datacenters = ["dc1"]

    node_pool = "gpu_instances" # <other config hidden> group "smi" { task "smi" { # <other config hidden> resources { device "nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb" { count = 1 } } } } } nvidia-smi.nomad.hcl