DevOpsPorto Meetup 36: Computing and Operations at CERN: From Physical HW to Virtualization and Containers by Ricardo Rocha

DevOpsPorto Meetup 36: Computing and Operations at CERN: From Physical HW to Virtualization and Containers by Ricardo Rocha

A2c14a1c4e16aa337c7d36abe7d1cf8f?s=128

DevOpsPorto

February 11, 2020
Tweet

Transcript

  1. Infrastructure at CERN Scale Ricardo Rocha - CERN Cloud Team

    @ahcorporto ricardo.rocha@cern.ch
  2. Founded in 1954 What is 96% of the universe made

    of? Fundamental Science Why isn’t there anti-matter in the universe? What was the state of matter just after the Big Bang?
  3. None
  4. None
  5. None
  6. None
  7. 7

  8. None
  9. 9

  10. None
  11. None
  12. ~70 PB/year 700 000 Cores ~400 000 Jobs ~30 GiB/s

    200+ Sites
  13. Computing at CERN Increased numbers, increased automation 1970s 2007

  14. Computing at CERN Increased numbers, increased automation 1970 2007

  15. Computing at CERN Increased numbers, increased automation 1970 2007

  16. Computing at CERN Increased numbers, increased automation 1970 2007

  17. CERN IT Today 200+ people Storage, Computing and Monitoring, Databases,

    Network, DC, … Batch Systems and Core Physics Services But also campus services: hotel, bike service, wifi, … Common for teams to work on 2 weeks sprints, even for operations Rota system per team ServiceNow for end user support tickets
  18. Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or

    Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive
  19. Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or

    Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive Cloud API Virtualization Minutes Minutes or Hours Minutes or Hours Good Potentially Less Intrusive
  20. OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)

    Scalability, Rolling Upgrades Regions split in multiple Cells Often matching hardware deliveries Different configurations and capabilities Single hypervisor type (KVM, used to have HyperV as well) CELL 1 MAIN CELL 2 CELL N Compute GPU Compute Nova Network Neutron Neutron
  21. OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)

    Scalability, Rolling Upgrades, Regions split in multiple Cells Often matching hardware deliveries Different configurations and capabilities Single hypervisor type (KVM, used to have HyperV as well) CELL 1 MAIN CELL 2 CELL N Compute GPU Compute Nova Network Neutron Neutron
  22. OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)

    Scalability, Rolling Upgrades, Regions split in multiple Cells Often matching hardware deliveries Different configurations and capabilities Single hypervisor type (KVM, used to have HyperV as well) CELL 1 MAIN CELL 2 CELL N Compute GPU Compute Nova Network Neutron Neutron
  23. Networking Flat, segmented network (broadcast domains) Hierarchy of Primary (hypervisors)

    and Secondary (VMs) services CELL NODE 2 NODE 1 VN V2 V1 V3 V2 V1 V3 S513-V-IP123 137.1XX.43.0/24 ( Primary Service ) S513-V-VM908 188.1XX.191.0/24 ( Secondary Service ) Hypervisors Virtual Machines
  24. OpenStack Private Cloud Automate everything! Puppet based deployment of all

    components Including control plane running on VMs Same is true for most CERN services Workflows for all sorts of tasks Onboarding new users, project creation, quota updates, special capabilities Overcommit, Pre-emptible instances, Backfilling workloads
  25. Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or

    Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive Cloud API Virtualization Minutes Minutes or Hours Minutes or Hours Good Potentially Less Intrusive Containers Seconds Seconds Seconds Very Good Less Intrusive
  26. Lingua franca of the cloud Managed services offered by all

    major public clouds Multiple options for on-premise or self-managed deployments Common declarative API for basic infrastructure : compute, storage, networking Healthy ecosystem of tools offering extended functionality Kubernetes
  27. Lingua franca of the cloud Managed services offered by all

    major public clouds Multiple options for on-premise or self-managed deployments Common declarative API for basic infrastructure : compute, storage, networking Healthy ecosystem of tools offering extended functionality Kubernetes
  28. GitOps for Automation We were already doing similar things with

    Puppet Git as the source of truth for configuration data Allowing multiple choices of deployment models 1 ⇢ 1: Currently the most popular: one application, one cluster 1 ⇢ *: One application, multiple clusters (HA, Blast Radius, Rolling Upgrades) * ⇢ *: Separation of roles, improve resource usage Meta Chart git push FluxCD git pull Helm Release CRD Helm Operator
  29. Cluster Creation Image Pre-Pull Data Stage-In Process 5 min 4

    min 4 min 90 sec Kubernetes More than just infrastructure management Potential to ease scaling out data analysis on-demand Challenge: Re-processing the Higgs analysis in under 10min Processing a dataset of ~70TB of data split in ~25000 files
  30. OpenStack Magnum 70 TB Dataset Job Results Interactive Visualization Aggregation

    25000 Kubernetes Jobs
  31. Cluster on GKE Max 25000 Cores Single Region, 3 Zones

    70 TB Dataset Job Results Interactive Visualization Aggregation 25000 Kubernetes Jobs
  32. None
  33. Monitoring From ~40k machines More than 3TB/day compressed Modular architecture

    Decoupled producers / consumers Built-in stream processing Multiple backends with different SLAs Credit: Diogo Lima Nicolau
  34. Credit: Diogo Lima Nicolau

  35. Service Availability Historical View • Availability per service • Outages

    integration Credit: Diogo Lima Nicolau
  36. Alarming Local (on the machine) • Simple Threshold / Actuators

    On dashboards • Grafana alert engine External • Alarm source Integrated with ticketing system • Service now Credit: Diogo Lima Nicolau
  37. Challenges Do more with similar resources High Luminosity Large Hadron

    Collider x7 collisions per second, x10 Data and Computing Machine Learning Considered for fast simulation, detector triggers, anomaly detection, … Accommodate accelerators and scale this new type of workload GPUs, TPUs, IPUs, FPGAs, ...
  38. Questions ? @ahcorporto ricardo.rocha@cern.ch http://visits.cern/

  39. User Notebook Distributed Compute Build, Validate Model Train at Scale

    Persistent Storage for Feedback 1. 2. 3. 4. Serving