Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsPorto Meetup 36: Computing and Operations at CERN: From Physical HW to Virtualization and Containers by Ricardo Rocha

DevOpsPorto
February 11, 2020

DevOpsPorto Meetup 36: Computing and Operations at CERN: From Physical HW to Virtualization and Containers by Ricardo Rocha

DevOpsPorto

February 11, 2020
Tweet

More Decks by DevOpsPorto

Other Decks in Technology

Transcript

  1. Infrastructure at CERN Scale
    Ricardo Rocha - CERN Cloud Team
    @ahcorporto
    [email protected]

    View Slide

  2. Founded in 1954
    What is 96% of the universe made of?
    Fundamental Science
    Why isn’t there anti-matter in the universe?
    What was the state of matter just after the Big Bang?

    View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. 7

    View Slide

  8. View Slide

  9. 9

    View Slide

  10. View Slide

  11. View Slide

  12. ~70 PB/year
    700 000 Cores
    ~400 000 Jobs
    ~30 GiB/s 200+ Sites

    View Slide

  13. Computing at CERN
    Increased numbers, increased automation
    1970s 2007

    View Slide

  14. Computing at CERN
    Increased numbers, increased automation
    1970 2007

    View Slide

  15. Computing at CERN
    Increased numbers, increased automation
    1970 2007

    View Slide

  16. Computing at CERN
    Increased numbers, increased automation
    1970 2007

    View Slide

  17. CERN IT Today
    200+ people
    Storage, Computing and Monitoring, Databases, Network, DC, …
    Batch Systems and Core Physics Services
    But also campus services: hotel, bike service, wifi, …
    Common for teams to work on 2 weeks sprints, even for operations
    Rota system per team
    ServiceNow for end user support tickets

    View Slide

  18. Provisioning Deployment Update
    Physical
    Infrastructure
    Days or
    Weeks
    Minutes or
    Hours
    Minutes or
    Hours
    Utilization
    Poor
    Maintenance
    Highly
    Intrusive

    View Slide

  19. Provisioning Deployment Update
    Physical
    Infrastructure
    Days or
    Weeks
    Minutes or
    Hours
    Minutes or
    Hours
    Utilization
    Poor
    Maintenance
    Highly
    Intrusive
    Cloud API
    Virtualization
    Minutes Minutes or
    Hours
    Minutes or
    Hours
    Good
    Potentially
    Less Intrusive

    View Slide

  20. OpenStack Private Cloud
    3 Separate Regions (Main, Batch, Point 8)
    Scalability, Rolling Upgrades
    Regions split in multiple Cells
    Often matching hardware deliveries
    Different configurations and capabilities
    Single hypervisor type (KVM, used to have HyperV as well)
    CELL 1
    MAIN
    CELL 2
    CELL N
    Compute
    GPU
    Compute
    Nova Network
    Neutron
    Neutron

    View Slide

  21. OpenStack Private Cloud
    3 Separate Regions (Main, Batch, Point 8)
    Scalability, Rolling Upgrades,
    Regions split in multiple Cells
    Often matching hardware deliveries
    Different configurations and capabilities
    Single hypervisor type (KVM, used to have HyperV as well)
    CELL 1
    MAIN
    CELL 2
    CELL N
    Compute
    GPU
    Compute
    Nova Network
    Neutron
    Neutron

    View Slide

  22. OpenStack Private Cloud
    3 Separate Regions (Main, Batch, Point 8)
    Scalability, Rolling Upgrades,
    Regions split in multiple Cells
    Often matching hardware deliveries
    Different configurations and capabilities
    Single hypervisor type (KVM, used to have HyperV as well)
    CELL 1
    MAIN
    CELL 2
    CELL N
    Compute
    GPU
    Compute
    Nova Network
    Neutron
    Neutron

    View Slide

  23. Networking
    Flat, segmented network (broadcast domains)
    Hierarchy of Primary (hypervisors) and Secondary (VMs) services
    CELL
    NODE 2
    NODE 1
    VN
    V2
    V1
    V3
    V2
    V1
    V3
    S513-V-IP123
    137.1XX.43.0/24
    ( Primary Service )
    S513-V-VM908
    188.1XX.191.0/24
    ( Secondary Service )
    Hypervisors Virtual Machines

    View Slide

  24. OpenStack Private Cloud
    Automate everything!
    Puppet based deployment of all components
    Including control plane running on VMs
    Same is true for most CERN services
    Workflows for all sorts of tasks
    Onboarding new users, project creation, quota updates, special capabilities
    Overcommit, Pre-emptible instances, Backfilling workloads

    View Slide

  25. Provisioning Deployment Update
    Physical
    Infrastructure
    Days or
    Weeks
    Minutes or
    Hours
    Minutes or
    Hours
    Utilization
    Poor
    Maintenance
    Highly
    Intrusive
    Cloud API
    Virtualization
    Minutes Minutes or
    Hours
    Minutes or
    Hours
    Good
    Potentially
    Less Intrusive
    Containers Seconds Seconds Seconds Very
    Good
    Less Intrusive

    View Slide

  26. Lingua franca of the cloud
    Managed services offered by all major public clouds
    Multiple options for on-premise or self-managed deployments
    Common declarative API for basic infrastructure : compute, storage, networking
    Healthy ecosystem of tools offering extended functionality
    Kubernetes

    View Slide

  27. Lingua franca of the cloud
    Managed services offered by all major public clouds
    Multiple options for on-premise or self-managed deployments
    Common declarative API for basic infrastructure : compute, storage, networking
    Healthy ecosystem of tools offering extended functionality
    Kubernetes

    View Slide

  28. GitOps for Automation
    We were already doing similar things with Puppet
    Git as the source of truth for configuration data
    Allowing multiple choices of deployment models
    1 ⇢ 1: Currently the most popular: one application, one cluster
    1 ⇢ *: One application, multiple clusters (HA, Blast Radius, Rolling Upgrades)
    * ⇢ *: Separation of roles, improve resource usage
    Meta
    Chart
    git push
    FluxCD
    git pull
    Helm
    Release
    CRD
    Helm
    Operator

    View Slide

  29. Cluster
    Creation
    Image
    Pre-Pull
    Data
    Stage-In
    Process
    5 min 4 min 4 min 90 sec
    Kubernetes
    More than just infrastructure management
    Potential to ease scaling out data analysis on-demand
    Challenge: Re-processing the Higgs analysis in under 10min
    Processing a dataset of ~70TB of data split in ~25000 files

    View Slide

  30. OpenStack Magnum
    70 TB Dataset Job Results Interactive
    Visualization
    Aggregation
    25000 Kubernetes Jobs

    View Slide

  31. Cluster on GKE
    Max 25000 Cores
    Single Region, 3 Zones
    70 TB Dataset Job Results Interactive
    Visualization
    Aggregation
    25000 Kubernetes Jobs

    View Slide

  32. View Slide

  33. Monitoring
    From ~40k machines
    More than 3TB/day compressed
    Modular architecture
    Decoupled producers / consumers
    Built-in stream processing
    Multiple backends with different SLAs
    Credit: Diogo Lima Nicolau

    View Slide

  34. Credit: Diogo Lima Nicolau

    View Slide

  35. Service Availability
    Historical View

    Availability per service

    Outages integration
    Credit: Diogo Lima Nicolau

    View Slide

  36. Alarming
    Local (on the machine)

    Simple Threshold /
    Actuators
    On dashboards

    Grafana alert engine
    External

    Alarm source
    Integrated with ticketing system

    Service now
    Credit: Diogo Lima Nicolau

    View Slide

  37. Challenges
    Do more with similar resources
    High Luminosity Large Hadron Collider
    x7 collisions per second, x10 Data and Computing
    Machine Learning
    Considered for fast simulation, detector triggers, anomaly detection, …
    Accommodate accelerators and scale this new type of workload
    GPUs, TPUs, IPUs, FPGAs, ...

    View Slide

  38. Questions ?
    @ahcorporto
    [email protected]
    http://visits.cern/

    View Slide

  39. User Notebook
    Distributed
    Compute
    Build, Validate Model Train at Scale
    Persistent Storage for Feedback
    1. 2.
    3.
    4. Serving

    View Slide