Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2020.02 Meetup] [Talk #2] Diogo Nicolau - Monitoring CERN Data Centre and WLCG Experiments

DevOps Lisbon
February 10, 2020

[2020.02 Meetup] [Talk #2] Diogo Nicolau - Monitoring CERN Data Centre and WLCG Experiments

This talk discusses the monitoring architecture, the challenges encountered in operating and scaling a pipeline to handle billions of events and presents how users benefit from a central monitoring service for processing and analysis of monitoring data.

Diogo is a Software Engineer with experience in designing and operating complex data pipelines. He earned his Masters in Computer Science at Instituto Superior Técnico with focus on intelligent decision systems. While yet studying he had the chance to explore several startups where he helped building from recommender systems to marketing newsletters. He joined the CERN IT Department to work on the evolution of monitoring infrastructure towards modern and scalable pipelines for data ingestion and stream processing. In his spare time, as a true millennial, he enjoys travelling and finding new places to eat as much as watching Netflix.

DevOps Lisbon

February 10, 2020
Tweet

More Decks by DevOps Lisbon

Other Decks in Technology

Transcript

  1. Monitoring CERN DC and WLCG 4 About myself • Software

    Engineer • Joined CERN not so long ago • Designing and operating complex data pipelines [email protected]
  2. Monitoring CERN DC and WLCG 5 How did I end

    up Monitoring “The crazy adventures of someone who just wanted to properly deploy a Recommendations System” Image: © 2015 – Susan Rossell
  3. Monitoring CERN DC and WLCG 6 How did I end

    up Monitoring Me building ML models Me getting monitoring data
  4. Monitoring CERN DC and WLCG 7 • Over 300 people

    • Enable the laboratory to fulfill its mission • Data Center and more: The IT Department IT Services Experiments Services Engineering Infrastructure Batch Storage Network Web Servers SW builds Chip design Hotel Bikes
  5. Monitoring CERN DC and WLCG 8 CERN Data Centre: Primary

    Copy of LHC Data 70k disks 13k servers More than 300PB on tapes More than More than 300 000 300 000 cores cores Private Openstack Cloud
  6. Tier-0 (CERN)  Data distribution  Data recording & archiving

     20-40 Gbit/s connect to Tier1s  Tier-1s (13 centres)  Initial data reconstruction  Permanent storage  Re-processing  Tier-2s (>150 centres) Simulation End-user analysis WLCG: LHC Computing Grid Image credit: CERN 170 sites WORLDWIDE > 10k users 250k jobs CONCURRENTLY > 600k cores 15% of CERN RESOURCES > 700 PB storage
  7. Monitoring CERN DC and WLCG 10 Monitoring Mission • Provide

    Monitoring as a Service for CERN Data Center (DC), IT Services and the WLCG collaboration • e.g. Dashboards, Alarms, Search, Archive • Collect, transport, store and process metrics and logs for applications and infrastructure
  8. Monitoring CERN DC and WLCG 12 Challenges Rate and Volume

    From ~ 40k machines More than 3 TB/day (compressed)
  9. 1 modular architecture / built on open source tools 2

    easy data integration / multiple ingestion endpoints 3 decoupled producers & consumers 4 built-in stream processing 5 support multiple backends with different SLAs Monitoring CERN DC and WLCG 14 Key Concepts
  10. Data Center Base monitoring all nodes  40k nodes running

    Collectd  OS and HW metrics and alarms Service specific monitoring  Custom or upstream plugins  Monit metrics endpoint
  11. Alarming Local (on the machine)  Simple Threshold / Actuators

    On dashboards  Grafana alert engine External  Alarm source Integrated with ticketing system  Service now
  12. Monitoring CERN DC and WLCG 25 The team 2 Portuguese

    + 2 Spaniards + 1 Italian + 1 Bulgarian We work as any Dev team:  2 weeks sprints  15min dailys  1 member in “rota” Teams reach us by:  Mattermost  Mail  SnowTicket We mantain our code in git:  Puppet  Repo per Spark job  Gitlab CI works a charm
  13. Stable production infrastructure integrated many different monitoring use cases can

    scale to its continuous growth Monitoring CERN DC and WLCG 26 Monitoring by Numbers: ~ 900 Active Users > 1000 Dashboards ~ 1000000 Queries/day > 30 Grafana Orgs tapping monit datasources