Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Monitoring CERN Data Center and WLCG Experiments Monitoring CERN DC and WLCG 2 Diogo Nicolau – CERN/IT

Slide 3

Slide 3 text

Monitoring CERN DC and WLCG 3

Slide 4

Slide 4 text

Monitoring CERN DC and WLCG 4 About myself • Software Engineer • Joined CERN not so long ago • Designing and operating complex data pipelines [email protected]

Slide 5

Slide 5 text

Monitoring CERN DC and WLCG 5 How did I end up Monitoring “The crazy adventures of someone who just wanted to properly deploy a Recommendations System” Image: © 2015 – Susan Rossell

Slide 6

Slide 6 text

Monitoring CERN DC and WLCG 6 How did I end up Monitoring Me building ML models Me getting monitoring data

Slide 7

Slide 7 text

Monitoring CERN DC and WLCG 7 • Over 300 people • Enable the laboratory to fulfill its mission • Data Center and more: The IT Department IT Services Experiments Services Engineering Infrastructure Batch Storage Network Web Servers SW builds Chip design Hotel Bikes

Slide 8

Slide 8 text

Monitoring CERN DC and WLCG 8 CERN Data Centre: Primary Copy of LHC Data 70k disks 13k servers More than 300PB on tapes More than More than 300 000 300 000 cores cores Private Openstack Cloud

Slide 9

Slide 9 text

Tier-0 (CERN)  Data distribution  Data recording & archiving  20-40 Gbit/s connect to Tier1s  Tier-1s (13 centres)  Initial data reconstruction  Permanent storage  Re-processing  Tier-2s (>150 centres) Simulation End-user analysis WLCG: LHC Computing Grid Image credit: CERN 170 sites WORLDWIDE > 10k users 250k jobs CONCURRENTLY > 600k cores 15% of CERN RESOURCES > 700 PB storage

Slide 10

Slide 10 text

Monitoring CERN DC and WLCG 10 Monitoring Mission • Provide Monitoring as a Service for CERN Data Center (DC), IT Services and the WLCG collaboration • e.g. Dashboards, Alarms, Search, Archive • Collect, transport, store and process metrics and logs for applications and infrastructure

Slide 11

Slide 11 text

Monitoring CERN DC and WLCG 11

Slide 12

Slide 12 text

Monitoring CERN DC and WLCG 12 Challenges Rate and Volume From ~ 40k machines More than 3 TB/day (compressed)

Slide 13

Slide 13 text

Monitoring CERN DC and WLCG 13 Challenges Variety and Reliability More than 150 producers

Slide 14

Slide 14 text

1 modular architecture / built on open source tools 2 easy data integration / multiple ingestion endpoints 3 decoupled producers & consumers 4 built-in stream processing 5 support multiple backends with different SLAs Monitoring CERN DC and WLCG 14 Key Concepts

Slide 15

Slide 15 text

Monitoring CERN DC and WLCG 15

Slide 16

Slide 16 text

Monitoring CERN DC and WLCG 16 Observability Pipeline? @lucamag @smithclay

Slide 17

Slide 17 text

Monitoring CERN DC and WLCG 17

Slide 18

Slide 18 text

Data Center Base monitoring all nodes  40k nodes running Collectd  OS and HW metrics and alarms Service specific monitoring  Custom or upstream plugins  Monit metrics endpoint

Slide 19

Slide 19 text

Service Availability Historical View  Availability per service  Outages integration

Slide 20

Slide 20 text

20 WLCG Experiments Transfers classified by location, country, site… “Shifters” who keep tight control on the dasboard

Slide 21

Slide 21 text

Job Monitoring Jobs classified by state, source, resource…

Slide 22

Slide 22 text

Alarming Local (on the machine)  Simple Threshold / Actuators On dashboards  Grafana alert engine External  Alarm source Integrated with ticketing system  Service now

Slide 23

Slide 23 text

Service Alarms Service Plugins create the required metrics Include Puppet module and overwrite configurations

Slide 24

Slide 24 text

Monitoring CERN DC and WLCG 24

Slide 25

Slide 25 text

Monitoring CERN DC and WLCG 25 The team 2 Portuguese + 2 Spaniards + 1 Italian + 1 Bulgarian We work as any Dev team:  2 weeks sprints  15min dailys  1 member in “rota” Teams reach us by:  Mattermost  Mail  SnowTicket We mantain our code in git:  Puppet  Repo per Spark job  Gitlab CI works a charm

Slide 26

Slide 26 text

Stable production infrastructure integrated many different monitoring use cases can scale to its continuous growth Monitoring CERN DC and WLCG 26 Monitoring by Numbers: ~ 900 Active Users > 1000 Dashboards ~ 1000000 Queries/day > 30 Grafana Orgs tapping monit datasources

Slide 27

Slide 27 text

No content