Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Grid Monitoring at CERN with the Elastic Stack

Elastic Co
February 18, 2016

Grid Monitoring at CERN with the Elastic Stack

Learn how CERN uses the Elastic Stack for five different use cases for the Worldwide LHC Computing Grid: Messaging, Job Monitoring, Data Monitoring, Infrastructure Monitoring, and Cloud Benchmarking.

Elastic Co

February 18, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 2 2/25/16 Pablo Saiz ElasticON 2016 Table of contents • 

    CERN and Worldwide LHC Computing GRID (WLCG) •  Elastic for CERN IT •  Data center monitoring •  Job Monitoring •  Service Monitoring •  Data transfers •  CERN IT Elastic Service Many slides taken from François Briard, International Relations
  2. 3 CERN was founded 1954: 12 European States “Science for

    Peace” Today: 21 Member States Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, the Netherlands, Norway, Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and the United Kingdom Candidate for Accession: Romania Associate Member in Pre-Stage to Membership: Serbia Applicant States for Membership or Associate Membership: Brazil, Cyprus, Pakistan, Russia, Slovenia, Turkey, Ukraine Observers to Council: India, Japan, Russia, Turkey, United States of America; European Commission and UNESCO ~ 2300 staff ~ 1600 other paid personnel ~ 11000 users Budget ~1000 MCHF Well known for: •  Physics •  WWW •  Media: Angels & Demons, Flashforward, Daily show
  3. 12 2/25/16 Pablo Saiz ElasticON 2016 •  Biggest scientific Grid

    project in the world •  ~170 computer centers (site) •  1 Tier 0 (distributed in two locations) •  14 bigger centers (Tier 1) •  ~160 Tier 2 •  42 countries •  10,000 users •  Running since Oct 2008 •  3 million jobs per day •  ~600.000 cores •  300 PB data •  Do you want to contribute? •  http://lhcathome.web.cern.ch/
  4. 18 Table of contents •  CERN and Worldwide LHC Computing

    GRID (WLCG) •  Elastic for CERN IT •  Data center monitoring •  Job Monitoring •  Service Monitoring •  Data transfers •  CERN IT Elastic Service 2/25/16 Pablo Saiz ElasticON 2016
  5. 19 2/25/16 Pablo Saiz ElasticON 2016 CERN IT Monitoring • 

    Monitoring for Data Center and WLCG •  Many different tools: •  Data center, Job, data transfers, services •  Team of ~10 people •  Different audience: •  Site/service/experiment managers, end users, general public •  Multiple use cases •  Hosts/Job/Services/Transfers •  Some use cases with sensitive data: •  Cloud benchmarking/Syslog
  6. Elasticsearch clusters 20 2/25/16 Pablo Saiz ElasticON 2016 •  3

    production clusters •  Physical machines for data notes, some SSD •  Dedicated masters/search on virtual machines
  7. Use case 1: Data center monitoring 21 2/25/16 Pablo Saiz

    ElasticON 2016 •  2 locations: Geneva, Budapest •  ~170,000 cores •  ~90,000 disk drives •  ~190 PB disk •  ~160 PB tape •  ~400,000 completed jobs/day •  3.5MW •  450kW for critical services Collect •  Host/service metrics •  Alarms •  Archive data
  8. Data center monitoring architecture 22 2/25/16 Pablo Saiz ElasticON 2016

    Monitoring Lemon Batch, LSF Storage, EOS … Alarms Transport Archives (batch layer) Displays/Streaming (speed layer) Alerts Analytics (serving layer)
  9. 25 2/25/16 Pablo Saiz ElasticON 2016 Use case 2: WLCG

    Job Monitoring •  Goal: •  follow up the status of job processing •  ~3M daily jobs (170M updates) over ~170 sites •  Multiple applications: interactive view, summaries, accounting… •  Landing page aggregates last day •  Aggregation 60M entries (100GB) •  In production (with RDBMS) since 2008 •  Elasticsearch alternative in place •  Interactive view for one experiment
  10. 27 2/25/16 Pablo Saiz ElasticON 2016 Architecture •  Python collectors

    •  Apache web server •  With mod_python •  Javascript •  Datatables, highcarts… •  Currently replacing the storage with ES
  11. 28 2/25/16 Pablo Saiz ElasticON 2016 Performance 11.8 0.6 0

    2 4 6 8 10 12 14 RDBMS (no cache) ElasticSearch Opening the landing page 223 584 0 100 200 300 400 500 600 700 Python +RDBMS Python +ElasticSearch Maximum updates per second 1084.3 14.1 0 200 400 600 800 1000 1200 RDBMS (no cache) ElasticSearch One Month Query
  12. 29 2/25/16 Pablo Saiz ElasticON 2016 Use case 3: Infrastructure

    Monitoring •  Goal: •  Follow up status of sites/services using user defined metrics •  Metrics: •  Numerical or string •  Non-linear in time (independent sources, future value, changing old entries) •  Keeping only status changes •  Combining metrics (AND/OR/ANY/ALL/FILTER/OVERWRITE) •  2M entries per day (1M raw metrics, 1 M combined metrics) •  Creating ‘Views’ as lists of metrics •  Use in production (RDBMS) since 2010 •  Elastic version being developed
  13. Combining metrics 31 2/25/16 Pablo Saiz ElasticON 2016 OPERATION PRIORITY

    OR AND AND IF DATA <m> OVERWRITE <n> IF <N> != THEN <N> ELSE <M> ANY <m>, <t> OR of all instances in <m> that have the same value in <t> ALL <m>, <t> AND IF DATA of all instance in <m> that have the same value in <t> FILTER <m>, <t>=‘v’ Take only the instances of <m> that have a value of ‘v’ in metric <t> * OK WARNING CRITICAL DOWNTIME UNKNOWN NO DATA * Will be converted to if there is at least one more metric
  14. 32 2/25/16 Pablo Saiz ElasticON 2016 Use case 4: Data

    Monitoring •  Goal: •  Provide transfer/success rate, volume of data movements •  25M daily transfers, avg speed 10 GB/s •  RDBMS version in production since 2010 •  Elastic alternative •  1 index: 160M documents, 80 GB •  Current status: ES index created, and being populated Still working on UI
  15. Document reference 34 Table of contents •  CERN and Worldwide

    LHC Computing GRID (WLCG) •  Elastic for CERN IT •  Data center monitoring •  Job Monitoring •  Service Monitoring •  Data transfers •  CERN IT Elastic Service 2/25/16 Pablo Saiz ElasticON 2016
  16. Situation before Christmas •  ~20 different Elasticsearch clusters at CERN

    •  Maintained by different teams •  Multiple versions of ES: 1.4 à2.X •  Different security setup •  Duplication of data •  Difficult to correlate •  Need to consolidate! •  Create small team to support Elasticsearch at CERN 35 2/25/16 Pablo Saiz ElasticON 2016
  17. CERN IT Elasticsearch Service •  Currently, small team (4 people,

    part time) •  And should become even smaller •  Prototype Elasticsearch 2.2 cluster •  Couple of clients invited as guinea-pigs •  Goal: production ready service by Q4 2016 36 2/25/16 Pablo Saiz ElasticON 2016
  18. Challenges •  Number of clusters: •  Common or dedicated per

    client? •  Physical/virtual nodes •  IO performance or ease of management? •  Security, ACL •  Ensure privacy and multitenant •  Accounting •  Only space? Include usage? •  Migrate data from current clusters •  Automate (PaaS) •  FOUND? Elasticcloud? 38 2/25/16 Pablo Saiz ElasticON 2016
  19. Elastic meetup at CERN 8 Feb 2016 39 2/25/16 Pablo

    Saiz ElasticON 2016 More than 110 people ~70 external, ~40 internal Thanks to the organizers, speakers, guides, volunteers!
  20. Summary •  CERN IT Monitoring has been using Elastic in

    production for a couple of years •  Multiple use cases: •  Data Center, WLCG Job/Service/Data,… •  Currently setting up a centralized Elasticsearch Service •  Plenty of interest from many different clients •  Goal to provide a production service by Q4 2016 40 2/25/16 Pablo Saiz ElasticON 2016
  21. « Magic is not happening at CERN, magic is being

    explained at CERN. » Tom Hanks