Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Grid Monitoring at CERN with Elastic

Elastic Co
November 10, 2015

Grid Monitoring at CERN with Elastic

In this talk Pablo Saiz presents the five different use cases where Elastic is used for WLCG monitoring at CERN:
- Messaging
- Job Monitoring
- Data Monitoring
- Infrastructure Monitoring
- Cloud Benchmarking
Pablo will also highlight the future goals with Elastic and their plan to expand their Elastic useage at CERN.

Pablo Saiz | Elastic{ON}Tour Munich | November 10, 2015

Elastic Co

November 10, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Table of

    contents ▪ CERN ▪ Worldwide LHC Computing GRID (WLCG) ▪ Elastic for WLCG Monitoring ▪ Messaging ▪ Job Monitoring ▪ Data Monitoring ▪ Infrastructure monitoring ▪ Cloud benchmarking 05/11/15 2
  2. IT-SDC 3 CERN was founded 1954: 12 European States “Science

    for Peace” Today: 21 Member States Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, the Netherlands, Norway, Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and the United Kingdom Candidate for Accession: Romania Associate Member in Pre-Stage to Membership: Serbia Applicant States for Membership or Associate Membership:
 Brazil, Cyprus, Pakistan, Russia, Slovenia, Turkey, Ukraine Observers to Council: India, Japan, Russia, Turkey, United States of America; European Commission and UNESCO ~ 2300 staff ~ 1600 other paid personnel ~ 10500 users Budget (2014) ~1000 MCHF Well known for: • Physics • WWW • Media: Angels & Demons, Flashforward, Daily show
  3. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN WLCG ▪

    Biggest scientific Grid project in the world ▪ ~170 computer centers (site) ▪ 1 Tier 0 (distributed in two locations) ▪ 12 bigger centers (Tier 1) ▪ ~160 Tier 2 ▪ 42 countries ▪ 10,000 users ▪ Running since Oct 2008 ▪ 2 million jobs per day ▪ ~600.000 cores ▪ 300 PB data 05/11/15 6 http://cern.ch/wlcg
  4. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN WLCG 05/11/15

    7 Including Max-Planck- Institut für Physik , Munich
  5. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN WLCG Monitoring

    ▪ Many different tools: ▪ Job, data transfers, services ▪ Usual Architecture: ▪ Python, RDBMS, apache, javascript ▪ Running since 2007 ▪ Team of ~10 people ▪ Different audience: ▪ Site/service/experiment managers, end users, general public 09/11/15 8 http://dashboard.cern.ch
  6. IT-SDC Elastic{ON} Tour in Munich, Pablo Saiz, CERN ElasticSearch configuration

    ▪ 2 clusters ▪ Production: 8 data, 2 search, 3 master ▪ Development: 2 data, 1 search, 1 master ▪ Data nodes: physical, 32GB, 32 core ▪ Search, master: virtual 8GB ▪ ~10 different use cases
  7. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Use case

    1: Messaging Team (MIG) ▪ Goal: ▪ Check the status of the machines and services used for Messaging ▪ 25 machines, 15 clusters, ~130 applications, 260 M msg/day ▪ Using ElasticSearch and Kibana 3 in production since Aug 2014 ▪ 3 daily indexes, ~3M documents (850MB) per day ▪ Documents contain sensitive information ▪ Ensure only authorized people can access them 05/11/15 10 Messaging infrastructure ElasticSearch ESPER Transport
 layer (Messaging) Log Check Metric Status Graphite http://cern.ch/messaging
  8. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Use case

    2: Job Monitoring ▪ Goal: follow up the status of job processing ▪ ~2M daily jobs (170M updates) over ~170 sites ▪ Multiple applications: interactive view, summaries, accounting… ▪ Landing page query with 2 million records aggregation ▪ Aggregation 60M entries (100GB) ▪ In production (with RDBMS) since 2008 ▪ Elasticsearch alternative in place ▪ Currently checking data consistency 05/11/15 12
  9. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Architecture ▪

    Python collectors ▪ Apache web server ▪ With mod_python ▪ Javascript ▪ Datatables, highcarts… ▪ Currently replacing the storage with ES 05/11/15 14
  10. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN ElasticSearch result

    ▪ Good speed-up in job parsing: ▪ From 220 jobs per second to 580 (2,6x) ▪ Better speed-up in applications: ▪ From 12 seconds to 0,70s in the most used query ▪ More queries possibilities: ▪ More than one month of aggregation 05/11/15 15
  11. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Performance 05/11/15

    16 Opening the landing page lasticSearch S (no cache) 0 3 6 9 12 11.8 0.6 Maximum updates per second Python +ElasticSearch Python +RDBMS 0 150 300 450 600 223 584 One Month Query lasticSearch S (no cache) 0 300 600 900 1200 1084.3 14.1
  12. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Use case

    3: Infrastructure Monitoring ▪ Goal: ▪ Follow up status of sites/services using user defined metrics ▪ 2M entries per day (1M raw metrics, 1 M combined metrics) ▪ Metrics: ▪ Numerical or string ▪ Non-linear in time (ndependent sources, future value, changing old entries) ▪ Keeping only status changes ▪ Combining metrics (AND/OR/ANY/ALL/FILTER) ▪ Creating ‘Views’ as lists of metrics ▪ Use in production (RDBMS) since 2010 ▪ ES alternative ready for data insertion and combining metrics ▪ Moving from python to java ▪ Using JEST interface ▪ Using ESPER ▪ ESPER/ES reduce significantly latency of metric combination ▪ UI needs to be modified 05/11/15 17 http://wlcg-mon.cern.ch/
  13. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN How to

    combine metrics 09/11/15 19 Submit: ce1, ce2 OK ce3: FAIL Topo: Site1: ce1, ce2, St1, Tr1 Sote2: ce3, St2 Downtime ce3: maint CE ce1: OK ce2, ce3:FAIL Storage St1: OK St2: FAIL Transfer Tr1:OK CE Site1: Ok Site2: MAINT Storage Site1: OK Site2: FAIL Transfer Site1: OK Storage+ Transfer Site1:OK Site2: FAIL Status Site1: OK Site2: FAIL HOST SITE Example: Any CE AND ( Storage OR Transfer) Where CE=Submit AND Efficiency>80; Storage=Read AND Write; Transfer=CanTransfer Downtimes; Topology Efficiency: ce1:85 ce2:60 ce3:90 Read: St1: Ok St2: OK Write: St1 OK St2: FAIL Transfer: Tr1 OK CE ce1: OK Ce2: FAIL Ce3:MAINT Storage St1: OK St2: FAIL Transfer Tr1:OK
  14. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Use case

    4: Data Monitoring ▪ Goal: ▪ Provide speed, success rate, volume of data movements ▪ 25M daily transfers, avg speed 10 GB/s ▪ RDBMS version in production since 2010 ▪ Elastic alternative ▪ 1 index: 160M documents, 80 GB ▪ Current status: ES index created, and being populated. Still to work on UI 05/11/15 20
  15. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Use case

    5: Benchmarking Commercial Cloud Resources ▪ Goal: Compare different benchmark tools on different cloud providers ▪ Use 6 different benchmark tools and X cloud providers ▪ 1 index: 270K documents (~250 MB) ▪ Detailed analysis performed with Ipython analytics http://indico.cern.ch/event/384358/session/12/contribution/25 05/11/15 22
  16. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Benchmark results

    at a glance ▪ Metric ~ [s]; each point ⬄10 min average; Colour ⬄ Cloud 05/11/15 23
  17. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Qualitative look

    at data 05/11/15 24 1 VM 16 VMs 30 VMs 16 VMs 25 VMs 20 VMs ▪ Identifiable transition of CPU performance when load changes ▪ Seen in all benchmark measurements. Performance recovers scaling down
  18. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Qualitative look

    at data ▪ Larger dispersion in KV and FastBmk values in the highest- load region 05/11/15 25 1 VM 16 VMs 30 VMs 16 VMs 25 VMs 20 VMs
  19. IT-SDC IT-SDC Analysis done outside elastic ▪ Correlation between different

    fields in documents ▪ Histograms with floating numbers 05/11/15 26 Evolution of a single VM in the parameter space FastBmk Vs KV Ratio mean(30VMs)/mean(16 VMs) Projection-X Aggr. x hypervisor Projection-Y Aggr. x hypervisor Profile-X 2D plot KV FastBmk A single 
 Hyperv.!!
  20. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Next goals

    ▪ Use ES for Job Monitoring in production ▪ Migrate to ES 2 ▪ Check indexes/shards ▪ Ensure fine-grained security ▪ Currently, based on apache configuration (allow only REST interface!) ▪ Need to move to another alternative ▪ Move to Kibana 4 (once security in place) 05/11/15 27
  21. IT-SDC IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN Summary ▪

    WLCG Grid Monitoring at CERN has been using Elastic in production for over year ▪ Currently, evaluating ElasticSearch for more use cases ▪ Very promising results so far! ▪ Looking forwards to expanding our Elastic usage! 05/11/15 28