Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Grid Monitoring at CERN with Elastic

Elastic Co
November 10, 2015

Grid Monitoring at CERN with Elastic

In this talk Pablo Saiz presents the five different use cases where Elastic is used for WLCG monitoring at CERN:
- Messaging
- Job Monitoring
- Data Monitoring
- Infrastructure Monitoring
- Cloud Benchmarking
Pablo will also highlight the future goals with Elastic and their plan to expand their Elastic useage at CERN.

Pablo Saiz | Elastic{ON}Tour Munich | November 10, 2015

Elastic Co

November 10, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Pablo Saiz 6-November-2015
    1
    Grid Monitoring at CERN with Elastic

    View Slide

  2. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Table of contents
    ▪ CERN
    ▪ Worldwide LHC Computing GRID (WLCG)
    ▪ Elastic for WLCG Monitoring
    ▪ Messaging
    ▪ Job Monitoring
    ▪ Data Monitoring
    ▪ Infrastructure monitoring
    ▪ Cloud benchmarking
    05/11/15 2

    View Slide

  3. IT-SDC 3
    CERN was founded 1954: 12 European States
    “Science for Peace”
    Today: 21 Member States
    Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece,
    Hungary, Israel, Italy, the Netherlands, Norway, Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and the
    United Kingdom
    Candidate for Accession: Romania
    Associate Member in Pre-Stage to Membership: Serbia
    Applicant States for Membership or Associate Membership:

    Brazil, Cyprus, Pakistan, Russia, Slovenia, Turkey, Ukraine
    Observers to Council: India, Japan, Russia, Turkey, United States of America; European Commission and
    UNESCO
    ~ 2300 staff
    ~ 1600 other paid personnel
    ~ 10500 users
    Budget (2014) ~1000 MCHF
    Well known for:
    • Physics
    • WWW
    • Media: Angels & Demons,
    Flashforward, Daily show

    View Slide

  4. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    CERN’s Accelerator Complex
    05/11/15 4
    http://cernland.net

    View Slide

  5. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Colliding particles (event)
    05/11/15 5
    Black Hole
    Higgs

    View Slide

  6. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    WLCG
    ▪ Biggest scientific Grid project in the world
    ▪ ~170 computer centers (site)
    ▪ 1 Tier 0 (distributed in two locations)
    ▪ 12 bigger centers (Tier 1)
    ▪ ~160 Tier 2
    ▪ 42 countries
    ▪ 10,000 users
    ▪ Running since Oct 2008
    ▪ 2 million jobs per day
    ▪ ~600.000 cores
    ▪ 300 PB data
    05/11/15 6
    http://cern.ch/wlcg

    View Slide

  7. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    WLCG
    05/11/15 7
    Including Max-Planck-
    Institut für Physik ,
    Munich

    View Slide

  8. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    WLCG Monitoring
    ▪ Many different tools:
    ▪ Job, data transfers, services
    ▪ Usual Architecture:
    ▪ Python, RDBMS, apache, javascript
    ▪ Running since 2007
    ▪ Team of ~10 people
    ▪ Different audience:
    ▪ Site/service/experiment managers,
    end users, general public
    09/11/15 8
    http://dashboard.cern.ch

    View Slide

  9. IT-SDC Elastic{ON} Tour in Munich, Pablo Saiz, CERN
    ElasticSearch configuration
    ▪ 2 clusters
    ▪ Production: 8 data, 2 search, 3 master
    ▪ Development: 2 data, 1 search, 1 master
    ▪ Data nodes: physical, 32GB, 32 core
    ▪ Search, master: virtual 8GB
    ▪ ~10 different use cases

    View Slide

  10. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Use case 1: Messaging Team (MIG)
    ▪ Goal:
    ▪ Check the status of the machines and services used for Messaging
    ▪ 25 machines, 15 clusters, ~130 applications, 260 M msg/day
    ▪ Using ElasticSearch and Kibana 3 in production since Aug 2014
    ▪ 3 daily indexes, ~3M documents (850MB) per day
    ▪ Documents contain sensitive information
    ▪ Ensure only authorized people can access them
    05/11/15 10
    Messaging
    infrastructure ElasticSearch
    ESPER
    Transport

    layer
    (Messaging)
    Log
    Check
    Metric
    Status
    Graphite
    http://cern.ch/messaging

    View Slide

  11. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    MIG Dashboard
    05/11/15 11

    View Slide

  12. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Use case 2: Job Monitoring
    ▪ Goal: follow up the status of job processing
    ▪ ~2M daily jobs (170M updates) over ~170 sites
    ▪ Multiple applications: interactive view, summaries,
    accounting…
    ▪ Landing page query with 2 million records aggregation
    ▪ Aggregation 60M entries (100GB)
    ▪ In production (with RDBMS) since 2008
    ▪ Elasticsearch alternative in place
    ▪ Currently checking data consistency
    05/11/15 12

    View Slide

  13. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Accessing raw data / summary
    05/11/15 13

    View Slide

  14. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Architecture
    ▪ Python collectors
    ▪ Apache web server
    ▪ With mod_python
    ▪ Javascript
    ▪ Datatables, highcarts…
    ▪ Currently replacing the
    storage with ES
    05/11/15 14

    View Slide

  15. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    ElasticSearch result
    ▪ Good speed-up in job parsing:
    ▪ From 220 jobs per second to 580 (2,6x)
    ▪ Better speed-up in applications:
    ▪ From 12 seconds to 0,70s in the most used
    query
    ▪ More queries possibilities:
    ▪ More than one month of aggregation
    05/11/15 15

    View Slide

  16. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Performance
    05/11/15 16
    Opening the landing page
    lasticSearch
    S (no cache)
    0 3 6 9 12
    11.8
    0.6
    Maximum updates per second
    Python +ElasticSearch
    Python +RDBMS
    0 150 300 450 600
    223
    584
    One Month Query
    lasticSearch
    S (no cache)
    0 300 600 900 1200
    1084.3
    14.1

    View Slide

  17. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Use case 3: Infrastructure Monitoring
    ▪ Goal:
    ▪ Follow up status of sites/services using user defined metrics
    ▪ 2M entries per day (1M raw metrics, 1 M combined metrics)
    ▪ Metrics:
    ▪ Numerical or string
    ▪ Non-linear in time (ndependent sources, future value, changing old entries)
    ▪ Keeping only status changes
    ▪ Combining metrics (AND/OR/ANY/ALL/FILTER)
    ▪ Creating ‘Views’ as lists of metrics
    ▪ Use in production (RDBMS) since 2010
    ▪ ES alternative ready for data insertion and combining metrics
    ▪ Moving from python to java
    ▪ Using JEST interface
    ▪ Using ESPER
    ▪ ESPER/ES reduce significantly latency of metric combination
    ▪ UI needs to be modified
    05/11/15 17
    http://wlcg-mon.cern.ch/

    View Slide

  18. IT-SDC
    IT-SDC WLCG Monitoring Consolidation, Pablo Saiz, CERN
    Site Status Board
    05/11/15 18

    View Slide

  19. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    How to combine metrics
    09/11/15 19
    Submit:
    ce1, ce2 OK
    ce3: FAIL
    Topo:
    Site1: ce1, ce2, St1, Tr1
    Sote2: ce3, St2
    Downtime
    ce3: maint
    CE
    ce1: OK
    ce2, ce3:FAIL
    Storage
    St1: OK
    St2: FAIL
    Transfer
    Tr1:OK
    CE
    Site1: Ok
    Site2: MAINT
    Storage
    Site1: OK
    Site2: FAIL
    Transfer
    Site1: OK
    Storage+ Transfer
    Site1:OK
    Site2: FAIL
    Status
    Site1: OK
    Site2: FAIL
    HOST SITE
    Example: Any CE AND ( Storage OR Transfer)
    Where
    CE=Submit AND Efficiency>80;
    Storage=Read AND Write; Transfer=CanTransfer
    Downtimes; Topology
    Efficiency:
    ce1:85 ce2:60
    ce3:90
    Read:
    St1: Ok
    St2: OK
    Write:
    St1 OK
    St2: FAIL
    Transfer:
    Tr1 OK
    CE
    ce1: OK
    Ce2: FAIL
    Ce3:MAINT
    Storage
    St1: OK
    St2: FAIL
    Transfer
    Tr1:OK

    View Slide

  20. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Use case 4: Data Monitoring
    ▪ Goal:
    ▪ Provide speed, success rate, volume of data
    movements
    ▪ 25M daily transfers, avg speed 10 GB/s
    ▪ RDBMS version in production since 2010
    ▪ Elastic alternative
    ▪ 1 index: 160M documents, 80 GB
    ▪ Current status: ES index created, and being
    populated. Still to work on UI
    05/11/15 20

    View Slide

  21. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    WLCG Transfers dashboard
    05/11/15 21

    View Slide

  22. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Use case 5: Benchmarking Commercial Cloud
    Resources
    ▪ Goal: Compare different benchmark tools on different cloud providers
    ▪ Use 6 different benchmark tools and X cloud providers
    ▪ 1 index: 270K documents (~250 MB)
    ▪ Detailed analysis performed with Ipython analytics
    http://indico.cern.ch/event/384358/session/12/contribution/25
    05/11/15 22

    View Slide

  23. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Benchmark results at a glance
    ▪ Metric ~ [s]; each point ⬄10 min average; Colour ⬄ Cloud
    05/11/15 23

    View Slide

  24. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Qualitative look at data
    05/11/15 24
    1 VM 16 VMs 30 VMs 16 VMs
    25 VMs
    20 VMs
    ▪ Identifiable transition of CPU performance when load changes
    ▪ Seen in all benchmark measurements. Performance recovers scaling down

    View Slide

  25. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Qualitative look at data
    ▪ Larger dispersion in KV and FastBmk values in the highest-
    load region
    05/11/15 25
    1 VM 16 VMs 30 VMs 16 VMs
    25 VMs
    20 VMs

    View Slide

  26. IT-SDC
    IT-SDC
    Analysis done outside elastic
    ▪ Correlation between different fields in documents
    ▪ Histograms with floating numbers
    05/11/15 26
    Evolution of a single VM in the
    parameter space FastBmk Vs KV
    Ratio mean(30VMs)/mean(16 VMs)
    Projection-X
    Aggr. x hypervisor
    Projection-Y
    Aggr. x hypervisor
    Profile-X
    2D plot
    KV
    FastBmk
    A single 

    Hyperv.!!

    View Slide

  27. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Next goals
    ▪ Use ES for Job Monitoring in production
    ▪ Migrate to ES 2
    ▪ Check indexes/shards
    ▪ Ensure fine-grained security
    ▪ Currently, based on apache configuration (allow
    only REST interface!)
    ▪ Need to move to another alternative
    ▪ Move to Kibana 4 (once security in place)
    05/11/15 27

    View Slide

  28. IT-SDC
    IT-SDC Elastic{ON}Tour in Munich, Pablo Saiz, CERN
    Summary
    ▪ WLCG Grid Monitoring at CERN has been
    using Elastic in production for over year
    ▪ Currently, evaluating ElasticSearch for more
    use cases
    ▪ Very promising results so far!
    ▪ Looking forwards to expanding our Elastic
    usage!
    05/11/15 28

    View Slide

  29. www.elastic.co

    View Slide