$30 off During Our Annual Pro Sale. View Details »

PDX 2017 - Pedro Andrade

PDX 2017 - Pedro Andrade

Monitorama

May 23, 2017
Tweet

More Decks by Monitorama

Other Decks in Technology

Transcript

  1. View Slide

  2. Monitoring @ CERN
    [email protected]

    View Slide

  3. 23 May 2017
    Founded in 1954
    12 European states
    Associate members
    • India
    • Pakistan
    • Turkey
    • Ukraine
    Staff members
    • 2300
    Other personnel
    • 1400
    Scientific users
    • 12500

    View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. Monitoring @ CERN 7
    40m per sec > 100k per sec > 1k per sec
    HW trigger > SW trigger > recorded
    From ~1 PB per sec to ~4GB per sec

    View Slide

  8. CERN Data Centres
    • Main site
    • Built in the 70s
    • Geneva, Switzerland
    • Extension site
    • Budapest, Hungary
    • 3x100Gb links
    • Commodity HW
    23 May 2017 Monitoring @ CERN 8

    View Slide

  9. View Slide

  10. Worldwide LHC Computing Grid
    • WLCG provides global computing resources to
    store, distribute and analyse the LHC data
    • The CERN data centre (Tier-0) distributes LHC
    data to other WLCG sites (Tier-1, Tier-2, Tier-3)
    • Global collaboration of more than 170 data
    centres around the world from 42 countries
    Monitoring @ CERN 10
    23 May 2017

    View Slide

  11. About WLCG:
    • A community of 10,000 physicists
    • ~250,000 jobs running concurrently
    • 600,000 processing cores
    • 15% of the resources are at CERN
    • 700 PB storage available worldwide
    • 20-40 Gbit/s connect CERN to Tier1s
    Tier-0 (CERN)
    • Initial data reconstruction
    • Data distribution
    • Data recording & archiving
    Tier-1s (13 centres)
    • Initial data reconstruction
    • Permanent storage
    • Re-processing
    • Analysis
    Tier-2s (>150 centres)
    • Simulation
    • End-user analysis

    View Slide

  12. Monitoring Team
    Provide a common infrastructure to measure,
    collect, transport, visualize, process and alarm
    monitoring data from the CERN Data Centres and
    the WLCG collaboration
    23 May 2017 Monitoring @ CERN 12

    View Slide

  13. Monitoring Data
    • Big variety of data as metrics or logs
    • Data centre HW and OS
    • Data centre services
    • WLCG site/services monitoring
    • WLCG job monitoring and data management
    • Spikey workload with an average of 500GB/day
    • When things go bad we have more data/users
    23 May 2017 Monitoring @ CERN 13

    View Slide

  14. Monitoring Architecture
    • Common solution / Open source tools
    • Scalable infrastructure / Empower users
    23 May 2017 Monitoring @ CERN 14
    Sources
    Transport
    Processing
    Access
    Storage

    View Slide

  15. Sources
    • Internal data sources
    • Data centre OS and HW
    • Based on Collectd: 30k nodes, 1 min samples
    • External data sources
    • Data centre services and WLCG
    • A mix between push and pull models
    • Different technologies and protocols
    23 May 2017 Monitoring @ CERN 15

    View Slide

  16. View Slide

  17. Transport
    23 May 2017 Monitoring @ CERN 17
    Flume
    Kafka
    Kafka
    Flume
    Kafka
    Flume
    JMS
    Flume
    JDBC
    Flume
    HTTP
    Flume
    Logs
    Flume
    Metrics
    Flume
    DC
    DB
    HTTP
    AMQ
    Logs
    Metrics
    Metrics

    View Slide

  18. Transport
    • 10 different types of Flume agents
    • E.g. JMS/AVRO, AVRO/KAFKA, KAFKA/ESSINK
    • File based channels (small capacity)
    • Several interceptors and morhphlines
    • For validation and transformation
    • Check mandatory fields, apply common schema
    23 May 2017 Monitoring @ CERN 18

    View Slide

  19. Transport
    • Kafka is the rock-solid core of transport layer
    • Data buffered for 12h (target is 72h)
    • Each data source in a separate topic
    • Each topic divided in 20 partitions
    • Two replicas per partition
    23 May 2017 Monitoring @ CERN 19

    View Slide

  20. View Slide

  21. Processing
    23 May 2017 Monitoring @ CERN 21
    Flume
    Kafka
    (buffering)
    (enrichment)
    (aggregation)
    Flume

    View Slide

  22. Processing
    • Stream processing jobs (scala)
    • Enrichment jobs: e.g. topology metadata
    • Aggregation jobs: mostly over time
    • Correlation jobs: cpu load vs service activity
    • Batch processing jobs (scala)
    • Compaction and reprocessing jobs
    • Monthly or yearly reports for management
    23 May 2017 Monitoring @ CERN 22

    View Slide

  23. Processing
    • Jobs built and packaged as Docker images
    • Automatic process at every push with Gitlab CI
    • Jobs orchestrated as processes on Mesos
    • Marathon for long-living processes (e.g. streaming)
    • Chronos for recurrent execution (e.g. batch)
    • Jobs executed in Mesos or Yarn clusters
    23 May 2017 Monitoring @ CERN 23

    View Slide

  24. Flume
    Storage
    23 May 2017 Monitoring @ CERN 24
    Kafka
    (raw or processed)
    Flume
    Flume

    View Slide

  25. Storage
    • HDFS for long-term archive and data recovery
    • Data kept forever (limited to resources)
    • ES for searches and data discovery
    • Data kept for 1 month
    • InfluxDB for plots and dashboards
    • Data kept for 7 days
    • 5m bins kept for 1M, 1h bins kept for 5Y
    23 May 2017 Monitoring @ CERN 25

    View Slide

  26. Storage
    • All data written to HDFS by default
    • /project/monitoring/archive/fts/metrics/2017/05/22/
    • Data aggregated once per day into ~1Gb files
    • Selected data sets stored in ES and/or InfluxDB
    • ES: two generic instances (metrics and logs)
    • ES: each data producer in a separate index
    • InfluxDB: one instance per data producer
    23 May 2017 Monitoring @ CERN 26

    View Slide

  27. Access
    23 May 2017 Monitoring @ CERN 27
    HDFS
    ElasticSearch
    InfluxDB

    View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. Alarms
    • Based on existing in-house toolset
    • Old implementation with >5 years
    • Local alarms from metrics thresholds
    • Moving towards a multiple scope strategy
    • Local alarms from Collectd, to enable local actuators
    • Base alarms from Grafana, easy integration
    • Advanced alarms from Spark Jobs (user specific)
    23 May 2017 Monitoring @ CERN 31

    View Slide

  32. Monitoring Infrastructure
    • Openstack VMs
    • 60 Flume, 20 Kafka, 10 Spark, 50 Elasticsearch
    • Some nodes with Ceph volumes attached
    • Physical nodes
    • Only used for InfluxDB and HDFS
    • All configuration done via Puppet
    23 May 2017 Monitoring @ CERN 32

    View Slide

  33. Lessons Learned
    • Protocol based flume agents
    • Extremely handy, very easy to add new sources
    • Kafka is the core of the “business”
    • Careful resources planning (topic/partition split)
    • Control consumer group and offsets is key
    • Securing the infrastructure takes time
    23 May 2017 Monitoring @ CERN 33

    View Slide

  34. Lessons Learned
    • Scala was the right choice for Spark
    • Keep batch and streaming code as close as possible
    • DataFrame to decouple from JSON (un)marshalling
    • Store checkpoints on HDFS
    • Very positive experience with Marathon
    • InfluxDB and Elasticsearch
    • Tried to make them as complementary as possible
    23 May 2017 Monitoring @ CERN 34

    View Slide

  35. Thank You !
    https://cern.ch/monitdocs

    View Slide

  36. View Slide