Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PDX 2017 - Pedro Andrade

PDX 2017 - Pedro Andrade

Monitorama

May 23, 2017
Tweet

More Decks by Monitorama

Other Decks in Technology

Transcript

  1. 23 May 2017 Founded in 1954 12 European states Associate

    members • India • Pakistan • Turkey • Ukraine Staff members • 2300 Other personnel • 1400 Scientific users • 12500
  2. Monitoring @ CERN 7 40m per sec > 100k per

    sec > 1k per sec HW trigger > SW trigger > recorded From ~1 PB per sec to ~4GB per sec
  3. CERN Data Centres • Main site • Built in the

    70s • Geneva, Switzerland • Extension site • Budapest, Hungary • 3x100Gb links • Commodity HW 23 May 2017 Monitoring @ CERN 8
  4. Worldwide LHC Computing Grid • WLCG provides global computing resources

    to store, distribute and analyse the LHC data • The CERN data centre (Tier-0) distributes LHC data to other WLCG sites (Tier-1, Tier-2, Tier-3) • Global collaboration of more than 170 data centres around the world from 42 countries Monitoring @ CERN 10 23 May 2017
  5. About WLCG: • A community of 10,000 physicists • ~250,000

    jobs running concurrently • 600,000 processing cores • 15% of the resources are at CERN • 700 PB storage available worldwide • 20-40 Gbit/s connect CERN to Tier1s Tier-0 (CERN) • Initial data reconstruction • Data distribution • Data recording & archiving Tier-1s (13 centres) • Initial data reconstruction • Permanent storage • Re-processing • Analysis Tier-2s (>150 centres) • Simulation • End-user analysis
  6. Monitoring Team Provide a common infrastructure to measure, collect, transport,

    visualize, process and alarm monitoring data from the CERN Data Centres and the WLCG collaboration 23 May 2017 Monitoring @ CERN 12
  7. Monitoring Data • Big variety of data as metrics or

    logs • Data centre HW and OS • Data centre services • WLCG site/services monitoring • WLCG job monitoring and data management • Spikey workload with an average of 500GB/day • When things go bad we have more data/users 23 May 2017 Monitoring @ CERN 13
  8. Monitoring Architecture • Common solution / Open source tools •

    Scalable infrastructure / Empower users 23 May 2017 Monitoring @ CERN 14 Sources Transport Processing Access Storage
  9. Sources • Internal data sources • Data centre OS and

    HW • Based on Collectd: 30k nodes, 1 min samples • External data sources • Data centre services and WLCG • A mix between push and pull models • Different technologies and protocols 23 May 2017 Monitoring @ CERN 15
  10. Transport 23 May 2017 Monitoring @ CERN 17 Flume Kafka

    Kafka Flume Kafka Flume JMS Flume JDBC Flume HTTP Flume Logs Flume Metrics Flume DC DB HTTP AMQ Logs Metrics Metrics
  11. Transport • 10 different types of Flume agents • E.g.

    JMS/AVRO, AVRO/KAFKA, KAFKA/ESSINK • File based channels (small capacity) • Several interceptors and morhphlines • For validation and transformation • Check mandatory fields, apply common schema 23 May 2017 Monitoring @ CERN 18
  12. Transport • Kafka is the rock-solid core of transport layer

    • Data buffered for 12h (target is 72h) • Each data source in a separate topic • Each topic divided in 20 partitions • Two replicas per partition 23 May 2017 Monitoring @ CERN 19
  13. Processing 23 May 2017 Monitoring @ CERN 21 Flume Kafka

    (buffering) (enrichment) (aggregation) Flume
  14. Processing • Stream processing jobs (scala) • Enrichment jobs: e.g.

    topology metadata • Aggregation jobs: mostly over time • Correlation jobs: cpu load vs service activity • Batch processing jobs (scala) • Compaction and reprocessing jobs • Monthly or yearly reports for management 23 May 2017 Monitoring @ CERN 22
  15. Processing • Jobs built and packaged as Docker images •

    Automatic process at every push with Gitlab CI • Jobs orchestrated as processes on Mesos • Marathon for long-living processes (e.g. streaming) • Chronos for recurrent execution (e.g. batch) • Jobs executed in Mesos or Yarn clusters 23 May 2017 Monitoring @ CERN 23
  16. Storage • HDFS for long-term archive and data recovery •

    Data kept forever (limited to resources) • ES for searches and data discovery • Data kept for 1 month • InfluxDB for plots and dashboards • Data kept for 7 days • 5m bins kept for 1M, 1h bins kept for 5Y 23 May 2017 Monitoring @ CERN 25
  17. Storage • All data written to HDFS by default •

    /project/monitoring/archive/fts/metrics/2017/05/22/ • Data aggregated once per day into ~1Gb files • Selected data sets stored in ES and/or InfluxDB • ES: two generic instances (metrics and logs) • ES: each data producer in a separate index • InfluxDB: one instance per data producer 23 May 2017 Monitoring @ CERN 26
  18. Alarms • Based on existing in-house toolset • Old implementation

    with >5 years • Local alarms from metrics thresholds • Moving towards a multiple scope strategy • Local alarms from Collectd, to enable local actuators • Base alarms from Grafana, easy integration • Advanced alarms from Spark Jobs (user specific) 23 May 2017 Monitoring @ CERN 31
  19. Monitoring Infrastructure • Openstack VMs • 60 Flume, 20 Kafka,

    10 Spark, 50 Elasticsearch • Some nodes with Ceph volumes attached • Physical nodes • Only used for InfluxDB and HDFS • All configuration done via Puppet 23 May 2017 Monitoring @ CERN 32
  20. Lessons Learned • Protocol based flume agents • Extremely handy,

    very easy to add new sources • Kafka is the core of the “business” • Careful resources planning (topic/partition split) • Control consumer group and offsets is key • Securing the infrastructure takes time 23 May 2017 Monitoring @ CERN 33
  21. Lessons Learned • Scala was the right choice for Spark

    • Keep batch and streaming code as close as possible • DataFrame to decouple from JSON (un)marshalling • Store checkpoints on HDFS • Very positive experience with Marathon • InfluxDB and Elasticsearch • Tried to make them as complementary as possible 23 May 2017 Monitoring @ CERN 34