PDX 2017 - Pedro Andrade

Monitoring @ CERN [email protected]

23 May 2017 Founded in 1954 12 European states Associate
members • India • Pakistan • Turkey • Ukraine Staff members • 2300 Other personnel • 1400 Scientific users • 12500

Monitoring @ CERN 7 40m per sec > 100k per
sec > 1k per sec HW trigger > SW trigger > recorded From ~1 PB per sec to ~4GB per sec

CERN Data Centres • Main site • Built in the
70s • Geneva, Switzerland • Extension site • Budapest, Hungary • 3x100Gb links • Commodity HW 23 May 2017 Monitoring @ CERN 8

Worldwide LHC Computing Grid • WLCG provides global computing resources
to store, distribute and analyse the LHC data • The CERN data centre (Tier-0) distributes LHC data to other WLCG sites (Tier-1, Tier-2, Tier-3) • Global collaboration of more than 170 data centres around the world from 42 countries Monitoring @ CERN 10 23 May 2017

About WLCG: • A community of 10,000 physicists • ~250,000
jobs running concurrently • 600,000 processing cores • 15% of the resources are at CERN • 700 PB storage available worldwide • 20-40 Gbit/s connect CERN to Tier1s Tier-0 (CERN) • Initial data reconstruction • Data distribution • Data recording & archiving Tier-1s (13 centres) • Initial data reconstruction • Permanent storage • Re-processing • Analysis Tier-2s (>150 centres) • Simulation • End-user analysis

Monitoring Team Provide a common infrastructure to measure, collect, transport,
visualize, process and alarm monitoring data from the CERN Data Centres and the WLCG collaboration 23 May 2017 Monitoring @ CERN 12

Monitoring Data • Big variety of data as metrics or
logs • Data centre HW and OS • Data centre services • WLCG site/services monitoring • WLCG job monitoring and data management • Spikey workload with an average of 500GB/day • When things go bad we have more data/users 23 May 2017 Monitoring @ CERN 13

Monitoring Architecture • Common solution / Open source tools •
Scalable infrastructure / Empower users 23 May 2017 Monitoring @ CERN 14 Sources Transport Processing Access Storage

Sources • Internal data sources • Data centre OS and
HW • Based on Collectd: 30k nodes, 1 min samples • External data sources • Data centre services and WLCG • A mix between push and pull models • Different technologies and protocols 23 May 2017 Monitoring @ CERN 15

Transport 23 May 2017 Monitoring @ CERN 17 Flume Kafka
Kafka Flume Kafka Flume JMS Flume JDBC Flume HTTP Flume Logs Flume Metrics Flume DC DB HTTP AMQ Logs Metrics Metrics

Transport • 10 different types of Flume agents • E.g.
JMS/AVRO, AVRO/KAFKA, KAFKA/ESSINK • File based channels (small capacity) • Several interceptors and morhphlines • For validation and transformation • Check mandatory fields, apply common schema 23 May 2017 Monitoring @ CERN 18

Transport • Kafka is the rock-solid core of transport layer
• Data buffered for 12h (target is 72h) • Each data source in a separate topic • Each topic divided in 20 partitions • Two replicas per partition 23 May 2017 Monitoring @ CERN 19

Processing 23 May 2017 Monitoring @ CERN 21 Flume Kafka
(buffering) (enrichment) (aggregation) Flume

Processing • Stream processing jobs (scala) • Enrichment jobs: e.g.
topology metadata • Aggregation jobs: mostly over time • Correlation jobs: cpu load vs service activity • Batch processing jobs (scala) • Compaction and reprocessing jobs • Monthly or yearly reports for management 23 May 2017 Monitoring @ CERN 22

Processing • Jobs built and packaged as Docker images •
Automatic process at every push with Gitlab CI • Jobs orchestrated as processes on Mesos • Marathon for long-living processes (e.g. streaming) • Chronos for recurrent execution (e.g. batch) • Jobs executed in Mesos or Yarn clusters 23 May 2017 Monitoring @ CERN 23

Flume Storage 23 May 2017 Monitoring @ CERN 24 Kafka
(raw or processed) Flume Flume

Storage • HDFS for long-term archive and data recovery •
Data kept forever (limited to resources) • ES for searches and data discovery • Data kept for 1 month • InfluxDB for plots and dashboards • Data kept for 7 days • 5m bins kept for 1M, 1h bins kept for 5Y 23 May 2017 Monitoring @ CERN 25

Storage • All data written to HDFS by default •
/project/monitoring/archive/fts/metrics/2017/05/22/ • Data aggregated once per day into ~1Gb files • Selected data sets stored in ES and/or InfluxDB • ES: two generic instances (metrics and logs) • ES: each data producer in a separate index • InfluxDB: one instance per data producer 23 May 2017 Monitoring @ CERN 26

Access 23 May 2017 Monitoring @ CERN 27 HDFS ElasticSearch
InfluxDB

Alarms • Based on existing in-house toolset • Old implementation
with >5 years • Local alarms from metrics thresholds • Moving towards a multiple scope strategy • Local alarms from Collectd, to enable local actuators • Base alarms from Grafana, easy integration • Advanced alarms from Spark Jobs (user specific) 23 May 2017 Monitoring @ CERN 31

Monitoring Infrastructure • Openstack VMs • 60 Flume, 20 Kafka,
10 Spark, 50 Elasticsearch • Some nodes with Ceph volumes attached • Physical nodes • Only used for InfluxDB and HDFS • All configuration done via Puppet 23 May 2017 Monitoring @ CERN 32

Lessons Learned • Protocol based flume agents • Extremely handy,
very easy to add new sources • Kafka is the core of the “business” • Careful resources planning (topic/partition split) • Control consumer group and offsets is key • Securing the infrastructure takes time 23 May 2017 Monitoring @ CERN 33

Lessons Learned • Scala was the right choice for Spark
• Keep batch and streaming code as close as possible • DataFrame to decouple from JSON (un)marshalling • Store checkpoints on HDFS • Very positive experience with Marathon • InfluxDB and Elasticsearch • Tried to make them as complementary as possible 23 May 2017 Monitoring @ CERN 34

Thank You ! https://cern.ch/monitdocs

PDX 2017 - Pedro Andrade

PDX 2017 - Pedro Andrade

Monitorama

More Decks by Monitorama

Other Decks in Technology

Featured

Transcript

Monitoring @ CERN [email protected]

23 May 2017 Founded in 1954 12 European states Associate

Monitoring @ CERN 7 40m per sec > 100k per

CERN Data Centres • Main site • Built in the

Worldwide LHC Computing Grid • WLCG provides global computing resources

About WLCG: • A community of 10,000 physicists • ~250,000

Monitoring Team Provide a common infrastructure to measure, collect, transport,

Monitoring Data • Big variety of data as metrics or

Monitoring Architecture • Common solution / Open source tools •

Sources • Internal data sources • Data centre OS and

Transport 23 May 2017 Monitoring @ CERN 17 Flume Kafka

Transport • 10 different types of Flume agents • E.g.

Transport • Kafka is the rock-solid core of transport layer

Processing 23 May 2017 Monitoring @ CERN 21 Flume Kafka

Processing • Stream processing jobs (scala) • Enrichment jobs: e.g.

Processing • Jobs built and packaged as Docker images •

Flume Storage 23 May 2017 Monitoring @ CERN 24 Kafka

Storage • HDFS for long-term archive and data recovery •

Storage • All data written to HDFS by default •

Access 23 May 2017 Monitoring @ CERN 27 HDFS ElasticSearch

Alarms • Based on existing in-house toolset • Old implementation

Monitoring Infrastructure • Openstack VMs • 60 Flume, 20 Kafka,

Lessons Learned • Protocol based flume agents • Extremely handy,

Lessons Learned • Scala was the right choice for Spark

Thank You ! https://cern.ch/monitdocs