Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop - Piotr Turek

Unraveling mysteries of the Universe at Unraveling mysteries of the
Universe at CERN, with OpenStack and Hadoop CERN, with OpenStack and Hadoop There and back again There and back again / Piotr Turek @rekurencja

Where does the story Where does the story begin? begin?

How it all came to be? How it all came
to be? How will it end? How will it end? What is the fundamental structure What is the fundamental structure of space and time? of space and time?

"Somewhere, something incredible is "Somewhere, something incredible is waiting to
be known" waiting to be known" ― Carl Sagan ― Carl Sagan

Does size matter? Does size matter? Not always ;) Not
always ;) Particle Physics is born Particle Physics is born

1954 1954 A lot of... A lot of...

2009 2009

Can you see him now? Can you see him now?
4 stories high 4 stories high 14,000 tons 14,000 tons +100 m underground +100 m underground

Let's smash some hadrons! Let's smash some hadrons! Proton Proton
Proton Proton

One small bottle for a man... One small bottle for
a man... Mind boggling fact: one bottle can last for many months 0.2 nanogram / day ~ 2 red blood cells / day

Accelerating Science Accelerating Science 0.999999991 c

Mind boggling facts continued... Mind boggling facts continued... 10 km/h
slower than light -271.3°C (1.9K) ~coldest place in the Universe Total kinetic energy of a train Beam 1 Beam 2

Eventually... Eventually... The The trains trains collide and collide and
data data starts to pour in starts to pour in lots of data ;) lots of data ;) ~600 million times a second ~600 million times a second

How much data is How much data is too too
much? much? 1MB 1MB * 1,000,000,000 events / s * 1,000,000,000 events / s * 3600s * 15 (hours) * 3600s * 15 (hours) * 100 days * 100 days = = 1 petabyte / s 1 petabyte / s * 54,000 s * 54,000 s * 100 days = * 100 days = ... ...

Noise. Noise. Your data is Your data is probably full
of it probably full of it If in doubt, ﬁlter it out If in doubt, ﬁlter it out

The LHC Trigger System The LHC Trigger System Custom-built Custom-built
hardware (fpgas) hardware (fpgas) 1GHz (interaction rate) <100kHz 40MHz resolution

The LHC Trigger System (2) The LHC Trigger System (2)
<100kHz Software ﬁltering and Software ﬁltering and reconstruction reconstruction ~200 events / s

petabytes / year petabytes / year Reconstructed Event Reconstructed Event

Let's do something Let's do something useful with data useful
with data a.k.a a.k.a ofﬂine analysis ofﬂine analysis

Let's assume... Let's assume... I'm a physicist I'm a physicist
@ TOTEM experiment @ TOTEM experiment What is my What is my typical use-case typical use-case working with the data? working with the data?

Typical use-case Typical use-case Out of all data Out of
all data from run X, from run X, give me only give me only specific events specific events that that fulfill certain criteria. fulfill certain criteria. I will I will analyse analyse this sample this sample manually manually Rinse and Repeat

Remember this slide? Remember this slide?

The old way The old way 1. Write custom scripts
2. Submit a job to lxbatch 3. Select only events that satisfy criteria Files hundreds MB to many GB each Filtered sample Up to a couple TB involved Up to a couple TB involved

Problem #1: latency variability Problem #1: latency variability Camel Distribution
Camel Distribution warning: xkcd graphs ;)

Problem #1: latency variability Problem #1: latency variability Two-tier storage
Two-tier storage 1. Disk-based hot store 2. Tape-based cold store

Problem #1: latency variability Problem #1: latency variability Job 1
Job 1 1. CASTOR, give me files 1..100 2. Downloading file 1 from disk 3. ... 4. Downloading file 99 from disk 5. Bad luck, file 100 is on a tape 6. File 100 loaded onto a disk 7. Downloading file 100 Job 2 Job 2 1. CASTOR, give me files 1..100 2. ... 3. ... 4. ... 5. Bad luck, file 100 is on a tape, again Sometime later... Sometime later...

Problem #2: work distribution Problem #2: work distribution CASTOR 20
ﬁles (different sizes) 40 workers available sub-optimal performance poor resource utilization

Problem #3: failure Problem #3: failure in intolerance tolerance Batch
Batch jobs jobs like to like to fail fail and when and when they do ... they do ... ... it's completely up to you ... it's completely up to you

Problem #4: the funny one Problem #4: the funny one
ROOT Data Analysis Framework ROOT Data Analysis Framework “ A cornerstone of High Energy Physics software

Problem #4: the funny one Problem #4: the funny one
15 years of development 15 years of development 1,762,865 lines of code 1,762,865 lines of code 46,308 commits 46,308 commits Object-oriented libraries for: data analysis statistics visualization simulation reconstruction event display DAQ C++ Interpreter Suite CINT - the interpreter close enough to standard C++ extensions rich RTTI some syntactic sugar ACLiC - automatic compiler

Why it's the Why it's the best idea ever best
idea ever the command language, the command language, the scripting language the scripting language the programming language the programming language are all C++ are all C++ Feature rich Extremely performant Specialized storage formats

Why it's the Why it's the worst idea ever worst
idea ever the command language, the command language, the scripting language the scripting language the programming language the programming language are all C++ are all C++

"C makes it easy to shoot yourself in the foot;
"C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do, it C++ makes it harder, but when you do, it blows away your whole leg." blows away your whole leg." ― Bjarne Stroustrup ― Bjarne Stroustrup "Especially, when you use it as an interpreted "Especially, when you use it as an interpreted language with reﬂection." language with reﬂection." ― Captain Obvious ― Captain Obvious

Key assumptions of Key assumptions of happy happy analysis analysis
1. Load once Load once, analyze many times , analyze many times 2. Optimal Optimal granularity of jobs granularity of jobs 3. Scalable Scalable 4. Little Little network and I/O network and I/O overhead overhead 5. Failure Failure tolerant tolerant 6. Takes care of Takes care of 2-5 automatically 2-5 automatically 7. Requires Requires me to write me to write less code less code

The new way The new way Create a cluster of
machines ( Create a cluster of machines (single click single click) ) Request files Request files from CASTOR to be loaded onto the from CASTOR to be loaded onto the analysis cluster analysis cluster System System automatically loads and distributes automatically loads and distributes the files the files CASTOR 20 files (different sizes) 20 workers evenly sized chunks (small) EOS

The new way The new way Request files Request files
stored on the cluster stored on the cluster to be processed to be processed Declare Declare selection logic selection logic System System automatically processes automatically processes the files the files Rinse and Repeat paths to files on CASTOR

Overview of architecture Overview of architecture

Building blocks: Hadoop 2 Building blocks: Hadoop 2 “ Apache
Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. fault tolerant fault tolerant scalable scalable designed for data locality designed for data locality HDFS: A distributed ﬁle system that provides high-throughput access to application data. MapReduce: A YARN-based system for parallel processing of large data sets.

Building blocks: OpenStack Building blocks: OpenStack “ OpenStack is a
cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard

OpenStack OpenStack vs vs AWS AWS Amazon Amazon EC2 EC2
vs vs Nova Nova Amazon Amazon S3 S3 vs vs Swift Swift Elastic Block Storage Elastic Block Storage vs vs Cinder Cinder Amazon Amazon VPC VPC vs vs Neutron Neutron Amazon Amazon CloudWatch CloudWatch vs vs Ceilometer Ceilometer Elastic MapReduce Elastic MapReduce vs vs Sahara Sahara AWS AWS Console Console vs vs Horizon Horizon

Building blocks: Sahara Building blocks: Sahara “ Sahara is an
OpenStack data processing plugin, which provides a simple means to provision a Hadoop cluster on top of OpenStack. template, launch, manage Hadoop clusters with a single click (or a command) add / remove nodes submit, execute and track Hadoop jobs

Building blocks: Sahara Building blocks: Sahara

Are we done with Are we done with the infrastructure
the infrastructure then? then? NO NO

Challenge: CERN's OpenStack has Challenge: CERN's OpenStack has no Sahara
no Sahara How do you use Sahara on How do you use Sahara on OpenStack that ... OpenStack that ... ... does not support Sahara? ... does not support Sahara?

Solution: You need to go deeper Solution: You need to
go deeper Separate Separate Horizon on your host Horizon on your host Sahara with Sahara with changes to authentication changes to authentication

Lesson learned: Lesson learned: OpenStack is ﬂexible OpenStack is ﬂexible
has has nice Python code base nice Python code base <3 <3 clean APIs clean APIs easy to jump into easy to jump into

an riddle: What is it? an riddle: What is it?
It's an hipster It's an hipster

Challenge: Hadoop on Challenge: Hadoop on exotic exotic distro distro

Solution: Prepare your own images Solution: Prepare your own images
<template> <name>SLC6 Sahara Icehouse CERN Server - x86_64</name> <description>SLC6 Server with Cern-additions: AFS, Kerberos, user accounts, ... and <os> <name>SLC-6</name> <version>5</version> <arch>x86_64</arch> <install type='iso'> <iso>http://linuxsoft.cern.ch/cern/slc65/iso/SLC_6.5_x86_64_dvd.iso</iso> </install> </os> <packages> <package name='virt-what'/> (...) </packages> <files> <file name='/etc/init.d/firstboot_diskresize' type='raw'> #!/bin/sh (...) </file> (...) </files> <commands> <command name='time-sync'> # set up cron job to synchronize time (...) </command> (...) </commands> </template> Used CERN image builders OZ tool is cool -> Upload to Glance Your OZ customization ﬁle may look like this:

Lesson learned: Lesson learned: Debugging VM images Debugging VM images
is difﬁcult is difﬁcult

Challenge: Cluster provisioning fails often Challenge: Cluster provisioning fails often
You : Sahara, give me 20 machines Sahara : Nova, launch machine no1 Sahara : Nova, launch machine no2 ... Sahara : I'm waiting for all to be Active, before conﬁguring Sahara : 6 failed, rolling back all! You : Oh, for God's sake! ... Sahara : I'm waiting for all to be Active, before conﬁguring them ... waits forever Or even worse! Or even worse!

Solution: First try Solution: First try Modiﬁed Modiﬁed Direct Engine:
Direct Engine: timeout timeout for launching machines for launching machines simple simple retries retries for failed machines for failed machines removes removes completely completely failed failed machines machines ... Sahara: Cluster provisioned. Machines requested: 20. Machines succeeded: 5 You: What the...

Solution: Exponential Backoff Solution: Exponential Backoff Sleeping Sleeping delay delay
is a is a randomized, randomized, exponential exponential function function of of retry count retry count ... Sahara: Cluster provisioned. Machines requested: 20. Machines succeeded: 18 You: Thanks!

Lesson learned: Lesson learned: Be nice to systems you Be
nice to systems you depend on ... depend on ... ... They will thank you with a ... They will thank you with a 200 200

How to load the data using Hadoop How to load
the data using Hadoop CASTOR 20 ﬁles (different sizes) 20 workers evenly sized chunks (small) EOS Map tasks Map tasks HDFS HDFS We need a map-only job We need a map-only job

How to load the data using Hadoop How to load
the data using Hadoop CASTOR EOS path 1 path 2 path 3 ... path 1 C++ path 1 path 2 File 1 Map task 1 Map task 2 Map task 3 path 3 size = HDFS block

TTree TTree: Apache : Apache Parquet Parquet of of HEP
HEP Events Events row per event Row oriented Column oriented Memory layouts Memory layouts Compression Compression unit unit per column per column Read Read only only the the data you need data you need Much Much harder harder to to partition evenly partition evenly ;) ;)

Lesson learned: Lesson learned: Columnar storage formats Columnar storage formats
are great ... are great ... ... give Apache Parquet a try ... give Apache Parquet a try

How to filter the data using Hadoop How to filter
the data using Hadoop paths to files on CASTOR Map tasks Map tasks We need a map-only job* We need a map-only job*

How to ﬁlter the data using Hadoop How to ﬁlter
the data using Hadoop path 1 path 2 path 3 ... C++ Map task 1 Map task 2 Map task N (...)

Challenge: It works Challenge: It works too too fast fast
SELECT two columns out of 100 ... WHERE "complex criteria" SELECT * ... WHERE "1=1" Map task takes Map task takes ~6s ~6s Map task takes Map task takes ~80s ~80s Execution time depends on: Execution time depends on: amount of data read amount of data produced cpu-heaviness of selection criteria Increase the Increase the HDFS block size? HDFS block size? Increase the Increase the HDFS block size? HDFS block size?

Solution: Solution: Optimize Optimize each each query query Split the
job in two: Learning phase Learning phase 1. Select a small sample of input 2. Run the job 3. Calculate avg time of map-task r = t /t heaviness requested avg Mature phase Mature phase Use CombineWholeFileInputFormat maxInputSplitSize = r ∗ blockSize heaviness r ∗ heaviness Result: Filtering Result: Filtering up to 100 times faster up to 100 times faster than loading than loading

Lesson learned: Lesson learned: Hadoop Hadoop is not is not
a low latency a low latency framework framework ... make your tasks ... make your tasks heavier than 30s heavier than 30s

Did it make any sense in the end? Did it
make any sense in the end? YES YES Much more Much more performant performant Much more Much more scalable scalable Little to no Little to no code req code req , but , but Some parts Some parts missing missing change change comes slowly comes slowly resources as resources as well well

What is the moral of What is the moral of
this story? this story?

There are stories to tell, There are stories to tell,
go create them. go create them.

Thank You Thank You turu-on-things.com turu-on-things.com @ @rekurencja rekurencja

Unraveling mysteries of the Universe at CERN, w...

Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop - Piotr Turek

More Decks by Tech Space guests

Other Decks in Technology

Featured

Transcript