Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unraveling mysteries of the Universe at CERN, w...

Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop - Piotr Turek

Data KRK # 10
18/06/1015

Tech Space guests

June 18, 2015
Tweet

More Decks by Tech Space guests

Other Decks in Technology

Transcript

  1. Unraveling mysteries of the Universe at Unraveling mysteries of the

    Universe at CERN, with OpenStack and Hadoop CERN, with OpenStack and Hadoop There and back again There and back again / Piotr Turek @rekurencja
  2. How it all came to be? How it all came

    to be? How will it end? How will it end? What is the fundamental structure What is the fundamental structure of space and time? of space and time?
  3. "Somewhere, something incredible is "Somewhere, something incredible is waiting to

    be known" waiting to be known" ― Carl Sagan ― Carl Sagan
  4. Does size matter? Does size matter? Not always ;) Not

    always ;) Particle Physics is born Particle Physics is born
  5. Can you see him now? Can you see him now?

    4 stories high 4 stories high 14,000 tons 14,000 tons +100 m underground +100 m underground
  6. One small bottle for a man... One small bottle for

    a man... Mind boggling fact: one bottle can last for many months 0.2 nanogram / day ~ 2 red blood cells / day
  7. Mind boggling facts continued... Mind boggling facts continued... 10 km/h

    slower than light -271.3°C (1.9K) ~coldest place in the Universe Total kinetic energy of a train Beam 1 Beam 2
  8. Eventually... Eventually... The The trains trains collide and collide and

    data data starts to pour in starts to pour in lots of data ;) lots of data ;) ~600 million times a second ~600 million times a second
  9. How much data is How much data is too too

    much? much? 1MB 1MB * 1,000,000,000 events / s * 1,000,000,000 events / s * 3600s * 15 (hours) * 3600s * 15 (hours) * 100 days * 100 days = = 1 petabyte / s 1 petabyte / s * 54,000 s * 54,000 s * 100 days = * 100 days = ... ...
  10. Noise. Noise. Your data is Your data is probably full

    of it probably full of it If in doubt, filter it out If in doubt, filter it out
  11. The LHC Trigger System The LHC Trigger System Custom-built Custom-built

    hardware (fpgas) hardware (fpgas) 1GHz (interaction rate) <100kHz 40MHz resolution
  12. The LHC Trigger System (2) The LHC Trigger System (2)

    <100kHz Software filtering and Software filtering and reconstruction reconstruction ~200 events / s
  13. Let's do something Let's do something useful with data useful

    with data a.k.a a.k.a offline analysis offline analysis
  14. Let's assume... Let's assume... I'm a physicist I'm a physicist

    @ TOTEM experiment @ TOTEM experiment What is my What is my typical use-case typical use-case working with the data? working with the data?
  15. Typical use-case Typical use-case Out of all data Out of

    all data from run X, from run X, give me only give me only specific events specific events that that fulfill certain criteria. fulfill certain criteria. I will I will analyse analyse this sample this sample manually manually Rinse and Repeat
  16. The old way The old way 1. Write custom scripts

    2. Submit a job to lxbatch 3. Select only events that satisfy criteria Files hundreds MB to many GB each Filtered sample Up to a couple TB involved Up to a couple TB involved
  17. Problem #1: latency variability Problem #1: latency variability Two-tier storage

    Two-tier storage 1. Disk-based hot store 2. Tape-based cold store
  18. Problem #1: latency variability Problem #1: latency variability Job 1

    Job 1 1. CASTOR, give me files 1..100 2. Downloading file 1 from disk 3. ... 4. Downloading file 99 from disk 5. Bad luck, file 100 is on a tape 6. File 100 loaded onto a disk 7. Downloading file 100 Job 2 Job 2 1. CASTOR, give me files 1..100 2. ... 3. ... 4. ... 5. Bad luck, file 100 is on a tape, again Sometime later... Sometime later...
  19. Problem #2: work distribution Problem #2: work distribution CASTOR 20

    files (different sizes) 40 workers available sub-optimal performance poor resource utilization
  20. Problem #3: failure Problem #3: failure in intolerance tolerance Batch

    Batch jobs jobs like to like to fail fail and when and when they do ... they do ... ... it's completely up to you ... it's completely up to you
  21. Problem #4: the funny one Problem #4: the funny one

    ROOT Data Analysis Framework ROOT Data Analysis Framework “ A cornerstone of High Energy Physics software
  22. Problem #4: the funny one Problem #4: the funny one

    15 years of development 15 years of development 1,762,865 lines of code 1,762,865 lines of code 46,308 commits 46,308 commits Object-oriented libraries for: data analysis statistics visualization simulation reconstruction event display DAQ C++ Interpreter Suite CINT - the interpreter close enough to standard C++ extensions rich RTTI some syntactic sugar ACLiC - automatic compiler
  23. Why it's the Why it's the best idea ever best

    idea ever the command language, the command language, the scripting language the scripting language the programming language the programming language are all C++ are all C++ Feature rich Extremely performant Specialized storage formats
  24. Why it's the Why it's the worst idea ever worst

    idea ever the command language, the command language, the scripting language the scripting language the programming language the programming language are all C++ are all C++
  25. "C makes it easy to shoot yourself in the foot;

    "C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do, it C++ makes it harder, but when you do, it blows away your whole leg." blows away your whole leg." ― Bjarne Stroustrup ― Bjarne Stroustrup "Especially, when you use it as an interpreted "Especially, when you use it as an interpreted language with reflection." language with reflection." ― Captain Obvious ― Captain Obvious
  26. Key assumptions of Key assumptions of happy happy analysis analysis

    1. Load once Load once, analyze many times , analyze many times 2. Optimal Optimal granularity of jobs granularity of jobs 3. Scalable Scalable 4. Little Little network and I/O network and I/O overhead overhead 5. Failure Failure tolerant tolerant 6. Takes care of Takes care of 2-5 automatically 2-5 automatically 7. Requires Requires me to write me to write less code less code
  27. The new way The new way Create a cluster of

    machines ( Create a cluster of machines (single click single click) ) Request files Request files from CASTOR to be loaded onto the from CASTOR to be loaded onto the analysis cluster analysis cluster System System automatically loads and distributes automatically loads and distributes the files the files CASTOR 20 files (different sizes) 20 workers evenly sized chunks (small) EOS
  28. The new way The new way Request files Request files

    stored on the cluster stored on the cluster to be processed to be processed Declare Declare selection logic selection logic System System automatically processes automatically processes the files the files Rinse and Repeat paths to files on CASTOR
  29. Building blocks: Hadoop 2 Building blocks: Hadoop 2 “ Apache

    Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. fault tolerant fault tolerant scalable scalable designed for data locality designed for data locality HDFS: A distributed file system that provides high-throughput access to application data. MapReduce: A YARN-based system for parallel processing of large data sets.
  30. Building blocks: OpenStack Building blocks: OpenStack “ OpenStack is a

    cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard
  31. OpenStack OpenStack vs vs AWS AWS Amazon Amazon EC2 EC2

    vs vs Nova Nova Amazon Amazon S3 S3 vs vs Swift Swift Elastic Block Storage Elastic Block Storage vs vs Cinder Cinder Amazon Amazon VPC VPC vs vs Neutron Neutron Amazon Amazon CloudWatch CloudWatch vs vs Ceilometer Ceilometer Elastic MapReduce Elastic MapReduce vs vs Sahara Sahara AWS AWS Console Console vs vs Horizon Horizon
  32. Building blocks: Sahara Building blocks: Sahara “ Sahara is an

    OpenStack data processing plugin, which provides a simple means to provision a Hadoop cluster on top of OpenStack. template, launch, manage Hadoop clusters with a single click (or a command) add / remove nodes submit, execute and track Hadoop jobs
  33. Are we done with Are we done with the infrastructure

    the infrastructure then? then? NO NO
  34. Challenge: CERN's OpenStack has Challenge: CERN's OpenStack has no Sahara

    no Sahara How do you use Sahara on How do you use Sahara on OpenStack that ... OpenStack that ... ... does not support Sahara? ... does not support Sahara?
  35. Solution: You need to go deeper Solution: You need to

    go deeper Separate Separate Horizon on your host Horizon on your host Sahara with Sahara with changes to authentication changes to authentication
  36. Lesson learned: Lesson learned: OpenStack is flexible OpenStack is flexible

    has has nice Python code base nice Python code base <3 <3 clean APIs clean APIs easy to jump into easy to jump into
  37. an riddle: What is it? an riddle: What is it?

    It's an hipster It's an hipster
  38. Solution: Prepare your own images Solution: Prepare your own images

    <template> <name>SLC6 Sahara Icehouse CERN Server - x86_64</name> <description>SLC6 Server with Cern-additions: AFS, Kerberos, user accounts, ... and <os> <name>SLC-6</name> <version>5</version> <arch>x86_64</arch> <install type='iso'> <iso>http://linuxsoft.cern.ch/cern/slc65/iso/SLC_6.5_x86_64_dvd.iso</iso> </install> </os> <packages> <package name='virt-what'/> (...) </packages> <files> <file name='/etc/init.d/firstboot_diskresize' type='raw'> #!/bin/sh (...) </file> (...) </files> <commands> <command name='time-sync'> # set up cron job to synchronize time (...) </command> (...) </commands> </template> Used CERN image builders OZ tool is cool -> Upload to Glance Your OZ customization file may look like this:
  39. Challenge: Cluster provisioning fails often Challenge: Cluster provisioning fails often

    You : Sahara, give me 20 machines Sahara : Nova, launch machine no1 Sahara : Nova, launch machine no2 ... Sahara : I'm waiting for all to be Active, before configuring Sahara : 6 failed, rolling back all! You : Oh, for God's sake! ... Sahara : I'm waiting for all to be Active, before configuring them ... waits forever Or even worse! Or even worse!
  40. Solution: First try Solution: First try Modified Modified Direct Engine:

    Direct Engine: timeout timeout for launching machines for launching machines simple simple retries retries for failed machines for failed machines removes removes completely completely failed failed machines machines ... Sahara: Cluster provisioned. Machines requested: 20. Machines succeeded: 5 You: What the...
  41. Solution: Exponential Backoff Solution: Exponential Backoff Sleeping Sleeping delay delay

    is a is a randomized, randomized, exponential exponential function function of of retry count retry count ... Sahara: Cluster provisioned. Machines requested: 20. Machines succeeded: 18 You: Thanks!
  42. Lesson learned: Lesson learned: Be nice to systems you Be

    nice to systems you depend on ... depend on ... ... They will thank you with a ... They will thank you with a 200 200
  43. How to load the data using Hadoop How to load

    the data using Hadoop CASTOR 20 files (different sizes) 20 workers evenly sized chunks (small) EOS Map tasks Map tasks HDFS HDFS We need a map-only job We need a map-only job
  44. How to load the data using Hadoop How to load

    the data using Hadoop CASTOR EOS path 1 path 2 path 3 ... path 1 C++ path 1 path 2 File 1 Map task 1 Map task 2 Map task 3 path 3 size = HDFS block
  45. TTree TTree: Apache : Apache Parquet Parquet of of HEP

    HEP Events Events row per event Row oriented Column oriented Memory layouts Memory layouts Compression Compression unit unit per column per column Read Read only only the the data you need data you need Much Much harder harder to to partition evenly partition evenly ;) ;)
  46. Lesson learned: Lesson learned: Columnar storage formats Columnar storage formats

    are great ... are great ... ... give Apache Parquet a try ... give Apache Parquet a try
  47. How to filter the data using Hadoop How to filter

    the data using Hadoop paths to files on CASTOR Map tasks Map tasks We need a map-only job* We need a map-only job*
  48. How to filter the data using Hadoop How to filter

    the data using Hadoop path 1 path 2 path 3 ... C++ Map task 1 Map task 2 Map task N (...)
  49. Challenge: It works Challenge: It works too too fast fast

    SELECT two columns out of 100 ... WHERE "complex criteria" SELECT * ... WHERE "1=1" Map task takes Map task takes ~6s ~6s Map task takes Map task takes ~80s ~80s Execution time depends on: Execution time depends on: amount of data read amount of data produced cpu-heaviness of selection criteria Increase the Increase the HDFS block size? HDFS block size? Increase the Increase the HDFS block size? HDFS block size?
  50. Solution: Solution: Optimize Optimize each each query query Split the

    job in two: Learning phase Learning phase 1. Select a small sample of input 2. Run the job 3. Calculate avg time of map-task r = t /t heaviness requested avg Mature phase Mature phase Use CombineWholeFileInputFormat maxInputSplitSize = r ∗ blockSize heaviness r ∗ heaviness Result: Filtering Result: Filtering up to 100 times faster up to 100 times faster than loading than loading
  51. Lesson learned: Lesson learned: Hadoop Hadoop is not is not

    a low latency a low latency framework framework ... make your tasks ... make your tasks heavier than 30s heavier than 30s
  52. Did it make any sense in the end? Did it

    make any sense in the end? YES YES Much more Much more performant performant Much more Much more scalable scalable Little to no Little to no code req code req , but , but Some parts Some parts missing missing change change comes slowly comes slowly resources as resources as well well