Slide 1

Slide 1 text

Ruby and the World Record Pi Calculation 2024-05-17 Emma Haruka Iwao, Software Engineer

Slide 2

Slide 2 text

https://cloud.google.com/blog/products/compute/calculating-31-4-trillion-digits-of-archimedes-constant-on-google-cloud https://cloud.google.com/blog/products/compute/calculating-100-trillion-digits-of-pi-on-google-cloud

Slide 3

Slide 3 text

Why Calculate Pi?

Slide 4

Slide 4 text

"Human progress in calculation has traditionally been measured by the number of decimal digits of π..." The Art of Computer Programming, Volume 2, Third Edition, Donald E. Knuth, 1997

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Pi is a popular PC benchmark ● Super PI (1995), up to 16.7 million digits ● PiFast (1997), up to 16 billion digits ● y-cruncher (2009), up to 108 quadrillion (10^15) digits Time to calculate 1 billion digits https://hwbot.org/benchmark/y-cruncher_-_pi-1b/

Slide 7

Slide 7 text

Maximizing within Constraints

Slide 8

Slide 8 text

Using y-cruncher ● Developed by Alexander Yee, started when he was in high school ● Fastest program to calculate pi on a single-node computer ● Written in C++ with hand optimizations for modern CPUs ● You need a fast computer with a lot of memory and storage ○ 468 TiB for 100 trillion digits ○ Too big to fit in DRAM

Slide 9

Slide 9 text

Storage is the bottleneck ● Storage is orders of magnitude slower than CPU ● CPU speed isn’t very important. ● The 100 trillion calculation took 157 days, moving 62.8 PiB of data. ● The average CPU utilization is 35%. ● With infinitely fast CPU, it’d still take more than 100 days.

Slide 10

Slide 10 text

Solving the puzzle ● We’re using Google Compute Engine (GCE) ● Maximum Persistent Disk per GCE VM: 257 TiB ● Storage we need: 500 TiB ● Need network storage: iSCSI ○ Block device over TCP/IP provided by OS ● Network throughput limit: 100 Gbps

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

y-cruncher TCP/IP I/O Scheduler iSCSI Initiator Virtual NIC Cloud Network Virtual NIC iSCSI Target TCP/IP Storage Linux Linux Compute Node VM Storage Node VM I/O Scheduler Filesystem

Slide 13

Slide 13 text

Configurable parameters ● Filesystem: ext4, xfs, … ● I/O scheduler: mq-deadline, none ● TCP/IP parameters: buffer size, congestion algorithm ● iSCSI parameters: queue depth, outstanding requests ● Simultaneous multithreading ● y-cruncher: bytes / seek ● Cloud specific: instance type, Persistent Disk type

Slide 14

Slide 14 text

Every tuning goes a long way If something is 1% faster, it could save a day in computation time. y-cruncher has benchmark mode. Each run takes 30 - 60 minutes. Can we automate it?

Slide 15

Slide 15 text

Automating Most of the Things

Slide 16

Slide 16 text

y-cruncher’s config file FarMemoryConfig : { Framework : "disk-raid0" InterleaveWidth : 262144 BufferPerLane : 134217728 Checksums : "true" RawIO : "true" Lanes : [ { // Lane 0 Path : "/mnt/disk0" BufferAllocator : { Allocator : "interleave-libnuma" LockedPages : "attempt" Nodes : [1] } WorkerThreadCores : [32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] WorkerThreadPriority : 2 } { // Lane 1 Path : "/mnt/disk1" BufferAllocator : { Allocator : "interleave-libnuma" LockedPages : "attempt" Nodes : [1] } WorkerThreadCores : [32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] WorkerThreadPriority : 2 } { // Lane 2 Path : "/mnt/disk2" BufferAllocator : { Allocator : "interleave-libnuma"

Slide 17

Slide 17 text

ERB to the rescue! ● ERB is a template engine. ● Text inside <% %> runs as Ruby code. ● <%= %> replaces the block with the code output. Example: Hello, World こんにちは、世界! <%= 'Hello, World' %> こんにちは、世界!

Slide 18

Slide 18 text

https://rubykaigi.org/2024/presentations/m_seki.html#day3

Slide 19

Slide 19 text

Two ways to use ERB 1. Command line erb command 2. ERB class from code <%# hello.txt.erb %> Hello, <%= location %>! > erb location=Okinawa hello.txt.erb Hello, Okinawa! require 'erb' location = 'Okinawa' file = ERB.new(File.read('hello.txt.erb')) puts file.result

Slide 20

Slide 20 text

require 'erb' CONFIG_FILE='y-bench.cfg' RESULTS_DIR='./bench-results' def test(count:) bytes_per_seek = 256 * 1024 * count template = ERB.new(File.read("bench-templ.cfg.erb")) cfg_file = "#{RESULTS_DIR}/bench-#{count}.cfg" File.write(cfg_file, template.result) system("cd y-cruncher && ./y-cruncher config ../#{cfg_file} | tee ../#{RESULTS_DIR}/result-#{count}.txt") end (32..72).step(2) do |n| test(count: n) end Runs y-cruncher with 32, 34, 36, …, 72 disks automatically

Slide 21

Slide 21 text

FarMemory : { Framework : "disk-raid0" InterleaveWidth : 262144 BufferPerLane : 134217728 Checksums : "true" RawIO : "true" Lanes : [ <% count.times do |i| %> { // Lane <%= i %> Path : "/mnt/disk<%= i %>" BufferAllocator : { Allocator : "interleave-libnuma" LockedPages : "attempt" Nodes : [1] } WorkerThreadCores : [32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] WorkerThreadPriority : 2 } <% end %> ] }

Slide 22

Slide 22 text

Why ERB and not ? ● Because I’m a Rubyist! ● Any tool would’ve worked. ● Picked something I was already familiar with. ● The goal is to finish benchmarking as quickly as possible. ○ Less about learning a new tool.

Slide 23

Slide 23 text

y-cruncher’s benchmark results Need these numbers

Slide 24

Slide 24 text

Looks pretty, doesn’t it? > less result.txt "result.txt" may be a binary file. See it anyway?

Slide 25

Slide 25 text

What it actually looks like

Slide 26

Slide 26 text

How to extract the numbers find . -name 'result-*.txt' -exec sh -c "grep -E '(Far Memory)|(Sequential)|(Threshold)|(Computation)|(Disk I/O)' {} | grep -Eo '[0-9]+\.[0-9]+ GiB/s' | sed -n '2p;4p;6p;8p;9p;10p' | grep -Eo '[0-9]+\.[0-9]+' | paste -s -d, - " \;

Slide 27

Slide 27 text

First, get the lines we need Simple regular expression to filter lines

Slide 28

Slide 28 text

Filter the lines ● Running grep with ‘GiB/s’ as a marker ● Some lines appear twice because y-cruncher writes the final result after a line break.

Slide 29

Slide 29 text

Sed to get specific lines ● -n: only output matching patterns ● number: line number ● p: print the current pattern space

Slide 30

Slide 30 text

Now, we just need the numbers We left “GiB/s” as markers, but we don’t need them anymore. Grep again!

Slide 31

Slide 31 text

Convert to csv with paste paste command reads lines and outputs with a delimiter of choice. -d: Use the specified character as a delimiter

Slide 32

Slide 32 text

Now we want a lot of them find them!

Slide 33

Slide 33 text

Now we have CSV output

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Before and after performance tuning First Config Best Config Diff Sequential Read (GiB/s) 3.48 11.2 312% Sequential Write (GiB/s) 6.79 8.65 127% VST Computing (GiB/s) 14.7 25.6 174% VST I/O (GiB/s) 3.50 7.52 215% Actual run time: 157 days It could’ve taken more than 300 days without any optimizations.

Slide 37

Slide 37 text

Why not Ruby? ● I’m more familiar with Linux command-line tools for text processing. ● Trial and error is faster on command line. ● It’s one-off. No maintenance needed. ● Google Sheets is easy to share and collaborate with. ● Didn’t want to lose more time benchmarking vs improvements.

Slide 38

Slide 38 text

A few months later… Verifying Decimal Output: Time: 33956.481 seconds ( 9.432 hours ) Verifying Hexadecimal Output: Time: 33311.682 seconds ( 9.253 hours ) Start Time: Thu Oct 14 04:45:44 2021 End Time: Mon Mar 21 04:16:52 2022 Total Computation Time: 11303429.462 seconds ( 130.827 days ) Start-to-End Wall Time: 13649467.651 seconds ( 157.980 days ) CPU Utilization: 2185.38 % + 17.43 % kernel overhead Multi-core Efficiency: 34.15 % + 0.27 % kernel overhead Last Decimal Digits: Pi 4658718895 1242883556 4671544483 9873493812 1206904813 : 99,999,999,999,950 2656719174 5255431487 2142102057 7077336434 3095295560 : 100,000,000,000,000 Spot Check: Good through 50,000,000,000,000 Version: 0.7.8.9507 (Linux/18-CNL ~ Shinoa) Processor(s): Intel(R) Xeon(R) CPU @ 2.60GHz Topology: 64 threads / 64 cores / 2 sockets / 2 NUMA nodes Usable Memory: 913,099,632,640 ( 850 GiB) CPU Base Frequency: 2,599,987,648 Hz Validation File: /mnt/y-cruncher/results/Pi - 20220321-041655.txt

Slide 39

Slide 39 text

A Rubyist’s Road to World Records

Slide 40

Slide 40 text

My first RubyKaigi ● My first RubyKaigi is RubyKaigi 2009 when I was in university. ● My senpai learned there that I was interested in Ruby. ● Then, he suggested I attend RailsGirls Tokyo 2nd in 2013

Slide 41

Slide 41 text

RailsGirls to RubyKaigi speaker After attending RailsGirls, I started contributing to RailsGirls as a coach. A few years later, I spoke about RailsGirls at RubyKaigi 2014 for the first time as a speaker. https://rubykaigi.org/2014/presentation/S-HarukaIwao/

Slide 42

Slide 42 text

Ruby gave me the opportunities ● I got my second job with a referral from someone I met at RubyKaigi. ● When I interviewed for Google DevRel in 2017, I submitted the video of my talk at RubyKaigi 2014 as my English conference talk example. ○ The hiring manager also valued my RailsGirls contributions.

Slide 43

Slide 43 text

Me, pi, and DevRel ● I always wanted to calculate pi, but didn’t have resources. ● Google’s Cloud DevRel had a Pi Day tradition of showcasing Cloud technologies with pi calculations. ● I suddenly had the right idea at the right place. ● My manager and director supported the pi calculation project.

Slide 44

Slide 44 text

https://pi.delivery/

Slide 45

Slide 45 text

Ruby made the pi world records possible Without Ruby, ● I would’ve had a different career path. ● I might not have broken the pi world record twice. ● I am not here today.

Slide 46

Slide 46 text

Thank you Emma Haruka Iwao Software Engineer / SRE at Google @yuryu