PelemayFp: An Efficient parallelization library for Elixir based on skeletons for data parallelism

Slide 1

Slide 1 text

PelemayFp: An Efficient parallelization library for Elixir based on skeletons for data parallelism Susumu Yamazaki Univ. of Kitakyushu 1 © 2021 Susumu Yamazaki This research is supported by Adaptable and Seamless Technology transfer Program through Target-driven R&D (A-STEP) from Japan Science and Technology Agency (JST) Grant Number JPMJTM20H1.

Slide 2

Slide 2 text

Elixir, Enum and Flow • Elixir is a dynamic, functional programming language for building scalable and maintainable applications, which is based on Erlang. • Elixir has Enum module, which provides map, filter and reduce functions, based on the BMF manner [2]. • Elixir also has Flow optional module, which provides Enum like functions with executing in parallel but without sorting. This is an algorithmic skeleton [3], especially for data parallelism, based on BMF [8]. [2] Bird, R. S.: An Introduction to the Theory of Lists, Logic of Programming and Calculi of Discrete Design (Broy, M., ed.), NATO ASI Series, Vol. 36, Springer, Berlin, Heidelberg (1987). [3] Cole, M. : Algorithmic Skeletons: Structured Management of Parallel Computation, MIT Press (1989). [8] Skillicorn, D. B.: Foundation of Parallel Programming, Cambridge International Series on Parallel Computation, Cambridge University Press (1994). 2 © 2021 Susumu Yamazaki

Slide 3

Slide 3 text

Hastega and Pelemay • Flow is effective for I/O bound case, but not for CPU bound case. • Thus, we started to research of a native compiler that has Flow like interface, including Hastega, which is a prototype of a native compiler for Elixir, which generates OpenCL or SIMD instructions of x86_64 with multicore, written in Elixir and Rust, • And Pelemay Super-Parallelism (in short, Pelemay), which is a native compiler for Elixir, and which generates SIMD instructions of x86_64 or ARMv8 with single core, or OpenCL, written in Elixir and C, generated by Elixir. • As the results of performance evaluation on x86_64, we achieved performance improvement of integer calculation, float calculation and string replacement that are 2.25, 4.48 and 3.86 times faster than Enum, and that are 21.0, 247 and 7.26 × 1000 times faster than Flow with sorting, respectively. 3 © 2021 Susumu Yamazaki

Slide 4

Slide 4 text

PelemayFp • Pelemay, however, can execute only on a single core. • Thus, we propose Pelemay Fast Parallel map (PelemayFp), which provides fast Parallel map function, similar to the Enum module, although computation will be executed in parallel with spawning processes, written in only Elixir. ✓The previous trial that makes Pelemay modified to be able to execute in parallel with only native code wasn’t succeeded due to the issues of quality and maintainability. 4 © 2021 Susumu Yamazaki

Slide 5

Slide 5 text

Combination PelemayFp and Pelemay • We also propose to use PelemayFp with Pelemay, in order to execute in parallel and with native code. ✴We guessed that the combination of PelemayFp and Pelemay would be faster than PelemayFp alone. 5 © 2021 Susumu Yamazaki

Slide 6

Slide 6 text

Contribution 1. PelemayFp is designed in the Distributor / Worker / Collector style, carefully, for maximum parallelism, in case of parallelization of sequential data flow with a list. Even so, however, the best case of execution time is in case that the worker processes is only 12, and it will be useless to spawn more worker processes. 2. We guessed that the combination of PelemayFp and Pelemay would be faster than PelemayFp alone, but the results were disappointing. The analysis using Amdahl’s law reveals that this experimental results can be explained by assuming that when calling native code from Elixir with NIFs, the part that is not executed in parallel increases by about 30-40 percent. If it is true, it will be needed to modify Erlang VM drastically, to try to explore another FFI, or to create a new VM. We’ll show these evidences as follows. 6 © 2021 Susumu Yamazaki

Slide 7

Slide 7 text

Sample code of PelemayFp • The right figure shows a sample code of PelemayFp. • 1..1_000_000: the range from 1 to 1,000,000 • |> : the pipeline operator of Elixir • & &1 * &1: square function • This equals to PelemayFp.map(1..1_000_000, & &1 * &1) • This returns a list where each element is the result of invoking the square function, on each corresponding element of the list generated from the range from 1 to 1,000,000, with processing in parallel where each thread processes threshold elements, which is omitted and the default value is 12,000. • As a result, this returns a squared list of each element from 1 to 1,000,000. 7 © 2021 Susumu Yamazaki

Slide 8

Slide 8 text

Design of PelemayFp • PelemayFp is implemented using only Elixir, like Flow. • In Flow, the order of the list after computation is not guaranteed, while in PelemayFp, the order of the list is guaranteed because it is sorting while collecting and merging. • PelemayFp consists of the following modules: • PelemayFp, which is the facade module, and which provides map and map_chunk functions; • PelemayFp.ParallelSplitter, which provides the split function that decides a given enumerable into a chunk list whose size equals to or is less than the given threshold, with spawning each process corresponding to the chunked list, and the range function that returns the range from the number of the chunk minus 1 to 0. The reason that the range is descending is that PelemayFp.ParallelBinaryMerger will receive fragments of the given list at the best execution speed if it is in reverse order, because a list in Elixir is immutable; • PelemayFp.Sub, which is inner and undocumented, and which wraps the difference of map and map_chunk; • PelemayFp.ParallelBinaryMerger, which receives a given consecutive list of tuples of a Range, count and a list, or an exit or dying message from the monitored process and merges it into a result, and send it; • PelemayFp.BinaryMerger, which inserts a given consecutive list of tuples of a Range, count and a list into another one; • PelemayFp.Merger, which merges two consecutive list of tuples of a Range, count and a list. • Splitting the given list, spawning each worker process, calculating at each worker, collecting the result of workers, and sorting the result are executed in parallel. • In contrast, Flow with sorting executes them partially in parallel and partially in sequential. 8 © 2021 Susumu Yamazaki

Slide 9

Slide 9 text

The results of the experiments • The right figure shows the graph that plots the results of the experiments in Table 3. • The best case of execution time is in case that the worker processes and the speedup ratio are 12 and 2.11, respectively • And it will be useless to spawn more worker processes. 9 © 2021 Susumu Yamazaki Faster

Slide 10

Slide 10 text

The results of the experiments • We guessed that the combination of PelemayFp and Pelemay would be faster than PelemayFp alone, but the results were disappointing, because the speedup ratio of the combination of PelemayFp and Pelemay is so lower than that of PelemayFp alone. 10 © 2021 Susumu Yamazaki Faster

Slide 11

Slide 11 text

The analysis by Amdahl’s law • The right figure shows the graph of the analysis by Amdahl’s law in Table 4. • This shows the percentage of parallel execution in the entire code of PelemayFp is 48-66 percent, while that of the combination of PelemayFp and Pelemay is 6-46 percent. 11 © 2021 Susumu Yamazaki

Slide 12

Slide 12 text

The analysis by Amdahl’s law • Furthermore, we assume the combination of PelemayFp and Pelemay has more non- parallelized sections in the entire code than PelemayFp because Pelemay uses NIFs, and it may have overheads due to the non- parallelized sections when calling NIFs. • When N is 2, the analysis will be formulated in the right figure, where 1.49 and 1.12 are the speedup ratios from 1 core of PelemayFp and the combination PelemayFp and Pelemay, respectively. 12 © 2021 Susumu Yamazaki

Slide 13

Slide 13 text

The analysis by Amdahl’s law • N = 2, p = 0.66 → q = 0.44 • N = 3, p = 0.70 → q = 0.38 • N = 6, p = 0.64 → q = 0.30 • N = 12, p = 0.65 → q = 0.37 ➡The analysis using Amdahl’s law reveals that this experimental results can be explained by assuming that when calling native code from Elixir with NIFs, the part that is not executed in parallel increases by about 30-40 percent. 13 © 2021 Susumu Yamazaki

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Future Works • Therefore, when generating native code including SIMD instructions and adopting the approach of parallelizing with Elixir for speeding up, it will not be appropriate to adopt NIFs. • It may be appropriate to incorporate a code optimization mechanism using SIMD instructions into the JIT, which will be released in the next major version of Erlang, or to use another FFI method, Port, instead of using NIFs. • Thus, as future works, we will evaluate them. • Further future works include researching and developing application of it to satellite image processing systems, and ZEAM, which is a new VM for Elixir, including solving the above mentioned issues. 15 © 2021 Susumu Yamazaki