Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PelemayFp: An Efficient parallelization library for Elixir based on skeletons for data parallelism

PelemayFp: An Efficient parallelization library for Elixir based on skeletons for data parallelism

In this presentation, we propose Pelemay Fast Parallel map (PelemayFp), which is a library to parallelize Elixir code, efficiently, based on skeletons for data parallelism. PelemayFp is implemented using only Elixir, like Flow, which is a library of previous works. In Flow, the order of the list after computation is not guaranteed, while in PelemayFp, the order of the list is guaranteed because it is sorting while collecting and merging. On the other hand, Pelemay Super Parallelism (Pelemay), which we proposed, generates native code using SIMD instructions and calls it by NIFs, which is one of FFIs that Erlang provides, without performing multi-core parallelism, guaranteeing the order of the list. We evaluated the integer arithmetic performance by logistic mapping of PelemayFp alone, Pelemay alone, the combination of PelemayFp and Pelemay, Flow, and Enum, which is in the standard library of Elixir. When run on an Intel Xeon W-2191B CPU with 18 cores and 36 threads, the PelemayFp alone is up to 2.1 times faster than Enum. It is also faster than Flow without sorting. On the other hand, the combination of PelemayFp and Pelemay is up to 1.1 times faster than Enum. We also estimated the percentage of parallel execution in the entire code based on Amdahl's law. That of PelemayFp is 48--66 percent, while that of the combination of PelemayFp and Pelemay is 6--46 percent. Further analysis revealed that this experimental results can be explained by assuming that when calling native code from Elixir with NIFs, the part that is not executed in parallel increases by about 30--40 percent. Therefore, when generating native code including SIMD instructions and adopting the approach of parallelizing with Elixir for speeding up, it will not be appropriate to adopt NIFs.

83722380372c00bd75ac920f2089f6aa?s=128

Susumu Yamazaki (ZACKY)

March 17, 2021
Tweet

More Decks by Susumu Yamazaki (ZACKY)

Other Decks in Programming

Transcript

  1. PelemayFp: An Efficient parallelization library for Elixir based on skeletons

    for data parallelism Susumu Yamazaki Univ. of Kitakyushu 1 © 2021 Susumu Yamazaki This research is supported by Adaptable and Seamless Technology transfer Program through Target-driven R&D (A-STEP) from Japan Science and Technology Agency (JST) Grant Number JPMJTM20H1.
  2. Elixir, Enum and Flow • Elixir is a dynamic, functional

    programming language for building scalable and maintainable applications, which is based on Erlang. • Elixir has Enum module, which provides map, filter and reduce functions, based on the BMF manner [2]. • Elixir also has Flow optional module, which provides Enum like functions with executing in parallel but without sorting. This is an algorithmic skeleton [3], especially for data parallelism, based on BMF [8]. [2] Bird, R. S.: An Introduction to the Theory of Lists, Logic of Programming and Calculi of Discrete Design (Broy, M., ed.), NATO ASI Series, Vol. 36, Springer, Berlin, Heidelberg (1987). [3] Cole, M. : Algorithmic Skeletons: Structured Management of Parallel Computation, MIT Press (1989). [8] Skillicorn, D. B.: Foundation of Parallel Programming, Cambridge International Series on Parallel Computation, Cambridge University Press (1994). 2 © 2021 Susumu Yamazaki
  3. Hastega and Pelemay • Flow is effective for I/O bound

    case, but not for CPU bound case. • Thus, we started to research of a native compiler that has Flow like interface, including Hastega, which is a prototype of a native compiler for Elixir, which generates OpenCL or SIMD instructions of x86_64 with multicore, written in Elixir and Rust, • And Pelemay Super-Parallelism (in short, Pelemay), which is a native compiler for Elixir, and which generates SIMD instructions of x86_64 or ARMv8 with single core, or OpenCL, written in Elixir and C, generated by Elixir. • As the results of performance evaluation on x86_64, we achieved performance improvement of integer calculation, float calculation and string replacement that are 2.25, 4.48 and 3.86 times faster than Enum, and that are 21.0, 247 and 7.26 × 1000 times faster than Flow with sorting, respectively. 3 © 2021 Susumu Yamazaki
  4. PelemayFp • Pelemay, however, can execute only on a single

    core. • Thus, we propose Pelemay Fast Parallel map (PelemayFp), which provides fast Parallel map function, similar to the Enum module, although computation will be executed in parallel with spawning processes, written in only Elixir. ✓The previous trial that makes Pelemay modified to be able to execute in parallel with only native code wasn’t succeeded due to the issues of quality and maintainability. 4 © 2021 Susumu Yamazaki
  5. Combination PelemayFp and Pelemay • We also propose to use

    PelemayFp with Pelemay, in order to execute in parallel and with native code. ✴We guessed that the combination of PelemayFp and Pelemay would be faster than PelemayFp alone. 5 © 2021 Susumu Yamazaki
  6. Contribution 1. PelemayFp is designed in the Distributor / Worker

    / Collector style, carefully, for maximum parallelism, in case of parallelization of sequential data flow with a list. Even so, however, the best case of execution time is in case that the worker processes is only 12, and it will be useless to spawn more worker processes. 2. We guessed that the combination of PelemayFp and Pelemay would be faster than PelemayFp alone, but the results were disappointing. The analysis using Amdahl’s law reveals that this experimental results can be explained by assuming that when calling native code from Elixir with NIFs, the part that is not executed in parallel increases by about 30-40 percent. If it is true, it will be needed to modify Erlang VM drastically, to try to explore another FFI, or to create a new VM. We’ll show these evidences as follows. 6 © 2021 Susumu Yamazaki
  7. Sample code of PelemayFp • The right figure shows a

    sample code of PelemayFp. • 1..1_000_000: the range from 1 to 1,000,000 • |> : the pipeline operator of Elixir • & &1 * &1: square function • This equals to PelemayFp.map(1..1_000_000, & &1 * &1) • This returns a list where each element is the result of invoking the square function, on each corresponding element of the list generated from the range from 1 to 1,000,000, with processing in parallel where each thread processes threshold elements, which is omitted and the default value is 12,000. • As a result, this returns a squared list of each element from 1 to 1,000,000. 7 © 2021 Susumu Yamazaki
  8. Design of PelemayFp • PelemayFp is implemented using only Elixir,

    like Flow. • In Flow, the order of the list after computation is not guaranteed, while in PelemayFp, the order of the list is guaranteed because it is sorting while collecting and merging. • PelemayFp consists of the following modules: • PelemayFp, which is the facade module, and which provides map and map_chunk functions; • PelemayFp.ParallelSplitter, which provides the split function that decides a given enumerable into a chunk list whose size equals to or is less than the given threshold, with spawning each process corresponding to the chunked list, and the range function that returns the range from the number of the chunk minus 1 to 0. The reason that the range is descending is that PelemayFp.ParallelBinaryMerger will receive fragments of the given list at the best execution speed if it is in reverse order, because a list in Elixir is immutable; • PelemayFp.Sub, which is inner and undocumented, and which wraps the difference of map and map_chunk; • PelemayFp.ParallelBinaryMerger, which receives a given consecutive list of tuples of a Range, count and a list, or an exit or dying message from the monitored process and merges it into a result, and send it; • PelemayFp.BinaryMerger, which inserts a given consecutive list of tuples of a Range, count and a list into another one; • PelemayFp.Merger, which merges two consecutive list of tuples of a Range, count and a list. • Splitting the given list, spawning each worker process, calculating at each worker, collecting the result of workers, and sorting the result are executed in parallel. • In contrast, Flow with sorting executes them partially in parallel and partially in sequential. 8 © 2021 Susumu Yamazaki
  9. The results of the experiments • The right figure shows

    the graph that plots the results of the experiments in Table 3. • The best case of execution time is in case that the worker processes and the speedup ratio are 12 and 2.11, respectively • And it will be useless to spawn more worker processes. 9 © 2021 Susumu Yamazaki Faster
  10. The results of the experiments • We guessed that the

    combination of PelemayFp and Pelemay would be faster than PelemayFp alone, but the results were disappointing, because the speedup ratio of the combination of PelemayFp and Pelemay is so lower than that of PelemayFp alone. 10 © 2021 Susumu Yamazaki Faster
  11. The analysis by Amdahl’s law • The right figure shows

    the graph of the analysis by Amdahl’s law in Table 4. • This shows the percentage of parallel execution in the entire code of PelemayFp is 48-66 percent, while that of the combination of PelemayFp and Pelemay is 6-46 percent. 11 © 2021 Susumu Yamazaki
  12. The analysis by Amdahl’s law • Furthermore, we assume the

    combination of PelemayFp and Pelemay has more non- parallelized sections in the entire code than PelemayFp because Pelemay uses NIFs, and it may have overheads due to the non- parallelized sections when calling NIFs. • When N is 2, the analysis will be formulated in the right figure, where 1.49 and 1.12 are the speedup ratios from 1 core of PelemayFp and the combination PelemayFp and Pelemay, respectively. 12 © 2021 Susumu Yamazaki
  13. The analysis by Amdahl’s law • N = 2, p

    = 0.66 → q = 0.44 • N = 3, p = 0.70 → q = 0.38 • N = 6, p = 0.64 → q = 0.30 • N = 12, p = 0.65 → q = 0.37 ➡The analysis using Amdahl’s law reveals that this experimental results can be explained by assuming that when calling native code from Elixir with NIFs, the part that is not executed in parallel increases by about 30-40 percent. 13 © 2021 Susumu Yamazaki
  14. Contribution 1. PelemayFp is designed in the Distributor / Worker

    / Collector style, carefully, for maximum parallelism, in case of parallelization of sequential data flow with a list. Even so, however, the best case of execution time is in case that the worker processes is only 12, and it will be useless to spawn more worker processes. 2. We guessed that the combination of PelemayFp and Pelemay would be faster than PelemayFp alone, but the results were disappointing. The analysis using Amdahl’s law reveals that this experimental results can be explained by assuming that when calling native code from Elixir with NIFs, the part that is not executed in parallel increases by about 30-40 percent. If it is true, it will be needed to modify Erlang VM drastically, to try to explore another FFI, or to create a new VM. 14 © 2021 Susumu Yamazaki
  15. Future Works • Therefore, when generating native code including SIMD

    instructions and adopting the approach of parallelizing with Elixir for speeding up, it will not be appropriate to adopt NIFs. • It may be appropriate to incorporate a code optimization mechanism using SIMD instructions into the JIT, which will be released in the next major version of Erlang, or to use another FFI method, Port, instead of using NIFs. • Thus, as future works, we will evaluate them. • Further future works include researching and developing application of it to satellite image processing systems, and ZEAM, which is a new VM for Elixir, including solving the above mentioned issues. 15 © 2021 Susumu Yamazaki