Hastega: Challenge for GPGPU on Elixir @ Lonestar ElixirConf 2019

Hastega: Challenge for GPGPU on Elixir @ Lonestar ElixirConf 2019

We've succeeded in implementing a demonstration program in which an Elixir code invokes directly a GPGPU benchmark by Rustler. We propose a Hastega (Hyper Accelerator of Spreading Tasks for Elixir with GPU Activation) method that converts an Elixir code using Enum/Flow to an executable code for GPU or multi-core CPU with SIMD.
We've conducted a performance evaluation using the Logistic Maps of the experimental implementation of GPGPU using the Hastega method. We've got the following results:
our Hastega is 4-8 times faster than pure Elixir executed by only CPU
our Hastega is up to 3 times faster than CuPy/Python executed with GPU
our Hastega is only 1.5 times slower than native code executed with GPU
Now, we implement Linear Regression and Neural Network in Elixir, and will accelerate it with Hastega. Our main future work is to implement a compiler from Elixir code with Enum/Flow to native code for GPU and/or multi-core CPU with SIMD.

# Bio

Susumu Yamazaki (ZACKY) is currently an Associate Professor at the University of Kitakyushu. His current research projects focus on programming language processors, software engineering, programming education and social implementation of software systems.

https://lonestarelixir.com/2019/speakers/21

83722380372c00bd75ac920f2089f6aa?s=128

Susumu Yamazaki (ZACKY)

March 02, 2019
Tweet

Transcript

  1. 1.

    Hastega: Challenge for GPGPU on Elixir Susumu Yamazaki (ZACKY) Associate

    Professor at Univ. of Kitakyushu Adviser at fukuoka.ex
  2. 2.

    Susumu Yamazaki @ZACKY1972
 Associate Professor Univ. of Kitakyushu Adviser at

    fukuoka.ex I came from Japan over whole one day, to make this presentation on Hastega My experience of Elixir is
 only 1 year!
  3. 7.

    Data Explosion has come! 55.1 Exa bytes in 2013 It’s

    growing exponentially Year Amount of Data reported by Cisco 2013
  4. 8.

    We need more power! Computer Architecture: A Quantitative Approach But,

    evolution of CPU clock
 became over 15 years ago ↑2003 Year Clock rate
  5. 9.

    We need more power! *OUFM$PSF
 &YUSFNF9  $MPDLT ()[ DPSFT

      *OUFM$PSFJ
 9&  $MPDLT ()[ DPSFT   •CPU clocks haven’t grown •# of cores is growing rapidly
  6. 10.

    We need more power! •# of cores is growing, but


    CPU clocks haven’t grown •It requires parallel computing to us
  7. 11.

    We need more power! •But we have no effective parallel

    programming languages •Multi-threaded programming is too early to be used correctly
  8. 15.

    The reason of dystopia • Suppose some data are shared

    with some cores 4IBSFE%BUB $PSF #1 $PSF2 3.14
  9. 16.

    4IBSFE%BUB $PSF #1 $PSF2 3.14 →1.5 The reason of dystopia

    • If a core updates the shared data • then it notifies other cores, and they stop processing • It causes slow down Update Notify Stop processing
  10. 17.

    The reason of dystopia • If there are many cores,


    waiting time grows exponentially 4IBSFE%BUB $PSF #1 $PSF2 3.14 →1.5 Update Notify Stop processing
  11. 19.

    Elixir is a drastic solution • Elixir is immutable •

    That is, it forbids all updates of shared data • Thus, the other cores don’t need to stop processing D 4IBSFE%BUB $PSF #1 $PSF2 3.14 Don’t Update Don’t need to stop processing
  12. 20.

    Elixir and Phoenix are the most promising solution for the

    Data Explosion problem Fedrecheski, G., Costa, L. C. P. and Zuffo, M. K.: Elixir programming language evaluation for IoT, 2016 IEEE International Symposium on Consumer Electronics (ISCE), pp. 105–106 (online), DOI: 10.1109/ISCE.2016.7797392 (2016). Java is defeated by highly frequent requests more than 1,200 Requests/sec Elixir endures highly frequent requests
 less than 1,800 Requests/sec Server: a quad-core computer with 6GB RAM Client: an eight-core computer with 12GB RAM ↑Slower ↓Faster
  13. 22.

    José creates Flow!!! • It’s not a magic! (by José)

    • But I think it’s a quite marvelous and fantastic magic! • CPU- and/or IO-bound works will be parallelized and accelerated by multi- core CPUs • Elixir got the stylish and powerful parallel computing technology Single-processing code with Enum: 1..1_000_000
 |> Enum.map(&foo(&1))
 |> Enum.map(&bar(&1)) Multiple-processing code with Flow: 1..1_000_000
 |> Flow.from_enumerable
 |> Flow.map(&foo(&1))
 |> Flow.map(&bar(&1))
 |> Enum.to_list
  14. 24.

    More pessimistic prediction 
 of Data Explosion 40 Zeta Bytes

    (= 40,000 Exa Bytes) in 2020 180 Zeta Bytes (= 180,000 Exa Bytes) in 2025 ©2014 IDC
  15. 28.

    Hastega • It’s a magic!! • the highest evolved magic

    to accelerate our party in Final Fantasy!
 (Stronger than Haste) • It will be the highest evolved technology to accelerate our machines in the Elixir ecosystem! It’s inspired by Flow
  16. 30.

    4-8x faster than Elixir using Flow! and also faster than

    P-map
 (Parallel map) ↑ Slower ↓ Faster Flow
  17. 36.

    Why is Enum.map Zen? ˜4UFQIBOF%"MV • Zen is the essential

    beauty • The essential of programming is data transformation • Enum.map describes only data transformation list = 1..1_000_000 |> Enum.to_list list 
 |> Enum.map(&foo(&1))
 |> Enum.map(&bar(&1))
  18. 37.

    1..1_000_000 
 |> Enum.map(&foo(&1))
 |> Enum.map(&bar(&1)) 1..1_000_000
 |> Enum.to_list
 |>

    func()
 
 def func( [] ), do: [] def func( [ head | tail ] ) do
 [ head |> foo |> bar 
 | func(tail) ]
 end (A) (B) • The code A, B and C are equivalent. • A: in the loop style in Java • B: in the recursive call style • C: using Enum.map Comparison int i;
 int[] array = new int[1000000];
 for(i = 0; i < 1000000; i++)
 array[i] = i + 1;
 for(i = 0; i < 1000000; i++)
 array[i] = foo(array[i]);
 for(i = 0; i < 1000000; i++)
 array[i] = bar(array[i]); (C)
  19. 38.

    Why are they not Zen? ˜4UFQIBOF%"MV • Loop operation describes

    flow of processing, loop counter and destructive update int i;
 int[] array = new int[1000000];
 for(i = 0; i < 1000000; i++)
 array[i] = i + 1;
 for(i = 0; i < 1000000; i++)
 array[i] = foo(array[i]);
 for(i = 0; i < 1000000; i++)
 array[i] = bar(array[i]);
  20. 39.

    Why are they not Zen? ˜4UFQIBOF%"MV • Recursive call describes

    not data transformation but flow of processing 1..1_000_000
 |> Enum.to_list
 |> func()
 
 def func( [] ), do: [] def func( [ head | tail ] ) do
 [ head |> foo |> bar 
 | func(tail) ]
 end
  21. 40.

    1..1_000_000 
 |> Enum.map(&foo(&1))
 |> Enum.map(&bar(&1)) • We propose to

    call it the Elixir Zen style to write in Enum.map • It is a good programming custom • Because it’s more readable and maintainable The Elixir Zen style
  22. 42.

    In Elixir on Erlang VM, the Elixir Zen style is

    20 percents slower than recursive call list = 1..1_000_000 |> Enum.to_list list 
 |> Enum.map(&foo(&1))
 |> Enum.map(&bar(&1)) list
 |> func()
 
 def func( [] ), do: [] def func( [ head | tail ] ) do
 [ head |> foo |> bar 
 | func(tail) ]
 end 6msec ↑ Slower ↓ Faster Performance Evaluation
  23. 44.

    • make the Elixir Zen styled code faster • by

    casting the spell of it on Samurai to be berserk • that is, to be transformed into the fastest native code, 
 using all computing resources, • not only multi-core CPUs (with SIMD instructions)
 but also GPUs Hastaga will… We feel it Wabi-Sabi
  24. 45.

    Inspiration from Enum.map • This code has a potential of

    1,000,000 parallelism: • Each element will be transformed by the combination function of 
 foo and bar • There are no dependency between each other 1..1_000_000
 |> Enum.map(&foo(&1))
 |> Enum.map(&bar(&1))
  25. 47.

    Principle of Hastega • It can be transformed to SIMD

    native code such as OpenCL, which drives multi- core CPUs and GPUs, easily. • To write Hastega code is simple • All you have to do is to write defhastega with a do block, to include def blocks you wanna optimize defhastega do
 def func do
 1..1_000_000
 |> Enum.map(&foo(&1))
 |> Enum.map(&bar(&1))
 end
 end _kernel void calc(
 __global long* input,
 __global long* output) {
 size_t i = get_global_id(0);
 long temp = input[i];
 temp = foo(temp);
 temp = bar(temp);
 output[i] = temp;
 }
  26. 48.

    ↑ Slower ↓ Faster Flow ↑ Slower ↓ Faster Performance

    of Hastega from Zen is much better than recursive call
  27. 50.

    Demo I’m sorry I cannot tell you details in English

    well. But I believe common language for us is Elixir! Please feel our passion from Elixir code…
  28. 52.

    Inside of Hastega • Hastega has two subsystems that we

    are developing: • SumMag: a meta-programming library • Magicite: an Elixir-LLVM binding via NIFs in Rustler • each code name is from FF
  29. 53.

    SumMag: a meta- programming library • to extract each code

    block of a series of Enum.map to a new function • without writing such a parser from full- scratch • Thank you, José, for providing meta- programming infrastructure of Elixir
  30. 54.

    Magicite: an Elixir-LLVM binding via NIFs in Rustler • using

    the state-of-the-art compiler infrastructure, LLVM • for generating native code • commanded by Rust via Rustler • invoked by Elixir • You’ll write only Elixir code, not Rust code
  31. 56.

    Roadmap to Implement • Firstly, Hastega will support x86_64 CPUs,

    • using SIMD instructions (but on only a single core) • Next, it will support GPUs including AMD and NVIDIA, • which support OpenCL, • implemented by messaging to a process monopolizing communication to a GPU
  32. 57.

    Roadmap to Implement • Supporting multi-core processing in Hastega •

    may be a little difficult • to implement in current Erlang VM • because we observed that • our prototypes are inefficient • to start and to synchronize new processes
  33. 58.

    Roadmap to Implement • I’m also interested • in implementing

    to support Metal and CUDA • to realize highly efficiency, • and in load-balancing CPUs and GPUs • to make programming Hastega easier
  34. 59.

    Roadmap to Implement • In future, I wanna implement •

    not only server-side computing • for data-base manipulation and machine learning on server • but also edge-computing and web-clients • by JS, WebAssembly, WebGL and WebGPU • generated from Elixir • for UI, computer vision and machine learning on edge and/ or web-client
  35. 61.

    Our mission is to establish the technologies, including Elixir, to

    prevent us from dystopia for all people happiness!!!
  36. 63.

    I’m sorry, but the 1st practical use version of Hastega

    will be released before Summer, 2019 m(_ _)m
  37. 65.

    Conclusion • We should use Hastega • to make the

    Elixir Zen styled code • to be transformed native code • optimized to CPUs with SIMD instructions and GPUs • like a berserk Samurai
  38. 70.

    Other Samurais have released more Elixir products!!! Materia: A Collection

    of Powerful web Authentication APIs with managing account, mail, errors and multi transaction. https://github.com/karabiner-inc/materia Esuna: A Data Science Platform built on Phoenix,
 enabling you to convert and aggregate data with GUI. It’s same to data manipulation as Python's pandas. What will happen if Esuna meets Hastega…!!! https://qiita.com/piacere_ex/items/ab0b32c521293d4ab38e
  39. 71.

    Materia and Esuna will be released
 from Karabiner.inc We are

    developing systems with Elixir / Phoenix and others.
 https://www.karabiner.tech/
  40. 72.