Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to Writing Applications for the...

An Introduction to Writing Applications for the Parallella Board

Parallella is a credit card-sized computer with a many-core accelerator that allows it to achieve high floating-point performance while consuming only a few watts. In this talk we will take a look at the Epiphany architecture and how to use the eSDK to write highly parallel applications for it, using hardware and software features to benchmark code and optimise performance.

Simon Cook

August 16, 2014
Tweet

More Decks by Simon Cook

Other Decks in Programming

Transcript

  1. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Introduction to Parallella • Dual core A9 • 16-core Epiphany coprocessor • Xilinx FPGA (Zynq 7010/7020) • 1GB RAM • 24/48 GPIO • Board design, all software open source
  2. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Epiphany Architecture • Superscalar: 2 ALU ops and 64-bit memory load each cycle • 64 registers • 32KB local memory • Access to shared memory with other cores/host
  3. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Multicore Framework • Each core has routing processor forming three meshes: – cMesh for on-chip write – rMesh for on-chip read – xMesh for off-chip write • Global address space • Upper 12 bits mark node address
  4. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Building Software • Standard GNU Tools – GCC/GDB/etc. • e-lib target library • e-hal multicore library • Achieved via COPTHR SDK • Epiphany shows as Compute Device of type CL_DEVICE_TYPE_ ACCELERATOR
  5. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Hello World – Epiphany #include <stdio.h> #include <stdlib.h> #include "e_lib.h" char outbuf[128] SECTION("shared_dram"); int main (void) { e_coreid_t coreid; coreid = e_get_coreid(); sprintf (outbuf, "Hello World from core 0x%03x!", coreid); return EXIT_SUCCESS; }
  6. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Hello World – Host Code #include "e-hal.h" int main (void) { e_platform_t platform; e_mem_t emem; e_init (NULL); e_reset_system (); e_get_platform_info (&platform); e_alloc (&emem, _BufOffset, _BufSize); for (i=0; i<_SeqLen; i++) { row = rand() % platform.rows; col = rand() % platform.cols; coreid = (row + platform.row) * 64 + col + platform.col; e_open (&dev, row, col, 1, 1); e_reset_core (&dev, 0, 0); e_load ("e_hello_world.srec", &dev, 0, 0, E_TRUE); } return EXIT_SUCCESS; }
  7. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Matrix Multiply – Epiphany int main (void) { init(); e_barrier_init (barriers, tgt_bars); if (me.corenum == 0) { while (Mailbox.pCore->go == 0) ; Mailbox.pCore->ready = 0; } e_barrier (barriers, tgt_bars); bigmatmul(); e_barrier (barriers, tgt_bars); if (me.corenum == 0) { Mailbox.pCore->go = 0; Mailbox.pCore->ready = 0; }
  8. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Matrix Multiply – Host int main (void) { <...> e_open(pEpiphany, 0, 0, e_platform.chip[0].rows, e_platform.chip[0].cols])); Mailbox.core.ready = 0; e_write (pDRAM, 0, 0, addr, &Mailbox.core.ready, Sizeof (Mailbox.core.ready); e_load_group (ar.srecFile, pEpiphany, 0, 0, pEpiphany->rows, pEpiphany->cols, ar.run_target); matrix_init (seed); return EXIT_SUCCESS; }
  9. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Benchmarking Code • Each core has two timers which can be used to examine the performance of your code. • Timers can count instructions, pipeline stalls, etc. Example: e_ctimer_stop(E_CTIMER_0) // stop timer e_ctimer_set(E_CTIMER_0, 0) // zero timer e_ctimer_start(E_CTIMER_0, E_CTIMER_CLK) // measure clk cycles foo() // my function e_ctimer_stop(E_CTIMER_0) // stop timer time = e_ctimer_get(E_CTIMER_0) // get time
  10. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Placing Code in Fast Memory internal.ldf • Everything stored in internal SRAM • Best if everything fits within 32KB fast.ldf • User code/data and stack in internal SRAM • Standard libraries stored in external DRAM • Best if using few large library functions legacy.ldf • Everything stored in external DRAM • Will be much slower than internal and legacy • 1MB storage for all program. Speed Available Space Example Usage: e-gcc -T ${ESDK}/bsps/current/fast.ldf foo.c -o foo.o -le-lib
  11. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Writing faster for Communication • Both read and write meshes take one cycle to pass data to a neighbour. • Reading data takes n data to send address and n more to get data. • Writing data takes n to send data.
  12. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license DMA Engines • Each core has two DMA engines for moving data between cores and (optionally) off chip. • Can move double-word per cycle, so at 1GHz maximum throughput is 8GB/s. • Can be configured to straddle data and run in either blocking or nonblocking modes. • e_dma_copy alternative to memcpy for a simple configuration.
  13. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Coming Soon... Software Cache • As part of the next SDK release, the tools will support software caching. • Copies of functions will be copied into internal RAM at runtime as they are needed, striking a balance between execution speed and internal storage. bar foo bar bar baz Internal RAM: Internal RAM: External RAM: baz
  14. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license ... and better Multicore Debugging • Improved e-server for better multicore debugging • Debug all cores as threads in one gdb connection • Will enable use of Eclipse Multicore Visualizer • Source available in jeremy-update branch on GitHub – Feedback welcome
  15. Copyright © 2014 Embecosm. Freely available under a Creative Commons

    license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Workshop