An Introduction to Writing Applications for the Parallella Board

Parallella Simon Cook

Copyright © 2014 Embecosm. Freely available under a Creative Commons
license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Introduction to Parallella • Dual core A9 • 16-core Epiphany coprocessor • Xilinx FPGA (Zynq 7010/7020) • 1GB RAM • 24/48 GPIO • Board design, all software open source

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Epiphany Architecture • Superscalar: 2 ALU ops and 64-bit memory load each cycle • 64 registers • 32KB local memory • Access to shared memory with other cores/host

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Multicore Framework • Each core has routing processor forming three meshes: – cMesh for on-chip write – rMesh for on-chip read – xMesh for off-chip write • Global address space • Upper 12 bits mark node address

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Building Software • Standard GNU Tools – GCC/GDB/etc. • e-lib target library • e-hal multicore library • Achieved via COPTHR SDK • Epiphany shows as Compute Device of type CL_DEVICE_TYPE_ ACCELERATOR

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Hello World – Epiphany #include <stdio.h> #include <stdlib.h> #include "e_lib.h" char outbuf[128] SECTION("shared_dram"); int main (void) { e_coreid_t coreid; coreid = e_get_coreid(); sprintf (outbuf, "Hello World from core 0x%03x!", coreid); return EXIT_SUCCESS; }

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Hello World – Host Code #include "e-hal.h" int main (void) { e_platform_t platform; e_mem_t emem; e_init (NULL); e_reset_system (); e_get_platform_info (&platform); e_alloc (&emem, _BufOffset, _BufSize); for (i=0; i<_SeqLen; i++) { row = rand() % platform.rows; col = rand() % platform.cols; coreid = (row + platform.row) * 64 + col + platform.col; e_open (&dev, row, col, 1, 1); e_reset_core (&dev, 0, 0); e_load ("e_hello_world.srec", &dev, 0, 0, E_TRUE); } return EXIT_SUCCESS; }

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Matrix Multiply – Epiphany int main (void) { init(); e_barrier_init (barriers, tgt_bars); if (me.corenum == 0) { while (Mailbox.pCore->go == 0) ; Mailbox.pCore->ready = 0; } e_barrier (barriers, tgt_bars); bigmatmul(); e_barrier (barriers, tgt_bars); if (me.corenum == 0) { Mailbox.pCore->go = 0; Mailbox.pCore->ready = 0; }

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Matrix Multiply – Host int main (void) { <...> e_open(pEpiphany, 0, 0, e_platform.chip[0].rows, e_platform.chip[0].cols])); Mailbox.core.ready = 0; e_write (pDRAM, 0, 0, addr, &Mailbox.core.ready, Sizeof (Mailbox.core.ready); e_load_group (ar.srecFile, pEpiphany, 0, 0, pEpiphany->rows, pEpiphany->cols, ar.run_target); matrix_init (seed); return EXIT_SUCCESS; }

Measuring and Optimising Performance

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Benchmarking Code • Each core has two timers which can be used to examine the performance of your code. • Timers can count instructions, pipeline stalls, etc. Example: e_ctimer_stop(E_CTIMER_0) // stop timer e_ctimer_set(E_CTIMER_0, 0) // zero timer e_ctimer_start(E_CTIMER_0, E_CTIMER_CLK) // measure clk cycles foo() // my function e_ctimer_stop(E_CTIMER_0) // stop timer time = e_ctimer_get(E_CTIMER_0) // get time

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Placing Code in Fast Memory internal.ldf • Everything stored in internal SRAM • Best if everything fits within 32KB fast.ldf • User code/data and stack in internal SRAM • Standard libraries stored in external DRAM • Best if using few large library functions legacy.ldf • Everything stored in external DRAM • Will be much slower than internal and legacy • 1MB storage for all program. Speed Available Space Example Usage: e-gcc -T ${ESDK}/bsps/current/fast.ldf foo.c -o foo.o -le-lib

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Writing faster for Communication • Both read and write meshes take one cycle to pass data to a neighbour. • Reading data takes n data to send address and n more to get data. • Writing data takes n to send data.

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license DMA Engines • Each core has two DMA engines for moving data between cores and (optionally) off chip. • Can move double-word per cycle, so at 1GHz maximum throughput is 8GB/s. • Can be configured to straddle data and run in either blocking or nonblocking modes. • e_dma_copy alternative to memcpy for a simple configuration.

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Coming Soon... Software Cache • As part of the next SDK release, the tools will support software caching. • Copies of functions will be copied into internal RAM at runtime as they are needed, striking a balance between execution speed and internal storage. bar foo bar bar baz Internal RAM: Internal RAM: External RAM: baz

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license ... and better Multicore Debugging • Improved e-server for better multicore debugging • Debug all cores as threads in one gdb connection • Will enable use of Eclipse Multicore Visualizer • Source available in jeremy-update branch on GitHub – Feedback welcome

license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Copyright © 2014 Embecosm. Freely available under a Creative Commons license Workshop

Parallella Simon Cook

An Introduction to Writing Applications for the...

An Introduction to Writing Applications for the Parallella Board

Simon Cook

More Decks by Simon Cook

Other Decks in Programming

Featured

Transcript

Parallella Simon Cook

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Measuring and Optimising Performance

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Copyright © 2014 Embecosm. Freely available under a Creative Commons

Parallella Simon Cook