Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Domain-specific languages: the key to computational efficiency and programmer productivity

Be1c8a24b76f8b2b23f53eb22d401810?s=47 Imperial ACM
February 07, 2014

Domain-specific languages: the key to computational efficiency and programmer productivity

A rule of thumb in computing says that generality and efficiency are conflicting goals when designing a system. In a similar way, a high-level system is generally regarded as less efficient than a low-level system. In this talk I will show that this is not necessarily the case and how domain-specific languages can be used as a building block in efficient yet high-level computational frameworks. Not only do they provide the key to computational efficiency, but also to programmer productivity and more maintainable codes.

Be1c8a24b76f8b2b23f53eb22d401810?s=128

Imperial ACM

February 07, 2014
Tweet

Transcript

  1. Domain-specific languages: the key to computational efficiency and programmer productivity

    Imperial College London ACM Student Chapter seminar series Florian Rathgeber, Department of Computing, Imperial College London February 7 2014 Slides: http://kynan.github.io/ACM2014 1 / 34
  2. Domain- Specific Languages (DSLs) Languages specialised to a particular application

    domain In contrast to general-purpose (programming) languages Markup languages XML dialects: HTML, SVG Graphs, graphics: GraphViz, GrGen, PGF Text: Markdown, rST, Org, MediaWiki remark used for this slide deck Modelling languages Linear algebra: MATLAB, Mathematica, Maple Symbolic maths: Mathematica, Maxima Hardware description: Verilog, VHDL Database queries: SQL, SparQL Spreadsheets: Excel / LibreOffice formulas Embedded DSLs Domain-specific language embedded in a host language (C++, Python, Ruby) DSL user can mix and match and extend DSL code with the host language DSL writer can use lexing, parsing, syntax tools of the host language Boundaries are not sharp: LaTeX and PostScript are turing- complete! 2 / 34
  3. Motivating example: markup languages 3 / 34

  4. This slide deck is written in remark A simple, in-

    browser, markdown- driven slideshow tool. or A domain- specific language for writing browser-based slide decks remark source for this slide deck: n a m e : i n v e r s e l a y o u t : t r u e c l a s s : c e n t e r , m i d d l e , i n v e r s e - - - # D o m a i n - s p e c i f i c l a n g u a g e s : t h e k e y t o c o m p u t a t i o n a l e f f i c i e n c y a n d p r o g # # I m p e r i a l C o l l e g e L o n d o n A C M S t u d e n t C h a p t e r s e m i n a r s e r i e s # # # * * F l o r i a n R a t h g e b e r * * , D e p a r t m e n t o f C o m p u t i n g , I m p e r i a l C o l l e g e L o n d # # # F e b r u a r y 7 2 0 1 4 S l i d e s : h t t p : / / k y n a n . g i t h u b . i o / A C M 2 0 1 4 - - - l a y o u t : f a l s e . l e f t - c o l u m n [ # # D o m a i n - S p e c i f i c L a n g u a g e s ( D S L s ) L a n g u a g e s s p e c i a l i s e d t o a p a r t i c u l a r a p p l i c a t i o n d o m a i n I n c o n t r a s t t o g e n e r a l - p u r p o s e ( p r o g r a m m i n g ) l a n g u a g e s ] . r i g h t - c o l u m n [ . . . ] - - - 4 / 34
  5. remark generated html for the title slide < d i

    v c l a s s = " r e m a r k - s l i d e - c o n t a i n e r r e m a r k - v i s i b l e " > < d i v c l a s s = " r e m a r k - s l i d e - s c a l e r " s t y l e = " w i d t h : 9 0 8 p x ; h e i g h t : 6 8 1 p x ; - w e b k i t - t r a n s f o r m : s c a l e ( 0 . 7 9 7 3 5 6 8 2 8 1 9 3 8 3 2 6 ) ; l e f t : 1 1 7 p x ; t o p : 0 p x ; " > < d i v c l a s s = " r e m a r k - s l i d e " > < d i v c l a s s = " r e m a r k - s l i d e - c o n t e n t c e n t e r m i d d l e i n v e r s e h l j s - d e f a u l t " > < h 1 > D o m a i n - s p e c i f i c l a n g u a g e s : t h e k e y t o c o m p u t a t i o n a l e f f i c i e n c y a n d p r o g r a m m e r p r o d u c t i v i t y < / h 1 > < h 2 > I m p e r i a l C o l l e g e L o n d o n A C M S t u d e n t C h a p t e r s e m i n a r s e r i e s < / h 2 > < h 3 > < s t r o n g > F l o r i a n R a t h g e b e r < / s t r o n g > , D e p a r t m e n t o f C o m p u t i n g , I m p e r i a l C o l l e g e L o n d o n < / h 3 > < h 3 > F e b r u a r y 7 2 0 1 4 < / h 3 > < p > S l i d e s : < a h r e f = " h t t p : / / k y n a n . g i t h u b . i o / A C M 2 0 1 4 " > h t t p : / / k y n a n . g i t h u b . i o / A C M 2 0 1 4 < / a > < / p > < d i v c l a s s = " r e m a r k - s l i d e - n u m b e r " > 1 / 2 5 < / d i v > < / d i v > < / d i v > < / d i v > < / d i v > 5 / 34
  6. Domain-specific languages in computational science 6 / 34

  7. Image processing 7 / 34

  8. Halide (1/2) A domain-specific language for image processing and computational

    photography embedded in C++: http://halide-lang.org Write high-performance image processing code on modern machines Compiler targets: x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL Schedule for a 3x3 box filter defined as a series of two 3x1 passes: F u n c b l u r _ 3 x 3 ( F u n c i n p u t ) { F u n c b l u r _ x , b l u r _ y ; V a r x , y , x i , y i ; / / T h e a l g o r i t h m - n o s t o r a g e o r o r d e r b l u r _ x ( x , y ) = ( i n p u t ( x - 1 , y ) + i n p u t ( x , y ) + i n p u t ( x + 1 , y ) ) / 3 ; b l u r _ y ( x , y ) = ( b l u r _ x ( x , y - 1 ) + b l u r _ x ( x , y ) + b l u r _ x ( x , y + 1 ) ) / 3 ; / / T h e s c h e d u l e - d e f i n e s o r d e r , l o c a l i t y ; i m p l i e s s t o r a g e b l u r _ y . t i l e ( x , y , x i , y i , 2 5 6 , 3 2 ) . v e c t o r i z e ( x i , 8 ) . p a r a l l e l ( y ) ; b l u r _ x . c o m p u t e _ a t ( b l u r _ y , x ) . v e c t o r i z e ( x , 8 ) ; r e t u r n b l u r _ y ; } A F u n c object represents a pipeline stage, a "computed image": a pure function defining the value of each pixel. 8 / 34
  9. Halide (2/2) Single-stage imaging pipeline that outputs a grayscale diagonal

    gradient: # i n c l u d e < H a l i d e . h > # i n c l u d e < s t d i o . h > i n t m a i n ( i n t a r g c , c h a r * * a r g v ) { H a l i d e : : F u n c g r a d i e n t ; H a l i d e : : V a r x , y ; H a l i d e : : E x p r e = x + y ; g r a d i e n t ( x , y ) = e ; H a l i d e : : I m a g e < i n t 3 2 _ t > o u t p u t = g r a d i e n t . r e a l i z e ( 8 0 0 , 6 0 0 ) ; f o r ( i n t j = 0 ; j < o u t p u t . h e i g h t ( ) ; j + + ) { f o r ( i n t i = 0 ; i < o u t p u t . w i d t h ( ) ; i + + ) { i f ( o u t p u t ( i , j ) ! = i + j ) { p r i n t f ( " S o m e t h i n g w e n t w r o n g ! \ n " " P i x e l % d , % d w a s s u p p o s e d t o b e % d , b u t i n s t e a d i t ' s % d \ n " , i , j , i + j , o u t p u t ( i , j ) ) ; r e t u r n - 1 ; } } } r e t u r n 0 ; } 9 / 34
  10. Stencil languages 10 / 34

  11. Pochoir Compiler and runtime system for implementing stencil computations on

    multicore processors Domain-specific language embedded in C++ Stencils Define the value of a grid point in a d-dimensional spatial grid at time t as a function of neighboring grid points at recent times before t. A stencil computation computes the stencil for each grid point over many time steps 19-point stencil for the Lattice-Boltzman Method (LBM) 11 / 34
  12. 2D heat equation coded in Pochoir # d e f

    i n e m o d ( r , m ) ( ( r ) % ( m ) + ( ( r ) < 0 ) ? ( m ) : 0 ) P o c h o i r _ B o u n d a r y _ 2 D ( h e a t _ b v , a , t , x , y ) r e t u r n a . g e t ( t , m o d ( x , a . s i z e ( 1 ) ) , m o d ( y , a . s i z e ( 0 ) ) ) ; P o c h o i r _ B o u n d a r y _ E n d i n t m a i n ( v o i d ) { P o c h o i r _ S h a p e _ 2 D 2 D _ f i v e _ p t [ ] = { { 1 , 0 , 0 } , { 0 , 1 , 0 } , { 0 , - 1 , 0 } , { 0 , - 1 , - 1 } , { 0 , 0 , - 1 } , { 0 , 0 , 1 } } ; P o c h o i r _ 2 D h e a t ( 2 D _ f i v e _ p t ) ; P o c h o i r _ A r r a y _ 2 D ( d o u b l e ) u ( X , Y ) ; u . R e g i s t e r _ B o u n d a r y ( h e a t _ b v ) ; h e a t . R e g i s t e r _ A r r a y ( u ) ; P o c h o i r _ K e r n e l _ 2 D ( h e a t _ f n , t , x , y ) u ( t + 1 , x , y ) = C X * ( u ( t , x + 1 , y ) - 2 * u ( t , x , y ) + u ( t , x - 1 , y ) ) + C Y * ( u ( t , x , y + 1 ) - 2 * u ( t , x , y ) + u ( t , x , y - 1 ) ) + u ( t , x , y ) ; P o c h o i r _ K e r n e l _ E n d f o r ( i n t x = 0 ; x < X ; + + x ) f o r ( i n t y = 0 ; y < Y ; + + y ) u ( 0 , x , y ) = r a n d ( ) ; h e a t . R u n ( T , h e a t _ f n ) ; f o r ( i n t x = 0 ; x < X ; + + x ) f o r ( i n t y = 0 ; y < Y ; + + y ) c o u t < < u ( T , x , y ) ; r e t u r n 0 ; } 12 / 34
  13. Unstructured meshes / graphs 13 / 34

  14. Structure of scientific computations on unstructured meshes Independent local operations

    for each element of the mesh described as a kernel. Reductions aggregating results from local operations to produce the final result. PyOP2 A domain-specific language embedded in Python for parallel computations on unstructured meshes or graphs. Unstructured meshes Meshes are described by sets and maps between them and can have data associated with them: 1 0 2 3 6 5 4 7 8 0 1 3 2 4 5 8 7 6 OP2 Sets: nodes = [0, 1, 2, 3, 4, 5, 6, 7, 8] elements = [0, 1, 2, 3, 4, 5, 6, 7, 8] OP2 Map elements-nodes: elem_nodes = [[0, 1, 2], [1, 3, 2], ...] OP2 Dat on nodes: coords = [..., [.5,.5], [.5,-.25], [1,.25], ...] 14 / 34
  15. PyOP2 Data Model Mesh topology Sets – cells, vertices, etc

    Maps – connectivity between entities in different sets Data Dats – Defined on sets (hold pressure, temperature, etc) Kernels Executed in parallel on a set through a parallel loop Read / write / increment data accessed via maps Linear algebra Matrix sparsities defined by mappings Matrix data on sparsities Kernels compute a local matrix – PyOP2 handles global assembly 15 / 34
  16. PyOP2 Kernels and Parallel Loops Performance portability for any unstructured

    mesh computations Parallel loop syntax o p 2 . p a r _ l o o p ( k e r n e l , i t e r a t i o n _ s e t , k e r n e l _ a r g 1 ( a c c e s s _ m o d e , m a p p i n g [ i n d e x ] ) , . . . , k e r n e l _ a r g N ( a c c e s s _ m o d e , m a p p i n g [ i n d e x ] ) ) PyOP2 programme for computing the midpoint of a triangle f r o m p y o p 2 i m p o r t o p 2 o p 2 . i n i t ( ) v e r t i c e s = o p 2 . S e t ( n u m _ v e r t i c e s ) c e l l s = o p 2 . S e t ( n u m _ c e l l s ) c e l l 2 v e r t e x = o p 2 . M a p ( c e l l s , v e r t i c e s , 3 , [ . . . ] ) c o o r d i n a t e s = o p 2 . D a t ( v e r t i c e s * * 2 , [ . . . ] , d t y p e = f l o a t ) m i d p o i n t s = o p 2 . D a t ( c e l l s * * 2 , d t y p e = f l o a t ) m i d p o i n t = o p 2 . K e r n e l ( " " " v o i d m i d p o i n t ( d o u b l e p [ 2 ] , d o u b l e * c o o r d s [ 2 ] ) { p [ 0 ] = ( c o o r d s [ 0 ] [ 0 ] + c o o r d s [ 1 ] [ 0 ] + c o o r d s [ 2 ] [ 0 ] ) / 3 . 0 ; p [ 1 ] = ( c o o r d s [ 0 ] [ 1 ] + c o o r d s [ 1 ] [ 1 ] + c o o r d s [ 2 ] [ 1 ] ) / 3 . 0 ; } " " " , " m i d p o i n t " ) o p 2 . p a r _ l o o p ( m i d p o i n t , c e l l s , m i d p o i n t s ( o p 2 . W R I T E ) , c o o r d i n a t e s ( o p 2 . R E A D , c e l l 2 v e r t e x ) ) Future work: define kernels as Python functions 16 / 34
  17. PyOP2 Architecture Parallel scheduling: partitioning, staging and coloring Runtime code

    generation and JIT compilation OpenCL CUDA Instant PyOpenCL PyCUDA CPU OpenMP CPU seq. MPI Runtime Core partitioning, parallel scheduling Lin. algebra PETSc/Cusp Kernels Data Access Descriptors Application code Backends Code gen PyOP2 core User code 17 / 34
  18. Generated sequential code calling the midpoint kernel v o i

    d w r a p _ m i d p o i n t _ _ ( P y O b j e c t * _ s t a r t , P y O b j e c t * _ e n d , P y O b j e c t * _ a r g 0 _ 0 , P y O b j e c t * _ a r g 1 _ 0 , P y O b j e c t * _ a r g 1 _ 0 _ m a p 0 _ 0 ) { i n t s t a r t = ( i n t ) P y I n t _ A s L o n g ( _ s t a r t ) ; i n t e n d = ( i n t ) P y I n t _ A s L o n g ( _ e n d ) ; d o u b l e * a r g 0 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 ) - > d a t a ) ; i n t * a r g 1 _ 0 _ m a p 0 _ 0 = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 _ m a p 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 _ v e c [ 3 ] ; f o r ( i n t n = s t a r t ; n < e n d ; n + + ) { a r g 1 _ 0 _ v e c [ 0 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 0 ] * 2 ; a r g 1 _ 0 _ v e c [ 1 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 1 ] * 2 ; a r g 1 _ 0 _ v e c [ 2 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 2 ] * 2 ; m i d p o i n t ( a r g 0 _ 0 + n * 2 , a r g 1 _ 0 _ v e c ) ; } } Generated OpenMP code calling the midpoint kernel next slide 18 / 34
  19. v o i d w r a p _ m

    i d p o i n t _ _ ( P y O b j e c t * _ b o f f s e t , P y O b j e c t * _ n b l o c k s , P y O b j e c t * _ b l k m a p , P y O b j e c t * _ o f f s e t , P y O b j e c t * _ n e l e m s , P y O b j e c t * _ a r g 0 _ 0 , P y O b j e c t * _ a r g 1 _ 0 , P y O b j e c t * _ a r g 1 _ 0 _ m a p 0 _ 0 ) { i n t b o f f s e t = ( i n t ) P y I n t _ A s L o n g ( _ b o f f s e t ) ; i n t n b l o c k s = ( i n t ) P y I n t _ A s L o n g ( _ n b l o c k s ) ; i n t * b l k m a p = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ b l k m a p ) - > d a t a ) ; i n t * o f f s e t = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ o f f s e t ) - > d a t a ) ; i n t * n e l e m s = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ n e l e m s ) - > d a t a ) ; d o u b l e * a r g 0 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 ) - > d a t a ) ; i n t * a r g 1 _ 0 _ m a p 0 _ 0 = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 _ m a p 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 _ v e c [ 3 2 ] [ 3 ] ; # i f d e f _ O P E N M P i n t n t h r e a d = o m p _ g e t _ m a x _ t h r e a d s ( ) ; # e l s e i n t n t h r e a d = 1 ; # e n d i f # p r a g m a o m p p a r a l l e l s h a r e d ( b o f f s e t , n b l o c k s , n e l e m s , b l k m a p ) { i n t t i d = o m p _ g e t _ t h r e a d _ n u m ( ) ; # p r a g m a o m p f o r s c h e d u l e ( s t a t i c ) f o r ( i n t _ _ b = b o f f s e t ; _ _ b < b o f f s e t + n b l o c k s ; _ _ b + + ) { i n t b i d = b l k m a p [ _ _ b ] ; i n t n e l e m = n e l e m s [ b i d ] ; i n t e f i r s t = o f f s e t [ b i d ] ; f o r ( i n t n = e f i r s t ; n < e f i r s t + n e l e m ; n + + ) { a r g 1 _ 0 _ v e c [ t i d ] [ 0 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 0 ] * 2 ; a r g 1 _ 0 _ v e c [ t i d ] [ 1 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 1 ] * 2 ; a r g 1 _ 0 _ v e c [ t i d ] [ 2 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 2 ] * 2 ; m i d p o i n t ( a r g 0 _ 0 + n * 2 , a r g 1 _ 0 _ v e c [ t i d ] ) ; } } } } 19 / 34
  20. PyOP2 Partitioning, Staging & Coloring Key optimizations performed by PyOP2

    runtime core Partitioning for on-chip memory (shared memory / cache) Coloring to avoid data races on updates to the same memory location Example Parallel computation executing a kernel over the edges of the mesh: # S e t s o f n o d e s a n d e d g e s n o d e s = o p 2 . S e t ( N ) # N = n u m b e r o f n o d e s e d g e s = o p 2 . S e t ( M ) # M = n u m b e r o f e d g e s # M a p p i n g f r o m e d g e s t o n o d e s e d g e _ t o _ n o d e _ m a p = o p 2 . M a p ( e d g e s , n o d e s , 2 , . . . ) # D a t a d e f i n e d o n n o d e s u = o p 2 . D a t ( n o d e s , . . . ) # K e r n e l e x e c u t i n g o v e r s e t o f e d g e s , c o m p u t i n g o n n o d a l d a t a o p 2 . p a r _ l o o p ( k e r n e l , e d g e s , u ( e g d e _ t o _ n o d e _ m a p [ : ] , o p 2 . I N C ) ) 20 / 34
  21. 21 / 34

  22. edges shared / staging memory vertices 22 / 34

  23. edges shared / staging memory vertices 23 / 34

  24. Finite-element computations 24 / 34

  25. i i j j k k i i i i

    j j j j k k k k i j k Ax = b The weak form of the Helmholtz equation: Finite-element assembly 25 / 34
  26. UFL: High- level definition of finite- element forms UFL is

    the Unified Form Language from the FEniCS project. The weak form of the Helmholtz equation And its (almost) literal translation to Python with UFL UFL: embedded domain-specific language (eDSL) for weak forms of partial differential equations (PDEs) e = F i n i t e E l e m e n t ( ' C G ' , ' t r i a n g l e ' , 1 ) v = T e s t F u n c t i o n ( e ) u = T r i a l F u n c t i o n ( e ) f = C o e f f i c i e n t ( e ) l m b d a = 1 a = ( d o t ( g r a d ( v ) , g r a d ( u ) ) - l m b d a * v * u ) * d x L = v * f * d x ∇v ⋅ ∇u − λvu dV = vf dV ∫ Ω ∫ Ω 26 / 34
  27. Helmholtz local assembly kernel generated by FFC The FEniCS Form

    Compiler FFC compiles UFL forms to low- level code. Helmholtz equation UFL expression a = ( d o t ( g r a d ( v ) , g r a d ( u ) ) - l m b d a * v * u ) * d x Generated C code / / A - l o c a l t e n s o r t o a s s e m b l e / / x - l o c a l c o o r d i n a t e s / / j , k - 2 D i n d i c e s i n t o t h e l o c a l a s s e m b l y m a t r i x v o i d k e r n e l ( d o u b l e A [ 1 ] [ 1 ] , d o u b l e * x [ 2 ] , i n t j , i n t k ) { / / F E 0 - S h a p e f u n c t i o n s / / D i j - S h a p e f u n c t i o n d e r i v a t i v e s / / K i j - J a c o b i a n i n v e r s e / d e t e r m i n a n t / / W 3 - Q u a d r a t u r e w e i g h t s / / d e t - J a c o b i a n d e t e r m i n a n t f o r ( u n s i g n e d i n t i p = 0 ; i p < 3 ; i p + + ) { A [ 0 ] [ 0 ] + = ( F E 0 [ i p ] [ j ] * F E 0 [ i p ] [ k ] * ( - 1 . 0 ) + ( ( ( K 0 0 * D 1 0 [ i p ] [ j ] + K 1 0 * D 0 1 [ i p ] [ j ] ) ) * ( ( K 0 0 * D 1 0 [ i p ] [ k ] + K 1 0 * D 0 1 [ i p ] [ k ] ) ) + ( ( K 0 1 * D 1 0 [ i p ] [ j ] + K 1 1 * D 0 1 [ i p ] [ j ] ) ) * ( ( K 0 1 * D 1 0 [ i p ] [ k ] + K 1 1 * D 0 1 [ i p ] [ k ] ) ) ) ) * W 3 [ i p ] * d e t ; } } ∇v ⋅ ∇u − λvu dV = vf dV ∫ Ω ∫ Ω 27 / 34
  28. The Firedrake/PyOP2 tool chain PyOP2 Parallel unstructured mesh computation framework

    Firedrake A performance-portable Finite-element computation framework Unified Form Language (UFL) PyOP2 Interface FFC Form Compiler Parallel scheduling, code generation CPU (+MPI + OpenMP/ OpenCL) GPU (CUDA / OpenCL) Future arch. Problem definition in FEM weak form Local assembly kernels, data dependencies Parallel loops over kernels with access descriptors Explicitly parallel hardware- specific implemen- tation Geometry, fields and meshes Fluidity 28 / 34
  29. Two-layered abstraction: Separation of concerns Unified Form Language (UFL) PyOP2

    Interface FFC Form Compiler Parallel scheduling, code generation CPU (+MPI + OpenMP/ OpenCL) GPU (CUDA / OpenCL) Future arch. Problem definition in FEM weak form Local assembly kernels, data dependencies Parallel loops over kernels with access descriptors Explicitly parallel hardware- specific implemen- tation Geometry, fields and meshes Fluidity Domain specialist: mathematical model using FEM Numerical analyst: generation of FEM kernels Domain specialist: mathematical model on un- structured grid Parallel programming expert: hardware architectures, optimization Expert for each layer 29 / 34
  30. Driving Finite-element Computations in Firedrake Solving the Helmholtz equation in

    Python using Firedrake: f r o m f i r e d r a k e i m p o r t * # R e a d a m e s h a n d d e f i n e a f u n c t i o n s p a c e m e s h = M e s h ( ' f i l e n a m e ' ) V = F u n c t i o n S p a c e ( m e s h , " L a g r a n g e " , 1 ) # D e f i n e f o r c i n g f u n c t i o n f o r r i g h t - h a n d s i d e d e f f o r c e ( X ) : l m b d a = 1 n = 8 r e t u r n - ( l m b d a + 2 * ( n * * 2 ) * p i * * 2 ) * s i n ( X [ 0 ] * p i * n ) * s i n ( X [ 1 ] * p i * n ) f = E x p r e s s i o n ( f o r c e ) # S e t u p t h e F i n i t e - e l e m e n t w e a k f o r m s u = T r i a l F u n c t i o n ( V ) v = T e s t F u n c t i o n ( V ) l m b d a = 1 a = ( d o t ( g r a d ( v ) , g r a d ( u ) ) - l m b d a * v * u ) * d x L = v * f * d x # S o l v e t h e r e s u l t i n g f i n i t e - e l e m e n t e q u a t i o n p = F u n c t i o n ( V ) s o l v e ( a = = L , p ) 30 / 34
  31. Finite element assembly and solve in Firedrake What goes on

    behind the scenes of the s o l v e call (simplified example!): f r o m p y o p 2 i m p o r t o p 2 , f f c _ i n t e r f a c e d e f s o l v e ( e q u a t i o n , x ) : # I n v o k e F F C t o g e n e r a t e k e r n e l s f o r m a t r i x a n d r h s a s s e m b l y l h s = f f c _ i n t e r f a c e . c o m p i l e _ f o r m ( e q u a t i o n . l h s , " l h s " ) [ 0 ] r h s = f f c _ i n t e r f a c e . c o m p i l e _ f o r m ( e q u a t i o n . r h s , " r h s " ) [ 0 ] # O m i t t e d : e x t r a c t c o o r d i n a t e s ( c o o r d s ) , c o n n e c t i v i t y ( e l e m _ n o d e ) # a n d c o e f f i c i e n t s ( f i e l d f ) # C o n s t r u c t O P 2 m a t r i x t o a s s e m b l e i n t o s p a r s i t y = o p 2 . S p a r s i t y ( ( e l e m _ n o d e , e l e m _ n o d e ) , s p a r s i t y _ d i m ) m a t = o p 2 . M a t ( s p a r s i t y , n u m p y . f l o a t 6 4 ) b = o p 2 . D a t ( n o d e s , n p . z e r o s ( n o d e s . s i z e ) ) # A s s e m b l e l h s , r h s a n d s o l v e l i n e a r s y s t e m o p 2 . p a r _ l o o p ( l h s , e l e m e n t s , m a t ( o p 2 . I N C , ( e l e m _ n o d e [ o p 2 . i [ 0 ] ] , e l e m _ n o d e [ o p 2 . i [ 1 ] ] ) ) , c o o r d s ( o p 2 . R E A D , e l e m _ n o d e ) ) o p 2 . p a r _ l o o p ( r h s , e l e m e n t s , b ( o p 2 . I N C , e l e m _ n o d e [ o p 2 . i [ 0 ] ] ) , c o o r d s ( o p 2 . R E A D , e l e m _ n o d e ) , f ( o p 2 . R E A D , e l e m _ n o d e ) ) # S o l v e t h e a s s e m b l e d s p a r s e l i n e a r s y s t e m o f e q u a t i o n s o p 2 . s o l v e ( m a t , x , b ) 31 / 34
  32. Recap 32 / 34

  33. Computational Science is hard Unless you break it down with

    the right abstractions Many-core hardware has brought a paradigm shift to CSE, scientific software needs to keep up The Solution High-level structure Goal: producing high level interfaces to numerical computing PyOP2: a high-level interface to unstructured mesh based methods Efficiently execute kernels over an unstructured grid in parallel Firedrake: a performance-portable finite-element computation framework Drive FE computations from a high-level problem specification Low-level operations Separating the low-level implementation from the high-level problem specification Generate platform-specific implementations from a common source instead of hand-coding them Runtime code generation and JIT compilation open space for compiler-driven optimizations and performance portability 33 / 34
  34. Thank you! Contact: Florian Rathgeber, @frathgeber, f.rathgeber@imperial.ac.uk Resources Halide http://halide-lang.org

    Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe. PLDI 2013 Pochoir http://groups.csail.mit.edu/sct/wiki/index.php?title=The_Pochoir_Project The Pochoir Stencil Compiler Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. SPAA ’11, 2011 PyOP2 https://github.com/OP2/PyOP2 PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes Florian Rathgeber, Graham R. Markall, Lawrence Mitchell, Nicholas Loriant, David A. Ham, Carlo Bertolli, Paul H.J. Kelly WOLFHPC 2012 Firedrake https://github.com/firedrakeproject/firedrake Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS Graham R. Markall, Florian Rathgeber, Lawrence Mitchell, Nicolas Loriant, Carlo Bertolli, David A. Ham, Paul H. J. Kelly ISC 2013 UFL https://bitbucket.org/mapdes/ufl FFC https://bitbucket.org/mapdes/ffc This talk is available at http://kynan.github.io/ACM2014 Slides created with remark 34 / 34