Domain-specific languages: the key to computational efficiency and programmer productivity

Domain-specific languages: the key to computational efficiency and programmer productivity
Imperial College London ACM Student Chapter seminar series Florian Rathgeber, Department of Computing, Imperial College London February 7 2014 Slides: http://kynan.github.io/ACM2014 1 / 34

Domain- Specific Languages (DSLs) Languages specialised to a particular application
domain In contrast to general-purpose (programming) languages Markup languages XML dialects: HTML, SVG Graphs, graphics: GraphViz, GrGen, PGF Text: Markdown, rST, Org, MediaWiki remark used for this slide deck Modelling languages Linear algebra: MATLAB, Mathematica, Maple Symbolic maths: Mathematica, Maxima Hardware description: Verilog, VHDL Database queries: SQL, SparQL Spreadsheets: Excel / LibreOffice formulas Embedded DSLs Domain-specific language embedded in a host language (C++, Python, Ruby) DSL user can mix and match and extend DSL code with the host language DSL writer can use lexing, parsing, syntax tools of the host language Boundaries are not sharp: LaTeX and PostScript are turing- complete! 2 / 34

Motivating example: markup languages 3 / 34

This slide deck is written in remark A simple, in-
browser, markdown- driven slideshow tool. or A domain- specific language for writing browser-based slide decks remark source for this slide deck: n a m e : i n v e r s e l a y o u t : t r u e c l a s s : c e n t e r , m i d d l e , i n v e r s e - - - # D o m a i n - s p e c i f i c l a n g u a g e s : t h e k e y t o c o m p u t a t i o n a l e f f i c i e n c y a n d p r o g # # I m p e r i a l C o l l e g e L o n d o n A C M S t u d e n t C h a p t e r s e m i n a r s e r i e s # # # * * F l o r i a n R a t h g e b e r * * , D e p a r t m e n t o f C o m p u t i n g , I m p e r i a l C o l l e g e L o n d # # # F e b r u a r y 7 2 0 1 4 S l i d e s : h t t p : / / k y n a n . g i t h u b . i o / A C M 2 0 1 4 - - - l a y o u t : f a l s e . l e f t - c o l u m n [ # # D o m a i n - S p e c i f i c L a n g u a g e s ( D S L s ) L a n g u a g e s s p e c i a l i s e d t o a p a r t i c u l a r a p p l i c a t i o n d o m a i n I n c o n t r a s t t o g e n e r a l - p u r p o s e ( p r o g r a m m i n g ) l a n g u a g e s ] . r i g h t - c o l u m n [ . . . ] - - - 4 / 34

remark generated html for the title slide < d i
v c l a s s = " r e m a r k - s l i d e - c o n t a i n e r r e m a r k - v i s i b l e " > < d i v c l a s s = " r e m a r k - s l i d e - s c a l e r " s t y l e = " w i d t h : 9 0 8 p x ; h e i g h t : 6 8 1 p x ; - w e b k i t - t r a n s f o r m : s c a l e ( 0 . 7 9 7 3 5 6 8 2 8 1 9 3 8 3 2 6 ) ; l e f t : 1 1 7 p x ; t o p : 0 p x ; " > < d i v c l a s s = " r e m a r k - s l i d e " > < d i v c l a s s = " r e m a r k - s l i d e - c o n t e n t c e n t e r m i d d l e i n v e r s e h l j s - d e f a u l t " > < h 1 > D o m a i n - s p e c i f i c l a n g u a g e s : t h e k e y t o c o m p u t a t i o n a l e f f i c i e n c y a n d p r o g r a m m e r p r o d u c t i v i t y < / h 1 > < h 2 > I m p e r i a l C o l l e g e L o n d o n A C M S t u d e n t C h a p t e r s e m i n a r s e r i e s < / h 2 > < h 3 > < s t r o n g > F l o r i a n R a t h g e b e r < / s t r o n g > , D e p a r t m e n t o f C o m p u t i n g , I m p e r i a l C o l l e g e L o n d o n < / h 3 > < h 3 > F e b r u a r y 7 2 0 1 4 < / h 3 > < p > S l i d e s : < a h r e f = " h t t p : / / k y n a n . g i t h u b . i o / A C M 2 0 1 4 " > h t t p : / / k y n a n . g i t h u b . i o / A C M 2 0 1 4 < / a > < / p > < d i v c l a s s = " r e m a r k - s l i d e - n u m b e r " > 1 / 2 5 < / d i v > < / d i v > < / d i v > < / d i v > < / d i v > 5 / 34

Domain-specific languages in computational science 6 / 34

Image processing 7 / 34

Halide (1/2) A domain-specific language for image processing and computational
photography embedded in C++: http://halide-lang.org Write high-performance image processing code on modern machines Compiler targets: x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL Schedule for a 3x3 box filter defined as a series of two 3x1 passes: F u n c b l u r _ 3 x 3 ( F u n c i n p u t ) { F u n c b l u r _ x , b l u r _ y ; V a r x , y , x i , y i ; / / T h e a l g o r i t h m - n o s t o r a g e o r o r d e r b l u r _ x ( x , y ) = ( i n p u t ( x - 1 , y ) + i n p u t ( x , y ) + i n p u t ( x + 1 , y ) ) / 3 ; b l u r _ y ( x , y ) = ( b l u r _ x ( x , y - 1 ) + b l u r _ x ( x , y ) + b l u r _ x ( x , y + 1 ) ) / 3 ; / / T h e s c h e d u l e - d e f i n e s o r d e r , l o c a l i t y ; i m p l i e s s t o r a g e b l u r _ y . t i l e ( x , y , x i , y i , 2 5 6 , 3 2 ) . v e c t o r i z e ( x i , 8 ) . p a r a l l e l ( y ) ; b l u r _ x . c o m p u t e _ a t ( b l u r _ y , x ) . v e c t o r i z e ( x , 8 ) ; r e t u r n b l u r _ y ; } A F u n c object represents a pipeline stage, a "computed image": a pure function defining the value of each pixel. 8 / 34

Halide (2/2) Single-stage imaging pipeline that outputs a grayscale diagonal
gradient: # i n c l u d e < H a l i d e . h > # i n c l u d e < s t d i o . h > i n t m a i n ( i n t a r g c , c h a r * * a r g v ) { H a l i d e : : F u n c g r a d i e n t ; H a l i d e : : V a r x , y ; H a l i d e : : E x p r e = x + y ; g r a d i e n t ( x , y ) = e ; H a l i d e : : I m a g e < i n t 3 2 _ t > o u t p u t = g r a d i e n t . r e a l i z e ( 8 0 0 , 6 0 0 ) ; f o r ( i n t j = 0 ; j < o u t p u t . h e i g h t ( ) ; j + + ) { f o r ( i n t i = 0 ; i < o u t p u t . w i d t h ( ) ; i + + ) { i f ( o u t p u t ( i , j ) ! = i + j ) { p r i n t f ( " S o m e t h i n g w e n t w r o n g ! \ n " " P i x e l % d , % d w a s s u p p o s e d t o b e % d , b u t i n s t e a d i t ' s % d \ n " , i , j , i + j , o u t p u t ( i , j ) ) ; r e t u r n - 1 ; } } } r e t u r n 0 ; } 9 / 34

Stencil languages 10 / 34

Pochoir Compiler and runtime system for implementing stencil computations on
multicore processors Domain-specific language embedded in C++ Stencils Define the value of a grid point in a d-dimensional spatial grid at time t as a function of neighboring grid points at recent times before t. A stencil computation computes the stencil for each grid point over many time steps 19-point stencil for the Lattice-Boltzman Method (LBM) 11 / 34

2D heat equation coded in Pochoir # d e f
i n e m o d ( r , m ) ( ( r ) % ( m ) + ( ( r ) < 0 ) ? ( m ) : 0 ) P o c h o i r _ B o u n d a r y _ 2 D ( h e a t _ b v , a , t , x , y ) r e t u r n a . g e t ( t , m o d ( x , a . s i z e ( 1 ) ) , m o d ( y , a . s i z e ( 0 ) ) ) ; P o c h o i r _ B o u n d a r y _ E n d i n t m a i n ( v o i d ) { P o c h o i r _ S h a p e _ 2 D 2 D _ f i v e _ p t [ ] = { { 1 , 0 , 0 } , { 0 , 1 , 0 } , { 0 , - 1 , 0 } , { 0 , - 1 , - 1 } , { 0 , 0 , - 1 } , { 0 , 0 , 1 } } ; P o c h o i r _ 2 D h e a t ( 2 D _ f i v e _ p t ) ; P o c h o i r _ A r r a y _ 2 D ( d o u b l e ) u ( X , Y ) ; u . R e g i s t e r _ B o u n d a r y ( h e a t _ b v ) ; h e a t . R e g i s t e r _ A r r a y ( u ) ; P o c h o i r _ K e r n e l _ 2 D ( h e a t _ f n , t , x , y ) u ( t + 1 , x , y ) = C X * ( u ( t , x + 1 , y ) - 2 * u ( t , x , y ) + u ( t , x - 1 , y ) ) + C Y * ( u ( t , x , y + 1 ) - 2 * u ( t , x , y ) + u ( t , x , y - 1 ) ) + u ( t , x , y ) ; P o c h o i r _ K e r n e l _ E n d f o r ( i n t x = 0 ; x < X ; + + x ) f o r ( i n t y = 0 ; y < Y ; + + y ) u ( 0 , x , y ) = r a n d ( ) ; h e a t . R u n ( T , h e a t _ f n ) ; f o r ( i n t x = 0 ; x < X ; + + x ) f o r ( i n t y = 0 ; y < Y ; + + y ) c o u t < < u ( T , x , y ) ; r e t u r n 0 ; } 12 / 34

Unstructured meshes / graphs 13 / 34

Structure of scientific computations on unstructured meshes Independent local operations
for each element of the mesh described as a kernel. Reductions aggregating results from local operations to produce the final result. PyOP2 A domain-specific language embedded in Python for parallel computations on unstructured meshes or graphs. Unstructured meshes Meshes are described by sets and maps between them and can have data associated with them: 1 0 2 3 6 5 4 7 8 0 1 3 2 4 5 8 7 6 OP2 Sets: nodes = [0, 1, 2, 3, 4, 5, 6, 7, 8] elements = [0, 1, 2, 3, 4, 5, 6, 7, 8] OP2 Map elements-nodes: elem_nodes = [[0, 1, 2], [1, 3, 2], ...] OP2 Dat on nodes: coords = [..., [.5,.5], [.5,-.25], [1,.25], ...] 14 / 34

PyOP2 Data Model Mesh topology Sets – cells, vertices, etc
Maps – connectivity between entities in different sets Data Dats – Defined on sets (hold pressure, temperature, etc) Kernels Executed in parallel on a set through a parallel loop Read / write / increment data accessed via maps Linear algebra Matrix sparsities defined by mappings Matrix data on sparsities Kernels compute a local matrix – PyOP2 handles global assembly 15 / 34

PyOP2 Kernels and Parallel Loops Performance portability for any unstructured
mesh computations Parallel loop syntax o p 2 . p a r _ l o o p ( k e r n e l , i t e r a t i o n _ s e t , k e r n e l _ a r g 1 ( a c c e s s _ m o d e , m a p p i n g [ i n d e x ] ) , . . . , k e r n e l _ a r g N ( a c c e s s _ m o d e , m a p p i n g [ i n d e x ] ) ) PyOP2 programme for computing the midpoint of a triangle f r o m p y o p 2 i m p o r t o p 2 o p 2 . i n i t ( ) v e r t i c e s = o p 2 . S e t ( n u m _ v e r t i c e s ) c e l l s = o p 2 . S e t ( n u m _ c e l l s ) c e l l 2 v e r t e x = o p 2 . M a p ( c e l l s , v e r t i c e s , 3 , [ . . . ] ) c o o r d i n a t e s = o p 2 . D a t ( v e r t i c e s * * 2 , [ . . . ] , d t y p e = f l o a t ) m i d p o i n t s = o p 2 . D a t ( c e l l s * * 2 , d t y p e = f l o a t ) m i d p o i n t = o p 2 . K e r n e l ( " " " v o i d m i d p o i n t ( d o u b l e p [ 2 ] , d o u b l e * c o o r d s [ 2 ] ) { p [ 0 ] = ( c o o r d s [ 0 ] [ 0 ] + c o o r d s [ 1 ] [ 0 ] + c o o r d s [ 2 ] [ 0 ] ) / 3 . 0 ; p [ 1 ] = ( c o o r d s [ 0 ] [ 1 ] + c o o r d s [ 1 ] [ 1 ] + c o o r d s [ 2 ] [ 1 ] ) / 3 . 0 ; } " " " , " m i d p o i n t " ) o p 2 . p a r _ l o o p ( m i d p o i n t , c e l l s , m i d p o i n t s ( o p 2 . W R I T E ) , c o o r d i n a t e s ( o p 2 . R E A D , c e l l 2 v e r t e x ) ) Future work: define kernels as Python functions 16 / 34

PyOP2 Architecture Parallel scheduling: partitioning, staging and coloring Runtime code
generation and JIT compilation OpenCL CUDA Instant PyOpenCL PyCUDA CPU OpenMP CPU seq. MPI Runtime Core partitioning, parallel scheduling Lin. algebra PETSc/Cusp Kernels Data Access Descriptors Application code Backends Code gen PyOP2 core User code 17 / 34

Generated sequential code calling the midpoint kernel v o i
d w r a p _ m i d p o i n t _ _ ( P y O b j e c t * _ s t a r t , P y O b j e c t * _ e n d , P y O b j e c t * _ a r g 0 _ 0 , P y O b j e c t * _ a r g 1 _ 0 , P y O b j e c t * _ a r g 1 _ 0 _ m a p 0 _ 0 ) { i n t s t a r t = ( i n t ) P y I n t _ A s L o n g ( _ s t a r t ) ; i n t e n d = ( i n t ) P y I n t _ A s L o n g ( _ e n d ) ; d o u b l e * a r g 0 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 ) - > d a t a ) ; i n t * a r g 1 _ 0 _ m a p 0 _ 0 = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 _ m a p 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 _ v e c [ 3 ] ; f o r ( i n t n = s t a r t ; n < e n d ; n + + ) { a r g 1 _ 0 _ v e c [ 0 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 0 ] * 2 ; a r g 1 _ 0 _ v e c [ 1 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 1 ] * 2 ; a r g 1 _ 0 _ v e c [ 2 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 2 ] * 2 ; m i d p o i n t ( a r g 0 _ 0 + n * 2 , a r g 1 _ 0 _ v e c ) ; } } Generated OpenMP code calling the midpoint kernel next slide 18 / 34

v o i d w r a p _ m
i d p o i n t _ _ ( P y O b j e c t * _ b o f f s e t , P y O b j e c t * _ n b l o c k s , P y O b j e c t * _ b l k m a p , P y O b j e c t * _ o f f s e t , P y O b j e c t * _ n e l e m s , P y O b j e c t * _ a r g 0 _ 0 , P y O b j e c t * _ a r g 1 _ 0 , P y O b j e c t * _ a r g 1 _ 0 _ m a p 0 _ 0 ) { i n t b o f f s e t = ( i n t ) P y I n t _ A s L o n g ( _ b o f f s e t ) ; i n t n b l o c k s = ( i n t ) P y I n t _ A s L o n g ( _ n b l o c k s ) ; i n t * b l k m a p = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ b l k m a p ) - > d a t a ) ; i n t * o f f s e t = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ o f f s e t ) - > d a t a ) ; i n t * n e l e m s = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ n e l e m s ) - > d a t a ) ; d o u b l e * a r g 0 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 ) - > d a t a ) ; i n t * a r g 1 _ 0 _ m a p 0 _ 0 = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 _ m a p 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 _ v e c [ 3 2 ] [ 3 ] ; # i f d e f _ O P E N M P i n t n t h r e a d = o m p _ g e t _ m a x _ t h r e a d s ( ) ; # e l s e i n t n t h r e a d = 1 ; # e n d i f # p r a g m a o m p p a r a l l e l s h a r e d ( b o f f s e t , n b l o c k s , n e l e m s , b l k m a p ) { i n t t i d = o m p _ g e t _ t h r e a d _ n u m ( ) ; # p r a g m a o m p f o r s c h e d u l e ( s t a t i c ) f o r ( i n t _ _ b = b o f f s e t ; _ _ b < b o f f s e t + n b l o c k s ; _ _ b + + ) { i n t b i d = b l k m a p [ _ _ b ] ; i n t n e l e m = n e l e m s [ b i d ] ; i n t e f i r s t = o f f s e t [ b i d ] ; f o r ( i n t n = e f i r s t ; n < e f i r s t + n e l e m ; n + + ) { a r g 1 _ 0 _ v e c [ t i d ] [ 0 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 0 ] * 2 ; a r g 1 _ 0 _ v e c [ t i d ] [ 1 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 1 ] * 2 ; a r g 1 _ 0 _ v e c [ t i d ] [ 2 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 2 ] * 2 ; m i d p o i n t ( a r g 0 _ 0 + n * 2 , a r g 1 _ 0 _ v e c [ t i d ] ) ; } } } } 19 / 34

PyOP2 Partitioning, Staging & Coloring Key optimizations performed by PyOP2
runtime core Partitioning for on-chip memory (shared memory / cache) Coloring to avoid data races on updates to the same memory location Example Parallel computation executing a kernel over the edges of the mesh: # S e t s o f n o d e s a n d e d g e s n o d e s = o p 2 . S e t ( N ) # N = n u m b e r o f n o d e s e d g e s = o p 2 . S e t ( M ) # M = n u m b e r o f e d g e s # M a p p i n g f r o m e d g e s t o n o d e s e d g e _ t o _ n o d e _ m a p = o p 2 . M a p ( e d g e s , n o d e s , 2 , . . . ) # D a t a d e f i n e d o n n o d e s u = o p 2 . D a t ( n o d e s , . . . ) # K e r n e l e x e c u t i n g o v e r s e t o f e d g e s , c o m p u t i n g o n n o d a l d a t a o p 2 . p a r _ l o o p ( k e r n e l , e d g e s , u ( e g d e _ t o _ n o d e _ m a p [ : ] , o p 2 . I N C ) ) 20 / 34

21 / 34

edges shared / staging memory vertices 22 / 34

edges shared / staging memory vertices 23 / 34

Finite-element computations 24 / 34

i i j j k k i i i i
j j j j k k k k i j k Ax = b The weak form of the Helmholtz equation: Finite-element assembly 25 / 34

UFL: High- level definition of finite- element forms UFL is
the Unified Form Language from the FEniCS project. The weak form of the Helmholtz equation And its (almost) literal translation to Python with UFL UFL: embedded domain-specific language (eDSL) for weak forms of partial differential equations (PDEs) e = F i n i t e E l e m e n t ( ' C G ' , ' t r i a n g l e ' , 1 ) v = T e s t F u n c t i o n ( e ) u = T r i a l F u n c t i o n ( e ) f = C o e f f i c i e n t ( e ) l m b d a = 1 a = ( d o t ( g r a d ( v ) , g r a d ( u ) ) - l m b d a * v * u ) * d x L = v * f * d x ∇v ⋅ ∇u − λvu dV = vf dV ∫ Ω ∫ Ω 26 / 34

Helmholtz local assembly kernel generated by FFC The FEniCS Form
Compiler FFC compiles UFL forms to low- level code. Helmholtz equation UFL expression a = ( d o t ( g r a d ( v ) , g r a d ( u ) ) - l m b d a * v * u ) * d x Generated C code / / A - l o c a l t e n s o r t o a s s e m b l e / / x - l o c a l c o o r d i n a t e s / / j , k - 2 D i n d i c e s i n t o t h e l o c a l a s s e m b l y m a t r i x v o i d k e r n e l ( d o u b l e A [ 1 ] [ 1 ] , d o u b l e * x [ 2 ] , i n t j , i n t k ) { / / F E 0 - S h a p e f u n c t i o n s / / D i j - S h a p e f u n c t i o n d e r i v a t i v e s / / K i j - J a c o b i a n i n v e r s e / d e t e r m i n a n t / / W 3 - Q u a d r a t u r e w e i g h t s / / d e t - J a c o b i a n d e t e r m i n a n t f o r ( u n s i g n e d i n t i p = 0 ; i p < 3 ; i p + + ) { A [ 0 ] [ 0 ] + = ( F E 0 [ i p ] [ j ] * F E 0 [ i p ] [ k ] * ( - 1 . 0 ) + ( ( ( K 0 0 * D 1 0 [ i p ] [ j ] + K 1 0 * D 0 1 [ i p ] [ j ] ) ) * ( ( K 0 0 * D 1 0 [ i p ] [ k ] + K 1 0 * D 0 1 [ i p ] [ k ] ) ) + ( ( K 0 1 * D 1 0 [ i p ] [ j ] + K 1 1 * D 0 1 [ i p ] [ j ] ) ) * ( ( K 0 1 * D 1 0 [ i p ] [ k ] + K 1 1 * D 0 1 [ i p ] [ k ] ) ) ) ) * W 3 [ i p ] * d e t ; } } ∇v ⋅ ∇u − λvu dV = vf dV ∫ Ω ∫ Ω 27 / 34

The Firedrake/PyOP2 tool chain PyOP2 Parallel unstructured mesh computation framework
Firedrake A performance-portable Finite-element computation framework Unified Form Language (UFL) PyOP2 Interface FFC Form Compiler Parallel scheduling, code generation CPU (+MPI + OpenMP/ OpenCL) GPU (CUDA / OpenCL) Future arch. Problem definition in FEM weak form Local assembly kernels, data dependencies Parallel loops over kernels with access descriptors Explicitly parallel hardware- specific implementation Geometry, fields and meshes Fluidity 28 / 34

Two-layered abstraction: Separation of concerns Unified Form Language (UFL) PyOP2
Interface FFC Form Compiler Parallel scheduling, code generation CPU (+MPI + OpenMP/ OpenCL) GPU (CUDA / OpenCL) Future arch. Problem definition in FEM weak form Local assembly kernels, data dependencies Parallel loops over kernels with access descriptors Explicitly parallel hardware- specific implementation Geometry, fields and meshes Fluidity Domain specialist: mathematical model using FEM Numerical analyst: generation of FEM kernels Domain specialist: mathematical model on unstructured grid Parallel programming expert: hardware architectures, optimization Expert for each layer 29 / 34

Driving Finite-element Computations in Firedrake Solving the Helmholtz equation in
Python using Firedrake: f r o m f i r e d r a k e i m p o r t * # R e a d a m e s h a n d d e f i n e a f u n c t i o n s p a c e m e s h = M e s h ( ' f i l e n a m e ' ) V = F u n c t i o n S p a c e ( m e s h , " L a g r a n g e " , 1 ) # D e f i n e f o r c i n g f u n c t i o n f o r r i g h t - h a n d s i d e d e f f o r c e ( X ) : l m b d a = 1 n = 8 r e t u r n - ( l m b d a + 2 * ( n * * 2 ) * p i * * 2 ) * s i n ( X [ 0 ] * p i * n ) * s i n ( X [ 1 ] * p i * n ) f = E x p r e s s i o n ( f o r c e ) # S e t u p t h e F i n i t e - e l e m e n t w e a k f o r m s u = T r i a l F u n c t i o n ( V ) v = T e s t F u n c t i o n ( V ) l m b d a = 1 a = ( d o t ( g r a d ( v ) , g r a d ( u ) ) - l m b d a * v * u ) * d x L = v * f * d x # S o l v e t h e r e s u l t i n g f i n i t e - e l e m e n t e q u a t i o n p = F u n c t i o n ( V ) s o l v e ( a = = L , p ) 30 / 34

Finite element assembly and solve in Firedrake What goes on
behind the scenes of the s o l v e call (simplified example!): f r o m p y o p 2 i m p o r t o p 2 , f f c _ i n t e r f a c e d e f s o l v e ( e q u a t i o n , x ) : # I n v o k e F F C t o g e n e r a t e k e r n e l s f o r m a t r i x a n d r h s a s s e m b l y l h s = f f c _ i n t e r f a c e . c o m p i l e _ f o r m ( e q u a t i o n . l h s , " l h s " ) [ 0 ] r h s = f f c _ i n t e r f a c e . c o m p i l e _ f o r m ( e q u a t i o n . r h s , " r h s " ) [ 0 ] # O m i t t e d : e x t r a c t c o o r d i n a t e s ( c o o r d s ) , c o n n e c t i v i t y ( e l e m _ n o d e ) # a n d c o e f f i c i e n t s ( f i e l d f ) # C o n s t r u c t O P 2 m a t r i x t o a s s e m b l e i n t o s p a r s i t y = o p 2 . S p a r s i t y ( ( e l e m _ n o d e , e l e m _ n o d e ) , s p a r s i t y _ d i m ) m a t = o p 2 . M a t ( s p a r s i t y , n u m p y . f l o a t 6 4 ) b = o p 2 . D a t ( n o d e s , n p . z e r o s ( n o d e s . s i z e ) ) # A s s e m b l e l h s , r h s a n d s o l v e l i n e a r s y s t e m o p 2 . p a r _ l o o p ( l h s , e l e m e n t s , m a t ( o p 2 . I N C , ( e l e m _ n o d e [ o p 2 . i [ 0 ] ] , e l e m _ n o d e [ o p 2 . i [ 1 ] ] ) ) , c o o r d s ( o p 2 . R E A D , e l e m _ n o d e ) ) o p 2 . p a r _ l o o p ( r h s , e l e m e n t s , b ( o p 2 . I N C , e l e m _ n o d e [ o p 2 . i [ 0 ] ] ) , c o o r d s ( o p 2 . R E A D , e l e m _ n o d e ) , f ( o p 2 . R E A D , e l e m _ n o d e ) ) # S o l v e t h e a s s e m b l e d s p a r s e l i n e a r s y s t e m o f e q u a t i o n s o p 2 . s o l v e ( m a t , x , b ) 31 / 34

Recap 32 / 34

Computational Science is hard Unless you break it down with
the right abstractions Many-core hardware has brought a paradigm shift to CSE, scientific software needs to keep up The Solution High-level structure Goal: producing high level interfaces to numerical computing PyOP2: a high-level interface to unstructured mesh based methods Efficiently execute kernels over an unstructured grid in parallel Firedrake: a performance-portable finite-element computation framework Drive FE computations from a high-level problem specification Low-level operations Separating the low-level implementation from the high-level problem specification Generate platform-specific implementations from a common source instead of hand-coding them Runtime code generation and JIT compilation open space for compiler-driven optimizations and performance portability 33 / 34

Thank you! Contact: Florian Rathgeber, @frathgeber, [email protected] Resources Halide http://halide-lang.org
Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe. PLDI 2013 Pochoir http://groups.csail.mit.edu/sct/wiki/index.php?title=The_Pochoir_Project The Pochoir Stencil Compiler Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. SPAA ’11, 2011 PyOP2 https://github.com/OP2/PyOP2 PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes Florian Rathgeber, Graham R. Markall, Lawrence Mitchell, Nicholas Loriant, David A. Ham, Carlo Bertolli, Paul H.J. Kelly WOLFHPC 2012 Firedrake https://github.com/firedrakeproject/firedrake Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS Graham R. Markall, Florian Rathgeber, Lawrence Mitchell, Nicolas Loriant, Carlo Bertolli, David A. Ham, Paul H. J. Kelly ISC 2013 UFL https://bitbucket.org/mapdes/ufl FFC https://bitbucket.org/mapdes/ffc This talk is available at http://kynan.github.io/ACM2014 Slides created with remark 34 / 34

Domain-specific languages: the key to computati...

Domain-specific languages: the key to computational efficiency and programmer productivity

More Decks by Imperial ACM

Other Decks in Research

Featured

Transcript