Slide 1

Slide 1 text

Domain-specific languages: the key to computational efficiency and programmer productivity Imperial College London ACM Student Chapter seminar series Florian Rathgeber, Department of Computing, Imperial College London February 7 2014 Slides: http://kynan.github.io/ACM2014 1 / 34

Slide 2

Slide 2 text

Domain- Specific Languages (DSLs) Languages specialised to a particular application domain In contrast to general-purpose (programming) languages Markup languages XML dialects: HTML, SVG Graphs, graphics: GraphViz, GrGen, PGF Text: Markdown, rST, Org, MediaWiki remark used for this slide deck Modelling languages Linear algebra: MATLAB, Mathematica, Maple Symbolic maths: Mathematica, Maxima Hardware description: Verilog, VHDL Database queries: SQL, SparQL Spreadsheets: Excel / LibreOffice formulas Embedded DSLs Domain-specific language embedded in a host language (C++, Python, Ruby) DSL user can mix and match and extend DSL code with the host language DSL writer can use lexing, parsing, syntax tools of the host language Boundaries are not sharp: LaTeX and PostScript are turing- complete! 2 / 34

Slide 3

Slide 3 text

Motivating example: markup languages 3 / 34

Slide 4

Slide 4 text

This slide deck is written in remark A simple, in- browser, markdown- driven slideshow tool. or A domain- specific language for writing browser-based slide decks remark source for this slide deck: n a m e : i n v e r s e l a y o u t : t r u e c l a s s : c e n t e r , m i d d l e , i n v e r s e - - - # D o m a i n - s p e c i f i c l a n g u a g e s : t h e k e y t o c o m p u t a t i o n a l e f f i c i e n c y a n d p r o g # # I m p e r i a l C o l l e g e L o n d o n A C M S t u d e n t C h a p t e r s e m i n a r s e r i e s # # # * * F l o r i a n R a t h g e b e r * * , D e p a r t m e n t o f C o m p u t i n g , I m p e r i a l C o l l e g e L o n d # # # F e b r u a r y 7 2 0 1 4 S l i d e s : h t t p : / / k y n a n . g i t h u b . i o / A C M 2 0 1 4 - - - l a y o u t : f a l s e . l e f t - c o l u m n [ # # D o m a i n - S p e c i f i c L a n g u a g e s ( D S L s ) L a n g u a g e s s p e c i a l i s e d t o a p a r t i c u l a r a p p l i c a t i o n d o m a i n I n c o n t r a s t t o g e n e r a l - p u r p o s e ( p r o g r a m m i n g ) l a n g u a g e s ] . r i g h t - c o l u m n [ . . . ] - - - 4 / 34

Slide 5

Slide 5 text

remark generated html for the title slide < d i v c l a s s = " r e m a r k - s l i d e - c o n t a i n e r r e m a r k - v i s i b l e " > < d i v c l a s s = " r e m a r k - s l i d e - s c a l e r " s t y l e = " w i d t h : 9 0 8 p x ; h e i g h t : 6 8 1 p x ; - w e b k i t - t r a n s f o r m : s c a l e ( 0 . 7 9 7 3 5 6 8 2 8 1 9 3 8 3 2 6 ) ; l e f t : 1 1 7 p x ; t o p : 0 p x ; " > < d i v c l a s s = " r e m a r k - s l i d e " > < d i v c l a s s = " r e m a r k - s l i d e - c o n t e n t c e n t e r m i d d l e i n v e r s e h l j s - d e f a u l t " > < h 1 > D o m a i n - s p e c i f i c l a n g u a g e s : t h e k e y t o c o m p u t a t i o n a l e f f i c i e n c y a n d p r o g r a m m e r p r o d u c t i v i t y < / h 1 > < h 2 > I m p e r i a l C o l l e g e L o n d o n A C M S t u d e n t C h a p t e r s e m i n a r s e r i e s < / h 2 > < h 3 > < s t r o n g > F l o r i a n R a t h g e b e r < / s t r o n g > , D e p a r t m e n t o f C o m p u t i n g , I m p e r i a l C o l l e g e L o n d o n < / h 3 > < h 3 > F e b r u a r y 7 2 0 1 4 < / h 3 > < p > S l i d e s : < a h r e f = " h t t p : / / k y n a n . g i t h u b . i o / A C M 2 0 1 4 " > h t t p : / / k y n a n . g i t h u b . i o / A C M 2 0 1 4 < / a > < / p > < d i v c l a s s = " r e m a r k - s l i d e - n u m b e r " > 1 / 2 5 < / d i v > < / d i v > < / d i v > < / d i v > < / d i v > 5 / 34

Slide 6

Slide 6 text

Domain-specific languages in computational science 6 / 34

Slide 7

Slide 7 text

Image processing 7 / 34

Slide 8

Slide 8 text

Halide (1/2) A domain-specific language for image processing and computational photography embedded in C++: http://halide-lang.org Write high-performance image processing code on modern machines Compiler targets: x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL Schedule for a 3x3 box filter defined as a series of two 3x1 passes: F u n c b l u r _ 3 x 3 ( F u n c i n p u t ) { F u n c b l u r _ x , b l u r _ y ; V a r x , y , x i , y i ; / / T h e a l g o r i t h m - n o s t o r a g e o r o r d e r b l u r _ x ( x , y ) = ( i n p u t ( x - 1 , y ) + i n p u t ( x , y ) + i n p u t ( x + 1 , y ) ) / 3 ; b l u r _ y ( x , y ) = ( b l u r _ x ( x , y - 1 ) + b l u r _ x ( x , y ) + b l u r _ x ( x , y + 1 ) ) / 3 ; / / T h e s c h e d u l e - d e f i n e s o r d e r , l o c a l i t y ; i m p l i e s s t o r a g e b l u r _ y . t i l e ( x , y , x i , y i , 2 5 6 , 3 2 ) . v e c t o r i z e ( x i , 8 ) . p a r a l l e l ( y ) ; b l u r _ x . c o m p u t e _ a t ( b l u r _ y , x ) . v e c t o r i z e ( x , 8 ) ; r e t u r n b l u r _ y ; } A F u n c object represents a pipeline stage, a "computed image": a pure function defining the value of each pixel. 8 / 34

Slide 9

Slide 9 text

Halide (2/2) Single-stage imaging pipeline that outputs a grayscale diagonal gradient: # i n c l u d e < H a l i d e . h > # i n c l u d e < s t d i o . h > i n t m a i n ( i n t a r g c , c h a r * * a r g v ) { H a l i d e : : F u n c g r a d i e n t ; H a l i d e : : V a r x , y ; H a l i d e : : E x p r e = x + y ; g r a d i e n t ( x , y ) = e ; H a l i d e : : I m a g e < i n t 3 2 _ t > o u t p u t = g r a d i e n t . r e a l i z e ( 8 0 0 , 6 0 0 ) ; f o r ( i n t j = 0 ; j < o u t p u t . h e i g h t ( ) ; j + + ) { f o r ( i n t i = 0 ; i < o u t p u t . w i d t h ( ) ; i + + ) { i f ( o u t p u t ( i , j ) ! = i + j ) { p r i n t f ( " S o m e t h i n g w e n t w r o n g ! \ n " " P i x e l % d , % d w a s s u p p o s e d t o b e % d , b u t i n s t e a d i t ' s % d \ n " , i , j , i + j , o u t p u t ( i , j ) ) ; r e t u r n - 1 ; } } } r e t u r n 0 ; } 9 / 34

Slide 10

Slide 10 text

Stencil languages 10 / 34

Slide 11

Slide 11 text

Pochoir Compiler and runtime system for implementing stencil computations on multicore processors Domain-specific language embedded in C++ Stencils Define the value of a grid point in a d-dimensional spatial grid at time t as a function of neighboring grid points at recent times before t. A stencil computation computes the stencil for each grid point over many time steps 19-point stencil for the Lattice-Boltzman Method (LBM) 11 / 34

Slide 12

Slide 12 text

2D heat equation coded in Pochoir # d e f i n e m o d ( r , m ) ( ( r ) % ( m ) + ( ( r ) < 0 ) ? ( m ) : 0 ) P o c h o i r _ B o u n d a r y _ 2 D ( h e a t _ b v , a , t , x , y ) r e t u r n a . g e t ( t , m o d ( x , a . s i z e ( 1 ) ) , m o d ( y , a . s i z e ( 0 ) ) ) ; P o c h o i r _ B o u n d a r y _ E n d i n t m a i n ( v o i d ) { P o c h o i r _ S h a p e _ 2 D 2 D _ f i v e _ p t [ ] = { { 1 , 0 , 0 } , { 0 , 1 , 0 } , { 0 , - 1 , 0 } , { 0 , - 1 , - 1 } , { 0 , 0 , - 1 } , { 0 , 0 , 1 } } ; P o c h o i r _ 2 D h e a t ( 2 D _ f i v e _ p t ) ; P o c h o i r _ A r r a y _ 2 D ( d o u b l e ) u ( X , Y ) ; u . R e g i s t e r _ B o u n d a r y ( h e a t _ b v ) ; h e a t . R e g i s t e r _ A r r a y ( u ) ; P o c h o i r _ K e r n e l _ 2 D ( h e a t _ f n , t , x , y ) u ( t + 1 , x , y ) = C X * ( u ( t , x + 1 , y ) - 2 * u ( t , x , y ) + u ( t , x - 1 , y ) ) + C Y * ( u ( t , x , y + 1 ) - 2 * u ( t , x , y ) + u ( t , x , y - 1 ) ) + u ( t , x , y ) ; P o c h o i r _ K e r n e l _ E n d f o r ( i n t x = 0 ; x < X ; + + x ) f o r ( i n t y = 0 ; y < Y ; + + y ) u ( 0 , x , y ) = r a n d ( ) ; h e a t . R u n ( T , h e a t _ f n ) ; f o r ( i n t x = 0 ; x < X ; + + x ) f o r ( i n t y = 0 ; y < Y ; + + y ) c o u t < < u ( T , x , y ) ; r e t u r n 0 ; } 12 / 34

Slide 13

Slide 13 text

Unstructured meshes / graphs 13 / 34

Slide 14

Slide 14 text

Structure of scientific computations on unstructured meshes Independent local operations for each element of the mesh described as a kernel. Reductions aggregating results from local operations to produce the final result. PyOP2 A domain-specific language embedded in Python for parallel computations on unstructured meshes or graphs. Unstructured meshes Meshes are described by sets and maps between them and can have data associated with them: 1 0 2 3 6 5 4 7 8 0 1 3 2 4 5 8 7 6 OP2 Sets: nodes = [0, 1, 2, 3, 4, 5, 6, 7, 8] elements = [0, 1, 2, 3, 4, 5, 6, 7, 8] OP2 Map elements-nodes: elem_nodes = [[0, 1, 2], [1, 3, 2], ...] OP2 Dat on nodes: coords = [..., [.5,.5], [.5,-.25], [1,.25], ...] 14 / 34

Slide 15

Slide 15 text

PyOP2 Data Model Mesh topology Sets – cells, vertices, etc Maps – connectivity between entities in different sets Data Dats – Defined on sets (hold pressure, temperature, etc) Kernels Executed in parallel on a set through a parallel loop Read / write / increment data accessed via maps Linear algebra Matrix sparsities defined by mappings Matrix data on sparsities Kernels compute a local matrix – PyOP2 handles global assembly 15 / 34

Slide 16

Slide 16 text

PyOP2 Kernels and Parallel Loops Performance portability for any unstructured mesh computations Parallel loop syntax o p 2 . p a r _ l o o p ( k e r n e l , i t e r a t i o n _ s e t , k e r n e l _ a r g 1 ( a c c e s s _ m o d e , m a p p i n g [ i n d e x ] ) , . . . , k e r n e l _ a r g N ( a c c e s s _ m o d e , m a p p i n g [ i n d e x ] ) ) PyOP2 programme for computing the midpoint of a triangle f r o m p y o p 2 i m p o r t o p 2 o p 2 . i n i t ( ) v e r t i c e s = o p 2 . S e t ( n u m _ v e r t i c e s ) c e l l s = o p 2 . S e t ( n u m _ c e l l s ) c e l l 2 v e r t e x = o p 2 . M a p ( c e l l s , v e r t i c e s , 3 , [ . . . ] ) c o o r d i n a t e s = o p 2 . D a t ( v e r t i c e s * * 2 , [ . . . ] , d t y p e = f l o a t ) m i d p o i n t s = o p 2 . D a t ( c e l l s * * 2 , d t y p e = f l o a t ) m i d p o i n t = o p 2 . K e r n e l ( " " " v o i d m i d p o i n t ( d o u b l e p [ 2 ] , d o u b l e * c o o r d s [ 2 ] ) { p [ 0 ] = ( c o o r d s [ 0 ] [ 0 ] + c o o r d s [ 1 ] [ 0 ] + c o o r d s [ 2 ] [ 0 ] ) / 3 . 0 ; p [ 1 ] = ( c o o r d s [ 0 ] [ 1 ] + c o o r d s [ 1 ] [ 1 ] + c o o r d s [ 2 ] [ 1 ] ) / 3 . 0 ; } " " " , " m i d p o i n t " ) o p 2 . p a r _ l o o p ( m i d p o i n t , c e l l s , m i d p o i n t s ( o p 2 . W R I T E ) , c o o r d i n a t e s ( o p 2 . R E A D , c e l l 2 v e r t e x ) ) Future work: define kernels as Python functions 16 / 34

Slide 17

Slide 17 text

PyOP2 Architecture Parallel scheduling: partitioning, staging and coloring Runtime code generation and JIT compilation OpenCL CUDA Instant PyOpenCL PyCUDA CPU OpenMP CPU seq. MPI Runtime Core partitioning, parallel scheduling Lin. algebra PETSc/Cusp Kernels Data Access Descriptors Application code Backends Code gen PyOP2 core User code 17 / 34

Slide 18

Slide 18 text

Generated sequential code calling the midpoint kernel v o i d w r a p _ m i d p o i n t _ _ ( P y O b j e c t * _ s t a r t , P y O b j e c t * _ e n d , P y O b j e c t * _ a r g 0 _ 0 , P y O b j e c t * _ a r g 1 _ 0 , P y O b j e c t * _ a r g 1 _ 0 _ m a p 0 _ 0 ) { i n t s t a r t = ( i n t ) P y I n t _ A s L o n g ( _ s t a r t ) ; i n t e n d = ( i n t ) P y I n t _ A s L o n g ( _ e n d ) ; d o u b l e * a r g 0 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 ) - > d a t a ) ; i n t * a r g 1 _ 0 _ m a p 0 _ 0 = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 _ m a p 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 _ v e c [ 3 ] ; f o r ( i n t n = s t a r t ; n < e n d ; n + + ) { a r g 1 _ 0 _ v e c [ 0 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 0 ] * 2 ; a r g 1 _ 0 _ v e c [ 1 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 1 ] * 2 ; a r g 1 _ 0 _ v e c [ 2 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 2 ] * 2 ; m i d p o i n t ( a r g 0 _ 0 + n * 2 , a r g 1 _ 0 _ v e c ) ; } } Generated OpenMP code calling the midpoint kernel next slide 18 / 34

Slide 19

Slide 19 text

v o i d w r a p _ m i d p o i n t _ _ ( P y O b j e c t * _ b o f f s e t , P y O b j e c t * _ n b l o c k s , P y O b j e c t * _ b l k m a p , P y O b j e c t * _ o f f s e t , P y O b j e c t * _ n e l e m s , P y O b j e c t * _ a r g 0 _ 0 , P y O b j e c t * _ a r g 1 _ 0 , P y O b j e c t * _ a r g 1 _ 0 _ m a p 0 _ 0 ) { i n t b o f f s e t = ( i n t ) P y I n t _ A s L o n g ( _ b o f f s e t ) ; i n t n b l o c k s = ( i n t ) P y I n t _ A s L o n g ( _ n b l o c k s ) ; i n t * b l k m a p = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ b l k m a p ) - > d a t a ) ; i n t * o f f s e t = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ o f f s e t ) - > d a t a ) ; i n t * n e l e m s = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ n e l e m s ) - > d a t a ) ; d o u b l e * a r g 0 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 = ( d o u b l e * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 ) - > d a t a ) ; i n t * a r g 1 _ 0 _ m a p 0 _ 0 = ( i n t * ) ( ( ( P y A r r a y O b j e c t * ) _ a r g 1 _ 0 _ m a p 0 _ 0 ) - > d a t a ) ; d o u b l e * a r g 1 _ 0 _ v e c [ 3 2 ] [ 3 ] ; # i f d e f _ O P E N M P i n t n t h r e a d = o m p _ g e t _ m a x _ t h r e a d s ( ) ; # e l s e i n t n t h r e a d = 1 ; # e n d i f # p r a g m a o m p p a r a l l e l s h a r e d ( b o f f s e t , n b l o c k s , n e l e m s , b l k m a p ) { i n t t i d = o m p _ g e t _ t h r e a d _ n u m ( ) ; # p r a g m a o m p f o r s c h e d u l e ( s t a t i c ) f o r ( i n t _ _ b = b o f f s e t ; _ _ b < b o f f s e t + n b l o c k s ; _ _ b + + ) { i n t b i d = b l k m a p [ _ _ b ] ; i n t n e l e m = n e l e m s [ b i d ] ; i n t e f i r s t = o f f s e t [ b i d ] ; f o r ( i n t n = e f i r s t ; n < e f i r s t + n e l e m ; n + + ) { a r g 1 _ 0 _ v e c [ t i d ] [ 0 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 0 ] * 2 ; a r g 1 _ 0 _ v e c [ t i d ] [ 1 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 1 ] * 2 ; a r g 1 _ 0 _ v e c [ t i d ] [ 2 ] = a r g 1 _ 0 + a r g 1 _ 0 _ m a p 0 _ 0 [ n * 3 + 2 ] * 2 ; m i d p o i n t ( a r g 0 _ 0 + n * 2 , a r g 1 _ 0 _ v e c [ t i d ] ) ; } } } } 19 / 34

Slide 20

Slide 20 text

PyOP2 Partitioning, Staging & Coloring Key optimizations performed by PyOP2 runtime core Partitioning for on-chip memory (shared memory / cache) Coloring to avoid data races on updates to the same memory location Example Parallel computation executing a kernel over the edges of the mesh: # S e t s o f n o d e s a n d e d g e s n o d e s = o p 2 . S e t ( N ) # N = n u m b e r o f n o d e s e d g e s = o p 2 . S e t ( M ) # M = n u m b e r o f e d g e s # M a p p i n g f r o m e d g e s t o n o d e s e d g e _ t o _ n o d e _ m a p = o p 2 . M a p ( e d g e s , n o d e s , 2 , . . . ) # D a t a d e f i n e d o n n o d e s u = o p 2 . D a t ( n o d e s , . . . ) # K e r n e l e x e c u t i n g o v e r s e t o f e d g e s , c o m p u t i n g o n n o d a l d a t a o p 2 . p a r _ l o o p ( k e r n e l , e d g e s , u ( e g d e _ t o _ n o d e _ m a p [ : ] , o p 2 . I N C ) ) 20 / 34

Slide 21

Slide 21 text

21 / 34

Slide 22

Slide 22 text

edges shared / staging memory vertices 22 / 34

Slide 23

Slide 23 text

edges shared / staging memory vertices 23 / 34

Slide 24

Slide 24 text

Finite-element computations 24 / 34

Slide 25

Slide 25 text

i i j j k k i i i i j j j j k k k k i j k Ax = b The weak form of the Helmholtz equation: Finite-element assembly 25 / 34

Slide 26

Slide 26 text

UFL: High- level definition of finite- element forms UFL is the Unified Form Language from the FEniCS project. The weak form of the Helmholtz equation And its (almost) literal translation to Python with UFL UFL: embedded domain-specific language (eDSL) for weak forms of partial differential equations (PDEs) e = F i n i t e E l e m e n t ( ' C G ' , ' t r i a n g l e ' , 1 ) v = T e s t F u n c t i o n ( e ) u = T r i a l F u n c t i o n ( e ) f = C o e f f i c i e n t ( e ) l m b d a = 1 a = ( d o t ( g r a d ( v ) , g r a d ( u ) ) - l m b d a * v * u ) * d x L = v * f * d x ∇v ⋅ ∇u − λvu dV = vf dV ∫ Ω ∫ Ω 26 / 34

Slide 27

Slide 27 text

Helmholtz local assembly kernel generated by FFC The FEniCS Form Compiler FFC compiles UFL forms to low- level code. Helmholtz equation UFL expression a = ( d o t ( g r a d ( v ) , g r a d ( u ) ) - l m b d a * v * u ) * d x Generated C code / / A - l o c a l t e n s o r t o a s s e m b l e / / x - l o c a l c o o r d i n a t e s / / j , k - 2 D i n d i c e s i n t o t h e l o c a l a s s e m b l y m a t r i x v o i d k e r n e l ( d o u b l e A [ 1 ] [ 1 ] , d o u b l e * x [ 2 ] , i n t j , i n t k ) { / / F E 0 - S h a p e f u n c t i o n s / / D i j - S h a p e f u n c t i o n d e r i v a t i v e s / / K i j - J a c o b i a n i n v e r s e / d e t e r m i n a n t / / W 3 - Q u a d r a t u r e w e i g h t s / / d e t - J a c o b i a n d e t e r m i n a n t f o r ( u n s i g n e d i n t i p = 0 ; i p < 3 ; i p + + ) { A [ 0 ] [ 0 ] + = ( F E 0 [ i p ] [ j ] * F E 0 [ i p ] [ k ] * ( - 1 . 0 ) + ( ( ( K 0 0 * D 1 0 [ i p ] [ j ] + K 1 0 * D 0 1 [ i p ] [ j ] ) ) * ( ( K 0 0 * D 1 0 [ i p ] [ k ] + K 1 0 * D 0 1 [ i p ] [ k ] ) ) + ( ( K 0 1 * D 1 0 [ i p ] [ j ] + K 1 1 * D 0 1 [ i p ] [ j ] ) ) * ( ( K 0 1 * D 1 0 [ i p ] [ k ] + K 1 1 * D 0 1 [ i p ] [ k ] ) ) ) ) * W 3 [ i p ] * d e t ; } } ∇v ⋅ ∇u − λvu dV = vf dV ∫ Ω ∫ Ω 27 / 34

Slide 28

Slide 28 text

The Firedrake/PyOP2 tool chain PyOP2 Parallel unstructured mesh computation framework Firedrake A performance-portable Finite-element computation framework Unified Form Language (UFL) PyOP2 Interface FFC Form Compiler Parallel scheduling, code generation CPU (+MPI + OpenMP/ OpenCL) GPU (CUDA / OpenCL) Future arch. Problem definition in FEM weak form Local assembly kernels, data dependencies Parallel loops over kernels with access descriptors Explicitly parallel hardware- specific implemen- tation Geometry, fields and meshes Fluidity 28 / 34

Slide 29

Slide 29 text

Two-layered abstraction: Separation of concerns Unified Form Language (UFL) PyOP2 Interface FFC Form Compiler Parallel scheduling, code generation CPU (+MPI + OpenMP/ OpenCL) GPU (CUDA / OpenCL) Future arch. Problem definition in FEM weak form Local assembly kernels, data dependencies Parallel loops over kernels with access descriptors Explicitly parallel hardware- specific implemen- tation Geometry, fields and meshes Fluidity Domain specialist: mathematical model using FEM Numerical analyst: generation of FEM kernels Domain specialist: mathematical model on un- structured grid Parallel programming expert: hardware architectures, optimization Expert for each layer 29 / 34

Slide 30

Slide 30 text

Driving Finite-element Computations in Firedrake Solving the Helmholtz equation in Python using Firedrake: f r o m f i r e d r a k e i m p o r t * # R e a d a m e s h a n d d e f i n e a f u n c t i o n s p a c e m e s h = M e s h ( ' f i l e n a m e ' ) V = F u n c t i o n S p a c e ( m e s h , " L a g r a n g e " , 1 ) # D e f i n e f o r c i n g f u n c t i o n f o r r i g h t - h a n d s i d e d e f f o r c e ( X ) : l m b d a = 1 n = 8 r e t u r n - ( l m b d a + 2 * ( n * * 2 ) * p i * * 2 ) * s i n ( X [ 0 ] * p i * n ) * s i n ( X [ 1 ] * p i * n ) f = E x p r e s s i o n ( f o r c e ) # S e t u p t h e F i n i t e - e l e m e n t w e a k f o r m s u = T r i a l F u n c t i o n ( V ) v = T e s t F u n c t i o n ( V ) l m b d a = 1 a = ( d o t ( g r a d ( v ) , g r a d ( u ) ) - l m b d a * v * u ) * d x L = v * f * d x # S o l v e t h e r e s u l t i n g f i n i t e - e l e m e n t e q u a t i o n p = F u n c t i o n ( V ) s o l v e ( a = = L , p ) 30 / 34

Slide 31

Slide 31 text

Finite element assembly and solve in Firedrake What goes on behind the scenes of the s o l v e call (simplified example!): f r o m p y o p 2 i m p o r t o p 2 , f f c _ i n t e r f a c e d e f s o l v e ( e q u a t i o n , x ) : # I n v o k e F F C t o g e n e r a t e k e r n e l s f o r m a t r i x a n d r h s a s s e m b l y l h s = f f c _ i n t e r f a c e . c o m p i l e _ f o r m ( e q u a t i o n . l h s , " l h s " ) [ 0 ] r h s = f f c _ i n t e r f a c e . c o m p i l e _ f o r m ( e q u a t i o n . r h s , " r h s " ) [ 0 ] # O m i t t e d : e x t r a c t c o o r d i n a t e s ( c o o r d s ) , c o n n e c t i v i t y ( e l e m _ n o d e ) # a n d c o e f f i c i e n t s ( f i e l d f ) # C o n s t r u c t O P 2 m a t r i x t o a s s e m b l e i n t o s p a r s i t y = o p 2 . S p a r s i t y ( ( e l e m _ n o d e , e l e m _ n o d e ) , s p a r s i t y _ d i m ) m a t = o p 2 . M a t ( s p a r s i t y , n u m p y . f l o a t 6 4 ) b = o p 2 . D a t ( n o d e s , n p . z e r o s ( n o d e s . s i z e ) ) # A s s e m b l e l h s , r h s a n d s o l v e l i n e a r s y s t e m o p 2 . p a r _ l o o p ( l h s , e l e m e n t s , m a t ( o p 2 . I N C , ( e l e m _ n o d e [ o p 2 . i [ 0 ] ] , e l e m _ n o d e [ o p 2 . i [ 1 ] ] ) ) , c o o r d s ( o p 2 . R E A D , e l e m _ n o d e ) ) o p 2 . p a r _ l o o p ( r h s , e l e m e n t s , b ( o p 2 . I N C , e l e m _ n o d e [ o p 2 . i [ 0 ] ] ) , c o o r d s ( o p 2 . R E A D , e l e m _ n o d e ) , f ( o p 2 . R E A D , e l e m _ n o d e ) ) # S o l v e t h e a s s e m b l e d s p a r s e l i n e a r s y s t e m o f e q u a t i o n s o p 2 . s o l v e ( m a t , x , b ) 31 / 34

Slide 32

Slide 32 text

Recap 32 / 34

Slide 33

Slide 33 text

Computational Science is hard Unless you break it down with the right abstractions Many-core hardware has brought a paradigm shift to CSE, scientific software needs to keep up The Solution High-level structure Goal: producing high level interfaces to numerical computing PyOP2: a high-level interface to unstructured mesh based methods Efficiently execute kernels over an unstructured grid in parallel Firedrake: a performance-portable finite-element computation framework Drive FE computations from a high-level problem specification Low-level operations Separating the low-level implementation from the high-level problem specification Generate platform-specific implementations from a common source instead of hand-coding them Runtime code generation and JIT compilation open space for compiler-driven optimizations and performance portability 33 / 34

Slide 34

Slide 34 text

Thank you! Contact: Florian Rathgeber, @frathgeber, [email protected] Resources Halide http://halide-lang.org Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe. PLDI 2013 Pochoir http://groups.csail.mit.edu/sct/wiki/index.php?title=The_Pochoir_Project The Pochoir Stencil Compiler Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. SPAA ’11, 2011 PyOP2 https://github.com/OP2/PyOP2 PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes Florian Rathgeber, Graham R. Markall, Lawrence Mitchell, Nicholas Loriant, David A. Ham, Carlo Bertolli, Paul H.J. Kelly WOLFHPC 2012 Firedrake https://github.com/firedrakeproject/firedrake Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS Graham R. Markall, Florian Rathgeber, Lawrence Mitchell, Nicolas Loriant, Carlo Bertolli, David A. Ham, Paul H. J. Kelly ISC 2013 UFL https://bitbucket.org/mapdes/ufl FFC https://bitbucket.org/mapdes/ffc This talk is available at http://kynan.github.io/ACM2014 Slides created with remark 34 / 34