TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Supercomputer

Slide 1

Slide 1 text

1 / 31 Distributed Analysis of the BMC Kind: Making It Fit the Tornado Supercomputer Azat Abdullin, Daniil Stepanov, and Marat Akhin JetBrains Research March 3, 2017

Slide 2

Slide 2 text

2 / 31 Static analysis Static program analysis is the analysis of computer software which is performed without actually executing programs

Slide 3

Slide 3 text

3 / 31 Performance problem Most static analyses have big problems with performance Our bounded model checking tool Borealis is not an exception We decided to try scaling Borealis to multiple cores

Slide 4

Slide 4 text

4 / 31 Bounded model checking algorithm

Slide 5

Slide 5 text

5 / 31 Verication example Program: int x; int y=8,z=0,w=0; if (x) z = y - 1; else w = y + 1; assert (z == 7 || w == 9) Constraints: y = 8, z = x ? y -1 : 0, w = x ? 0 : y + 1, z != 7, w != 9 UNSAT. Assert always true

Slide 6

Slide 6 text

6 / 31 Verication example Program: int x; int y=8,z=0,w=0; if (x) z = y - 1; else w = y + 1; assert (z == 5 || w == 9) Constraints: y = 8, z = x ? y -1 : 0, w = x ? 0 : y + 1, z != 5, w != 9 SAT. Program contains a bug Counterexample: y = 8, x = 1, w = 0, z = 7

Slide 7

Slide 7 text

7 / 31 Borealis

Slide 8

Slide 8 text

8 / 31 Program representation

Slide 9

Slide 9 text

9 / 31 Problem A huge number of SMT queries is involved in BMC We try to scale Borealis to multiple cores on our RSC Tornado

Slide 10

Slide 10 text

10 / 31 RSC Tornado supercomputer 712 dual-processor nodes with 1424 Intel Xeon E5-2697 64 GB of DDR4 RAM and local 120 GB SSD storage 1 PB Lustre storage InniBand FDR, 56 Gb/s

Slide 11

Slide 11 text

11 / 31 Lustre storage Parallel distributed le system Highly scalable Terabytes per second of I/O throughput Inecient work with small les

Slide 12

Slide 12 text

12 / 31 Borealis compilation scheme

Slide 13

Slide 13 text

13 / 31 Distributed compilation There are several ways to distribute compilation: Compilation on the Lustre storage Distribution of intermediate build tree to the processing nodes Distribution of copies of the analyzed project

Slide 14

Slide 14 text

14 / 31 Compilation on the Lustre storage Each node accesses Lustre for necessary les Lustre is slow when dealing with multiple small les

Slide 15

Slide 15 text

15 / 31 Distribution of intermediate build tree Reduce the CPU time Build may contain several related compilation/linking phases

Slide 16

Slide 16 text

16 / 31 Distribution of copies of the analyzed project Compilation is done using standard build tools We are repeating computations on every node Doesn't increase the wall-clock time

Slide 17

Slide 17 text

17 / 31 Distributed linking We distribute dierent SMT queries to dierent nodes/cores Borealis performs analysis on an LLVM IR module

Slide 18

Slide 18 text

18 / 31 Distributed linking Module level • Same as parallel make • Not really ecient Instruction level • Need to track dependencies between SMT calls • Too complex Function level • Medium eciency • Simple implementation

Slide 19

Slide 19 text

19 / 31 Distributed linking There are two ways how one can distribute functions between several processes: Dynamic distribution Static distribution

Slide 20

Slide 20 text

20 / 31 Dynamic distribution Master process distributes functions between several processes Based on a single producer/multiple consumers scheme If a process receives N functions, it also has to run auxiliary LLVM passes N times

Slide 21

Slide 21 text

21 / 31 Static distribution Each process determines a set of function based on it's rank We use the following two rank kinds: • global rank • local rank After some experiments we decided to use static method

Slide 22

Slide 22 text

22 / 31 Improving static distribution Need to balance workload We reinforce method with function complexity estimation Our estimation is based on the following properties: • Function size • Number of memory work instructions

Slide 23

Slide 23 text

23 / 31 PDD Borealis records the analysis results Thereby we don't re-analyze already processed functions Persistent Defect Data (PDD) is used for recording results PDD contains: • Defect location • Defect type • SMT result { "location": { "loc": { "col": 2, "line": 383 }, "filename": "rarpd.c" }, "type": "INI -03" }

Slide 24

Slide 24 text

24 / 31 PDD synchronization problem Transferring a full PDD takes a long time We synchronize a reduced PDD (rPDD) rPDD is simply a list of already analyzed functions

Slide 25

Slide 25 text

25 / 31 rPDD synchronization To make the synchronization we utilize a two-staged approach: Synchronize rPDD between the processes on a single node Synchronize rPDD between the nodes

Slide 26

Slide 26 text

26 / 31 Implementation Borealis HPC implementation is based on OpenMPI We implemented API to work with the library HPC Borealis is implemented in the form of 3 LLVM passes

Slide 27

Slide 27 text

27 / 31 Evaluation We tested the prototype in the following congurations: One process on a local1 machine Eight processes on a local machine On RSC Tornado using 1, 2, 4, 8, 16 and 32 nodes 1 a machine with Intel Core i7-4790 3.6 GHz processor, 32 GB of RAM and Intel 535 SSD storage

Slide 28

Slide 28 text

28 / 31 Evaluation projects Name SLOC Modules Description git 340k 49 distributed revision control system longs 209k 1 URL shortener beanstalkd 7.5k 1 simple, fast work queue zstd 42k 3 fast lossless compression algorithm library reptyr 3.5k 1 utility for reattaching programs to new terminals

Slide 29

Slide 29 text

29 / 31 Evaluation results zstd git longs beanstalkd reptyr SCC 1 process 678:23 2:05 1:30 SCC 1 node 2433:05 113:59 58:53 2:50 1:53 SCC 2 nodes 2421:35 101:22 59:00 2:12 1:32 SCC 4 nodes 2419:23 96:53 61:09 2:19 1:19 SCC 8 nodes 2510:34 96:51 63:09 2:10 1:43 SCC 16 nodes 2434:05 97:26 63:06 2:37 1:34 SCC 32 nodes 2346:39 107:14 63:02 2:34 1:52 Local 1 process 2450:02 281:11 205:05 0:36 0:08 Local 8 processes 2848:55 103:21 93:14 0:30 0:06

Slide 30

Slide 30 text

30 / 31 Conclusion Our main takeaways are as follows: Several big functions can bottleneck the analysis LLVM is not optimized for distributed scenarios Single-core optimizations can create diculties for HPC Adding nodes can increase the time of analysis

Slide 31

Slide 31 text

31 / 31 Contact information {abdullin, stepanov, akhin}@kspt.icc.spbstu.ru Borealis repository: https://bitbucket.org/vorpal-research/borealis