Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Supercomputer

TMPA-2017: Distributed Analysis of the BMC Kind: Making It Fit the Tornado Supercomputer

TMPA-2017: Tools and Methods of Program Analysis
3-4 March, 2017, Hotel Holiday Inn Moscow Vinogradovo, Moscow

Distributed Analysis of the BMC Kind: Making It Fit the Tornado Supercomputer
Azat Abdullin, Daniil Stepanov,St.Petersburg Polytechnic University
Marat Akhin, JetBrains Research
For video follow the link: https://youtu.be/CPlPpwFtN7k

Would like to know more?
Visit our website:
www.tmpaconf.org
www.exactprosystems.com/events/tmpa

Follow us:
https://www.linkedin.com/company/exactpro-systems-llc?trk=biz-companies-cym
https://twitter.com/exactpro

Exactpro

March 23, 2017
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. 1 / 31 Distributed Analysis of the BMC Kind: Making

    It Fit the Tornado Supercomputer Azat Abdullin, Daniil Stepanov, and Marat Akhin JetBrains Research March 3, 2017
  2. 2 / 31 Static analysis Static program analysis is the

    analysis of computer software which is performed without actually executing programs
  3. 3 / 31 Performance problem Most static analyses have big

    problems with performance Our bounded model checking tool Borealis is not an exception We decided to try scaling Borealis to multiple cores
  4. 5 / 31 Verication example Program: int x; int y=8,z=0,w=0;

    if (x) z = y - 1; else w = y + 1; assert (z == 7 || w == 9) Constraints: y = 8, z = x ? y -1 : 0, w = x ? 0 : y + 1, z != 7, w != 9 UNSAT. Assert always true
  5. 6 / 31 Verication example Program: int x; int y=8,z=0,w=0;

    if (x) z = y - 1; else w = y + 1; assert (z == 5 || w == 9) Constraints: y = 8, z = x ? y -1 : 0, w = x ? 0 : y + 1, z != 5, w != 9 SAT. Program contains a bug Counterexample: y = 8, x = 1, w = 0, z = 7
  6. 9 / 31 Problem A huge number of SMT queries

    is involved in BMC We try to scale Borealis to multiple cores on our RSC Tornado
  7. 10 / 31 RSC Tornado supercomputer 712 dual-processor nodes with

    1424 Intel Xeon E5-2697 64 GB of DDR4 RAM and local 120 GB SSD storage 1 PB Lustre storage InniBand FDR, 56 Gb/s
  8. 11 / 31 Lustre storage Parallel distributed le system Highly

    scalable Terabytes per second of I/O throughput Inecient work with small les
  9. 13 / 31 Distributed compilation There are several ways to

    distribute compilation: Compilation on the Lustre storage Distribution of intermediate build tree to the processing nodes Distribution of copies of the analyzed project
  10. 14 / 31 Compilation on the Lustre storage Each node

    accesses Lustre for necessary les Lustre is slow when dealing with multiple small les
  11. 15 / 31 Distribution of intermediate build tree Reduce the

    CPU time Build may contain several related compilation/linking phases
  12. 16 / 31 Distribution of copies of the analyzed project

    Compilation is done using standard build tools We are repeating computations on every node Doesn't increase the wall-clock time
  13. 17 / 31 Distributed linking We distribute dierent SMT queries

    to dierent nodes/cores Borealis performs analysis on an LLVM IR module
  14. 18 / 31 Distributed linking Module level • Same as

    parallel make • Not really ecient Instruction level • Need to track dependencies between SMT calls • Too complex Function level • Medium eciency • Simple implementation
  15. 19 / 31 Distributed linking There are two ways how

    one can distribute functions between several processes: Dynamic distribution Static distribution
  16. 20 / 31 Dynamic distribution Master process distributes functions between

    several processes Based on a single producer/multiple consumers scheme If a process receives N functions, it also has to run auxiliary LLVM passes N times
  17. 21 / 31 Static distribution Each process determines a set

    of function based on it's rank We use the following two rank kinds: • global rank • local rank After some experiments we decided to use static method
  18. 22 / 31 Improving static distribution Need to balance workload

    We reinforce method with function complexity estimation Our estimation is based on the following properties: • Function size • Number of memory work instructions
  19. 23 / 31 PDD Borealis records the analysis results Thereby

    we don't re-analyze already processed functions Persistent Defect Data (PDD) is used for recording results PDD contains: • Defect location • Defect type • SMT result { "location": { "loc": { "col": 2, "line": 383 }, "filename": "rarpd.c" }, "type": "INI -03" }
  20. 24 / 31 PDD synchronization problem Transferring a full PDD

    takes a long time We synchronize a reduced PDD (rPDD) rPDD is simply a list of already analyzed functions
  21. 25 / 31 rPDD synchronization To make the synchronization we

    utilize a two-staged approach: Synchronize rPDD between the processes on a single node Synchronize rPDD between the nodes
  22. 26 / 31 Implementation Borealis HPC implementation is based on

    OpenMPI We implemented API to work with the library HPC Borealis is implemented in the form of 3 LLVM passes
  23. 27 / 31 Evaluation We tested the prototype in the

    following congurations: One process on a local1 machine Eight processes on a local machine On RSC Tornado using 1, 2, 4, 8, 16 and 32 nodes 1 a machine with Intel Core i7-4790 3.6 GHz processor, 32 GB of RAM and Intel 535 SSD storage
  24. 28 / 31 Evaluation projects Name SLOC Modules Description git

    340k 49 distributed revision control system longs 209k 1 URL shortener beanstalkd 7.5k 1 simple, fast work queue zstd 42k 3 fast lossless compression algorithm library reptyr 3.5k 1 utility for reattaching programs to new terminals
  25. 29 / 31 Evaluation results zstd git longs beanstalkd reptyr

    SCC 1 process  678:23  2:05 1:30 SCC 1 node 2433:05 113:59 58:53 2:50 1:53 SCC 2 nodes 2421:35 101:22 59:00 2:12 1:32 SCC 4 nodes 2419:23 96:53 61:09 2:19 1:19 SCC 8 nodes 2510:34 96:51 63:09 2:10 1:43 SCC 16 nodes 2434:05 97:26 63:06 2:37 1:34 SCC 32 nodes 2346:39 107:14 63:02 2:34 1:52 Local 1 process 2450:02 281:11 205:05 0:36 0:08 Local 8 processes 2848:55 103:21 93:14 0:30 0:06
  26. 30 / 31 Conclusion Our main takeaways are as follows:

    Several big functions can bottleneck the analysis LLVM is not optimized for distributed scenarios Single-core optimizations can create diculties for HPC Adding nodes can increase the time of analysis