"Automatic Patch Generation" by Claire Le Goues

Slide 1

Slide 1 text

Automa'c Patch Genera'on Claire {Le~Goues} PWLConf; September 15, 2016 1

Slide 2

Slide 2 text

ONCE UPON A TIME… 2

Slide 3

Slide 3 text

Young Claire 3 Young Claire (developer)

Slide 4

Slide 4 text

4 Bug report from a customer…

Slide 5

Slide 5 text

Slide 6

Slide 6 text

6 ??! 32-64 bit transiYon for Unicode encoding of Ukrainian.

Slide 7

Slide 7 text

Problem: source-level defect repair 7 bug-ﬁxing patch

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

10 prind transformer

Slide 11

Slide 11 text

11 prind transformer Input: 2 5 6 1 3 4 8 7 9 11 10 12   Likely faulty. probability   Maybe faulty. probability   Not faulty. Spectrum-based fault localiza'on automa'cally ranks poten'ally buggy program pieces based on test case behavior.

Slide 12

Slide 12 text

Bug ﬁxing: the 30000-foot view 1.  Localize the bug. –  And possibly analyze it a liLle bit… 2.  Create/combine ﬁx possibiliYes into 1+ possible patches. 3.  Validate candidate patch. 12 1.  Heuris'c: including meta-heurisYc, “guess and check.” 2.  Seman'c: symbolic execuYon + SMT solvers, synthesis.

Slide 13

Slide 13 text

GenProg: automa'c program repair using gene'c programming. Biased, random search for a AST-level edits to a program that ﬁxes a given bug without breaking any previously-passing tests. 13 hLps://upload.wikimedia.org/wikipedia/commons/a/a4/13-02-27-spielbank-wiesbaden-by-RalfR-093.jpg

Slide 14

Slide 14 text

Genetic programming: the application of evolutionary or genetic algorithms to program source code. 14

Slide 15

Slide 15 text

INPUT OUTPUT EVALUATE FITNESS DISCARD ACCEPT MUTATE 15

Slide 16

Slide 16 text

GenProg: meta-heuris'c search. 1.  Localize the bug. –  And possibly analyze it a liLle bit… 2.  Create/combine ﬁx possibiliYes into 1+ possible patches. 3.  Validate candidate patch. 16 Localize to C statements. Use geneYc programming to search for statement-level patches, reusing code from exisYng proram.

Slide 17

Slide 17 text

17 1  void gcd(int a, int b) { 2  if (a == 0) { 3  printf(“%d”, b); 4  } 5  while (b > 0) { 6  if (a > b) 7  a = a – b; 8  else 9  b = b – a; 10  } 11  printf(“%d”, a); 12  return; 13  } > 

Slide 18

Slide 18 text

18 1  void gcd(int a, int b) { 2  if (a == 0) { 3  printf(“%d”, b); 4  } 5  while (b > 0) { 6  if (a > b) 7  a = a – b; 8  else 9  b = b – a; 10  } 11  printf(“%d”, a); 12  return; 13  } >  gcd(4,2) >  2 >  >  gcd(1071,1029) >  21 >  >  gcd(0,55) >  55 (looping forever) !

Slide 19

Slide 19 text

GenProg: meta-heuris'c search. 1.  Localize the bug. –  And possibly analyze it a liLle bit… 2.  Create/combine ﬁx possibiliYes into 1+ possible patches. 3.  Validate candidate patch. 19 Localize to C statements. Use geneYc programming to search for statement-level patches, reusing code from exisYng proram.

Slide 20

Slide 20 text

20 printf(b) {block} while (b>0) {block} {block} {block} if(a==0) if(a>b) a = a – b {block} {block} printf(a) return b = b – a Input:

Slide 21

Slide 21 text

21 printf(b) {block} while (b>0) {block} {block} {block} if(a==0) if(a>b) a = a – b {block} {block} printf(a) return b = b – a Input: Legend:   High change probability.   Low change probability.   Not changed.

Slide 22

Slide 22 text

•  A patch is a series of statement-level edits: –  delete X –  replace X with Y –  insert Y aqer X. •  Replace/insert: pick Y from somewhere else in the program. •  To mutate an individual, add new random edits to a given (possibly empty) patch. –  (Where? Right: fault localizaYon!) 22 An individual is a candidate patch/set of changes to the input program.

Slide 23

Slide 23 text

23 printf(b) {block} while (b>0) {block} {block} {block} if(a==0) if(a>b) a = a – b {block} {block} printf(a) return b = b – a Input: An edit is: • Insert statement X aqer statement Y • Replace statement X with statement Y • Delete statement X

Slide 24

Slide 24 text

24 printf(b) {block} while (b>0) {block} {block} {block} if(a==0) if(a>b) a = a – b {block} {block} printf(a) return b = b – a Input: An edit is: • Insert statement X aOer statement Y • Replace statement X with statement Y • Delete statement X

Slide 25

Slide 25 text

25 printf(b) {block} while (b>0) {block} {block} {block} if(a==0) if(a>b) a = a – b {block} {block} printf(a) return b = b – a Input: An edit is: • Insert statement X aOer statement Y • Replace statement X with statement Y • Delete statement X

Slide 26

Slide 26 text

26 {block} while (b>0) {block} {block} {block} if(a==0) if(a>b) a = a – b {block} {block} printf(a) return b = b – a Input: An edit is: • Insert statement X aOer statement Y • Replace statement X with statement Y • Delete statement X return printf(b)

Slide 27

Slide 27 text

What about Angelix? 1.  Localize the bug. –  And possibly analyze it a liLle bit… 2.  Create/combine ﬁx possibiliYes into 1+ possible patches. 3.  Validate candidate patch. 27 Same idea, but localizing to expressions. RHS of assignments, condiYonals.

Slide 28

Slide 28 text

1  int is_upward( int inhibit, int up_sep, int down_sep){ 2  int bias; 3  if (inhibit) 4  bias = down_sep; // bias= up_sep + 100 5  else bias = up_sep ; 6  if (bias > down_sep) 7  return 1; 8  else return 0; 9  } 28 Tremendous graYtude to Abhik Roychoudhury for sharing slides with me as starYng material for this talk.

Slide 29

Slide 29 text

1  int is_upward( int inhibit, int up_sep, int down_sep){ 2  int bias; 3  if (inhibit) 4  bias = down_sep; // bias= up_sep + 100 5  else bias = up_sep ; 6  if (bias > down_sep) 7  return 1; 8  else return 0; 9  } 29 inhibit up_sep down_sep Observed output Expected Output Result 1 0 100 0 0 pass 1 11 110 0 1 fail 0 100 50 1 1 pass 1 -20 60 0 1 fail 0 0 10 0 0 pass

Slide 30

Slide 30 text

What about Angelix? 1.  Localize the bug. –  And possibly analyze it a liLle bit… 2.  Create/combine ﬁx possibiliYes into 1+ possible patches. 3.  Validate candidate patch. 30 Concolic execu3on to ﬁnd expression values that would make the test pass. Program synthesis to construct replacement code that produces those values.

Slide 31

Slide 31 text

An expression’s angelic value is the value that would make a given test case pass. •  This value is set “arbitrarily”, by which we mean symbolically. •  You can solve for this value if you have: –  the test case’s expected input/output. –  the path condiYon controlling its execuYon. •  Path condiYon: the set of condiYons that controlled a parYcular execuYon. –  Start execuYng the test concretely, and then switch to symbolic execuYon when the angelic value starts to maLer. 31

Slide 32

Slide 32 text

1  int is_upward( int inhibit, int up_sep, int down_sep){ 2  int bias; 3  if (inhibit) 4  bias = down_sep; // bias= up_sep + 100 5  else bias = up_sep ; 6  if (bias > down_sep) 7  return 1; 8  else return 0; 9  } 32 inhibit up_sep down_sep Observed output Expected Output Result 1 0 100 0 0 pass 1 11 110 0 1 fail 0 100 50 1 1 pass 1 -20 60 0 1 fail 0 0 10 0 0 pass

Slide 33

Slide 33 text

1  int is_upward( int inhibit, int up_sep, int down_sep){ 2  int bias; 3  if (inhibit) 4  bias = ®; // bias= up_sep + 100 5  else bias = up_sep ; 6  if (bias > down_sep) 7  return 1; 8  else return 0; 9  } 33 inhibit up_sep down_sep Observed output Expected Output Result 1 11 110 0 1 fail inhibit = 1, up_sep = 11, down_sep = 110 bias = ®, PC = true Line 4 inhibit = 1, up_sep = 11, down_sep = 110 bias = ®, PC= ® > 110 Line 7 inhibit = 1, up_sep = 11, down_sep = 110 bias =®, PC= ® ≤ 110 Line 8

Slide 34

Slide 34 text

What should it have been? 34 1  int is_upward( int inhibit, int up_sep, int down_sep){ 2  int bias; 3  if (inhibit) 4  ® = f(inhibit, up_sep, down_sep) 5  else bias = up_sep ; 6  if (bias > down_sep) 7  return 1; 8  else return 0; 9  } inhibit == 1 up_sep == 11 down_sep == 110 Symbolic ExecuYon f(1,11,110) > 110

Slide 35

Slide 35 text

Collect all of the constraints! •  Accumulated constraints over all test cases: •  Use oracle guided component-based program synthesis to construct saYsfying f: –  Fix a set of of operators (component-based). –  Synthesize code that only uses those operators and saYsﬁes the constraints (oracle guided). •  Generated ﬁx –  f(inhibit,up_sep,down_sep) = up_sep + 100 35 f(1,11,110) > 110 ∧ f(1,0,100) ≤ 100 ∧ f(1,-20,60) > 60

Slide 36

Slide 36 text

(Legi'mately interes'ng encoding of synthesis problem elided for (dubious) brevity.) 36

Slide 37

Slide 37 text

So why all that a_en'on paid to “forests”? 37 hLps://commons.wikimedia.org/wiki/File:Michael_Spiller_-_twisty_forest_paths_(by-sa).jpg

Slide 38

Slide 38 text

Angelic Forest 38 E1 E2 E3 Program Angelic Paths

Slide 39

Slide 39 text

Angelic Forest 39 E1 E2 E3 Program Angelic Paths SAT angelic1 angelic2 angelic3

Slide 40

Slide 40 text

Angelic Forest 40 E1 E2 E3 Program Angelic Paths UNSAT angelic1 angelic2 angelic3

Slide 41

Slide 41 text

Angelic Forest 41 E1 E2 E3 Program Angelic Paths SAT angelic1 angelic2 angelic3 angelic1 angelic3

Slide 42

Slide 42 text

Angelic Forest 42 E1 E2 E3 Program Angelic Paths UNSAT angelic1 angelic2 angelic3 angelic1 angelic3

Slide 43

Slide 43 text

Tradeoffs and Challenges 43 Scalability Expressive power Output quality hLps://www.flickr.com/photos/86530412@N02/7935377706 : hLps://pixabay.com/en/approved-control-quality-stamp-147677/ hLps://www.flickr.com/photos/cimmyt/5219256862

Slide 44

Slide 44 text

Program Descrip'on LOC Bug Type Time 44 (s)

Slide 45

Slide 45 text

Program Descrip'on LOC Bug Type Time gcd example 22 infinite loop 153 nullhLpd webserver 5575 heap buffer overflow (code) 578 zune example 28 infinite loop 42 uniq text processing 1146 segmentaYon fault 34 look-u dicYonary lookup 1169 segmentaYon fault 45 look-s dicYonary lookup 1363 infinite loop 55 units metric conversion 1504 segmentaYon fault 109 deroff document processing 2236 segmentaYon fault 131 indent code processing 9906 infinite loop 546 flex lexical analyzer generator 18774 segmentaYon fault 230 openldap directory protocol 292598 non-overflow denial of service 665 ccrypt encrypYon uYlity 7515 segmentaYon fault 330 lighLpd webserver 51895 heap buffer overflow (vars) 394 atris graphical game 21553 local stack buffer exploit 80 php scripYng language 764489 integer overflow 56 wu-qpd FTP server 67029 format string vulnerability 2256 leukocyte computaYonal biology 6718 segmentaYon fault 360 Yff image processing 84067 segmentaYon fault 108 imagemagick image processing 450516 wrong output 2160 (s) 45

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

“IF I GAVE YOU THE LAST 100 BUGS FROM , HOW MANY COULD GENPROG ACTUALLY FIX?” – MANY PEOPLE 49

Slide 50

Slide 50 text

•  Goal: a large set of important, reproducible bugs in non-trivial programs. •  Approach: use historical data (source control!) to approximate discovery and repair of bugs in the wild. Challenge: Indica've Bug Set 50

Slide 51

Slide 51 text

Success/Cost Program Defects Repaired Cost per non-repair Cost per repair Hours US$ Hours US$ |c 1/3 8.52 5.56 6.52 4.08 gmp 1/2 9.93 6.61 1.60 0.44 gzip 1/5 5.11 3.04 1.41 0.30 libYﬀ 17/24 7.81 5.04 1.05 0.04 lighLpd 5/9 10.79 7.25 1.34 0.25 php 28/44 13.00 8.80 1.84 0.62 python 1/11 13.00 8.80 1.22 0.16 wireshark 1/7 13.00 8.80 1.23 0.17 Total 55/105 11.22h 1.60h •  $403 for all 105 trials, leading to 55 repairs; $7.32 per bug repaired. 51

Slide 52

Slide 52 text

Comparison (Repair-ability) 52 57% 23% 40% 100% 42% 57% 41% 40% 100% 21% 14% 11% 20% 50% 13% WIRESHARK PHP GZIP GMP LIBTIFF Angelix SPR GenProg

Slide 53

Slide 53 text

Heartbleed patch 53 if (hbtype == TLS1_HB_REQUEST && (payload + 18) < s->s3->rrec.length) { … } else if (hbtype == TLS1_HB_RESPONSE) { … } return 0; Generated patch if (1 + 2 + payload + 16 > s->s3->rrec.length) return 0; … if (hbtype == TLS1_HB_REQUEST) { … } else if (hbtype == TLS1_HB_RESPONSE) { … } return 0; Developer patch

Slide 54

Slide 54 text

Tradeoﬀs and Challenges 54 Scalability Output quality Expressive power

Slide 55

Slide 55 text

Flashback to 2008… “delete handling of POST requests” 55 ß nullhLpd: a webserver with basic GET + POST funcYonality. Version 0.5.0: remote-exploitable heap-based buﬀer overﬂow in handling of POST. Failing test case: run exploit, see if webserver is sYll running Easy passing test cases: 1.  “GET index.html” 2.  “GET image.jpg” 3.  “GET nodound.html” + = CC0 Public Domain

Slide 56

Slide 56 text

When we added a non-crashing test case for POST, proto-GenProg found a much be_er patch. •  When the test suite is your objecYve funcYon, test suite quality maLers. – …how much is a trickier issue. •  But we’re begging the quesYon... 56 hLps://en.wikipedia.org/wiki/Basket-hilted_sword#/media/File:Schiavona-Morges.jpg

Slide 57

Slide 57 text

57 hLps://en.wikipedia.org/wiki/Basket-hilted_sword#/media/File:Schiavona-Morges.jpg

Slide 58

Slide 58 text

When we added a non-crashing test case for POST, proto-GenProg found a much be_er patch. •  When the test suite is your objecYve funcYon, test suite quality maLers. – …how much is a trickier issue. •  But we’re begging the quesYon... 58 hLps://en.wikipedia.org/wiki/Basket-hilted_sword#/media/File:Schiavona-Morges.jpg

Slide 59

Slide 59 text

What is a high quality patch, anyway? •  Understandable? –  Well, I had no problem understanding the POST-deleYng patch… –  (non-funcYonal properYes are important and being studied by others!) •  Doesn’t delete? –  But what about goto fail? •  Does the same thing the human did/would do? –  But humans are oqen wrong! And how close does it have to be? •  Doesn’t introduce new bugs? –  How to tell? •  Addresses the cause, not the symptom… 59

Slide 60

Slide 60 text

Proposal: measure quality based on degree to which results generalize. •  In machine learning, techniques are trained and evaluated on disjoint datasets to assess overﬁ‚ng. •  In program repair: – Tests used to build a repair are training tests – Tests used to assess correctness are evalua3on tests 60

Slide 61

Slide 61 text

PROBLEM: THE DESIRED STUDY IS IMPOSSIBLE. 61

Slide 62

Slide 62 text

[Dataset + Tools] •  Student homework submissions from six UC Davis IntroducYon to Programming assignments •  Two full-coverage test suites: – White-box suite generated by Klee from reference implementaYon. – Black-box suite wriLen by course instructor. – Feature: Assess patch quality as dis3nct from test suite quality. •  Goal: Compare GenProg and TrpAutoRepair/ RSRepair, G&V techniques with diﬀerent search strategies. 62 Full dataset available at repairbenchmarks.cs.umass.edu

Slide 63

Slide 63 text

Both tools produced patches that overﬁt to the training set. 63

Slide 64

Slide 64 text

But: the tools do as well as the students! 64

Slide 65

Slide 65 text

Overﬁvng is not unique toheuris'c techniques. •  Angelix: 120/233 of patches produced on a subset to IntroClass overﬁt. •  ~40% of SPR patches studied in Angelix paper delete funcYonality by generaYng tautological if condiYons. 65

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Quality Comparison with SPR 67 25% 30% 0% 0% 20% 25% 39% 50% 0% 80% WIRESHARK PHP GZIP GMP LIBTIFF FUNCTIONALITY-DELETING REPAIRS Angelix SPR

Slide 68

Slide 68 text

Slide 69

Slide 69 text

OPTION 1: UNDERSTAND AND REASON ABOUT THE CIRCUMSTANCES UNDER WHICH PERFECTION IS NOT REQUIRED. Context maLers! 69

Slide 70

Slide 70 text

70 2012 ﬂashback… ß Scenario: Long-running servers + IDS + generate repairs for detected anomalies. ß Workloads: a day of unﬁltered requests to the UVA CS webserver. THIS PATCH WAS BAD

Slide 71

Slide 71 text

Even a func'onality-reducing repair had li_le prac'cal impact. Program Post-patch requests lost Fuzz Tests Failed General Exploit nullhLpd 0.00 % ± 0.25% 0 à 0 10 à 0 lighLpd 0.03% ± 1.53% 1410 à 1410 9 à 0 php-BAD 0.02% ± 0.02% 3 à 3 5 à 0 71

Slide 72

Slide 72 text

OPTION 2: DEVELOP TECHNIQUES THAT ARE MORE LIKELY TO GENERALIZE. How? 72

Slide 73

Slide 73 text

Challenge your assump'ons! 73 EXAMPLE ASSUMPTION: bug-fixing patches are like kittens: smaller is better! "Retouched KiLy" by Ozan Kilic, CC2.0 hLp://www.freestockphotos.biz/stockphoto/9343

Slide 74

Slide 74 text

•  Instead of trying to make small changes, we replaced buggy regions with code that correctly captures the overall desired logic? •  Principle: using human-wriLen code to ﬁx code at a higher granularity level leads to beLer quality repairs. •  What if… 74

Slide 75

Slide 75 text

SEARCHREPAIR: HIGH-QUALITY AUTOMATED BUG REPAIR USING SEMANTIC SEARCH 75

Slide 76

Slide 76 text

Seman'c code search looks for code based on what it should do. •  Keyword: “C median three numbers” •  SemanYc: Input Expected 2,6,8 6 2,8,6 6 6,2,8 6 6,8,2 6 8,6,2 6 9,9,9 9 76 …Generate and validate + SemanYc reasoning!

Slide 77

Slide 77 text

SearchRepair patches were of much higher quality than those produced by previous techniques. 77 Technique Held out tests passed SearchRepair 97.2% GenProg 68.7% TRPAP 72.1% AE 64.2%

Slide 78

Slide 78 text

Slide 79

Slide 79 text

The Three Major Challenges 79 Scalability Output quality Expressive power

Slide 80

Slide 80 text

80 80 inhibit = 1, up_sep = 11, down_sep = 110 bias = ®, PC = true Line 4 inhibit = 1, up_sep = 11, down_sep = 110 bias = ®, PC= ® > 110 Line 7 inhibit = 1, up_sep = 11, down_sep = 110 bias =®, PC= ® ≤ 110 Line 8

Slide 81

Slide 81 text

COLLABORATORS MAKE THE WORLD GO ROUND 81

Slide 82

Slide 82 text

82 AE: 1 SearchRepair: 20 GenProg: 32 52 0 0 68 RSRepair: 2 10 90 0 0 0 GenProg total: 287 AE total: 159 RSRepair total: 247 SearchRepair total: 150