Pro Yearly is on sale from $80 to $50! »

Document Recovery

Document Recovery

Imagine that you had been working on a document using your favourite application, but one day something happened to the file and you cannot open it anymore: every time you try to open the document, the application crashes or does not load the file. You would like to have your document back because it has some important data in it and you would not like to sacrifice much of the document contents in order to make it work again.

In this seminar I will talk about: the reasons why documents can cause incorrect operation of software, symbolic execution techniques that can be used to tackle this problem and my research project which is about a document recovery approach that is independent of the input format.

Be1c8a24b76f8b2b23f53eb22d401810?s=128

Imperial ACM

March 07, 2014
Tweet

Transcript

  1. Tomasz  Kuchta   Imperial  College  London   Imperial  College  ACM

     Student  Chapter  Seminar,  7th  March  2014  
  2.       2   Name:  Tomasz  Kuchta  (hDp://www.doc.ic.ac.uk/~tk2512/  )

      From:  Kraków  (Cracow),  Poland   Before:    MSc  in  Computer  Science  (Cracow  University  of  Technology)    Work  as  a  soUware  engineer  (telecommunicaVons)   Interests:    Music  (hDps://soundcloud.com/gitaronek  )    Photography  (hDp://www.flickr.com/photos/_tomek_/  )  
  3. Problem  overview   Symbolic  execuVon   Overview   Basic  definiVons

      Concolic  execuVon   Document  Recovery   Proposed  soluVon   Challenges     3  
  4. None
  5. The  user  is  unable  to  read  /  edit  broken  

      documents  since  they  cause  abnormal     applicaVon  terminaVon  or  do  not  load     Documents  can  get  corrupted,  be  malformed  or   malicious     Such  bugs  are  highly  user-­‐visible     Bad  input  accounts  for  a  large  number  of  security   vulnerabiliVes     Example:  Pine  –  a  text  mode  e-­‐mail  client   Message  with  a  special  “From:”  field  crashes  the   program         5   From: "\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\""@host.fubar
  6.   6   int array[4] = {100, 200, 300, 400};

    int tmp[4] = {10, 20, 30, 40}; int offset = <VALUE FROM THE DOCUMENT>; array[offset] = 0; for (int i = 0; i < 4; ++i) { array[i] = array[i] / tmp[i]; } Program   Document  
  7.   7   int array[4] = {100, 200, 300, 400};

    int tmp[4] = {10, 20, 30, 40}; int offset = <VALUE FROM THE DOCUMENT>; array[offset] = 0; for (int i = 0; i < 4; ++i) { array[i] = array[i] / tmp[i]; } 100   200   300   400   10   20   30   40   4   Memory   Program   100   200   300   400   0   20   30   40  
  8.   8   int array[4] = {100, 200, 300, 400};

    int tmp[4] = {10, 20, 30, 40}; int offset = <VALUE FROM THE DOCUMENT>; array[offset] = 0; for (int i = 0; i < 4; ++i) { array[i] = array[i] / tmp[i]; } 100   200   300   400   10   20   30   40   4   Memory   Program   100   200   300   400   0   20   30   40   possible  buffer  overflow   possible  division  by  zero  
  9. Truncate  the  file   Possible  loss  of  user  data  

    Test  the  file  against  a  specificaVon   Need  to  create  a  specificaVon  for  each  format   What  if  the  “buggy”  file  is  correct?   Try  to  guess  the  right  value   Might  be  hard  for  highly  structured  formats     Or  …     9  
  10. Is  it  possible  to  fix  a  malformed     document,

     without  assuming  any  input     format,  in  a  way  that  preserves  the     original  content  as  much  as  possible?         10  
  11. None
  12.   12   x, y, z -> symbolic if (x

    > 5) { if (y = 10) { } } if (z ≠ 20) { } x  >  5   x  ≤  5   y  =  10   y  ≠  10   z  =  20   z  ≠  20   z  =  20   z  ≠  20   z  =  20   z  ≠  20  
  13.   13   x, y, z -> symbolic if (x

    > 5) { if (y = 10) { } } if (z ≠ 20) { } x  >  5   x  ≤  5   y  =  10   y  ≠  10   z  =  20   z  ≠  20   z  =  20   z  ≠  20   z  =  20   z  ≠  20   Path condition: (x > 5) ∧ (y = 10) ∧ (z = 20)
  14.   14   x, y, z -> symbolic if (x

    > 5) { if (y = 10) { } } if (z ≠ 20) { } x  >  5   x  ≤  5   y  =  10   y  ≠  10   z  =  20   z  ≠  20   z  =  20   z  ≠  20   z  =  20   z  ≠  20   Path condition: (x ≤ 5) ∧ (z ≠ 20)
  15. Path  condiVon  (PC):  a  conjuncVon  of  constraints   on  symbolic

     variables  encountered  on  a  given   execuVon  path   SMT  solver:  a  specialised  version  of  SAT  solver   A  soUware  tool   Answers  the  quesVon  of  sa@sfiability   Returns  a  counterexample  –  a  set  of  values  that   saVsfy  the  constraints     15   Path  condi+on   Possible  counterexample  for  {x,  y,  z}   (x > 5) ∧ (y = 10) ∧ (z = 20) x = 7, y = 10, z = 20 (x ≤ 5) ∧ (z ≠ 20) x = 0, y = 0, z = 0
  16. Concolic  execuVon  /  tesVng  is  a  mix  of  concrete  

    (standard)  execuVon  and  symbolic  execuVon   Use  concrete  values  on  decision  points   Gather  symbolic  constraints   Tackles  the  problem  of  state  explosion  and   reaching  deep  states       16  
  17.   17   x, y, z -> symbolic if (x

    > 5) { if (y = 10) { } } if (z ≠ 20) { } x  >  5   x  ≤  5   y  =  10   y  ≠  10   z  =  20   z  ≠  20   z  =  20   z  ≠  20   z  =  20   z  ≠  20   Path condition: (x > 5) ∧ (y = 10) ∧ (z = 20) Concrete values: x = 7, y = 10, z = 20
  18.   This  work  is  supported  by  MicrosoU  Research  through  its

     PhD  Scholarship  Programme   A  joint  project  with  Dr  CrisVan  Cadar,  Dr  Miguel  Castro  and  Dr  Manuel  Costa  
  19.   19  

  20.   20   Collect  alternaVve  execuVon  paths   Explore  alternaVve

     paths  in  concolic  manner   Original  input  (document):   x = 5, y = 5, z = 5 Crash  path’s  Path  CondiVon:   (x ≥ 5) ∧ (y ≥ 5) ∧ (z ≥ 5) P3  path’s  Path  CondiVon:   (x ≥ 5) ∧ (y ≥ 5) ∧ (z < 5) New  input  (recovery  candidate):   x = 5, y = 5, z = 0
  21.   21   Collect  alternaVve  execuVon  paths   Explore  alternaVve

     paths  in  concolic  manner  
  22.   22   OpVmisaVons   Using  concolic  execuVon   Postponing

     SMT  solver  queries   CollecVng  only  the  last  N  alternaVve  paths   OpVmising  creaVon  of  recovery  candidates   ParVal  symbolic  execuVon   Taint  tracking  to  select  the  bytes  to  treat  as  symbolic        
  23.   23   Taint  tracking   int x = Document[1];

    int y = Document[2]; int a = x; // {1} int b = y; // {2} if (a > 5) // {1} { int c = b; // {2} or {1,2} int d = a + b; // {1,2} } Document   Data  flow   Data  &  Control  flow  
  24.   24   Tested  benchmarks:   pr  –  paginaVon  uVlity

     for  text  files   pine  –  text  mode  e-­‐mail  client   dwarfdump  –  display  debug  informaVon  of  binary   files  (tested  on  executable  files)   readelf  –  similar  to  dwarfdump  (tested  on  object   files)  
  25.   25     %PDF-1.7 ... 3 0 obj <<

    /Type /Page /Parent 2 0 R /Resources << /Font << /F2 11 0 R >> >> /Contents 4 0 R >> endobj 4 0 obj % page content << /Length 44 >> stream BT 70 50 TD /F2 12 Tf (Hello, world!) Tj ET endstream endobj 9 0 obj << /Type /ObjStm ... >> stream 11 0 << /Type /Font ... >> endstream endobj 12 0 obj << /Type /XRef ... >> stream 00 0000 FFFF 01 000a 0000 ... endstream endobj startxref 570 %%EOF
  26.   26     %PDF-1.7 ... 3 0 obj <<

    /Type /Page /Parent 2 0 R /Resources << /Font << /F2 11 0 R >> >> /Contents 4 0 R >> endobj 4 0 obj % page content << /Length 44 >> stream BT 70 50 TD /F2 12 Tf (Hello, world!) Tj ET endstream endobj 9 0 obj << /Type /ObjStm ... >> stream -1 0 << /Type /Font ... >> endstream endobj 12 0 obj << /Type /XRef ... >> stream 00 0000 FFFF 01 000a 0000 ... endstream endobj startxref 570 %%EOF
  27. [1]  C.  Cadar,  D.  Dunbar,  and  D.  Engler.  KLEE:  

    Unassisted  and  automa+c  genera+on  of  high-­‐ coverage  tests  for  complex  systems  programs.     In  OSDI’08,  Dec.  2008.     [2]  P.  D.  Marinescu  and  C.  Cadar.  make  test-­‐zes+:  A   symbolic  execu+on  solu+on  for  improving   regression  tes+ng.  In  ICSE’12,  June  2012.     [3]  F.  Long,  V.  Ganesh,  M.  Carbin,  S.  Sidiroglou,  and   M.  Rinard.  Automa+c  input  rec+fica+on.     In  ICSE’12,  June  2012.           27  
  28. SoUware  faults  and  “broken”  documents   Symbolic  execuVon  technique  

    Symbolic  variables   Path  CondiVon  and  SMT  solver   Concolic  execuVon  /  tesVng   Document  recovery  soluVon   Approach  based  on  concolic  execuVon   Various  performance  opVmisaVons   ParVal  symbolic  execuVon  of  the  bytes  selected  by   taint  tracking