Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JSNaughty: Recovering Clear, Natural Identifiers from Obfuscated JavaScript Names

Bogdan Vasilescu
September 07, 2017

JSNaughty: Recovering Clear, Natural Identifiers from Obfuscated JavaScript Names

Well-chosen variable names are critical to source code readability, reusability, and maintainability. Unfortunately, in deployed JavaScript code (which is ubiquitous on the web) the identifier names are frequently minified and overloaded. This is done both for efficiency and also to protect potentially proprietary intellectual property. In this paper, we describe an approach based on statistical machine translation (SMT) that recovers some of the original names from the JavaScript programs minified by the very popular UglifyJS. This simple tool, Autonym, performs comparably to the best currently available de-obfuscator for JavaScript, JSnice, which uses sophisticated static analysis. In fact, Autonym is quite complementary to JSnice, performing well when it does not, and vice versa. We also introduce a new tool, JSNaughty, which blends Autonym and JSNice, and significantly outperforms both at identifier name recovery, while remaining just as easy to use as JSNice.

JSNaughty is available online at http://jsnaughty.org.

Bogdan Vasilescu

September 07, 2017
Tweet

More Decks by Bogdan Vasilescu

Other Decks in Science

Transcript

  1. Bogdan Vasilescu (CMU, ISR) Prem Devanbu (UCDavis) Casey Casalnuovo (UCDavis)

    Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) Names @b_vasilescu @devanbu
  2. @b_vasilescu var geom2d = function() { var t = numeric.sum;

    function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }(); Today
  3. @b_vasilescu var geom2d = function() { var t = numeric.sum;

    function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }(); Today
  4. @b_vasilescu var geom2d = function() { var t = numeric.sum;

    function r(n, r) { this.x = n; this.y = r; } u(r, { P: function e(n) { return t([ this.x * n.x, this.y * n.y ]); } }); function u(n, r) { for (var t in r) n[t] = r[t]; return n; } return { V: r }; }(); var geom2d = function() { var sum = numeric.sum; function Vector2d(x, y) { this.x = x; this.y = y; } mix(Vector2d, { P: function dotProduct(vector) { return sum([ this.x * vector.x, this.y * vector.y ]); } }); function mix(dest, src) { for (var k in src) dest[k] = src[k]; return dest; } return { V: Vector2d }; }(); Today Data-driven method + tool
  5. Why? • Programs are (also) written to be read “Instead

    of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” [Don Knuth] @b_vasilescu
  6. Why? • Programs are (also) written to be read •

    Well-chosen variable names are critical to source code readability, reusability, maintainability • Example tasks: • reverse engineering binaries • reverse engineering obfuscated JavaScript • consistent styling in large, distributed teams @b_vasilescu
  7. Why? • Programs are (also) written to be read •

    Well-chosen variable names are critical to source code readability, reusability, maintainability • Example tasks: • reverse engineering binaries • reverse engineering obfuscated JavaScript • consistent styling in large, distributed teams @b_vasilescu
  8. Why? • Programs are (also) written to be read •

    Well-chosen variable names are critical to source code readability, reusability, maintainability [many] • Example tasks: • reverse engineering binaries • reverse engineering obfuscated JavaScript • consistent styling in large, distributed teams 9 9 9 Martin Vechev, “Probabilistic Learning From Big Code”. Keynote at ISSTA 2016 @b_vasilescu
  9. Tiger, Tiger
 burning bright In the forests of the night

    What immortal hand or eye, Could frame thy fearful symmetry? Natural languages are complex
  10. English, த"#, German Can be Rich, Powerful, Expressive ..but “in

    nature” is mostly Simple, Repetitive, Boring
  11. English, த"#, German Can be Rich, Powerful, Expressive ..but “in

    nature” is mostly Simple, Repetitive, Boring Statistical Models
  12. English, த"#, German Can be Rich, Powerful, Expressive ..but “in

    nature” is mostly Simple, Repetitive, Boring Statistical Models
  13. The “naturalness of software” thesis Programming Languages are complex... ...but

    Natural Programs are simple & repetitive. and this, too, CAN BE EXPLOITED!! [Hindle et al, 2011]
  14. Variable Name Guesser (AUTONYM) Minified
 Source Code function u(n, r)

    { for (var t in r) n[t] = r[t]; return n; } .org Autonym
  15. Variable Name Guesser (AUTONYM) Minified
 Source Code Un-Minified
 Source Code

    function u(n, r) { for (var t in r) n[t] = r[t]; return n; } function mix(dest, src) { for (var k in src) dest[k] = src[k]; return dest; } .org Autonym
  16. Noisy channel translation model Goal: recover e|f) = p(f|e)p(e)/p(f) channel

    model language model distorted message (for a given ) Bayes theorem
  17. Noisy channel translation model Goal: recover e|f) = p(f|e)p(e)/p(f) channel

    model language model distorted message Language model Translation (channel distortion) model
  18. Clear Code Corpus Language model Translation model Aligned Clear-Minified 


    Code Corpus Translating minified ( ) to clear JS ( )
  19. Clear Code Corpus Language model Translation model Aligned Clear-Minified 


    Code Corpus Translating minified ( ) to clear JS ( ) GitHub + minifier
  20. Alignment EN: I know what you named your identifiers! NL:

    Ik weet wat je je ID's genoemd! Natural language: non-trivial alignment • Reordering • Different length • Dropped words
  21. Alignment EN: I know what you named your identifiers! NL:

    Ik weet wat je je ID's genoemd! Natural language: non-trivial alignment • Reordering • Different length • Dropped words
  22. Alignment EN: I know what you named your identifiers! NL:

    Ik weet wat je je ID's genoemd! function u(n, r) { function mix(dest, src){ Natural language: non-trivial alignment • Reordering • Different length • Dropped words
  23. Alignment EN: I know what you named your identifiers! NL:

    Ik weet wat je je ID's genoemd! function u(n, r) { function mix(dest, src){ Natural language: non-trivial alignment • Reordering • Different length • Dropped words Minification: straightforward alignment
  24. function r(n, r) { for (var t in r) n[t]

    = r[t]; return n; } Complications
  25. function r(n, r) { for (var t in r) n[t]

    = r[t]; return n; } Complications
  26. function r(n, r) { for (var t in r) n[t]

    = r[t]; return n; } Complications Autonym
  27. function r(n, r) { for (var t in r) n[t]

    = r[t]; return n; } (1) Overloading function mix(dest, src) { } Complications Autonym
  28. function r(n, r) { for (var t in r) n[t]

    = r[t]; return n; } (1) Overloading function mix(dest, src) { } Scope analysis Complications Autonym
  29. function r(n, r) { for (var t in r) n[t]

    = r[t]; return n; } function mix(dest, src) { for (var k in list) dest[k] = list[k]; return dest; } (2) Consistency (Sentence-by-sentence translation) Complications Autonym
  30. function r(n, r) { for (var t in r) n[t]

    = r[t]; return n; } function mix(dest, src) { for (var k in list) dest[k] = list[k]; return dest; } (2) Consistency (Sentence-by-sentence translation) Language model scoring Idea: try all, let language model decide which is more natural, on average, across ALL lines Language model Translation model Complications Autonym
  31. Evaluation • Held-out test set: 2,149 files • Comparison to

    JSNice [Raychev et al, 2015] • Metric: % names recovered
  32. Evaluation • Held-out test set: 2,149 files • Comparison to

    JSNice [Raychev et al, 2015] • Metric: % names recovered • Global vs. local names (globals don’t change) var geom2d = function() { var t = numeric.sum; function r(n, r) { this.x = n; this.y = r; } ... var geom2d = function() { var sum = numeric.sum; function Vector2d(x, y) { this.x = x; this.y = y; } ...
  33. 0.00 0.25 0.50 0.75 1.00 Autonym (Local) Autonym (All) JSN

    ice (Local) JSN ice (All) JSN aughty (Local) JSN aughty (All) Renaming technique % names recovered − 2149 files % names recovered (2,149 test files) Local Global Autonym JSNice
  34. 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

    Autonym File Accuracy JSNice File Accuracy 20 40 60 Frequency Joining forces
  35. 0.00 0.25 0.50 0.75 1.00 Autonym (Local) Autonym (All) JSN

    ice (Local) JSN ice (All) JSN aughty (Local) JSN aughty (All) Renaming technique % names recovered − 2149 files % names recovered (2,149 test files) Autonym JSNice JSNaughty Global
  36. Examples 1 module . exports = http . c r

    e a t e S e r v e r ( function ( e , r ) { 2 var t ; 3 var i = new stream . Stream ( ) ; 4 . . . 5 var n = " " ; 6 csv ( ) . fromStream ( e ) . on ( " data " , function ( e , r ) { 7 i f ( ! t ) { . . . } 8 var a = {}; 9 ( . zip ( t , e ) ) . each ( function ( e ) { . . . } ) ; 10 i . emit ( " data " , n + JSON. s t r i n g i f y ( a ) ) ; 11 n = " ," ; 12 } ) . on ( " end " , function ( e ) { 13 i . emit ( " data " , " ]} " ) ; 14 i . emit ( " end " ) ; 15 } ) . on ( " error " , function ( e ) { 16 i . emit ( " error " , e ) ; 17 c o n s o l e . log ( " csv error " , e . message ) ; 18 } ) ; 19 } ) ; Original: error AUTONYM err JSNICE err JSNAUGHTY err Original: tuple AUTONYM tuple JSNICE key JSNAUGHTY tuple Original: headers AUTONYM headers JSNICE headers JSNAUGHTY headers Original: jsonStream AUTONYM i JSNICE s JSNAUGHTY s Original: req AUTONYM req JSNICE q JSNAUGHTY req Original: res AUTONYM res JSNICE r JSNAUGHTY res Original: separator AUTONYM data JSNICE sep JSNAUGHTY sep
  37. Input program (minified) Output program (un-minified) Moses SMT Optional: Pre-processor

    Post- processor Autonym JSNice Aligned clear-text/ minified corpus Language model Translation model Clear-text corpus Model training This material is based upon work supported by the National Science Foundation under Grant No. 1414172 .org https://github.com/bvasiles/jsNaughty • Identifier renaming using SMT, e.g., minified JS, decompiled C • Generic, mature off-the-shelf technology (Moses) • Language dependence restricted to tokenization and scope analysis • dependency parse in JSNice • Promising results: ~50% better than JSNice on local names, on average
  38. Machine translation for code # Python if n % 3

    == 0: Pseudo-code: if n is divisible by 3 // C# Console . WriteLine ( "Hello World!" ) ; // Java System . out . println ( "Hello World!" ) ; • Oda et al. (ASE ’15): code to pseudocode • Karaivanov et al. (Onward! ’14): porting C# to Java
  39. Machine translation for code # Python if n % 3

    == 0: Pseudo-code: if n is divisible by 3 // C# Console . WriteLine ( "Hello World!" ) ; // Java System . out . println ( "Hello World!" ) ; • Oda et al. (ASE ’15): code to pseudocode • Karaivanov et al. (Onward! ’14): porting C# to Java // Java public void findResultEdges() { for (Iterator it = dirEdgeList.iterator(); it.hasNext();) { DirectedEdge de = (DirectedEdge) it.next();…} } // C# public void FindResultEdges() { foreach (DirectedEdge de in _dirEdgeList){…} } • Nguyen et al. (FSE’ 13, ASE ’15): porting Java to C#