Slide 1

Slide 1 text

When BigData hits "BigCode" by Julian Viereck 
 JSZurich, 25. Feb 2015

Slide 2

Slide 2 text

Hi • My name is Julian Viereck • JavaScript developer since 2008 • Contribute to OpenSource (e.g. Firefox) • Master CS student at ETH Zurich • Machine Learning, Software Analysis & more

Slide 3

Slide 3 text

Today • Present research work from ETH Zurich • ‘Predicting Program Properties from “Big Code”’ by 'Veselin Raychev, Martin Vechev, Andreas Krause,
 POPL’15 • http://www.srl.inf.ethz.ch/papers/jsnice15.pdf • Bridge Software Analysis and Machine Learning

Slide 4

Slide 4 text

http://jsnice.org/

Slide 5

Slide 5 text

http://jsnice.org/ ← Types ← Names Predicts program properties from learning existing code.

Slide 6

Slide 6 text

Massive Code Available • Maybe can learn from existing programs? • Number of available code is growing Graphs from: http://githut.info/

Slide 7

Slide 7 text

The DARPA “big code” initiative, […] , seeks to leverage software analysis and big data analytics to improve the way software is built, debugged and verified. Big Code Initiative http://www.datanami.com/2014/05/05/darpa-launches-big-code-initiative/

Slide 8

Slide 8 text

How does this work?

Slide 9

Slide 9 text

Aliens want to learn JS • Assume you are an alien observing earth • You want to learn about the top programming language on earth • Of course that’s JavaScript! • Your task: How is ‘writeFileSync’ used? • Context, Argument names, Argument types • Talking to humans complicated, but <3 analysing data!

Slide 10

Slide 10 text

How to learn? Question Observe Pattern? Apply
 Predict

Slide 11

Slide 11 text

Observed Patterns fs.writeFileSync Context: writeFileSync(…): Function writeFileSync(Str, Str) Types: writeFileSync(file, data) Names:

Slide 12

Slide 12 text

How to automate process? Information Extract Apply “New” Code Machine Learning Existing Code “BigCode”

Slide 13

Slide 13 text

Use Machine Learning • Need to formalise problem precise (using math) • Pattern recognition art in Machine Learning • How to represent program “elements” • Idea: • Model program as dependency graph • Find most likely assignments

Slide 14

Slide 14 text

Dependency Graph • What is known, what is unknown? Known Properties: 0 [] length … Unkown Properties: ? ? e t ? n ? r ? i

Slide 15

Slide 15 text

L=_.R L+=R L

Slide 16

Slide 16 text

Find Best Assignment • Given scores for each relation, find global optimum • Scores learned from existing code Not local optima but yields better global score 1 3 2 1 3 2

Slide 17

Slide 17 text

Practical Issues • Large set of feature combinations: • # indexed js files: 324’501 • ~ 7’000’000 features for names • ~ 70’000 features for names ➡ 10 h / 1 h to learn on 32 core Xeon machine • But have to do this only once, reuse results

Slide 18

Slide 18 text

Practical Issues • Finding global optimal takes too long • Search greedy for optima locally • Only look at features that have high score • With these adjustments • Prediction works quite fast

Slide 19

Slide 19 text

Evaluation TYPO!

Slide 20

Slide 20 text

Evaluation TYPO! Nice example: takes full context into account Types missing :(

Slide 21

Slide 21 text

Where to use this? • Code Editor: • Code completion • Predict types based on names • Type Checking: • Provide hints for type inference • Optimal for cloud based service

Slide 22

Slide 22 text

Related Projects

Slide 23

Slide 23 text

Natural Language Concepts Applied To Programs http://www.codota.com/

Slide 24

Slide 24 text

Nice2Predict http://nice2predict.org/

Slide 25

Slide 25 text

More Background • About JSNice: • http://www.srl.inf.ethz.ch/jsnice.php • Programming Tools based on Big Data: • http://www.srl.inf.ethz.ch/spas.php https://www.youtube.com/watch? v=-_CvQeXbVGg

Slide 26

Slide 26 text

Julian Viereck
 @jviereck
 +JulianViereck Thanks! Any questions?