How To Quickly Understand Millions of Lines of Code

How to Quickly Understand Millions of Lines of Code? (a
case for static analysis) http://talks.robotlolita.me

Hi, I’m Quil. I like designing programming languages and tools.
(aka @robotlolita)

And I want to work on tools to help people
understand their programs.

Systems begin small and cute. meow~

And then, they grow. A lot. Hello, tiny human.

Reading the source to understand it isn’t feasible anymore. Just
9000 books to go...

This is the perfect case for software analysis tools! We
already rely on Dialyzer, Cover, Xref, etc

And so I came up with an idea for a
project that I called Glass. it means “ice cream” in Swedish.

it’s also a backronym: Generic Language Analysis & Search System

One of the primary use cases is “semantic code search”
Where’s “lists:map/2” used?

Let me have my cake and eat it, too! “I
want code search that is fast enough to be used in an IDE, easy to use, and scalable for codebases of any size!” Let me have my cake and eat it, too!

Glass’s approach is based on pattern matching & relational logic.
influenced by miniKanren

Let’s say you looking for where ct:pal/2 is called: “I’m
looking for code like this”

But what if you want very speciﬁc calls? “I’m looking
for code like this” “But only if this holds”

To evaluate queries, we unify patterns against every node in
our AST. progress: 1 of 9 million

+ Ignore this match

This approach is generally efﬁcient, and queries are easier to
write.

Compare to the Datalog approach:

“If only I had time to work on this cool
idea…”

A colleague and I built a prototype during an OSS
hackathon at Klarna.

We ran queries against Glass’ own codebase… They completed in
less than 1 second.

But we want Glass to work on BIG codebases, too...

So we ran it on our biggest codebase: • >
1 million lines of code • > 6000 Erlang modules • > 250 Erlang applications • Maintained by several teams over many years

...and things didn’t go so smooothly. Some queries didn’t even
finish!

Can we make it go faster?

ONE BIG PROBLEM: Search time is proportional to the size
of the codebase.

I know! Let’s use indexes! I know! Let’s use indexes!

SECOND BIG PROBLEM: Code changes frequently, and indexing is expensive.

Visiting every node is bad!!! Oh no, the code changed
AGAIN! Indexing!!! Invalidation :(

Let’s take a step back, How do we want people
to use Glass?

We expect users to search from their IDE, while editing
code.

“I’m looking for code that looks like this, but only
in some contexts.”

“Where do we still have debug logs?” “fully qualified function
references!”

“Are we still accessing record_a directly?” “record names look good
for indexes?”

“Have I seen this pattern before…?” “fully qualified function references
will likely be here.”

“Did I forget type specs anywhere?” “function definitions and other
static ops might not be so common...”

Queries have plenty of static information we can infer indexes
from!

Caitlin et al made similar observations at Google.

Caitlin et al’s work also adds: Queries are generally bounded
to “known” locations, rather than spanning all code.

New knowledge! • Static information in queries --- infer indexes
from it! • Queries are generally localised --- try to use location for ranking results! --- stream results to improve perceived performance!

But what about code changes?!

Changes are frequent, but localised.

Make our indexes incremental!

INCREMENTAL COMPUTATION A technique for efﬁciently updating a program’s output
when part of its inputs change.

A common approach is to maintain a dependency graph and
a cache.

Running with a cold cache will compute everything.

Running with a hot cache will only compute what has
changed.

For Erlang, we have module-level dependencies (because of erlc).

So our dependency graph looks like this: module header files
parser transforms

Header ﬁles and parser transforms don’t change as often! Amortised
cost is small!

Incremental Pattern indexes + queries = correct and fast searches,
even in an IDE.

One thing I’ve learned: Having a clear UX vision and
design constraints was very helpful. As did experience, of course.

Glass’ design constraints were: • The time to think of
a query matters --- make it tangible; • The perceived query performance matters --- show useful results quickly; • The user effort to get fast queries matters --- make queries easy to optimise;

Glass is still a work in progress! @klarna-incubator/glass

PAPER RECOMMENDATIONS µKanren: A Minimal Functional Core for Relational Programming
Jason Hemann and Daniel P. Friedman (2013) How Developers Search for Code: A Case Study Caitlin Sadowski, Kathryn T. Stolee, and Sebastian Elbaum (2015) Adapton: Composable, Demand-Driven Incremental Computation Matthew A. Hammer, Khoo Yit Phang, Michael Hicks, and Jeffrey S. Foster (2014)

THAT’S ALL, FOLKS http://talks.robotlolita.me

How To Quickly Understand Millions of Lines of ...

How To Quickly Understand Millions of Lines of Code

More Decks by Quil

Other Decks in Programming

Featured

Transcript