Slide 1

Slide 1 text

Billions of lines of code in a single repository, SRSLY? Guillaume Laforge Developer Advocate Google Cloud Platform @glaforge

Slide 2

Slide 2 text

● ACM paper: Why Google Stores Billions of Lines of Code in a Single Repository (by Rachel Potvin & Josh Levenberg) http://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-c ode-in-a-single-repository/fulltext ● Wired: Google is 2 billions lines of code — and it’s all in one place https://www.wired.com/2015/09/google-2-billion-lines-codeand-one-place ● Rachel Potvin’s presentation at @scale https://www.youtube.com/watch?v=W71BTkUbdqE References @glaforge

Slide 3

Slide 3 text

Google repository scale @glaforge

Slide 4

Slide 4 text

First, some figures… (as of 2015) 1 billion files 9 million source files 2 billion lines of code 35 million commits 86 terabytes of content 45 thousand commits / day

Slide 5

Slide 5 text

Exponential growth

Slide 6

Slide 6 text

Google repository usage Billions of file read requests (800 thousand QPS at peak) 30 thousand commits by automated systems per workday 15 thousand commits by humans per workday 25 thousand Googlers in dozens of offices around the world

Slide 7

Slide 7 text

Some perspective… 15 million lines of code in 40 thousand files 15 million lines of code in 250 thousand files changed per week 2 billions lines of code in 9 million source files Google Repository Linux Kernel

Slide 8

Slide 8 text

Google systems & workflows @glaforge

Slide 9

Slide 9 text

Sync user workspace to repo ● All code is reviewed before commit: by humans, and automated checks ● Directories have owners who must approve changes proposed ● Tests & automatic checks done before and after commit ● Auto-rollback of a commit can happen if changes spread breakage Google Workflow Write code Code review Commit

Slide 10

Slide 10 text

Google source systems Piper ● Stores a single, large repository ● Implemented on top of standard Google infrastructure ● Replicated across 10 data centers worldwide CitC ● Cloud-based storage backend and a local file system view ● Users see local changes overlaid on top of the full Piper repository ● Users can navigate and edit files across the entire codebase

Slide 11

Slide 11 text

Additional tooling Code review Code browsing, exploration, understanding, and archeology Static analysis of code surfaced in Critique and CodeSearch Customizable checks, testing, can block commits Comprehensive testing before & after commits, auto-rollback Large-scale refactorings: change distribution and management Critique CodeSearch Tricorder Presubmits TAP Rosie

Slide 12

Slide 12 text

● Piper users work “at head”, a consistent view of the codebase ● All changes are made to a repository in a single, serial ordering ● There is no significant use of branching for development ● Release branches are cut from specific revision of the repository Trunk-based development cherry-pick release branch trunk Combined with a centralized repository, this defines the monolithic model

Slide 13

Slide 13 text

Advantages of a monolithic repository @glaforge

Slide 14

Slide 14 text

● Unified versioning, one source of truth ● Extensive code sharing and reuse ● Simplified dependency management ● Atomic changes ● Large scale refactoring, codebase modernization ● Collaboration across teams ● Flexible team boundaries and code ownership ● Code visibility and clear tree structure providing implicit team namespacing Advantaged of a monolithic repository

Slide 15

Slide 15 text

● No confusion about which is the authoritative version of a file ● No forking of shared library ● No painful cross-repository merging of copied sources ● No artificial boundaries between teams and projects ● Supports gradual refactoring and reorganization of the codebase Single source of truth Changes to base libraries are instantly propagated through the dependency chain, greatly simplifying dependency management

Slide 16

Slide 16 text

Diamond dependency problem It may become impossible to build A. It may become difficult to release the base library D A C B D A C B D1 D2

Slide 17

Slide 17 text

● Make large, backward incompatible changes easily ● Change hundreds / thousands of files in a single consistent operation ● Rename a file or function in a single commit, with no broken builds or tests Atomic changes

Slide 18

Slide 18 text

● Single view of the codebase facilitates clean-ups, modernization efforts ○ Can be centrally managed by dedicated specialists ○ Ex. updating the codebase to make use of C++ 11 features ● Monolithic codebase captures all dependency information ○ Old APIs can be removed with confidence Codebase modernization Software errors or design mistakes can be found and fixed across the entire codebase and coupled with new compiler warnings or presubmit checks

Slide 19

Slide 19 text

Costs associated with this model @glaforge

Slide 20

Slide 20 text

● Tooling investment can be valuable but can be costly ○ Development to scale tools ○ Cost of execution of computationally intensive tools (eg. builds) ● Codebase complexity is a risk to productivity ● Code health must be a priority Costs associated with this model Note: monolithic codebases no way implies monolithic software design!

Slide 21

Slide 21 text

This model encourages code sharing and reuse, however... ● It’s very (too?) easy to add dependencies ● Unnecessary dependency increase: ○ Exposure to build breakage ○ Binary sizes ○ Costs for building, testing, and maintenance Codebase complexity This model reduces the incentive for stable, well-thought out APIs

Slide 22

Slide 22 text

● Tools have been built to: ○ Find and remove unused / underused dependencies and dead code ○ Support large-scale clean-ups and refactorings ● Google introduced API visibility, with default set to “private” ○ APIs must be explicitly be set as appropriate for use ○ APIs can be marked deprecated Investment in code health Lesson learned: put these mechanisms in place early to encourage hygienic dependency structures before the code base is too large

Slide 23

Slide 23 text

Conclusion @glaforge

Slide 24

Slide 24 text

● The monolithic model of source management works well when coupled with an engineering culture of collaboration and transparency ● Google has invested heavily in scalability and productivity tooling to support this model, due to the significant advantages it provides Conclusion Google has shown that this model can scale to repositories with 1 billion files, 35 million commits, and thousands of users around the globe This may or may not be the right approach for your company!

Slide 25

Slide 25 text

Thanks for your attention! @glaforge

Slide 26

Slide 26 text

Pictures credits —— 1/3 Numbers on paper http://www.publicdomainpictures.net/pictures/10000/velka/ciphering-numbers-2961283828065RtvT.jpg Rocket launch https://pixabay.com/fr/lancement-de-la-fus%C3%A9e-nuit-693236/ Space shuttle launch https://pixabay.com/fr/lancement-de-la-fus%C3%A9e-fus%C3%A9e-67646/ New York skyscrapers https://pixabay.com/fr/new-york-gratte-ciel-immeuble-927138/ Pied piper http://4.bp.blogspot.com/-xA_n_cW9p-k/Ue2bPv8TJTI/AAAAAAAALZY/r-j1GUaZUAE/s1600/The-Pied-Piper-Rats.gif

Slide 27

Slide 27 text

Pictures credits —— 2/3 Drill https://pixabay.com/fr/t%C3%AAte-de-forage-foret-perceuse-m%C3%A9tal-444504/ Monolith https://pixabay.com/fr/stone-henge-monument-pierre-1573342/ Agenda http://www.publicdomainpictures.net/pictures/200000/velka/lined-paper-1474013993szE.jpg Health http://www.publicdomainpictures.net/pictures/10000/velka/2473-1270809666Fpnm.jpg Stonehenge https://pixabay.com/fr/stone-henge-monument-pierre-1573342/

Slide 28

Slide 28 text

Pictures credits —— 3/3 Trunk https://pixabay.com/fr/automne-arbre-nature-des-for%C3%AAts-1589644/ Doors https://pixabay.com/fr/portes-choix-choisir-ouverte-1587329/ Atom https://upload.wikimedia.org/wikipedia/commons/thumb/8/80/Atom_editor_logo.svg/2000px-Atom_editor_logo.svg.png Diamond https://pixabay.com/fr/pr%C3%A9cieux-diamant-bijoux-cher-1199183/ Colored lines https://pixabay.com/fr/lignes-rayons-arri%C3%A8re-plan-593191/