Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Billions of lines of code in a single repository, SRSLY?

Billions of lines of code in a single repository, SRSLY?

Google stores all its source code in one single monolithic repository! Imagine 25,000 software developers working simultaneously on 86 TB of data, including two billion lines of code in 9 million unique source files. Each week, there are as many lines of code changed as there are lines in the full Linux kernel repository. How does Google’s source code works at this scale? What are the advantages and drawbacks of such an approach? Come and learn about what it means to work on such a big mammoth repository.

Guillaume Laforge

November 23, 2016
Tweet

More Decks by Guillaume Laforge

Other Decks in Technology

Transcript

  1. Billions of lines of code in a single repository, SRSLY?

    Guillaume Laforge Developer Advocate Google Cloud Platform @glaforge
  2. • ACM paper: Why Google Stores Billions of Lines of

    Code in a Single Repository (by Rachel Potvin & Josh Levenberg) http://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-c ode-in-a-single-repository/fulltext • Wired: Google is 2 billions lines of code — and it’s all in one place https://www.wired.com/2015/09/google-2-billion-lines-codeand-one-place • Rachel Potvin’s presentation at @scale https://www.youtube.com/watch?v=W71BTkUbdqE References @glaforge
  3. First, some figures… (as of 2015) 1 billion files 9

    million source files 2 billion lines of code 35 million commits 86 terabytes of content 45 thousand commits / day
  4. Google repository usage Billions of file read requests (800 thousand

    QPS at peak) 30 thousand commits by automated systems per workday 15 thousand commits by humans per workday 25 thousand Googlers in dozens of offices around the world
  5. Some perspective… 15 million lines of code in 40 thousand

    files 15 million lines of code in 250 thousand files changed per week 2 billions lines of code in 9 million source files Google Repository Linux Kernel
  6. Sync user workspace to repo • All code is reviewed

    before commit: by humans, and automated checks • Directories have owners who must approve changes proposed • Tests & automatic checks done before and after commit • Auto-rollback of a commit can happen if changes spread breakage Google Workflow Write code Code review Commit
  7. Google source systems Piper • Stores a single, large repository

    • Implemented on top of standard Google infrastructure • Replicated across 10 data centers worldwide CitC • Cloud-based storage backend and a local file system view • Users see local changes overlaid on top of the full Piper repository • Users can navigate and edit files across the entire codebase
  8. Additional tooling Code review Code browsing, exploration, understanding, and archeology

    Static analysis of code surfaced in Critique and CodeSearch Customizable checks, testing, can block commits Comprehensive testing before & after commits, auto-rollback Large-scale refactorings: change distribution and management Critique CodeSearch Tricorder Presubmits TAP Rosie
  9. • Piper users work “at head”, a consistent view of

    the codebase • All changes are made to a repository in a single, serial ordering • There is no significant use of branching for development • Release branches are cut from specific revision of the repository Trunk-based development cherry-pick release branch trunk Combined with a centralized repository, this defines the monolithic model
  10. • Unified versioning, one source of truth • Extensive code

    sharing and reuse • Simplified dependency management • Atomic changes • Large scale refactoring, codebase modernization • Collaboration across teams • Flexible team boundaries and code ownership • Code visibility and clear tree structure providing implicit team namespacing Advantaged of a monolithic repository
  11. • No confusion about which is the authoritative version of

    a file • No forking of shared library • No painful cross-repository merging of copied sources • No artificial boundaries between teams and projects • Supports gradual refactoring and reorganization of the codebase Single source of truth Changes to base libraries are instantly propagated through the dependency chain, greatly simplifying dependency management
  12. Diamond dependency problem It may become impossible to build A.

    It may become difficult to release the base library D A C B D A C B D1 D2
  13. • Make large, backward incompatible changes easily • Change hundreds

    / thousands of files in a single consistent operation • Rename a file or function in a single commit, with no broken builds or tests Atomic changes
  14. • Single view of the codebase facilitates clean-ups, modernization efforts

    ◦ Can be centrally managed by dedicated specialists ◦ Ex. updating the codebase to make use of C++ 11 features • Monolithic codebase captures all dependency information ◦ Old APIs can be removed with confidence Codebase modernization Software errors or design mistakes can be found and fixed across the entire codebase and coupled with new compiler warnings or presubmit checks
  15. • Tooling investment can be valuable but can be costly

    ◦ Development to scale tools ◦ Cost of execution of computationally intensive tools (eg. builds) • Codebase complexity is a risk to productivity • Code health must be a priority Costs associated with this model Note: monolithic codebases no way implies monolithic software design!
  16. This model encourages code sharing and reuse, however... • It’s

    very (too?) easy to add dependencies • Unnecessary dependency increase: ◦ Exposure to build breakage ◦ Binary sizes ◦ Costs for building, testing, and maintenance Codebase complexity This model reduces the incentive for stable, well-thought out APIs
  17. • Tools have been built to: ◦ Find and remove

    unused / underused dependencies and dead code ◦ Support large-scale clean-ups and refactorings • Google introduced API visibility, with default set to “private” ◦ APIs must be explicitly be set as appropriate for use ◦ APIs can be marked deprecated Investment in code health Lesson learned: put these mechanisms in place early to encourage hygienic dependency structures before the code base is too large
  18. • The monolithic model of source management works well when

    coupled with an engineering culture of collaboration and transparency • Google has invested heavily in scalability and productivity tooling to support this model, due to the significant advantages it provides Conclusion Google has shown that this model can scale to repositories with 1 billion files, 35 million commits, and thousands of users around the globe This may or may not be the right approach for your company!
  19. Pictures credits —— 1/3 Numbers on paper http://www.publicdomainpictures.net/pictures/10000/velka/ciphering-numbers-2961283828065RtvT.jpg Rocket launch

    https://pixabay.com/fr/lancement-de-la-fus%C3%A9e-nuit-693236/ Space shuttle launch https://pixabay.com/fr/lancement-de-la-fus%C3%A9e-fus%C3%A9e-67646/ New York skyscrapers https://pixabay.com/fr/new-york-gratte-ciel-immeuble-927138/ Pied piper http://4.bp.blogspot.com/-xA_n_cW9p-k/Ue2bPv8TJTI/AAAAAAAALZY/r-j1GUaZUAE/s1600/The-Pied-Piper-Rats.gif