Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Maintaining Python at Google Scale by Thomas Wouters

PyCon 2014
April 10, 2014
1.4k

Maintaining Python at Google Scale by Thomas Wouters

Sponsor Workshop session by Google at PyCon 2014

PyCon 2014

April 10, 2014
Tweet

More Decks by PyCon 2014

Transcript

  1. Google Confidential and Proprietary Maintaining Python at Google Scale Thomas

    Wouters <[email protected]> <[email protected]> Yhg1s on Freenode IRC Google Confidential and Proprietary Agenda • Google Scale • Python at Google • … in 2006 • Solving the problems • Building Python • Embedding Python • Unsolved issues • Questions
  2. Google Confidential and Proprietary Google Scale • Lots of machines

    ◦ servers and workstations • Lots of code ◦ over 100 million lines of code • Single shared codebase (mostly) ◦ lots of re-use, lots of moving targets • Lots of third-party software use ◦ including open-source software ◦ Python, GCC, LLVM, countless libraries ◦ strict adherence to licenses • Lots of exceptions Google Confidential and Proprietary Build ideals • Hermetic programs ◦ completely self-contained binaries ◦ run anywhere ◦ get the same result everywhere • Reproducible builds ◦ build at the same revision, get bit-identical binaries ◦ easier to cherry-pick changes ◦ easier to detect unintended changes • No shared libraries ◦ static linking everywhere ◦ no ABI concerns • Build everything from head
  3. Google Confidential and Proprietary Python at Google • Third-largest language

    ◦ 1/3rd lines of code of C++ ◦ still many millions of lines of code ◦ lots of use of C++ libraries (SWIG) • One giant Python package • Blaze (build tool) ◦ builds everything (in the cloud) ◦ creates “entrypoint” scripts ◦ makes interactive use of Python harder • PAR ◦ “JAR for Python” ◦ executable, distributable format ◦ hermetic except for Python • Lots of third-party use ◦ sys.path trickery to not change module names Google Confidential and Proprietary Python at Google in 2007 • Python programs controlled by shebang line ◦ mostly Python 2.2 ◦ 2.4 slowly growing ◦ some 2.3, never officially supported • Different versions of 2.4 on different machines ◦ workstations, RedHat-based, using 2.4.1, 32-bit ◦ servers using 2.4.3 with a different libc version, 32-bit ◦ new workstations, Ubuntu-based, using 2.4.5… but 64-bit • Not hermetic at all ◦ system-installed third-party dependencies available • No way to use 64-bit Python ◦ C++ was mostly 64-bit, except when building Python programs • All extension modules built for Python 2.2 ◦ and then used in 2.4
  4. Google Confidential and Proprietary Unifying Environments • Google Runtime Environment

    ◦ runtime libraries (glibc, Python) independent from the system ◦ version controlled by configuration in source tree • Python as part of GRTE ◦ Python 2.4 only, but 32-bit and 64-bit ◦ ignore shebang lines of .py files ◦ build tool selects right Python to use ▪ for extension modules as well ◦ Still not quite hermetic • Difficult to update once in use ◦ with millions of lines of code, many bugs seem like features ◦ long release cycle • Flag-day to flip Python version ◦ along with glibc and gcc ◦ lots of testing precedes Google Confidential and Proprietary GRTE and Python versions • GRTEv1: Python 2.4 (2008) ◦ to save space, symlinks identical files between 32-bit and 64-bit ◦ can’t symlink .pyc files • GRTEv2: Python 2.6 (2010) ◦ first major upgrade of Python in many years ◦ disables writing .pyc/.pyo files by default (-B option) ◦ to save space, uses the same stdlib for both 32-bit and 64-bit ▪ (causes confusing tracebacks.) • GRTEv3: Python 2.7 (2012) ◦ turns on hash randomization by default ▪ flushed out a surprising number of bugs ◦ to save space, puts the stdlib in a ZIP file ▪ flushed out a surprising number of bugs ◦ Builds with PGO: +20% performance
  5. Google Confidential and Proprietary Building Python is Hard • Two

    step process ◦ build python ◦ run setup.py with built python • setup.py is messy code ◦ searches filesystem ▪ third-party packages on host affect build output ◦ can’t static link dependencies ▪ need static linking for openssl, readline, etc. • Pre-distutils way: Modules/Setup ◦ still used for built-in (static linked) extension modules ◦ can also produce shared extension modules ◦ can control exact compiler arguments (up to a point) • PGO: profile-guided optimization ◦ make profile-opt ◦ who knew? Google Confidential and Proprietary Embedding Python • Not as easy as it looks • Even with static linking, Python still needs standard library ◦ including extension modules • Standard library is searched for relative to executable ◦ run program from inside /usr, find system Python stdlib ◦ run program elsewhere, find GRTE Python stdlib ◦ same is true for ‘exec -a process_name python ...’ ▪ used by PAR ◦ see Modules/getpath.c:search_for_prefix in CPython source • Solution: embed all the things ◦ static link extension modules ◦ embed Python stdlib in ZIP file in executable ◦ modified zipimport loads stdlib from executable • Solved by volunteer from another team ◦ 20% time
  6. Google Confidential and Proprietary Unsolved issues • Actual hermetic builds

    ◦ include Python in PAR file ▪ like py2exe and pyinstaller ▪ allows for gradual evolution of Python in Google • Extension modules are problematic for PAR files ◦ change glibc to accommodate Python • Python 3 ◦ treated as parallel Python version ◦ talk to Greg • Windows / MacOS / non-Google machines ◦ ignoring for now • Finding unused/dead code ◦ some is flushed out during GRTE upgrades • Keeping track of all uses of our Python ◦ interesting new uses sneak in Google Confidential and Proprietary Questions • Google Engineering Tools blog ◦ all about the build system and the tools ◦ http://google-engtools.blogspot.com/ • Questions (if there is time) • Come talk to us ◦ Thomas Wouters <[email protected]> ◦ Gregory P. Smith <[email protected]>