Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2017 - Bringing Python 3 to LinkedIn

Db2ee812bdc6fd057f8f4209c08b6f63?s=47 PyBay
August 21, 2017

2017 - Bringing Python 3 to LinkedIn

Organizations keep finding excuses to stay on Python 2, especially large companies with a lot of legacy code. Developers in such organizations either require or could benefit from Python 3 features, such as asyncio or type checking, but are constrained by their environment. It does not have to be that way and you can move to Python 3 with careful planning and agile, incremental execution.

We present the comprehensive case study of enabling Python 3 development in a large company. The talk covers all the stages of the process. We start with the initial motivation, goals, and possible solutions. Then, we go into the specifics of the design for infrastructure changes necessary to make the migration possible. After, we discuss the execution and the decision making process for resolving challenges or trade-offs. We talk about multi-version testing, conditional dependencies, build, continuous integration, and automation. We describe possible ways to resolve platform, dependency, and code base issues. In the end, we'll look into the porting of the specific code patterns and future development.

Zvezdan currently works at LinkedIn on the Python Foundation Team. He taught Python at University of Mary Washington and then worked for Zope Corporation where he led a team that developed one of its most important products. He also worked on and maintained dozens of packages for the open-source Zope web framework. These days, he contributes to PyGradle -- the open-source Python build tool developed by his team at LinkedIn. Zvezdan had presented at international conferences and internal company tech talks on various topics from microprocessor design to distributed file systems to troubleshooting Python applications.




August 21, 2017


  1. Bringing Python 3 to LinkedIn ​Zvezdan Petković ​LinkedIn

  2. Overview Motivation Goals Comparisons Justification Introduction Python interpreters Migration Multi-version

    testing Dependencies Design Platform Dependencies Legacy code base Build system Challenges Order Compatibility Code patterns Deployable vs. library Porting
  3. “I took a speed-reading course and read War and Peace

    in twenty minutes. It involves Russia.” W O O D Y A L L E N C A V E A T E M P T O R

  5. Motivation • Use of system Python was hard-coded into our

    build tooling • The vendor’s site- packages were heavily tainted • Some of our acquisitions already used Python 2.7 Flexibility Independence Transparency Progress • We used Python 2.6 because it was vended with RHEL 6 • RHEL 7 ships with Python 2.7 by default • Our apps and build tooling would be blocked from update • Internal command-line tools had to be rolled back sometimes because: • An external script (ab)used addsitedir() or activate_this.py • Depended on the tainted package from system site-packages • Python 2.6 retired with the release of Python 2.6.9 on October 29, 2013 • Python 2.7 end-of-life is in 2020 • Our developers want to use the newest features, such as asyncio or type hints (PEP 484)
  6. Goals • Clean Python interpreters on all platforms we actively

    support • Code that works with multiple versions of Python • Support for multiple versions of Python Provide Enable Do • Incremental migration • Python 2.7 — completed in December 2015 • Python 3 — completed in September 2016 • Break the dependency on the single vendor supplied version of Python • Migrate base libraries — completed Fall 2016 • Pave the way for the full migration to Python 3
  7. Existing solutions • The support for multiple Python interpreter versions

    can be accomplished through: • System vended Python • Clean Python Multiple Python versions Code base for multiple versions • There are existing tools that can migrate the code to work with multiple versions of Python
  8. System vended Python • Convenience • Vendors offer multiple versions

    of Python • One of the versions is the default • Red Hat offers Software Collections in RHEL 7 … but not for Python 2.6 • Apple supports library frameworks Advantages Disadvantages • The dependency on vendor for updates • Incompatible backport changes • Tainted libraries • Interactions with 3rd party packages installed in site-packages • Different paths to Python interpreter on different platforms or for different versions of Python
  9. Clean Python • Isolation from changes caused by the vended

    Python • Reproducibility of issues, without strange interactions • Control over updates Advantages Disadvantages • Must maintain the custom build • Must follow the security updates closely
  10. Code support for multiple versions • The package six abstracts

    the differences from the users • Automated rewriting with python- modernize • Uses six when it can • Supports Python 2.6 to 3.x • Straddling code • All Python developers faced the need to support Python 2 and 3 • The best tools and practices developed over the last decade • They maintain the compatibility with Python 2.6 • Extend the compatibility to the latest Python 3 version

    Botched backport fixes • Strange interactions with vendor's packages • Problems caused by a package installed into system's site-packages • Modernization reveals bad code patterns and improves our code base Prevents Enables • Migration of products to new versions of Python at developer’s own pace • Mission-critical teams can use Python 3 features they need • Be ready for 2020 instead of scrambling in the last moment • Be leaders in Python community
  12. Design and Implementation INFRASTRUCTURE CHANGES

  13. Requirements • Clean Python interpreters • Code migration • Multi-version

    testing • Conditional dependencies
  14. Clean Python Interpreters • A cleanpythonXY product for each X.Y

    version of Python • Optimal compilation options for each platform (Linux/Darwin) • Python source archive as an external dependency, C/C++ Gradle plugin, multi-variants • Patches as needed for specific platform • Support for native packaging formats — RPM and pkg • Distribution into a specific path: /…/python/X.Y/… • Breaks the dependency on vendor’s Python and avoids tainted environment • Allows incremental migration — OS or language version
  15. Code Migration • Keep compatibility with multiple versions of Python

    for libraries • Analyze the dependency graph of a web application • Modernize libraries with: • External dependencies only • One internal dependency • Products that depend on the previous groups [rinse and repeat] • Modernize the deployable application at the top of the graph • Spread the effort to other libraries used by CLI tools by engaging owners
  16. Multi-version testing • Add to the build system a feature

    that allows declaring the Python version used • Add a feature that allows declaring the supported versions for libraries • Needs to build with all declared supported versions • Needs to run tests with all declared supported versions • Ensures continuous compatibility • Should we use an existing Python tool — tox? — or build an extension in Gradle? • Investigation + prototype à Gradle extension was better for us
  17. Conditional dependencies • Dependencies vary between Python 2 and 3

    for the same library • Python has native solutions for conditional dependencies • The setuptools has extras_require • An empty key in extras_require means the default requirement • The wheel package has support too • Environment markers: PEP 508 • Another extension to enable markers in PyGradle with shortcut markers for ease-of-use • Additions to our custom distutils class helped to tie loose ends
  18. Environment markers in build.gradle dependencies { python(spec.external.avro) { environmentMarkers(it, 'py26')

    environmentMarkers(it, 'py27') } python(spec.external.'avro-python3') { environmentMarkers(it, 'py3') } }
  19. Environment markers in pinned.txt PyGradle converts the shortcut markers into

    expanded markers: avro-json-serializer==0.4.1 ... lipy-utils==4.0.17 avro==1.3.3 ; python_version=='2.6' or python_version=='2.7' avro-python3==1.8.1 ; python_version>='3.4' configparser==3.5.0 ; python_version=='2.6' or python_version=='2.7' enum34==1.1.2 ; python_version=='2.6' or python_version=='2.7'
  20. Environment markers to extras_require Then, distgradle.GradleDistribution converts them into extras_require:

    install_requires=['avro-json-serializer', … , 'lipy-utils'] extras_require={ ":python_version>='3.4'": ['avro-python3'], ":python_version=='2.6' or python_version=='2.7'": [ 'avro', 'configparser', 'enum34' ] }
  21. Python package metadata from markers ... [:python_version=='2.6' or python_version=='2.7'] avro

    configparser enum34 [:python_version>='3.4'] avro-python3
  22. Wheel metadata from markers ... Requires-Dist: lipy-utils Requires-Dist: avro; python_version=='2.6'

    or python_version=='2.7' Requires-Dist: configparser; python_version=='2.6' or python_version=='2.7' Requires-Dist: enum34; python_version=='2.6' or python_version=='2.7' Requires-Dist: avro-python3; python_version>='3.4'
  23. Challenges AND TRADE-OFFS

  24. Platform issues ​SLOWER ADOPTION • Different versions of C libraries

    across • major OS versions — libffi • minor releases of the operating system — OpenSSL (RHEL) • Vendor stops shipping header files for a C library — OpenSSL (Apple) • How to resolve these issues? • Multi-variant builds for major OS versions • Custom OpenSSL product — fulfills other needs too • The cleanpython products are linked against this OpenSSL

    format was made for Java and does not have some features • The subprocess32 package will break a build with Python 3 • The functools32 package will work only with 2.7, but not 2.6 or 3 • Wait, didn’t we solve that with environment markers? • Yes, but during the migration old packages don’t have correct metadata • They bring transitively “dependencies” that cannot be processed • In PyGradle, the environment markers extension keeps track of such dependencies • It excludes transitive dependencies based on the language version

    • Doing this with the old monolithic repo would be nearly impossible • Large products or dependency clusters of products are more difficult to migrate • Experience with 2.7 modernization in 2015 — apollo cluster • Once that cluster was simplified it was easier for Python 3 migration • Lessons learned: • Loosely coupled code base enables more agile development • Open-sourcing PyGradle while building Python 3 support • Stepping on each other’s toes all the time J
  27. Build system requirements • Pluggable — easy to add another

    build plugin for different type of artifact • Flexible — easy to add custom changes for one specific product • Extensible — easy to add support features, such as our markers and versions • Fit to the existing ecosystem: • Polyglot builds — Java, Scala, Python, JavaScript; all build together • Multiple products — a library and a deployable backend and frontend, for example • Python only builds with our older build system were OK — polyglot builds impossible • Had to upgrade setuptools, wheel, pip, virtualenv, …
  28. Porting CODE PATTERNS

  29. Bottom-to-top porting of libraries • Follow the dependency graph from

    the leaves up as described • Modernize the code • Upgrade PyGradle version and configuration files • Provide the correct environment markers • Bump the major version • Produce the updated package metadata for conditional dependencies • Proceed up the graph until all the libraries have the correct metadata
  30. Ensure the continued compatibility • Assert through multi-version testing •

    How to use? python { details.pythonVersion = '3.5' versions.supportedVersions = ['2.6', '2.7', '3.5'] } • Using Python 3 as the default fails early if Python 3 compatibility is broken
  31. Recognize suspicious code patterns • I presented a talk internally

    at LinkedIn — Craftsmanship with Python: Code Patterns • It pointed out code patterns that cause issues during migration • Some of them are a natural result of the changes between Python 2/3 • Some were caused by the original code • We documented them so that developers can port the products easier • The names in the code have been changed for space and privacy reasons • The irrelevant content was replaced by ellipses
  32. Porting deployable products • Deployable products are at the root

    of the dependency tree • Can be directly migrated to use only one version of Python • Use Python 3 unless you depend on a package that works with 2 only • They do not need straddling code • Consolidated language features • Simpler library APIs — see subprocess.run() • Ported an example Python web app and a widely used CLI tool generate-skeleton
  33. La ménagerie des folies • Relying on dictionary order is

    bad • Language idiom and performance — if key in d.keys() — list() can help • Ordering operators; total ordering; __eq__ and __ne__ both return True • unicode/str (ASCII) vs. str/bytes • Mind your encode/decode calls • Namespaces: how to handle and how to fix? • urllib, urllib2, and urlparse calls
  34. Misguided Assumptions • Some of the code and tests written

    in Python 2 worked without noticeable defects • Python 3 revealed places where the assumptions were incorrect, misguided, or misinterpreted
  35. “October. This is one of the particularly dangerous months to

    speculate in stocks. Others are November, December, January, February, March, April, May, June, July, August, and September.” M A R K T W A I N
  36. Reliance on dictionary order • Observed in the test code

    most frequently • Makes tests flaky and unreliable • Fails randomly • It’s more pernicious in runtime code • Random failures • Hard to reproduce • Needs to be rooted out and replaced with stable code and test patterns
  37. Reliance on dictionary order in code First, the dictionary: valid_operators

    = { '<=': operator.le, '>=': operator.ge, '>': operator.gt, '<': operator.lt, '==': operator.eq, '!=': operator.ne, }
  38. Reliance on dictionary order in code … Original code (replaced

    the class name with C): for key, op in C.valid_operators.iteritems(): # peek using key; consume if present; fetch operand Fixed code (the real fix would be to rewrite it as a solid parser): for key in sorted(C.valid_operators, key=len, reverse=True): op = C.valid_operators[key] # peek ..., consume ..., fetch ...
  39. Reliance on dictionary order in tests <application xmlns="..." name="xyz" version="1.2.3">

    <!-- ... --> <set> <value>2010</value> <value>2012</value> <value>2014</value> </set> <!-- ... --> </application> <application xmlns="..." version="1.2.3" name="xyz"> <!-- ... --> <set> <value>2014</value> <value>2010</value> <value>2012</value> </set> <!-- ... --> </application>
  40. Reliance on dictionary order in tests … Original test code:

    assert filecmp.cmp(produced_file, expected_file) Fixed test code (replaced the class name with A): produced = A.parse_from_file(produced_file) expected = A.parse_from_file(expected_file) assert expected == produced
  41. Be careful with testing repr output Due to changes between

    Python 2 and 3 the repr output can change: assert repr(A.m(mock_S, 'a', 'd')) == "<C '...', 'd', set([])>" The changed test is then: obj = A.m(mock_S, 'a', 'd') expected = "<C '...', 'd', {0}>".format( 'set()' if six.PY3 else 'set([])') assert repr(obj) == expected
  42. Reconsider assumptions • Mappings are not guaranteed to keep the

    order of the keys • JSON objects are mappings • Do not rely on JSON order of entries • Order of XML attributes is not significant • Do not rely on order of XML attributes • Pass encoding parameter to JSON and XML parsers/processors if needed
  43. Idiomatic Use • Spoken languages have idioms specific to each

    language • Mastering the idiomatic use of language is equally important with programming languages
  44. “The determined Real Programmer can write FORTRAN programs in any

    language.” E D P O S T , R E A L P R O G R A M M E R S D O N ' T U S E P A S C A L , 1 9 8 2
  45. Performance concerns • A code pattern: if key in d.keys():

    • Found in the original code base • Python 3 modernizer turns it into: if key in list(d.keys()): • This is required to preserve the semantics • It also reveals what is wrong with this pattern • A simple, idiomatic, Pythonic, if key in d: is an O(1) lookup • The d.keys() creates a list in Python 2; then in is an O(n) lookup in the list • This code pattern turns an O(1) mapping lookup into an O(n) list lookup
  46. O(1) traded for O(n)! • Exactly my thought when I

    saw this code.
  47. Namespaces • Some products had code modules in the linkedin

    namespace • To fix we had to: • Move src/linkedin/xxx.py to src/linkedin/xxx/xxx.py • Re-export for backward compatibility from src/linkedin/xxx/__init__.py from linkedin.xxx.xxx import * from linkedin.xxx.xxx import __doc__ • Add :imported-members: to automodule directive for API documentation • Apply any custom handling of linkedin.xxx to linkedin.xxx.xxx
  48. Heed the warnings of the static analyzer assert (parse_lines(line) is

    '---', 'Integration | Component | ...', 'blah blah: something something') assert parse_lines(line) == ( '---', 'Integration | Component | ...', 'blah blah: something something', )
  49. “It's easier to ask forgiveness than it is to get

    permission.” G R A C E H O P P E R
  50. “Look before you leap.” P R O V E R

  51. EAFP vs. LBYL • LBYL is widely used in C

    programming • EAFP is preferable for readability, clarity, DRY, completeness, race conditions
  52. EAFP vs. LBYL in the code if isinstance(filename, file): filename

    = filename.name elif not isinstance(filename, six.string_types): filename = None if not isinstance(filename, six.string_types): # assume it's a file-like object try: filename = filename.name except AttributeError: filename = None
  53. EAFP in the tests expected = [expected_counter, expected_gauge, expected_gauge2] expected_reordered

    = [ expected_counter, expected_gauge2, expected_gauge] try: fake_post.assert_called_with( topic='...', host='...', app='...', metrics=expected) except AssertionError: fake_post.assert_called_with( topic='...', host='...', app='...', metrics=expected_reordered )
  54. Python Changes • What to look for? • Consolidation of

    the language
  55. Comparisons • No more cmp() in Python 3 • Use

    instead __eq__ and __lt__ combined with functools total ordering @totalordering class Comparable(object): def __eq__(self, other): if isinstance(other, self.__class__): return self.x == other.x return NotImplemented def __lt__(self, other): # similar approach
  56. Total ordering in Python 2 • Beware __ne__ with Python

    2 total ordering: if six.PY2: def __ne__(self, other): equal = self.__eq__(other) return equal if equal is NotImplemented else not equal • Beware that __hash__ returns id() by default; define it instead def __hash__(self): return hash(self._comparison_keys())
  57. Strings vs. bytes • Python 2 code for line in

    packet.splitlines(): becomes packet_as_str = ( packet if not (six.PY3 and isinstance(packet, bytes)) else packet.decode('utf-8') ) for line in packet_as_str.splitlines(): • Do not forget to add a test that explicitly passes in b'...' instead of a string
  58. Know the binary APIs Python 2: base64.b64encode(hmac.new(secret, body, hashlib.sha1).digest()) Converted

    to: def ensure_bytes(str_or_bytes): return (str_or_bytes if isinstance(str_or_bytes, bytes) else str_or_bytes.encode('utf-8')) base64.b64encode( hmac.new(ensure_bytes(secret), ensure_bytes(body), hashlib.sha1).digest())
  59. Learn the API changes String functions and methods: try: #

    Python 3 has maketrans() as a static method of str type. maketrans = str.maketrans except AttributeError: # Python 2 has it in string module. from string import maketrans
  60. The API consolidation from urllib import url2pathname, quote_plus, unquote_plus from

    six.moves.urllib.request import url2pathname from six.moves.urllib.parse import quote_plus, unquote_plus httplib.responses à six.moves.http_client.responses log.warn à log.warning sys.maxint à sys.maxsize
  61. API removals Watch for completely removed functions/methods: tmp = os.tempnam(os.path.dirname(dest),

    'ts_') with tempfile.NamedTemporaryFile( dir=os.path.dirname(dest), prefix='ts_') as f: tmp = f.name from __future__ import absolute_import from __future__ import print_function print >> sys.stderr, repr(e) à print(repr(e), file=sys.stderr)
  62. Type changes Watch for type changes and provide the code

    that can handle them: ssl.PROTOCOL_SSLv23 → ssl.PROTOCOL_SSLv23.value if six.PY3: long = int Instead of the __metaclass__ attribute, the syntax changed to class Custom(metaclass=Meta): The six package provides compatibility: six.with_metaclass(Meta, Base)
  63. Exceptions changed The message attribute is deprecated The exceptions are

    not sequences any more, use args instead: str(err[-1]) à str(err.args[-1]) except SomeException, variable à except SomeException as variable raise exc_class, value, traceback à six.reraise(exc_class, value, traceback) Beware: raise e vs. raise — do not hide traceback Exception chaining possible with Python 3: raise EXCEPTION from CAUSE Exception hierarchy changed — see Python documentation
  64. Standard library vs. backports • Replace the 3rd party backport

    packages with the standard library: • simplejson vs. json • mock vs. unittest.mock • subprocess32 vs. subprocess (especially in Python 3.5) • trollius vs. asyncio • Stay up-to-date with the changes in the language: • izip and similar functions are not needed any more • etree API can take encoding • subprocess can take universal_newlines=True
  65. Status ​PYTHON 3 29% August 2017 • Percentage of Python

    products building with Python 3 • Early August 2017 • LinkedIn
  66. Future • Python 2.6 will not be supported at all

    starting August 2017 • We will make Python 3 the default • Automated migration to the new major version of PyGradle • Users can still set 2.7 as their default build interpreter • Pre-commit or build checks to warn about non-compatible code • Type checking • Requires an effort to annotate our code with type information • Product dashboard checks
  67. Thank you

  68. None