2017 - Bringing Python 3 to LinkedIn

Bringing Python 3 to LinkedIn Zvezdan Petković LinkedIn

Overview Motivation Goals Comparisons Justification Introduction Python interpreters Migration Multi-version
testing Dependencies Design Platform Dependencies Legacy code base Build system Challenges Order Compatibility Code patterns Deployable vs. library Porting

“I took a speed-reading course and read War and Peace
in twenty minutes. It involves Russia.” W O O D Y A L L E N C A V E A T E M P T O R

Introduction MOTIVATION, GOALS, AND SOLUTIONS

Motivation • Use of system Python was hard-coded into our
build tooling • The vendor’s site- packages were heavily tainted • Some of our acquisitions already used Python 2.7 Flexibility Independence Transparency Progress • We used Python 2.6 because it was vended with RHEL 6 • RHEL 7 ships with Python 2.7 by default • Our apps and build tooling would be blocked from update • Internal command-line tools had to be rolled back sometimes because: • An external script (ab)used addsitedir() or activate_this.py • Depended on the tainted package from system site-packages • Python 2.6 retired with the release of Python 2.6.9 on October 29, 2013 • Python 2.7 end-of-life is in 2020 • Our developers want to use the newest features, such as asyncio or type hints (PEP 484)

Goals • Clean Python interpreters on all platforms we actively
support • Code that works with multiple versions of Python • Support for multiple versions of Python Provide Enable Do • Incremental migration • Python 2.7 — completed in December 2015 • Python 3 — completed in September 2016 • Break the dependency on the single vendor supplied version of Python • Migrate base libraries — completed Fall 2016 • Pave the way for the full migration to Python 3

Existing solutions • The support for multiple Python interpreter versions
can be accomplished through: • System vended Python • Clean Python Multiple Python versions Code base for multiple versions • There are existing tools that can migrate the code to work with multiple versions of Python

System vended Python • Convenience • Vendors offer multiple versions
of Python • One of the versions is the default • Red Hat offers Software Collections in RHEL 7 … but not for Python 2.6 • Apple supports library frameworks Advantages Disadvantages • The dependency on vendor for updates • Incompatible backport changes • Tainted libraries • Interactions with 3rd party packages installed in site-packages • Different paths to Python interpreter on different platforms or for different versions of Python

Clean Python • Isolation from changes caused by the vended
Python • Reproducibility of issues, without strange interactions • Control over updates Advantages Disadvantages • Must maintain the custom build • Must follow the security updates closely

Code support for multiple versions • The package six abstracts
the differences from the users • Automated rewriting with python- modernize • Uses six when it can • Supports Python 2.6 to 3.x • Straddling code • All Python developers faced the need to support Python 2 and 3 • The best tools and practices developed over the last decade • They maintain the compatibility with Python 2.6 • Extend the compatibility to the latest Python 3 version

Justification WHAT IS A RETURN-ON-INVESTMENT FOR SUCH A MIGRATION? •
Botched backport fixes • Strange interactions with vendor's packages • Problems caused by a package installed into system's site-packages • Modernization reveals bad code patterns and improves our code base Prevents Enables • Migration of products to new versions of Python at developer’s own pace • Mission-critical teams can use Python 3 features they need • Be ready for 2020 instead of scrambling in the last moment • Be leaders in Python community

Design and Implementation INFRASTRUCTURE CHANGES

Requirements • Clean Python interpreters • Code migration • Multi-version
testing • Conditional dependencies

Clean Python Interpreters • A cleanpythonXY product for each X.Y
version of Python • Optimal compilation options for each platform (Linux/Darwin) • Python source archive as an external dependency, C/C++ Gradle plugin, multi-variants • Patches as needed for specific platform • Support for native packaging formats — RPM and pkg • Distribution into a specific path: /…/python/X.Y/… • Breaks the dependency on vendor’s Python and avoids tainted environment • Allows incremental migration — OS or language version

Code Migration • Keep compatibility with multiple versions of Python
for libraries • Analyze the dependency graph of a web application • Modernize libraries with: • External dependencies only • One internal dependency • Products that depend on the previous groups [rinse and repeat] • Modernize the deployable application at the top of the graph • Spread the effort to other libraries used by CLI tools by engaging owners

Multi-version testing • Add to the build system a feature
that allows declaring the Python version used • Add a feature that allows declaring the supported versions for libraries • Needs to build with all declared supported versions • Needs to run tests with all declared supported versions • Ensures continuous compatibility • Should we use an existing Python tool — tox? — or build an extension in Gradle? • Investigation + prototype à Gradle extension was better for us

Conditional dependencies • Dependencies vary between Python 2 and 3
for the same library • Python has native solutions for conditional dependencies • The setuptools has extras_require • An empty key in extras_require means the default requirement • The wheel package has support too • Environment markers: PEP 508 • Another extension to enable markers in PyGradle with shortcut markers for ease-of-use • Additions to our custom distutils class helped to tie loose ends

Environment markers in build.gradle dependencies { python(spec.external.avro) { environmentMarkers(it, 'py26')
environmentMarkers(it, 'py27') } python(spec.external.'avro-python3') { environmentMarkers(it, 'py3') } }

Environment markers in pinned.txt PyGradle converts the shortcut markers into
expanded markers: avro-json-serializer==0.4.1 ... lipy-utils==4.0.17 avro==1.3.3 ; python_version=='2.6' or python_version=='2.7' avro-python3==1.8.1 ; python_version>='3.4' configparser==3.5.0 ; python_version=='2.6' or python_version=='2.7' enum34==1.1.2 ; python_version=='2.6' or python_version=='2.7'

Environment markers to extras_require Then, distgradle.GradleDistribution converts them into extras_require:
install_requires=['avro-json-serializer', … , 'lipy-utils'] extras_require={ ":python_version>='3.4'": ['avro-python3'], ":python_version=='2.6' or python_version=='2.7'": [ 'avro', 'configparser', 'enum34' ] }

Python package metadata from markers ... [:python_version=='2.6' or python_version=='2.7'] avro
configparser enum34 [:python_version>='3.4'] avro-python3

Wheel metadata from markers ... Requires-Dist: lipy-utils Requires-Dist: avro; python_version=='2.6'
or python_version=='2.7' Requires-Dist: configparser; python_version=='2.6' or python_version=='2.7' Requires-Dist: enum34; python_version=='2.6' or python_version=='2.7' Requires-Dist: avro-python3; python_version>='3.4'

Challenges AND TRADE-OFFS

Platform issues SLOWER ADOPTION • Different versions of C libraries
across • major OS versions — libffi • minor releases of the operating system — OpenSSL (RHEL) • Vendor stops shipping header files for a C library — OpenSSL (Apple) • How to resolve these issues? • Multi-variant builds for major OS versions • Custom OpenSSL product — fulfills other needs too • The cleanpython products are linked against this OpenSSL

Dependency issues NEED FOR IMPROVEMENTS OF BUILD TOOLING • Ivy
format was made for Java and does not have some features • The subprocess32 package will break a build with Python 3 • The functools32 package will work only with 2.7, but not 2.6 or 3 • Wait, didn’t we solve that with environment markers? • Yes, but during the migration old packages don’t have correct metadata • They bring transitively “dependencies” that cannot be processed • In PyGradle, the environment markers extension keeps track of such dependencies • It excludes transitive dependencies based on the language version

Code base issues TIGHT COUPLING MAKES FOR A HARDER MIGRATION
• Doing this with the old monolithic repo would be nearly impossible • Large products or dependency clusters of products are more difficult to migrate • Experience with 2.7 modernization in 2015 — apollo cluster • Once that cluster was simplified it was easier for Python 3 migration • Lessons learned: • Loosely coupled code base enables more agile development • Open-sourcing PyGradle while building Python 3 support • Stepping on each other’s toes all the time J

Build system requirements • Pluggable — easy to add another
build plugin for different type of artifact • Flexible — easy to add custom changes for one specific product • Extensible — easy to add support features, such as our markers and versions • Fit to the existing ecosystem: • Polyglot builds — Java, Scala, Python, JavaScript; all build together • Multiple products — a library and a deployable backend and frontend, for example • Python only builds with our older build system were OK — polyglot builds impossible • Had to upgrade setuptools, wheel, pip, virtualenv, …

Porting CODE PATTERNS

Bottom-to-top porting of libraries • Follow the dependency graph from
the leaves up as described • Modernize the code • Upgrade PyGradle version and configuration files • Provide the correct environment markers • Bump the major version • Produce the updated package metadata for conditional dependencies • Proceed up the graph until all the libraries have the correct metadata

Ensure the continued compatibility • Assert through multi-version testing •
How to use? python { details.pythonVersion = '3.5' versions.supportedVersions = ['2.6', '2.7', '3.5'] } • Using Python 3 as the default fails early if Python 3 compatibility is broken

Recognize suspicious code patterns • I presented a talk internally
at LinkedIn — Craftsmanship with Python: Code Patterns • It pointed out code patterns that cause issues during migration • Some of them are a natural result of the changes between Python 2/3 • Some were caused by the original code • We documented them so that developers can port the products easier • The names in the code have been changed for space and privacy reasons • The irrelevant content was replaced by ellipses

Porting deployable products • Deployable products are at the root
of the dependency tree • Can be directly migrated to use only one version of Python • Use Python 3 unless you depend on a package that works with 2 only • They do not need straddling code • Consolidated language features • Simpler library APIs — see subprocess.run() • Ported an example Python web app and a widely used CLI tool generate-skeleton

La ménagerie des folies • Relying on dictionary order is
bad • Language idiom and performance — if key in d.keys() — list() can help • Ordering operators; total ordering; __eq__ and __ne__ both return True • unicode/str (ASCII) vs. str/bytes • Mind your encode/decode calls • Namespaces: how to handle and how to fix? • urllib, urllib2, and urlparse calls

Misguided Assumptions • Some of the code and tests written
in Python 2 worked without noticeable defects • Python 3 revealed places where the assumptions were incorrect, misguided, or misinterpreted

“October. This is one of the particularly dangerous months to
speculate in stocks. Others are November, December, January, February, March, April, May, June, July, August, and September.” M A R K T W A I N

Reliance on dictionary order • Observed in the test code
most frequently • Makes tests flaky and unreliable • Fails randomly • It’s more pernicious in runtime code • Random failures • Hard to reproduce • Needs to be rooted out and replaced with stable code and test patterns

Reliance on dictionary order in code First, the dictionary: valid_operators
= { '<=': operator.le, '>=': operator.ge, '>': operator.gt, '<': operator.lt, '==': operator.eq, '!=': operator.ne, }

Reliance on dictionary order in code … Original code (replaced
the class name with C): for key, op in C.valid_operators.iteritems(): # peek using key; consume if present; fetch operand Fixed code (the real fix would be to rewrite it as a solid parser): for key in sorted(C.valid_operators, key=len, reverse=True): op = C.valid_operators[key] # peek ..., consume ..., fetch ...

Reliance on dictionary order in tests <application xmlns="..." name="xyz" version="1.2.3">
 <set> <value>2010</value> <value>2012</value> <value>2014</value> </set>  </application> <application xmlns="..." version="1.2.3" name="xyz">  <set> <value>2014</value> <value>2010</value> <value>2012</value> </set>  </application>

Reliance on dictionary order in tests … Original test code:
assert filecmp.cmp(produced_file, expected_file) Fixed test code (replaced the class name with A): produced = A.parse_from_file(produced_file) expected = A.parse_from_file(expected_file) assert expected == produced

Be careful with testing repr output Due to changes between
Python 2 and 3 the repr output can change: assert repr(A.m(mock_S, 'a', 'd')) == "<C '...', 'd', set([])>" The changed test is then: obj = A.m(mock_S, 'a', 'd') expected = "<C '...', 'd', {0}>".format( 'set()' if six.PY3 else 'set([])') assert repr(obj) == expected

Reconsider assumptions • Mappings are not guaranteed to keep the
order of the keys • JSON objects are mappings • Do not rely on JSON order of entries • Order of XML attributes is not significant • Do not rely on order of XML attributes • Pass encoding parameter to JSON and XML parsers/processors if needed

Idiomatic Use • Spoken languages have idioms specific to each
language • Mastering the idiomatic use of language is equally important with programming languages

“The determined Real Programmer can write FORTRAN programs in any
language.” E D P O S T , R E A L P R O G R A M M E R S D O N ' T U S E P A S C A L , 1 9 8 2

Performance concerns • A code pattern: if key in d.keys():
• Found in the original code base • Python 3 modernizer turns it into: if key in list(d.keys()): • This is required to preserve the semantics • It also reveals what is wrong with this pattern • A simple, idiomatic, Pythonic, if key in d: is an O(1) lookup • The d.keys() creates a list in Python 2; then in is an O(n) lookup in the list • This code pattern turns an O(1) mapping lookup into an O(n) list lookup

O(1) traded for O(n)! • Exactly my thought when I
saw this code.

Namespaces • Some products had code modules in the linkedin
namespace • To fix we had to: • Move src/linkedin/xxx.py to src/linkedin/xxx/xxx.py • Re-export for backward compatibility from src/linkedin/xxx/__init__.py from linkedin.xxx.xxx import * from linkedin.xxx.xxx import __doc__ • Add :imported-members: to automodule directive for API documentation • Apply any custom handling of linkedin.xxx to linkedin.xxx.xxx

Heed the warnings of the static analyzer assert (parse_lines(line) is
'---', 'Integration | Component | ...', 'blah blah: something something') assert parse_lines(line) == ( '---', 'Integration | Component | ...', 'blah blah: something something', )

“It's easier to ask forgiveness than it is to get
permission.” G R A C E H O P P E R

“Look before you leap.” P R O V E R
B

EAFP vs. LBYL • LBYL is widely used in C
programming • EAFP is preferable for readability, clarity, DRY, completeness, race conditions

EAFP vs. LBYL in the code if isinstance(filename, file): filename
= filename.name elif not isinstance(filename, six.string_types): filename = None if not isinstance(filename, six.string_types): # assume it's a file-like object try: filename = filename.name except AttributeError: filename = None

EAFP in the tests expected = [expected_counter, expected_gauge, expected_gauge2] expected_reordered
= [ expected_counter, expected_gauge2, expected_gauge] try: fake_post.assert_called_with( topic='...', host='...', app='...', metrics=expected) except AssertionError: fake_post.assert_called_with( topic='...', host='...', app='...', metrics=expected_reordered )

Python Changes • What to look for? • Consolidation of
the language

Comparisons • No more cmp() in Python 3 • Use
instead __eq__ and __lt__ combined with functools total ordering @totalordering class Comparable(object): def __eq__(self, other): if isinstance(other, self.__class__): return self.x == other.x return NotImplemented def __lt__(self, other): # similar approach

Total ordering in Python 2 • Beware __ne__ with Python
2 total ordering: if six.PY2: def __ne__(self, other): equal = self.__eq__(other) return equal if equal is NotImplemented else not equal • Beware that __hash__ returns id() by default; define it instead def __hash__(self): return hash(self._comparison_keys())

Strings vs. bytes • Python 2 code for line in
packet.splitlines(): becomes packet_as_str = ( packet if not (six.PY3 and isinstance(packet, bytes)) else packet.decode('utf-8') ) for line in packet_as_str.splitlines(): • Do not forget to add a test that explicitly passes in b'...' instead of a string

Know the binary APIs Python 2: base64.b64encode(hmac.new(secret, body, hashlib.sha1).digest()) Converted
to: def ensure_bytes(str_or_bytes): return (str_or_bytes if isinstance(str_or_bytes, bytes) else str_or_bytes.encode('utf-8')) base64.b64encode( hmac.new(ensure_bytes(secret), ensure_bytes(body), hashlib.sha1).digest())

Learn the API changes String functions and methods: try: #
Python 3 has maketrans() as a static method of str type. maketrans = str.maketrans except AttributeError: # Python 2 has it in string module. from string import maketrans

The API consolidation from urllib import url2pathname, quote_plus, unquote_plus from
six.moves.urllib.request import url2pathname from six.moves.urllib.parse import quote_plus, unquote_plus httplib.responses à six.moves.http_client.responses log.warn à log.warning sys.maxint à sys.maxsize

API removals Watch for completely removed functions/methods: tmp = os.tempnam(os.path.dirname(dest),
'ts_') with tempfile.NamedTemporaryFile( dir=os.path.dirname(dest), prefix='ts_') as f: tmp = f.name from __future__ import absolute_import from __future__ import print_function print >> sys.stderr, repr(e) à print(repr(e), file=sys.stderr)

Type changes Watch for type changes and provide the code
that can handle them: ssl.PROTOCOL_SSLv23 → ssl.PROTOCOL_SSLv23.value if six.PY3: long = int Instead of the __metaclass__ attribute, the syntax changed to class Custom(metaclass=Meta): The six package provides compatibility: six.with_metaclass(Meta, Base)

Exceptions changed The message attribute is deprecated The exceptions are
not sequences any more, use args instead: str(err[-1]) à str(err.args[-1]) except SomeException, variable à except SomeException as variable raise exc_class, value, traceback à six.reraise(exc_class, value, traceback) Beware: raise e vs. raise — do not hide traceback Exception chaining possible with Python 3: raise EXCEPTION from CAUSE Exception hierarchy changed — see Python documentation

Standard library vs. backports • Replace the 3rd party backport
packages with the standard library: • simplejson vs. json • mock vs. unittest.mock • subprocess32 vs. subprocess (especially in Python 3.5) • trollius vs. asyncio • Stay up-to-date with the changes in the language: • izip and similar functions are not needed any more • etree API can take encoding • subprocess can take universal_newlines=True

Status PYTHON 3 29% August 2017 • Percentage of Python
products building with Python 3 • Early August 2017 • LinkedIn

Future • Python 2.6 will not be supported at all
starting August 2017 • We will make Python 3 the default • Automated migration to the new major version of PyGradle • Users can still set 2.7 as their default build interpreter • Pre-commit or build checks to warn about non-compatible code • Type checking • Requires an effort to annotate our code with type information • Product dashboard checks

Thank you

2017 - Bringing Python 3 to LinkedIn

2017 - Bringing Python 3 to LinkedIn

More Decks by PyBay

Other Decks in Programming

Featured

Transcript