Slide 1

Slide 1 text

Bringing Python 3 to LinkedIn ​Zvezdan Petković ​LinkedIn

Slide 2

Slide 2 text

Overview Motivation Goals Comparisons Justification Introduction Python interpreters Migration Multi-version testing Dependencies Design Platform Dependencies Legacy code base Build system Challenges Order Compatibility Code patterns Deployable vs. library Porting

Slide 3

Slide 3 text

“I took a speed-reading course and read War and Peace in twenty minutes. It involves Russia.” W O O D Y A L L E N C A V E A T E M P T O R

Slide 4

Slide 4 text

Introduction MOTIVATION, GOALS, AND SOLUTIONS

Slide 5

Slide 5 text

Motivation • Use of system Python was hard-coded into our build tooling • The vendor’s site- packages were heavily tainted • Some of our acquisitions already used Python 2.7 Flexibility Independence Transparency Progress • We used Python 2.6 because it was vended with RHEL 6 • RHEL 7 ships with Python 2.7 by default • Our apps and build tooling would be blocked from update • Internal command-line tools had to be rolled back sometimes because: • An external script (ab)used addsitedir() or activate_this.py • Depended on the tainted package from system site-packages • Python 2.6 retired with the release of Python 2.6.9 on October 29, 2013 • Python 2.7 end-of-life is in 2020 • Our developers want to use the newest features, such as asyncio or type hints (PEP 484)

Slide 6

Slide 6 text

Goals • Clean Python interpreters on all platforms we actively support • Code that works with multiple versions of Python • Support for multiple versions of Python Provide Enable Do • Incremental migration • Python 2.7 — completed in December 2015 • Python 3 — completed in September 2016 • Break the dependency on the single vendor supplied version of Python • Migrate base libraries — completed Fall 2016 • Pave the way for the full migration to Python 3

Slide 7

Slide 7 text

Existing solutions • The support for multiple Python interpreter versions can be accomplished through: • System vended Python • Clean Python Multiple Python versions Code base for multiple versions • There are existing tools that can migrate the code to work with multiple versions of Python

Slide 8

Slide 8 text

System vended Python • Convenience • Vendors offer multiple versions of Python • One of the versions is the default • Red Hat offers Software Collections in RHEL 7 … but not for Python 2.6 • Apple supports library frameworks Advantages Disadvantages • The dependency on vendor for updates • Incompatible backport changes • Tainted libraries • Interactions with 3rd party packages installed in site-packages • Different paths to Python interpreter on different platforms or for different versions of Python

Slide 9

Slide 9 text

Clean Python • Isolation from changes caused by the vended Python • Reproducibility of issues, without strange interactions • Control over updates Advantages Disadvantages • Must maintain the custom build • Must follow the security updates closely

Slide 10

Slide 10 text

Code support for multiple versions • The package six abstracts the differences from the users • Automated rewriting with python- modernize • Uses six when it can • Supports Python 2.6 to 3.x • Straddling code • All Python developers faced the need to support Python 2 and 3 • The best tools and practices developed over the last decade • They maintain the compatibility with Python 2.6 • Extend the compatibility to the latest Python 3 version

Slide 11

Slide 11 text

Justification ​WHAT IS A RETURN-ON-INVESTMENT FOR SUCH A MIGRATION? • Botched backport fixes • Strange interactions with vendor's packages • Problems caused by a package installed into system's site-packages • Modernization reveals bad code patterns and improves our code base Prevents Enables • Migration of products to new versions of Python at developer’s own pace • Mission-critical teams can use Python 3 features they need • Be ready for 2020 instead of scrambling in the last moment • Be leaders in Python community

Slide 12

Slide 12 text

Design and Implementation INFRASTRUCTURE CHANGES

Slide 13

Slide 13 text

Requirements • Clean Python interpreters • Code migration • Multi-version testing • Conditional dependencies

Slide 14

Slide 14 text

Clean Python Interpreters • A cleanpythonXY product for each X.Y version of Python • Optimal compilation options for each platform (Linux/Darwin) • Python source archive as an external dependency, C/C++ Gradle plugin, multi-variants • Patches as needed for specific platform • Support for native packaging formats — RPM and pkg • Distribution into a specific path: /…/python/X.Y/… • Breaks the dependency on vendor’s Python and avoids tainted environment • Allows incremental migration — OS or language version

Slide 15

Slide 15 text

Code Migration • Keep compatibility with multiple versions of Python for libraries • Analyze the dependency graph of a web application • Modernize libraries with: • External dependencies only • One internal dependency • Products that depend on the previous groups [rinse and repeat] • Modernize the deployable application at the top of the graph • Spread the effort to other libraries used by CLI tools by engaging owners

Slide 16

Slide 16 text

Multi-version testing • Add to the build system a feature that allows declaring the Python version used • Add a feature that allows declaring the supported versions for libraries • Needs to build with all declared supported versions • Needs to run tests with all declared supported versions • Ensures continuous compatibility • Should we use an existing Python tool — tox? — or build an extension in Gradle? • Investigation + prototype à Gradle extension was better for us

Slide 17

Slide 17 text

Conditional dependencies • Dependencies vary between Python 2 and 3 for the same library • Python has native solutions for conditional dependencies • The setuptools has extras_require • An empty key in extras_require means the default requirement • The wheel package has support too • Environment markers: PEP 508 • Another extension to enable markers in PyGradle with shortcut markers for ease-of-use • Additions to our custom distutils class helped to tie loose ends

Slide 18

Slide 18 text

Environment markers in build.gradle dependencies { python(spec.external.avro) { environmentMarkers(it, 'py26') environmentMarkers(it, 'py27') } python(spec.external.'avro-python3') { environmentMarkers(it, 'py3') } }

Slide 19

Slide 19 text

Environment markers in pinned.txt PyGradle converts the shortcut markers into expanded markers: avro-json-serializer==0.4.1 ... lipy-utils==4.0.17 avro==1.3.3 ; python_version=='2.6' or python_version=='2.7' avro-python3==1.8.1 ; python_version>='3.4' configparser==3.5.0 ; python_version=='2.6' or python_version=='2.7' enum34==1.1.2 ; python_version=='2.6' or python_version=='2.7'

Slide 20

Slide 20 text

Environment markers to extras_require Then, distgradle.GradleDistribution converts them into extras_require: install_requires=['avro-json-serializer', … , 'lipy-utils'] extras_require={ ":python_version>='3.4'": ['avro-python3'], ":python_version=='2.6' or python_version=='2.7'": [ 'avro', 'configparser', 'enum34' ] }

Slide 21

Slide 21 text

Python package metadata from markers ... [:python_version=='2.6' or python_version=='2.7'] avro configparser enum34 [:python_version>='3.4'] avro-python3

Slide 22

Slide 22 text

Wheel metadata from markers ... Requires-Dist: lipy-utils Requires-Dist: avro; python_version=='2.6' or python_version=='2.7' Requires-Dist: configparser; python_version=='2.6' or python_version=='2.7' Requires-Dist: enum34; python_version=='2.6' or python_version=='2.7' Requires-Dist: avro-python3; python_version>='3.4'

Slide 23

Slide 23 text

Challenges AND TRADE-OFFS

Slide 24

Slide 24 text

Platform issues ​SLOWER ADOPTION • Different versions of C libraries across • major OS versions — libffi • minor releases of the operating system — OpenSSL (RHEL) • Vendor stops shipping header files for a C library — OpenSSL (Apple) • How to resolve these issues? • Multi-variant builds for major OS versions • Custom OpenSSL product — fulfills other needs too • The cleanpython products are linked against this OpenSSL

Slide 25

Slide 25 text

Dependency issues ​NEED FOR IMPROVEMENTS OF BUILD TOOLING • Ivy format was made for Java and does not have some features • The subprocess32 package will break a build with Python 3 • The functools32 package will work only with 2.7, but not 2.6 or 3 • Wait, didn’t we solve that with environment markers? • Yes, but during the migration old packages don’t have correct metadata • They bring transitively “dependencies” that cannot be processed • In PyGradle, the environment markers extension keeps track of such dependencies • It excludes transitive dependencies based on the language version

Slide 26

Slide 26 text

Code base issues ​TIGHT COUPLING MAKES FOR A HARDER MIGRATION • Doing this with the old monolithic repo would be nearly impossible • Large products or dependency clusters of products are more difficult to migrate • Experience with 2.7 modernization in 2015 — apollo cluster • Once that cluster was simplified it was easier for Python 3 migration • Lessons learned: • Loosely coupled code base enables more agile development • Open-sourcing PyGradle while building Python 3 support • Stepping on each other’s toes all the time J

Slide 27

Slide 27 text

Build system requirements • Pluggable — easy to add another build plugin for different type of artifact • Flexible — easy to add custom changes for one specific product • Extensible — easy to add support features, such as our markers and versions • Fit to the existing ecosystem: • Polyglot builds — Java, Scala, Python, JavaScript; all build together • Multiple products — a library and a deployable backend and frontend, for example • Python only builds with our older build system were OK — polyglot builds impossible • Had to upgrade setuptools, wheel, pip, virtualenv, …

Slide 28

Slide 28 text

Porting CODE PATTERNS

Slide 29

Slide 29 text

Bottom-to-top porting of libraries • Follow the dependency graph from the leaves up as described • Modernize the code • Upgrade PyGradle version and configuration files • Provide the correct environment markers • Bump the major version • Produce the updated package metadata for conditional dependencies • Proceed up the graph until all the libraries have the correct metadata

Slide 30

Slide 30 text

Ensure the continued compatibility • Assert through multi-version testing • How to use? python { details.pythonVersion = '3.5' versions.supportedVersions = ['2.6', '2.7', '3.5'] } • Using Python 3 as the default fails early if Python 3 compatibility is broken

Slide 31

Slide 31 text

Recognize suspicious code patterns • I presented a talk internally at LinkedIn — Craftsmanship with Python: Code Patterns • It pointed out code patterns that cause issues during migration • Some of them are a natural result of the changes between Python 2/3 • Some were caused by the original code • We documented them so that developers can port the products easier • The names in the code have been changed for space and privacy reasons • The irrelevant content was replaced by ellipses

Slide 32

Slide 32 text

Porting deployable products • Deployable products are at the root of the dependency tree • Can be directly migrated to use only one version of Python • Use Python 3 unless you depend on a package that works with 2 only • They do not need straddling code • Consolidated language features • Simpler library APIs — see subprocess.run() • Ported an example Python web app and a widely used CLI tool generate-skeleton

Slide 33

Slide 33 text

La ménagerie des folies • Relying on dictionary order is bad • Language idiom and performance — if key in d.keys() — list() can help • Ordering operators; total ordering; __eq__ and __ne__ both return True • unicode/str (ASCII) vs. str/bytes • Mind your encode/decode calls • Namespaces: how to handle and how to fix? • urllib, urllib2, and urlparse calls

Slide 34

Slide 34 text

Misguided Assumptions • Some of the code and tests written in Python 2 worked without noticeable defects • Python 3 revealed places where the assumptions were incorrect, misguided, or misinterpreted

Slide 35

Slide 35 text

“October. This is one of the particularly dangerous months to speculate in stocks. Others are November, December, January, February, March, April, May, June, July, August, and September.” M A R K T W A I N

Slide 36

Slide 36 text

Reliance on dictionary order • Observed in the test code most frequently • Makes tests flaky and unreliable • Fails randomly • It’s more pernicious in runtime code • Random failures • Hard to reproduce • Needs to be rooted out and replaced with stable code and test patterns

Slide 37

Slide 37 text

Reliance on dictionary order in code First, the dictionary: valid_operators = { '<=': operator.le, '>=': operator.ge, '>': operator.gt, '<': operator.lt, '==': operator.eq, '!=': operator.ne, }

Slide 38

Slide 38 text

Reliance on dictionary order in code … Original code (replaced the class name with C): for key, op in C.valid_operators.iteritems(): # peek using key; consume if present; fetch operand Fixed code (the real fix would be to rewrite it as a solid parser): for key in sorted(C.valid_operators, key=len, reverse=True): op = C.valid_operators[key] # peek ..., consume ..., fetch ...

Slide 39

Slide 39 text

Reliance on dictionary order in tests 2010 2012 2014 2014 2010 2012

Slide 40

Slide 40 text

Reliance on dictionary order in tests … Original test code: assert filecmp.cmp(produced_file, expected_file) Fixed test code (replaced the class name with A): produced = A.parse_from_file(produced_file) expected = A.parse_from_file(expected_file) assert expected == produced

Slide 41

Slide 41 text

Be careful with testing repr output Due to changes between Python 2 and 3 the repr output can change: assert repr(A.m(mock_S, 'a', 'd')) == "" The changed test is then: obj = A.m(mock_S, 'a', 'd') expected = "".format( 'set()' if six.PY3 else 'set([])') assert repr(obj) == expected

Slide 42

Slide 42 text

Reconsider assumptions • Mappings are not guaranteed to keep the order of the keys • JSON objects are mappings • Do not rely on JSON order of entries • Order of XML attributes is not significant • Do not rely on order of XML attributes • Pass encoding parameter to JSON and XML parsers/processors if needed

Slide 43

Slide 43 text

Idiomatic Use • Spoken languages have idioms specific to each language • Mastering the idiomatic use of language is equally important with programming languages

Slide 44

Slide 44 text

“The determined Real Programmer can write FORTRAN programs in any language.” E D P O S T , R E A L P R O G R A M M E R S D O N ' T U S E P A S C A L , 1 9 8 2

Slide 45

Slide 45 text

Performance concerns • A code pattern: if key in d.keys(): • Found in the original code base • Python 3 modernizer turns it into: if key in list(d.keys()): • This is required to preserve the semantics • It also reveals what is wrong with this pattern • A simple, idiomatic, Pythonic, if key in d: is an O(1) lookup • The d.keys() creates a list in Python 2; then in is an O(n) lookup in the list • This code pattern turns an O(1) mapping lookup into an O(n) list lookup

Slide 46

Slide 46 text

O(1) traded for O(n)! • Exactly my thought when I saw this code.

Slide 47

Slide 47 text

Namespaces • Some products had code modules in the linkedin namespace • To fix we had to: • Move src/linkedin/xxx.py to src/linkedin/xxx/xxx.py • Re-export for backward compatibility from src/linkedin/xxx/__init__.py from linkedin.xxx.xxx import * from linkedin.xxx.xxx import __doc__ • Add :imported-members: to automodule directive for API documentation • Apply any custom handling of linkedin.xxx to linkedin.xxx.xxx

Slide 48

Slide 48 text

Heed the warnings of the static analyzer assert (parse_lines(line) is '---', 'Integration | Component | ...', 'blah blah: something something') assert parse_lines(line) == ( '---', 'Integration | Component | ...', 'blah blah: something something', )

Slide 49

Slide 49 text

“It's easier to ask forgiveness than it is to get permission.” G R A C E H O P P E R

Slide 50

Slide 50 text

“Look before you leap.” P R O V E R B

Slide 51

Slide 51 text

EAFP vs. LBYL • LBYL is widely used in C programming • EAFP is preferable for readability, clarity, DRY, completeness, race conditions

Slide 52

Slide 52 text

EAFP vs. LBYL in the code if isinstance(filename, file): filename = filename.name elif not isinstance(filename, six.string_types): filename = None if not isinstance(filename, six.string_types): # assume it's a file-like object try: filename = filename.name except AttributeError: filename = None

Slide 53

Slide 53 text

EAFP in the tests expected = [expected_counter, expected_gauge, expected_gauge2] expected_reordered = [ expected_counter, expected_gauge2, expected_gauge] try: fake_post.assert_called_with( topic='...', host='...', app='...', metrics=expected) except AssertionError: fake_post.assert_called_with( topic='...', host='...', app='...', metrics=expected_reordered )

Slide 54

Slide 54 text

Python Changes • What to look for? • Consolidation of the language

Slide 55

Slide 55 text

Comparisons • No more cmp() in Python 3 • Use instead __eq__ and __lt__ combined with functools total ordering @totalordering class Comparable(object): def __eq__(self, other): if isinstance(other, self.__class__): return self.x == other.x return NotImplemented def __lt__(self, other): # similar approach

Slide 56

Slide 56 text

Total ordering in Python 2 • Beware __ne__ with Python 2 total ordering: if six.PY2: def __ne__(self, other): equal = self.__eq__(other) return equal if equal is NotImplemented else not equal • Beware that __hash__ returns id() by default; define it instead def __hash__(self): return hash(self._comparison_keys())

Slide 57

Slide 57 text

Strings vs. bytes • Python 2 code for line in packet.splitlines(): becomes packet_as_str = ( packet if not (six.PY3 and isinstance(packet, bytes)) else packet.decode('utf-8') ) for line in packet_as_str.splitlines(): • Do not forget to add a test that explicitly passes in b'...' instead of a string

Slide 58

Slide 58 text

Know the binary APIs Python 2: base64.b64encode(hmac.new(secret, body, hashlib.sha1).digest()) Converted to: def ensure_bytes(str_or_bytes): return (str_or_bytes if isinstance(str_or_bytes, bytes) else str_or_bytes.encode('utf-8')) base64.b64encode( hmac.new(ensure_bytes(secret), ensure_bytes(body), hashlib.sha1).digest())

Slide 59

Slide 59 text

Learn the API changes String functions and methods: try: # Python 3 has maketrans() as a static method of str type. maketrans = str.maketrans except AttributeError: # Python 2 has it in string module. from string import maketrans

Slide 60

Slide 60 text

The API consolidation from urllib import url2pathname, quote_plus, unquote_plus from six.moves.urllib.request import url2pathname from six.moves.urllib.parse import quote_plus, unquote_plus httplib.responses à six.moves.http_client.responses log.warn à log.warning sys.maxint à sys.maxsize

Slide 61

Slide 61 text

API removals Watch for completely removed functions/methods: tmp = os.tempnam(os.path.dirname(dest), 'ts_') with tempfile.NamedTemporaryFile( dir=os.path.dirname(dest), prefix='ts_') as f: tmp = f.name from __future__ import absolute_import from __future__ import print_function print >> sys.stderr, repr(e) à print(repr(e), file=sys.stderr)

Slide 62

Slide 62 text

Type changes Watch for type changes and provide the code that can handle them: ssl.PROTOCOL_SSLv23 → ssl.PROTOCOL_SSLv23.value if six.PY3: long = int Instead of the __metaclass__ attribute, the syntax changed to class Custom(metaclass=Meta): The six package provides compatibility: six.with_metaclass(Meta, Base)

Slide 63

Slide 63 text

Exceptions changed The message attribute is deprecated The exceptions are not sequences any more, use args instead: str(err[-1]) à str(err.args[-1]) except SomeException, variable à except SomeException as variable raise exc_class, value, traceback à six.reraise(exc_class, value, traceback) Beware: raise e vs. raise — do not hide traceback Exception chaining possible with Python 3: raise EXCEPTION from CAUSE Exception hierarchy changed — see Python documentation

Slide 64

Slide 64 text

Standard library vs. backports • Replace the 3rd party backport packages with the standard library: • simplejson vs. json • mock vs. unittest.mock • subprocess32 vs. subprocess (especially in Python 3.5) • trollius vs. asyncio • Stay up-to-date with the changes in the language: • izip and similar functions are not needed any more • etree API can take encoding • subprocess can take universal_newlines=True

Slide 65

Slide 65 text

Status ​PYTHON 3 29% August 2017 • Percentage of Python products building with Python 3 • Early August 2017 • LinkedIn

Slide 66

Slide 66 text

Future • Python 2.6 will not be supported at all starting August 2017 • We will make Python 3 the default • Automated migration to the new major version of PyGradle • Users can still set 2.7 as their default build interpreter • Pre-commit or build checks to warn about non-compatible code • Type checking • Requires an effort to annotate our code with type information • Product dashboard checks

Slide 67

Slide 67 text

Thank you

Slide 68

Slide 68 text

No content