Slide 1

Slide 1 text

Thoth How to recommend the best possible packages for your application Fridolin Pokorny 2019-May-4

Slide 2

Slide 2 text

Thoth Station ● Fridolín “fridex” Pokorný ● Senior Software Engineer at Red Hat ● Distributed systems, AI/ML and (of course) Python fan ● Projects: ○ Reverse engineer RetDec (AVG) ○ Linux kernel TLS/DTLS module AF_KTLS ○ Selinon - distributed task flows scheduler on top of Celery ○ Project Thoth $ whoami https://fridex.github.io

Slide 3

Slide 3 text

Thoth Station ● Project Thoth - https://github.com/thoth-station ● Red Hat - Office of the CTO ○ Emerging technologies ○ AI team - https://github.com/aicoe ● Initially 2 engineers, now growing ○ Christoph Görn ○ Francesco Murdaca ○ Fridolín Pokorný ○ Harshad Reddy Nalla ○ Marek Cermak ○ Subin Modeel $ whoarewe # project Thoth https://thoth-station.ninja/

Slide 4

Slide 4 text

Thoth Station What is Thoth? Why Thoth?

Slide 5

Slide 5 text

Thoth Station Why Thoth? ● PyPI - Python Package Index ○ https://pypi.org/ ○ 178,016 projects ○ 1,303,926 releases (approx. 7 releases per project)

Slide 6

Slide 6 text

Thoth Station Why Thoth? import tensorflow as tf from flask import Flask application = Flask()

Slide 7

Slide 7 text

Thoth Station Why Thoth? import tensorflow as tf from flask import Flask application = Flask()

Slide 8

Slide 8 text

Thoth Station $ pip3 install --user tensorflow $ pip3 install --user flask $ python3 ./app.py Error: tensorflow 1.10.1 has requirement numpy<=1.14.5,>=1.13.3, but you'll have numpy 1.15.1 which is incompatible. $

Slide 9

Slide 9 text

Thoth Station Why Thoth? import tensorflow as tf from flask import Flask application = Flask() 59 releases 28 releases

Slide 10

Slide 10 text

Thoth Station Why Thoth? import tensorflow as tf from flask import Flask application = Flask() 59 releases 28 releases All combinations how to install libraries directly used: 59 * 28 = 1,652

Slide 11

Slide 11 text

Thoth Station Transitive dependencies ● Flask ○ click, itsdangerous, jinja2, markupsafe, werkzeug Estimatimated number of combinations: 54,395,000

Slide 12

Slide 12 text

Thoth Station Transitive dependencies ● TensorFlow ○ absl-py, astor, backports-weakref, bleach, enum34, gast, google-pasta, grpcio, h5py, html5lib, keras, keras-applications, keras-preprocessing, markdown, mock, numpy, pbr, protobuf, pyyaml, scipy, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-tensorboard, termcolor, tf-estimator-nightly, werkzeug, wheel Estimated number of combinations: 139,740,802,927,165,440,000 approx. 1.39*1020

Slide 13

Slide 13 text

Thoth Station Why Thoth? import tensorflow as tf from flask import Flask application = Flask() 1.39*1020 combinations 54,395,000 combinations All combinations how to install application stack of libraries directly and indirectly used (estimation): 1.39*1020 * 54,395,000 = 7.6*1027

Slide 14

Slide 14 text

Thoth Station Why Thoth? import pandas as pd import tensorflow as tf from flask import Flask application = Flask() Operating System Fedora 30 Fedora 29 ... CentOS 7.6 CentOS 7.5 … Python interpreter

Slide 15

Slide 15 text

Thoth Station Why Thoth? import pandas as pd import tensorflow as tf from flask import Flask application = Flask() Operating System Python interpreter glibc cuda

Slide 16

Slide 16 text

Thoth Station Hardware Why Thoth? import pandas as pd import tensorflow as tf from flask import Flask application = Flask() Operating System Python interpreter glibc cuda GPU CPU

Slide 17

Slide 17 text

Thoth Station Why Thoth? Python application

Slide 18

Slide 18 text

Thoth Station Hardware Why Thoth? Python application Operating System Python interpreter Native dependecies Kernel modules Direct Python dependencies Transitive Python dependencies

Slide 19

Slide 19 text

Thoth Station Why Thoth? ● Create knowledge base ○ What packages in which versions should I use? ■ Application builds correctly ■ Application runs correctly ■ Application behaves and performs well ● Create an advanced Python resolver which uses knowledge base to resolve software stacks Latest versions are not always greatest choices.

Slide 20

Slide 20 text

Thoth Station Building Thoth’s knowledge base

Slide 21

Slide 21 text

Thoth Station Gathering data for Thoth’s knowledge base ● Resolving software stacks ○ own resolution algorithm ● Analyses of container images ○ JupyterHub images ○ Thoth’s container images ● Amun, Dependency Monkey ○ running CI and perf related tests ○ performance related analyses ● ...

Slide 22

Slide 22 text

Thoth Station Optimized TensorFlow builds by Thoth team ● Automated tests of libraries ● Tests targeting performance ● Optimized TensorFlow builds https://tensorflow.pypi.thoth-station.ninja/

Slide 23

Slide 23 text

Thoth Station Recommendations

Slide 24

Slide 24 text

Thoth Station How good is my software stack? simplelib anotherlib

Slide 25

Slide 25 text

Thoth Station

Slide 26

Slide 26 text

Thoth Station v1 v2 simplelib v1 v2 anotherlib v1 v2 dependency1 v1 dependency2 v2

Slide 27

Slide 27 text

Thoth Station v1 v2 simplelib v1 v2 anotherlib v1 v2 dependency1 v1 dependency2 v2 pip/Pipenv (always latest): simplelib ==v2 anotherlib ==v2 dependency2 ==v2

Slide 28

Slide 28 text

Thoth Station v1 v2 simplelib v1 v2 anotherlib v1 v2 dependency1 v1 dependency2 v2

Slide 29

Slide 29 text

Thoth Station v1 v2 simplelib v1 v2 anotherlib v1 v2 dependency1 v1 dependency2 v2 Causes errors based on Thoth’s knowledge base.

Slide 30

Slide 30 text

Thoth Station v1 v2 simplelib v1 v2 anotherlib v1 v2 dependency1 v1 dependency2

Slide 31

Slide 31 text

Thoth Station v1 v2 simplelib v1 v2 anotherlib v1 v2 dependency1 v1 dependency2 Simplelib in version v1 performs better together with dependency1 in version v1 based on Thoth’s knowledge base.

Slide 32

Slide 32 text

Thoth Station v1 v2 simplelib v1 v2 anotherlib v1 v2 dependency1 v1 dependency2

Slide 33

Slide 33 text

Thoth Station v1 v2 simplelib v1 v2 anotherlib v1 v2 dependency1 v1 dependency2 Thoth (always greatest): simplelib ==v1 anotherlib ==v2 dependency1 ==v1 dependency2 ==v1

Slide 34

Slide 34 text

Thoth Station Stack generation pipeline Remove pre-releases Construct dependency graph Remove install errors Remove run errors Adjust based on performance Adjust based on security Sort based on semver Resolved stacks steram Performance based scoring Security based scoring Lockfile generation Final score gating Runtime environment Requirements Analysis of application Lock file Justification

Slide 35

Slide 35 text

Thoth Station Extending information about Python packages

Slide 36

Slide 36 text

Thoth Station ● License ● Classifiers ○ Programming Language :: Python :: 3.6 ○ Programming Language :: Python :: Implementation :: CPython ● Package purpose ○ machine learning library ○ plugin ○ … ● Is the given package affecting performance? Python package metadata

Slide 37

Slide 37 text

Thoth Station ● A vector space model ● Each vector in vector space corresponds to a project ● Each item in vector represents a feature ● Allows feature based queries and similar projects search F = {python, machine-learning, web, django-framework, webassembly, sql, spark, gpu-support, Java} F tf = {1, 1, 0, 0, 0, 0, 0, 1, 0} F - feature vector F tf - feature vector for project TensorFlow project2vec

Slide 38

Slide 38 text

Thoth Station Image address project2vec

Slide 39

Slide 39 text

Thoth Station Information about Thoth ● Website: ○ https://thoth-station.ninja/ ● Twitter ○ https://twitter.com/thothstation ■ Follow for updates on public availability ● GitHub ○ https://github.com/thoth-station

Slide 40

Slide 40 text

Thoth Station

Slide 41

Slide 41 text

Thoth Station ● Community sitting at https://github.com/thoth-station/ ● Bot Kebechet ○ https://thoth-station.ninja/kebechet/ ● Twitter ○ https://twitter.com/thothstation $ pip3 install thamos $ cd ~/repositories/my-repo/ $ thamos config $ thamos advise It’s all on you

Slide 42

Slide 42 text

THANK YOU plus.google.com/+RedHat linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat