Conda Internals - Speaker Deck

Slide 1

Slide 1 text

Conda Internals How does conda work? Aaron Meurer

Slide 2

Slide 2 text

We all know how to use conda

Slide 3

Slide 3 text

But how does it work?

Slide 4

Slide 4 text

Speciﬁcally, what happens when you type conda create -n myenv pandas numpy=1.8

Slide 5

Slide 5 text

Warning: This is not really a talk It’s more like notes in a presentation form

Slide 6

Slide 6 text

Basic outline of what happens • argparse parses the arguments. • The index is fetched. • The package solution is found. • The installation plan is generated. • The plan is executed.

Slide 7

Slide 7 text

argparse parses the arguments. The index is fetched. Package solution is found. The installation plan is generated.

Slide 8

Slide 8 text

The installation plan is generated. The plan is executed.

Slide 9

Slide 9 text

Terminology • package A single specific instance of what you might think of as a “package”. For example, numpy-1.8.2-py34_0.tar.bz2. • dist The filename of a package without the .tar.bz2. For example, numpy-1.8.2-py34_0. The name, version, and build string of a dist are dist.rsplit(‘-‘, 2). • prefix The base path where packages are installed. This is roughly what you think of as an “environment”. • index All the metadata for all the packages from all the channels. It is generated by accumulating repodata.json files from the channels.

Slide 10

Slide 10 text

Terminology • spec A specifier that refers to a collection of packages. For example, scipy, numpy >1.8, or python 3.4.2 0. This is similar to, but not exactly the same as, what you specify at the command line. • link Roughly, a synonym for “install”. Files from a package are “linked” from the package cache to the install prefix (usually using hard-links, but could be using soft-links or copying) • channel A directory structure of platform directories containing packages and repodata.json.bz2 files. Channels are accessed by conda through URLs.

Slide 11

Slide 11 text

Terminology • I’ll deﬁne more terminology in bold as I go.

Slide 12

Slide 12 text

Fetching the index Fetching package metadata: ....

Slide 13

Slide 13 text

Fetching the index • The conda.fetch module. • Channel urls like asmeurer and defaults are converted to https://conda.anaconda.org/asmeurer and https:// repo.continuum.io, and the current platform string is appended to the end (like https://conda.anaconda.org/asmeurer/ osx-64/). • The repodata.json.bz2 is then downloaded from that domain.

Slide 14

Slide 14 text

Fetching the index • The repodata.json is giant dictionary mapping packages to metadata.

Slide 15

Slide 15 text

repodata.json Example { "info": { "arch": "x86_64", "platform": "osx" }, "packages": { … "pandas-0.14.1-np18py34_0.tar.bz2": { "build": "np18py34_0", "build_number": 0, "depends": [ "dateutil", "numpy 1.8*", "python 3.4*", "pytz", "scipy", "setuptools" ], "license": "BSD", "md5": "2666cd5f884f793df5e29e924315abb6", "name": "pandas", "requires": [], "size": 4424913, "version": "0.14.1" }, … }

Slide 16

Slide 16 text

Fetching the index • The channels are fetched in reverse order (concurrently in Python 3 and in Python 2 if you have futures installed), and the index is generated. • (Pseudocode) index = {} for repodata in reversed(channels): index.update(repodata) • Hence, if two channels have the exact same package ﬁlename, the one earlier in the channel list is preferred. • If they have different ﬁlenames (like pandas-0.14.1-np18py34_0.tar.bz2 and pandas-0.16.1-np19py34_0.tar.bz2), they both are included (the conda.resolve module will sort which one should be used later).

Slide 17

Slide 17 text

Finding the package solution Solving package specifications: ...............

Slide 18

Slide 18 text

Finding the package solution • The conda.resolve module (and also conda.logic). • This is the most complicated part (and also my favorite :-). • You’ve probably heard something like, “conda uses a SAT solver to solve dependencies”.

Slide 19

Slide 19 text

Interlude (the SAT problem)

Slide 20

Slide 20 text

Interlude (the SAT problem) • A propositional formula is a Boolean formula with variables (x1, x2, …) and boolean operations (¬ (NOT), ∨ (OR), ∧ (AND), → (IMPLIES), …). • Example: (¬x1 ∨ x2) → x3 • Variables can be assigned either TRUE or FALSE.

Slide 21

Slide 21 text

Interlude (the SAT problem) • Given a variable assignment, a formula itself is either TRUE or FALSE. • Example: under the assignment x1 = TRUE, x2 = FALSE, x3 = FALSE, (¬x1 ∨ x2) → x3 is TRUE. • A formula with n variables has 2n possible assignments.

Slide 22

Slide 22 text

Interlude (the SAT problem) • A formula is called satisfiable if there is some variable assignment that makes it TRUE. • Example: (¬x1 ∨ x2) → x3 is satisfiable because the assignment x1 = TRUE, x2 = FALSE, x3 = FALSE, makes it TRUE. • A formula is called unsatisfiable if it is not satisfiable, i.e., every variable assignment makes it FALSE. • Example: x1 ∧ ¬x1 is unsatisfiable because all possible assignments (x1 = TRUE and x1 = FALSE) make it FALSE.

Slide 23

Slide 23 text

Interlude (the SAT problem) • Naively, to check if a formula is satisﬁable, plug in every possible assignment until we ﬁnd one that makes it TRUE. • (pseudocode) for assignment in all_possible_assignments(variables): if formula(assignment) == True: return True return False • The problem is there are 2n possible assignments for n variables, and we want to be able to deal with formulas with many variables.

Slide 24

Slide 24 text

Interlude (the SAT problem) • Fortunately, there are much more efﬁcient algorithms. • Also fortunately, they are implemented in very fast, black box libraries called SAT solvers. conda uses one called picosat (the Python bindings are called pycosat). • Yes, SAT is NP-complete, but for our purposes, that won’t matter. • As a general rule, if you convert a problem that isn’t NP-complete to SAT, a SAT solver will be able to solve it efﬁciently.

Slide 25

Slide 25 text

Interlude (the SAT problem) • A boolean formula is in conjunctive normal form (CNF) if it is an AND of ORs of atoms (variables or negations of variables). • Example: (x1 ∨ x2 ∨ ¬x3) ∧ (x1 ∨ ¬x4) ∧ ¬x1 • FACT: Every propositional formula can be rewritten in CNF.

Slide 26

Slide 26 text

Interlude (the SAT problem) • The ORs in a CNF formula are called clauses. • Thinks of clauses as constraints. The ANDs require that each constraint be satisﬁed. • AND x1 ∨ x2 ∨ ¬x3   AND x1 ∨ ¬x4  AND ¬x1 • A formula is satisﬁable if there is a solution to the “constraints”.

Slide 27

Slide 27 text

Interlude (the SAT problem) • Interpretation of common clauses: • x1 : x1 must be TRUE. • ¬x1 : x1 must be FALSE. • x1 ∨ x2 ∨ x3 : one of x1, x2 , x3 , must be TRUE. • ¬x1 ∨ x2 : The formula x1 → x2 must be TRUE (if x1 is TRUE, x2 must also be TRUE). • Variables typically have some meaning associated with them. For instance, in conda, a variable might represent a package, where TRUE means the package should be installed.

Slide 28

Slide 28 text

Interlude (the SAT problem) • A SAT solver will take a formula in CNF and return a solution if the formula is SAT, or “UNSAT”. The solution is a mapping of variables that makes the formula TRUE (also called a model). • Say we have a formula and get a solution from a SAT solver. We want to know if there are any other solutions. • Say the model is {x1=FALSE, x2=TRUE, x3=TRUE, x4=FALSE}. • We want a solution that isn’t (x1=FALSE AND x2=TRUE AND x3=TRUE AND x4=FALSE)

Slide 29

Slide 29 text

Interlude (the SAT problem) • We want a solution that isn’t (x1=FALSE AND x2=TRUE AND x3=TRUE AND x4=FALSE). • We want a solution that isn’t (¬x1 ∧ x2 ∧ x3 ∧ ¬x). • We want a solution that is ¬(¬x1 ∧ x2 ∧ x3 ∧ ¬x). • We want a solution that satisﬁes (x1 ∨ ¬x2 ∨ ¬x3 ∨ x) (deMorgan’s law).

Slide 30

Slide 30 text

Interlude (the SAT problem) • Add (x1 ∨ ¬x2 ∨ ¬x3 ∨ x) to the list of clauses and repeat. • We can repeat this until we get “UNSAT”, meaning there are no more solutions. • Note that finding all solutions in this way is almost always infeasibly inefficient, even if finding one is fast (there can be up to 2n solutions to a formula with n variables). • Unless you structure your clauses so that there should be very few solutions.

Slide 31

Slide 31 text

Deep breath

Slide 32

Slide 32 text

Back to computers

Slide 33

Slide 33 text

pycosat • Variables are represented by integers. 1 means x1. -1 means ¬x1. • A clause is a tuple of integers ((1, 2, -3)). • A formula is a set of clauses ({(1, 2, 3), (1, -4), (-1,)}).

Slide 34

Slide 34 text

pycosat • pycosat has two functions, solve() and itersolve(). • solve() returns a single solution, or “UNSAT”, itersolve() returns an iterator over all solutions. >>> pycosat.solve({(1, 2, -3), (1, -4), (-1,)}) [-1, -2, -3, -4] >>> list(pycosat.itersolve({(1, 2, -3), (1, -4), (-1,)})) [[-1, -2, -3, -4], [-1, 2, 3, -4], [-1, 2, -3, -4]] • [-1, 2, 3, -4] means {x1 =FALSE, x2 =TRUE, x3 =TRUE, x4 =FALSE}.

Slide 35

Slide 35 text

Finding the package solution • Back to conda create -n myenv pandas numpy=1.8 • conda recursively gathers all packages and dependencies related to pandas and numpy 1.8 (basically, anything that might be installed). • Each dist is assigned to a variable:  {“pandas-0.16.1-np19py34_0”: 1,  “pandas-0.14.1-np18py34_0”: 2,  …  “python-3.4.3-0”: 100,  “python-3.4.2-0”: 101,  …  } • A variable will be TRUE if the package should be installed.

Slide 36

Slide 36 text

Finding the package solution • Generate clauses corresponding to package “rules”: • Only one version of a given package name is installed: (¬numpy1.8 ∨ ¬numpy1.9), (¬numpy1.7 ∨ ¬numpy1.8), (¬numpy1.7 ∨ ¬numpy1.9) (NOT (numpy-1.8 AND numpy-1.9), …) • If a package is installed, so should its dependencies: (pandas1.4 → (python3.4.3 ∨ python3.4.2 ∨ …), (pandas1.4 → (numpy1.8.2 ∨ numpy1.8.1 ∨ …), … (as clauses, (NOT pandas-1.4 OR python-3.4.3 python-3.4.2 OR …))

Slide 37

Slide 37 text

Finding the package solution • Generate clauses corresponding to package “rules”: • A package matching each given spec should be installed: (numpy1.8.2 ∨ numpy1.8.1 ∨ …), (pandas0.16.1 ∨ pandas0.16.0 ∨ …) (for pandas and numpy 1.8). • Some stuff surrounding “features” that I won’t get into.

Slide 38

Slide 38 text

Finding the package solution • Find solutions. • If the SAT solver returns “UNSAT”, then generate some hints (won’t go into here), and error.

Slide 39

Slide 39 text

Finding the package solution • Problem: there are lots of solutions to the above clauses. • Every version of pandas from 0.13.0 to 0.14.1 has a numpy 1.8 build. • conda could also install some completely unrelated package (say, matplotlib), and the clauses would still be satisﬁed.

Slide 40

Slide 40 text

Finding the package solution • Solution (easy): when gathering dists, only get the latest version of pandas and numpy 1.8. • Iterate over all solutions and ﬁnd the one with the fewest TRUE variables (fewest installed packages). • If there is indeed only one such minimal solution, use that.

Slide 41

Slide 41 text

Finding the package solution • This method fails in this case. The latest version of pandas, 0.16.1, has not been built against numpy 1.8. • What we really want is to look at all the solutions with all the dists, and ﬁnd the one with the newest versions of everything (and also minimal number of installed packages). • We can’t just literally iterate over all solutions. There are combinatorially too many.

Slide 42

Slide 42 text

Finding the package solution • Idea: construct linear equations of the packages, where install (TRUE) = 1, and don’t install (FALSE) = 0. • Say all of our versions of pandas are 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, and 0.16.1. • Construct an expression 0p0.16.1 + 1p0.16.0 + 2p0.15.2 + 3p0.15.1 + 4p0.15.0 + 5p0.14.1 + 6p0.14.0 . • Clearly our constraints will allow no more than one variable to be equal to 1. • We want to minimize this expression relative to our constraints.

Slide 43

Slide 43 text

Finding the package solution • We really want to minimize all versions of every package. • Add equations like this for every package together into one big expression. • eq = 0p0.16.1 + 1p0.16.0 + 2p0.15.2 + 3p0.15.1 + 4p0.15.0 + 5p0.14.1 + 6p0.14.0 + 0np1.9.2 + 1np1.9.1 + 2np1.9.0 + …

Slide 44

Slide 44 text

Finding the package solution • How do we minimize a linear boolean equation relative to SAT constraints? • Answer: A pseudo-boolean SAT solver. • Or, if you don’t have one of those, convert the equation into SAT clauses.

Slide 45

Slide 45 text

Finding the package solution • eq is not actually a boolean predicate (it’s a number, not TRUE or FALSE). • But a ≤ eq ≤ b is. • Convert a ≤ eq ≤ b to SAT clauses and minimize it by bisecting a and b. • Two algorithms are implemented in conda to do this: BDD (binary decision tree) and sorter (as in “sorter network”)

Slide 46

Slide 46 text

Finding the package solution • BDD stands for binary decision tree.

Slide 47

Slide 47 text

Finding the package solution N. E´ en and N. S¨ orensson a 1 0 b 1 c 1 d 1 e 1 f 1 g 1 h 1 b 0 c 1 d 1 e 1 f 1 c 0 d 0 e f g e 0 f g h i a b & c & d & e & f & g & h & & & & & & & & ITE ITE ITE & ITE ITE ITE i ITE Figure 2. The BDD, and corresponding RBC, for the constraint “a + b + 2c + 2d + 3e + 3f + 3g + 3h + 7i ≥ 8”. The BDD terminals are denoted by “0” and “1”. A circle (“o”) marks • BDD for a + b + 2c + 2d + 3e + 3f + 3g + 3h + 7i ≥ 8. • The BDD is translated to SAT clauses via ITEs and Tseitin transformations. Image from A Translation of Pseudo-Boolean Constraints to SAT, Bailleux, Boufkhad, and Roussel

Slide 48

Slide 48 text

Finding the package solution • Convert a ≤ eq ≤ b to SAT clauses and minimize it by bisecting a and b. • We expect the solution to be near 0 (newest versions of everything), so the bisection checks [0, 10], [11, 21], … and bisects until it ﬁnds a solution.

Slide 49

Slide 49 text

Finding the package solution • You can see this happening with conda —-debug. • Each . after “Solving package specifications” is a bisection step.

Slide 50

Slide 50 text

Finding the package solution • There can still be multiple solutions, because you can always add some redundant package to the solution. • We still need to ﬁnd the solution with the fewest installed packages. • Try iterating through all solutions, as before, to ﬁnd the one with the fewest number of TRUE variables.

Slide 51

Slide 51 text

Finding the package solution • There could be a lot of solutions, meaning iteration would be too slow. • Set an iteration limit of 1000. If that is surpassed, set up an equation p1 + p2 + … + pn where each pi is a package, and minimize it. • Same idea as before, except this time, BDDs are too slow, so set up a sorter network.

Slide 52

Slide 52 text

Sorter networks • A sorter network is a predetermined set of comparisons to sort 2n items. Image from Wikipedia

Slide 53

Slide 53 text

Sorter networks • Sorter networks are commonly used to build predetermined sorters in hardware. • If the input to a sorter network is all 0s and 1s, then in the output, all 0s will come before all 1s. • This can be used to count the number of 1s, i.e., compute p1 + p2 + … + pn.

Slide 54

Slide 54 text

Sorter networks • Example: 1 ≤ x1 + x2 + x3 + x4 ≤ 3 • 0s will sort to the top and 1s to the bottom. x1 x2 x3 x4 this should be 0 (≤ 3) this should be 1 (≥ 1)

Slide 55

Slide 55 text

Sorter networks • A sorter network can be generated for k=2n numbers using odd- even merge sort (using O(k log2k) comparisons) • If k≠2n then pick the next largest power of 2 and set the extra inputs to 0. • Comparisons in the sorter network are (min, max), which for TRUE and FALSE is (AND, OR).

Slide 56

Slide 56 text

Sorter networks • Great thing about sorter networks: only have to generate the clauses for the network once. • To change a and b to bisect a ≤ eq ≤ b just change which output variables are asserted. • This could also be used for the version equation with 3x = x + x + x, but the SAT solver has a harder time with this than with BDD (even though generating the clauses is faster).

Slide 57

Slide 57 text

Finding the package solution • So we have a solution of packages that has the newest possible versions and the fewest number of packages. • If there are multiple solutions, print a warning. • If there are no solutions, generate a hint • Find a minimal subset of the dists that break together.

Slide 58

Slide 58 text

Can we do better? • This can be slow (especially if older versions are required) • The problem is the pseudo-boolean algorithm. • SAT solvers are designed for feasibility, not optimization. • Other kinds of solvers can solve the same problems more directly (like MILP or SMT solvers). • Can we ﬁnd a good, lightweight, open source, Python solver that is faster? • Answer: I hope so.

Slide 59

Slide 59 text

The package plan Package plan for installation in environment /Users/aaronmeurer/anaconda/envs/myenv

Slide 60

Slide 60 text

The package plan • Now we know what packages to install from the dependency solver. • What do we actually do to install them?

Slide 61

Slide 61 text

The package plan • The conda.instructions module has a set of instructions that conda can perform: • FETCH • EXTRACT • UNLINK • LINK • RM_EXTRACTED • RM_FETCHED • PREFIX • PRINT • PROGRESS • SYMLINK_CONDA

Slide 62

Slide 62 text

The package plan • FETCH: Download the package from the channel to the package cache. • EXTRACT: Extract the downloaded tarball in the package cache. • UNLINK: Remove an existing package from the preﬁx. • LINK: Install the package to the preﬁx. This instruction has an argument, linktype, which can be LINK_HARD, LINK_SOFT, or LINK_COPY.

Slide 63

Slide 63 text

The package plan • RM_EXTRACTED: Remove an existing extracted package from the package cache. • RM_FETCHED: Remove an existing tarball from the package cache. • PREFIX: The install preﬁx. Does not represent an actual action. • PRINT: Print a message to the screen. • PROGRESS: Create a progress bar. • SYMLINK_CONDA: Create a symlink for bin/conda, bin/activate, and bin/ deactivate in the preﬁx to the root environment.

Slide 64

Slide 64 text

The package plan • Here is the plan for for my example. [('PREFIX', '/Users/aaronmeurer/anaconda/envs/myenv'), ('PRINT', 'Fetching packages ...'), ('FETCH', 'numpy-1.8.2-py34_0'), ('FETCH', 'wheel-0.26.0-py34_0'), ('FETCH', 'scipy-0.14.0-np18py34_0'), ('FETCH', 'pandas-0.14.1-np18py34_0'), ('PRINT', 'Extracting packages ...'), ('PROGRESS', '4'), ('EXTRACT', 'numpy-1.8.2-py34_0'), ('EXTRACT', 'wheel-0.26.0-py34_0'), ('EXTRACT', 'scipy-0.14.0-np18py34_0'), ('EXTRACT', 'pandas-0.14.1-np18py34_0'), ('PRINT', 'Linking packages ...'), ('PROGRESS', '17'), ('LINK', 'ncurses-5.9-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'openssl-1.0.1k-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'sqlite-3.8.4.1-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'tk-8.5.18-0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'xz-5.0.5-0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'zlib-1.2.8-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'readline-6.2.5-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'python-3.4.3-0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'numpy-1.8.2-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'pytz-2015.6-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'setuptools-18.1-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'six-1.9.0-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'wheel-0.26.0-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'dateutil-2.4.1-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'pip-7.1.2-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'scipy-0.14.0-np18py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'pandas-0.14.1-np18py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('SYMLINK_CONDA', '/Users/aaronmeurer/anaconda')]

Slide 65

Slide 65 text

The package plan • You are probably more used to seeing this pretty output. Package plan for installation in environment /Users/aaronmeurer/anaconda/envs/myenv: The following packages will be downloaded: package | build ---------------------------|----------------- numpy-1.8.2 | py34_0 2.6 MB defaults wheel-0.26.0 | py34_0 77 KB defaults scipy-0.14.0 | np18py34_0 10.6 MB defaults pandas-0.14.1 | np18py34_0 4.2 MB defaults ------------------------------------------------------------ Total: 17.5 MB The following NEW packages will be INSTALLED: dateutil: 2.4.1-py34_0 defaults ncurses: 5.9-1 asmeurer numpy: 1.8.2-py34_0 defaults openssl: 1.0.1k-1 defaults pandas: 0.14.1-np18py34_0 defaults pip: 7.1.2-py34_0 defaults python: 3.4.3-0 defaults pytz: 2015.6-py34_0 defaults readline: 6.2.5-1 asmeurer scipy: 0.14.0-np18py34_0 defaults setuptools: 18.1-py34_0 defaults six: 1.9.0-py34_0 defaults sqlite: 3.8.4.1-1 defaults tk: 8.5.18-0 defaults wheel: 0.26.0-py34_0 defaults xz: 5.0.5-0 defaults zlib: 1.2.8-1 asmeurer

Slide 66

Slide 66 text

Terminology • I’m going to call high level commands that perform many actions operations (e.g., everything done by conda install scipy). • An action is one of the ten operations defined above. A single action may result in several instructions (for example, the LINK action will result in a LINK instruction for every package to be linked). • An instruction is a specific action with specific arguments. Instructions are ordered relative to other instructions. • A plan is an ordered list of instructions.

Slide 67

Slide 67 text

The package plan • Different operations generate different sets of actions. • For example, RM_EXTRACTED is used by conda install -f. • We are interested in conda.plan.install_actions().

Slide 68

Slide 68 text

conda.plan.install_actions() • Takes as arguments the install preﬁx, the index, and the install specs (and some options). • First gathers linked packages (conda.install.linked(prefix)). • Then the specs are modiﬁed.

Slide 69

Slide 69 text

Spec modiﬁcation • Unless --no-pinned was used, any specs in prefix/conda- meta/pinned are added to the specs. • Python is added to the specs.

Slide 70

Slide 70 text

Automatic Python version tracking • Conda does not consider what packages are already installed when solving package speciﬁcations. • This would be a nice feature, but we need a faster solver. • It might also break many peoples’ setups. A lot of people have environments that don’t strictly have consistent dependencies.

Slide 71

Slide 71 text

Automatic Python version tracking • However, we want to know about what Python version is installed, as a common case. • conda packages require different builds for different Python versions. • C extensions must be rebuilt separately. • Even for pure Python, the directories are different on Unix (lib/python2.7/site- packages/). • Also could have different dependencies, different code (e.g., if it uses 2to3). • Note: the build string typically contains the Python version as a convention, but conda does not use this! It looks at the dependency metadata directly.

Slide 72

Slide 72 text

Automatic Python version tracking • conda.plan.add_defaults_to_specs() uses the following logic to add Python to the specs: • If Python is already in the specs with a version, don’t do anything. Example: conda install scipy python=3.3 • If none of the specs depend on Python, don’t do anything. For simplicity, we only check packages from the latest version for each spec. Example: conda install hdf5

Slide 73

Slide 73 text

Automatic Python version tracking • If a spec is explicit (includes version and build string), don’t do anything. Example: conda install scipy=0.16.0=np19py34_0 • If Python is already linked in the preﬁx, add that version. Example: conda create -n env python=3.3; conda install -n env scipy

Slide 74

Slide 74 text

Automatic Python version tracking • Finally, if the root environment Python is Python 2, and none of the above conditions apply, add Python 2 to the specs. This way, people who want to use Python 2 will create Python 2 environments. Example: conda create -n env scipy (where conda info shows Python 2.7). • The Python spec is always added like python 2.7*, python 3.3*, etc. • The * in the version spec matches any characters. More on this below.

Slide 75

Slide 75 text

conda.plan.install_actions() • Now the specs are sent to the solver (what was already discussed). • The solver returns a list of packages to be installed. • Some packages may be ﬁltered out at this point. For example, if --no- deps is used, only the speciﬁed packages are kept. • Packages are sorted topologically by dependency, so that they can be installed in reverse dependency order (if possible). • This way post-link scripts can assume that dependencies are installed (useful if the script needs Python, for instance).

Slide 76

Slide 76 text

conda.plan.ensure_linked_actions() • A dictionary of actions is generated (conda.plan.ensure_linked_actions()). • For each package: • If a it is already installed, do nothing. • If it is not already extracted and/or fetched, add actions to do that. • If an different version of it is installed, add an UNLINK action for it.

Slide 77

Slide 77 text

conda.plan.ensure_linked_actions() • There is also a lot of logic here to determine if a package should be linked via hard links, soft links, or copying. • Soft links are used automatically if hard links from the pkgs cache are not possible. • Or copying if conﬁgured, or if soft links are not possible (e.g., on Windows).

Slide 78

Slide 78 text

conda.plan.install_actions() • The SYMLINK_CONDA action is performed on every non-root preﬁx on Unix, to ensure each environment can be activated (where the root environment will be removed from the PATH). • The PREFIX meta-action is also added.

Slide 79

Slide 79 text

The package plan • There are similar functions for different kinds of operations, like removing a package, removing a feature, and reverting to a revision from the history ﬁle.

Slide 80

Slide 80 text

The package plan • Once the actions are generated, it is printed in a pretty form (conda.plan.display_actions()). • Next the actions are converted to a plan (conda.plan.plan_from_actions()).

Slide 81

Slide 81 text

conda.plan.plan_from_actions() • The actions dictionary is sorted using an operations order. • Either actions[‘op_order’] or the default order. • The default order is (FETCH, EXTRACT, UNLINK, LINK, SYMLINK_CONDA, RM_EXTRACTED, RM_FETCHED).

Slide 82

Slide 82 text

conda.plan.plan_from_actions() • Some new instructions are injected into the plan. • “Verb” actions get PRINT instructions (like “Fetching packages…”) • “Progress” actions get PROGRESS instructions. These set up progress bars that update for the next n instructions. • The result is a sorted list of instructions.

Slide 83

Slide 83 text

Executing the instructions • Each instruction is mapped to a function. • The functions are executed in the order they are given by the plan. • This modular system makes it easy to add different kinds of operations.

Slide 84

Slide 84 text

has_prefix • One note on the LINK action: the has_prefix file. • A conda package may have an info/has_prefix file. • This file specifies files with a placeholder prefix, which should be replaced with the final install prefix. • The default placeholder is /opt/ anaconda1anaconda2anaconda3, but it can also be set to anything (such as the original build prefix).

Slide 85

Slide 85 text

has_prefix • This allows files in the package to reference where they are installed as an absolute path, and it will always be correct. • A common example is shebang lines, but shebang lines are not special-cased. It works with any instance of the install prefix in any text file.

Slide 86

Slide 86 text

has_prefix • This works well for text files, but it can also work with binary files, with come caveats. • For binary files, it assumes C-style NULL terminated strings. • If the placeholder is longer than the install prefix, it can be replaced by padding the end of the string with NULL characters.

Slide 87

Slide 87 text

That’s it • That’s it for installing. • We can touch on a few other things.

Slide 88

Slide 88 text

The spec • Conda has a well-defined spec (not to be confused with “specs” as defined above), which defines what .tar.bz2 packages and repodata.jsons look like. • The spec is outlined at http://conda.pydata.org/docs/spec.html • Note that the behavior of conda (just described) is not part of the spec. It’s just an implementation detail. • You could write your own “conda” that does things completely differently.

Slide 89

Slide 89 text

Dependency specifications • One thing of note is the dependency specification spec. • A package dependency is a string with one, two, or three parts, separated by spaces. • The first part is the name of the package, like python or numpy. • The second part is a version string.

Slide 90

Slide 90 text

Dependency speciﬁcations • Version strings can start with ==, !=, >=, <=, >, or <, which do what you expect. • A version string can also contain a *, which matches 0 or more characters (like .* in a regular expression). • Versions can be separated by |, meaning “or” and ,, meaning “and” (, has higher precedence than |).

Slide 91

Slide 91 text

Dependency speciﬁcations • The third part can be the build string. If there are three parts, the version needs to be an exact match (no inequalities, stars, or booleans). • Such exact specs are called explicit. For example, numpy 1.8.1 py27_0. • If conda’s solver sees only explicit specs, it can side-step the SAT solver and just install those exact packages.

Slide 92

Slide 92 text

Dependency speciﬁcations • Example: the following all match numpy-1.8.1-py27_0: • numpy • numpy 1.8* • numpy 1.8.1 • numpy >=1.8 • numpy ==1.8.1 • numpy 1.8|1.8* • numpy >=1.8,<2 • numpy >=1.8,<2|1.9 • numpy 1.8.1 py27_0

Slide 93

Slide 93 text

Dependency specifications • A gotcha: The dependency specifications we are talking about here are not the same as what is specified at the command line. • Packages at the command line are separated by =’s, not spaces. • python=3 generates the spec python 3|3*. • You can use inequalities at the command line, like “python>2” (although be sure to quote them, or else they might be interpreted by your shell).

Slide 94

Slide 94 text

Dependency specifications • See the full docs for dependency specifications at http:// conda.pydata.org/docs/spec.html#package-match-specifications.

Slide 95

Slide 95 text

Building packages • Conda proper is only concerned with installing packages, which is a relatively simple process (unpack the tarball and link the ﬁles). • A lot of complexity of packages should be handled at build time. • Anyone can build a conda package if they follow the spec, but it’s best to use conda-build.

Slide 96

Slide 96 text

Building packages • conda-build does a lot of stuff for you automatically. • Validates your metadata, and generates a valid conda package. • Creates a separate, reproducible build environment with your build dependencies, and gathers the ﬁles into a package automatically. • Can run tests on the package after it is built. • A lot of convenience things for Python packages (like automatic entry point generation and automatically depending on a major Python version like python 3.4*).

Slide 97

Slide 97 text

Building packages • It also does a lot “behind the scenes”. • Automatically generates has_prefix files so that hard-coded instances of the install prefix in plain text files are always set to the install prefix. • has_prefix can also be generated automatically for binary files by setting detect_binary_files_with_prefix: true in the meta.yaml. • This automatically uses a long build prefix so that the replacement can happen for any install prefix.

Slide 98

Slide 98 text

Building packages • It also does a lot “behind the scenes”. • On Linux, it automatically changes the RPATH in shared libraries to be relative (otherwise, they would reference the build prefix). • On OS X, a similar thing is done with the install names. • Absolute symlinks to the build prefix are made relative. • Without these so-called “relocatable” steps, packages would have to be installed in the same prefix where they were built.

Slide 99

Slide 99 text

Building packages • It also does a lot “behind the scenes”. • Egg directories for Python packages are “flattened” out (keeps the sys.path cleaner). • Permissions on files are fixed.

Slide 100

Slide 100 text

Building packages • Most importantly, packages built with conda build are reproducible. • Someone should be able to take the same recipe on a similar machine and build the same package. • I wouldn’t recommend building a package without using conda build, unless you are doing it as an exercise, or you (really) know what you are doing.

Slide 101

Slide 101 text

Questions? • Read the docs (http://conda.pydata.org/docs/). • Read the conda source code (it’s pretty readable, once you understand the terms deﬁned in the glossary at the beginning of this presentation). • Ask on the conda mailing list (http://groups.google.com/a/ continuum.io/group/conda/) • @mention me on Twitter or GitHub (@asmeurer). Note that I don’t work for Continuum anymore.