Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Conda Internals

Conda Internals

A set of technical notes on how the internals of conda work.

Note, this talk is intended for people who are already familiar with conda. If you aren't, you should read the conda docs http://conda.pydata.org/docs/, or watch my SciPY talk about conda (there is a link in the docs).

Aaron Meurer

October 02, 2015
Tweet

More Decks by Aaron Meurer

Other Decks in Programming

Transcript

  1. Basic outline of what happens • argparse parses the arguments.

    • The index is fetched. • The package solution is found. • The installation plan is generated. • The plan is executed.
  2. argparse parses the arguments. The index is fetched. Package solution

    is found. The installation plan is generated.
  3. Terminology • package A single specific instance of what you

    might think of as a “package”. For example, numpy-1.8.2-py34_0.tar.bz2. • dist The filename of a package without the .tar.bz2. For example, numpy-1.8.2-py34_0. The name, version, and build string of a dist are dist.rsplit(‘-‘, 2). • prefix The base path where packages are installed. This is roughly what you think of as an “environment”. • index All the metadata for all the packages from all the channels. It is generated by accumulating repodata.json files from the channels.
  4. Terminology • spec A specifier that refers to a collection

    of packages. For example, scipy, numpy >1.8, or python 3.4.2 0. This is similar to, but not exactly the same as, what you specify at the command line. • link Roughly, a synonym for “install”. Files from a package are “linked” from the package cache to the install prefix (usually using hard-links, but could be using soft-links or copying) • channel A directory structure of platform directories containing packages and repodata.json.bz2 files. Channels are accessed by conda through URLs.
  5. Fetching the index • The conda.fetch module. • Channel urls

    like asmeurer and defaults are converted to https://conda.anaconda.org/asmeurer and https:// repo.continuum.io, and the current platform string is appended to the end (like https://conda.anaconda.org/asmeurer/ osx-64/). • The repodata.json.bz2 is then downloaded from that domain.
  6. repodata.json Example { "info": { "arch": "x86_64", "platform": "osx" },

    "packages": { … "pandas-0.14.1-np18py34_0.tar.bz2": { "build": "np18py34_0", "build_number": 0, "depends": [ "dateutil", "numpy 1.8*", "python 3.4*", "pytz", "scipy", "setuptools" ], "license": "BSD", "md5": "2666cd5f884f793df5e29e924315abb6", "name": "pandas", "requires": [], "size": 4424913, "version": "0.14.1" }, … }
  7. Fetching the index • The channels are fetched in reverse

    order (concurrently in Python 3 and in Python 2 if you have futures installed), and the index is generated. • (Pseudocode) index = {} for repodata in reversed(channels): index.update(repodata) • Hence, if two channels have the exact same package filename, the one earlier in the channel list is preferred. • If they have different filenames (like pandas-0.14.1-np18py34_0.tar.bz2 and pandas-0.16.1-np19py34_0.tar.bz2), they both are included (the conda.resolve module will sort which one should be used later).
  8. Finding the package solution • The conda.resolve module (and also

    conda.logic). • This is the most complicated part (and also my favorite :-). • You’ve probably heard something like, “conda uses a SAT solver to solve dependencies”.
  9. Interlude (the SAT problem) • A propositional formula is a

    Boolean formula with variables (x1, x2, …) and boolean operations (¬ (NOT), ∨ (OR), ∧ (AND), → (IMPLIES), …). • Example: (¬x1 ∨ x2) → x3 • Variables can be assigned either TRUE or FALSE.
  10. Interlude (the SAT problem) • Given a variable assignment, a

    formula itself is either TRUE or FALSE. • Example: under the assignment x1 = TRUE, x2 = FALSE, x3 = FALSE, (¬x1 ∨ x2) → x3 is TRUE. • A formula with n variables has 2n possible assignments.
  11. Interlude (the SAT problem) • A formula is called satisfiable

    if there is some variable assignment that makes it TRUE. • Example: (¬x1 ∨ x2) → x3 is satisfiable because the assignment x1 = TRUE, x2 = FALSE, x3 = FALSE, makes it TRUE. • A formula is called unsatisfiable if it is not satisfiable, i.e., every variable assignment makes it FALSE. • Example: x1 ∧ ¬x1 is unsatisfiable because all possible assignments (x1 = TRUE and x1 = FALSE) make it FALSE.
  12. Interlude (the SAT problem) • Naively, to check if a

    formula is satisfiable, plug in every possible assignment until we find one that makes it TRUE. • (pseudocode) for assignment in all_possible_assignments(variables): if formula(assignment) == True: return True return False • The problem is there are 2n possible assignments for n variables, and we want to be able to deal with formulas with many variables.
  13. Interlude (the SAT problem) • Fortunately, there are much more

    efficient algorithms. • Also fortunately, they are implemented in very fast, black box libraries called SAT solvers. conda uses one called picosat (the Python bindings are called pycosat). • Yes, SAT is NP-complete, but for our purposes, that won’t matter. • As a general rule, if you convert a problem that isn’t NP-complete to SAT, a SAT solver will be able to solve it efficiently.
  14. Interlude (the SAT problem) • A boolean formula is in

    conjunctive normal form (CNF) if it is an AND of ORs of atoms (variables or negations of variables). • Example: (x1 ∨ x2 ∨ ¬x3) ∧ (x1 ∨ ¬x4) ∧ ¬x1 • FACT: Every propositional formula can be rewritten in CNF.
  15. Interlude (the SAT problem) • The ORs in a CNF

    formula are called clauses. • Thinks of clauses as constraints. The ANDs require that each constraint be satisfied. • AND x1 ∨ x2 ∨ ¬x3 
 AND x1 ∨ ¬x4
 AND ¬x1 • A formula is satisfiable if there is a solution to the “constraints”.
  16. Interlude (the SAT problem) • Interpretation of common clauses: •

    x1 : x1 must be TRUE. • ¬x1 : x1 must be FALSE. • x1 ∨ x2 ∨ x3 : one of x1, x2 , x3 , must be TRUE. • ¬x1 ∨ x2 : The formula x1 → x2 must be TRUE (if x1 is TRUE, x2 must also be TRUE). • Variables typically have some meaning associated with them. For instance, in conda, a variable might represent a package, where TRUE means the package should be installed.
  17. Interlude (the SAT problem) • A SAT solver will take

    a formula in CNF and return a solution if the formula is SAT, or “UNSAT”. The solution is a mapping of variables that makes the formula TRUE (also called a model). • Say we have a formula and get a solution from a SAT solver. We want to know if there are any other solutions. • Say the model is {x1=FALSE, x2=TRUE, x3=TRUE, x4=FALSE}. • We want a solution that isn’t (x1=FALSE AND x2=TRUE AND x3=TRUE AND x4=FALSE)
  18. Interlude (the SAT problem) • We want a solution that

    isn’t (x1=FALSE AND x2=TRUE AND x3=TRUE AND x4=FALSE). • We want a solution that isn’t (¬x1 ∧ x2 ∧ x3 ∧ ¬x). • We want a solution that is ¬(¬x1 ∧ x2 ∧ x3 ∧ ¬x). • We want a solution that satisfies (x1 ∨ ¬x2 ∨ ¬x3 ∨ x) (deMorgan’s law).
  19. Interlude (the SAT problem) • Add (x1 ∨ ¬x2 ∨

    ¬x3 ∨ x) to the list of clauses and repeat. • We can repeat this until we get “UNSAT”, meaning there are no more solutions. • Note that finding all solutions in this way is almost always infeasibly inefficient, even if finding one is fast (there can be up to 2n solutions to a formula with n variables). • Unless you structure your clauses so that there should be very few solutions.
  20. pycosat • Variables are represented by integers. 1 means x1.

    -1 means ¬x1. • A clause is a tuple of integers ((1, 2, -3)). • A formula is a set of clauses ({(1, 2, 3), (1, -4), (-1,)}).
  21. pycosat • pycosat has two functions, solve() and itersolve(). •

    solve() returns a single solution, or “UNSAT”, itersolve() returns an iterator over all solutions. >>> pycosat.solve({(1, 2, -3), (1, -4), (-1,)}) [-1, -2, -3, -4] >>> list(pycosat.itersolve({(1, 2, -3), (1, -4), (-1,)})) [[-1, -2, -3, -4], [-1, 2, 3, -4], [-1, 2, -3, -4]] • [-1, 2, 3, -4] means {x1 =FALSE, x2 =TRUE, x3 =TRUE, x4 =FALSE}.
  22. Finding the package solution • Back to conda create -n

    myenv pandas numpy=1.8 • conda recursively gathers all packages and dependencies related to pandas and numpy 1.8 (basically, anything that might be installed). • Each dist is assigned to a variable:
 {“pandas-0.16.1-np19py34_0”: 1,
 “pandas-0.14.1-np18py34_0”: 2,
 …
 “python-3.4.3-0”: 100,
 “python-3.4.2-0”: 101,
 …
 } • A variable will be TRUE if the package should be installed.
  23. Finding the package solution • Generate clauses corresponding to package

    “rules”: • Only one version of a given package name is installed: (¬numpy1.8 ∨ ¬numpy1.9), (¬numpy1.7 ∨ ¬numpy1.8), (¬numpy1.7 ∨ ¬numpy1.9) (NOT (numpy-1.8 AND numpy-1.9), …) • If a package is installed, so should its dependencies: (pandas1.4 → (python3.4.3 ∨ python3.4.2 ∨ …), (pandas1.4 → (numpy1.8.2 ∨ numpy1.8.1 ∨ …), … (as clauses, (NOT pandas-1.4 OR python-3.4.3 python-3.4.2 OR …))
  24. Finding the package solution • Generate clauses corresponding to package

    “rules”: • A package matching each given spec should be installed: (numpy1.8.2 ∨ numpy1.8.1 ∨ …), (pandas0.16.1 ∨ pandas0.16.0 ∨ …) (for pandas and numpy 1.8). • Some stuff surrounding “features” that I won’t get into.
  25. Finding the package solution • Find solutions. • If the

    SAT solver returns “UNSAT”, then generate some hints (won’t go into here), and error.
  26. Finding the package solution • Problem: there are lots of

    solutions to the above clauses. • Every version of pandas from 0.13.0 to 0.14.1 has a numpy 1.8 build. • conda could also install some completely unrelated package (say, matplotlib), and the clauses would still be satisfied.
  27. Finding the package solution • Solution (easy): when gathering dists,

    only get the latest version of pandas and numpy 1.8. • Iterate over all solutions and find the one with the fewest TRUE variables (fewest installed packages). • If there is indeed only one such minimal solution, use that.
  28. Finding the package solution • This method fails in this

    case. The latest version of pandas, 0.16.1, has not been built against numpy 1.8. • What we really want is to look at all the solutions with all the dists, and find the one with the newest versions of everything (and also minimal number of installed packages). • We can’t just literally iterate over all solutions. There are combinatorially too many.
  29. Finding the package solution • Idea: construct linear equations of

    the packages, where install (TRUE) = 1, and don’t install (FALSE) = 0. • Say all of our versions of pandas are 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, and 0.16.1. • Construct an expression 0p0.16.1 + 1p0.16.0 + 2p0.15.2 + 3p0.15.1 + 4p0.15.0 + 5p0.14.1 + 6p0.14.0 . • Clearly our constraints will allow no more than one variable to be equal to 1. • We want to minimize this expression relative to our constraints.
  30. Finding the package solution • We really want to minimize

    all versions of every package. • Add equations like this for every package together into one big expression. • eq = 0p0.16.1 + 1p0.16.0 + 2p0.15.2 + 3p0.15.1 + 4p0.15.0 + 5p0.14.1 + 6p0.14.0 + 0np1.9.2 + 1np1.9.1 + 2np1.9.0 + …
  31. Finding the package solution • How do we minimize a

    linear boolean equation relative to SAT constraints? • Answer: A pseudo-boolean SAT solver. • Or, if you don’t have one of those, convert the equation into SAT clauses.
  32. Finding the package solution • eq is not actually a

    boolean predicate (it’s a number, not TRUE or FALSE). • But a ≤ eq ≤ b is. • Convert a ≤ eq ≤ b to SAT clauses and minimize it by bisecting a and b. • Two algorithms are implemented in conda to do this: BDD (binary decision tree) and sorter (as in “sorter network”)
  33. Finding the package solution N. E´ en and N. S¨

    orensson a 1 0 b 1 c 1 d 1 e 1 f 1 g 1 h 1 b 0 c 1 d 1 e 1 f 1 c 0 d 0 e f g e 0 f g h i a b & c & d & e & f & g & h & & & & & & & & ITE ITE ITE & ITE ITE ITE i ITE Figure 2. The BDD, and corresponding RBC, for the constraint “a + b + 2c + 2d + 3e + 3f + 3g + 3h + 7i ≥ 8”. The BDD terminals are denoted by “0” and “1”. A circle (“o”) marks • BDD for a + b + 2c + 2d + 3e + 3f + 3g + 3h + 7i ≥ 8. • The BDD is translated to SAT clauses via ITEs and Tseitin transformations. Image from A Translation of Pseudo-Boolean Constraints to SAT, Bailleux, Boufkhad, and Roussel
  34. Finding the package solution • Convert a ≤ eq ≤

    b to SAT clauses and minimize it by bisecting a and b. • We expect the solution to be near 0 (newest versions of everything), so the bisection checks [0, 10], [11, 21], … and bisects until it finds a solution.
  35. Finding the package solution • You can see this happening

    with conda —-debug. • Each . after “Solving package specifications” is a bisection step.
  36. Finding the package solution • There can still be multiple

    solutions, because you can always add some redundant package to the solution. • We still need to find the solution with the fewest installed packages. • Try iterating through all solutions, as before, to find the one with the fewest number of TRUE variables.
  37. Finding the package solution • There could be a lot

    of solutions, meaning iteration would be too slow. • Set an iteration limit of 1000. If that is surpassed, set up an equation p1 + p2 + … + pn where each pi is a package, and minimize it. • Same idea as before, except this time, BDDs are too slow, so set up a sorter network.
  38. Sorter networks • A sorter network is a predetermined set

    of comparisons to sort 2n items. Image from Wikipedia
  39. Sorter networks • Sorter networks are commonly used to build

    predetermined sorters in hardware. • If the input to a sorter network is all 0s and 1s, then in the output, all 0s will come before all 1s. • This can be used to count the number of 1s, i.e., compute p1 + p2 + … + pn.
  40. Sorter networks • Example: 1 ≤ x1 + x2 +

    x3 + x4 ≤ 3 • 0s will sort to the top and 1s to the bottom. x1 x2 x3 x4 this should be 0 (≤ 3) this should be 1 (≥ 1)
  41. Sorter networks • A sorter network can be generated for

    k=2n numbers using odd- even merge sort (using O(k log2k) comparisons) • If k≠2n then pick the next largest power of 2 and set the extra inputs to 0. • Comparisons in the sorter network are (min, max), which for TRUE and FALSE is (AND, OR).
  42. Sorter networks • Great thing about sorter networks: only have

    to generate the clauses for the network once. • To change a and b to bisect a ≤ eq ≤ b just change which output variables are asserted. • This could also be used for the version equation with 3x = x + x + x, but the SAT solver has a harder time with this than with BDD (even though generating the clauses is faster).
  43. Finding the package solution • So we have a solution

    of packages that has the newest possible versions and the fewest number of packages. • If there are multiple solutions, print a warning. • If there are no solutions, generate a hint • Find a minimal subset of the dists that break together.
  44. Can we do better? • This can be slow (especially

    if older versions are required) • The problem is the pseudo-boolean algorithm. • SAT solvers are designed for feasibility, not optimization. • Other kinds of solvers can solve the same problems more directly (like MILP or SMT solvers). • Can we find a good, lightweight, open source, Python solver that is faster? • Answer: I hope so.
  45. The package plan • Now we know what packages to

    install from the dependency solver. • What do we actually do to install them?
  46. The package plan • The conda.instructions module has a set

    of instructions that conda can perform: • FETCH • EXTRACT • UNLINK • LINK • RM_EXTRACTED • RM_FETCHED • PREFIX • PRINT • PROGRESS • SYMLINK_CONDA
  47. The package plan • FETCH: Download the package from the

    channel to the package cache. • EXTRACT: Extract the downloaded tarball in the package cache. • UNLINK: Remove an existing package from the prefix. • LINK: Install the package to the prefix. This instruction has an argument, linktype, which can be LINK_HARD, LINK_SOFT, or LINK_COPY.
  48. The package plan • RM_EXTRACTED: Remove an existing extracted package

    from the package cache. • RM_FETCHED: Remove an existing tarball from the package cache. • PREFIX: The install prefix. Does not represent an actual action. • PRINT: Print a message to the screen. • PROGRESS: Create a progress bar. • SYMLINK_CONDA: Create a symlink for bin/conda, bin/activate, and bin/ deactivate in the prefix to the root environment.
  49. The package plan • Here is the plan for for

    my example. [('PREFIX', '/Users/aaronmeurer/anaconda/envs/myenv'), ('PRINT', 'Fetching packages ...'), ('FETCH', 'numpy-1.8.2-py34_0'), ('FETCH', 'wheel-0.26.0-py34_0'), ('FETCH', 'scipy-0.14.0-np18py34_0'), ('FETCH', 'pandas-0.14.1-np18py34_0'), ('PRINT', 'Extracting packages ...'), ('PROGRESS', '4'), ('EXTRACT', 'numpy-1.8.2-py34_0'), ('EXTRACT', 'wheel-0.26.0-py34_0'), ('EXTRACT', 'scipy-0.14.0-np18py34_0'), ('EXTRACT', 'pandas-0.14.1-np18py34_0'), ('PRINT', 'Linking packages ...'), ('PROGRESS', '17'), ('LINK', 'ncurses-5.9-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'openssl-1.0.1k-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'sqlite-3.8.4.1-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'tk-8.5.18-0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'xz-5.0.5-0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'zlib-1.2.8-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'readline-6.2.5-1 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'python-3.4.3-0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'numpy-1.8.2-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'pytz-2015.6-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'setuptools-18.1-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'six-1.9.0-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'wheel-0.26.0-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'dateutil-2.4.1-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'pip-7.1.2-py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'scipy-0.14.0-np18py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('LINK', 'pandas-0.14.1-np18py34_0 /Users/aaronmeurer/anaconda/pkgs 1'), ('SYMLINK_CONDA', '/Users/aaronmeurer/anaconda')]
  50. The package plan • You are probably more used to

    seeing this pretty output. Package plan for installation in environment /Users/aaronmeurer/anaconda/envs/myenv: The following packages will be downloaded: package | build ---------------------------|----------------- numpy-1.8.2 | py34_0 2.6 MB defaults wheel-0.26.0 | py34_0 77 KB defaults scipy-0.14.0 | np18py34_0 10.6 MB defaults pandas-0.14.1 | np18py34_0 4.2 MB defaults ------------------------------------------------------------ Total: 17.5 MB The following NEW packages will be INSTALLED: dateutil: 2.4.1-py34_0 defaults ncurses: 5.9-1 asmeurer numpy: 1.8.2-py34_0 defaults openssl: 1.0.1k-1 defaults pandas: 0.14.1-np18py34_0 defaults pip: 7.1.2-py34_0 defaults python: 3.4.3-0 defaults pytz: 2015.6-py34_0 defaults readline: 6.2.5-1 asmeurer scipy: 0.14.0-np18py34_0 defaults setuptools: 18.1-py34_0 defaults six: 1.9.0-py34_0 defaults sqlite: 3.8.4.1-1 defaults tk: 8.5.18-0 defaults wheel: 0.26.0-py34_0 defaults xz: 5.0.5-0 defaults zlib: 1.2.8-1 asmeurer
  51. Terminology • I’m going to call high level commands that

    perform many actions operations (e.g., everything done by conda install scipy). • An action is one of the ten operations defined above. A single action may result in several instructions (for example, the LINK action will result in a LINK instruction for every package to be linked). • An instruction is a specific action with specific arguments. Instructions are ordered relative to other instructions. • A plan is an ordered list of instructions.
  52. The package plan • Different operations generate different sets of

    actions. • For example, RM_EXTRACTED is used by conda install -f. • We are interested in conda.plan.install_actions().
  53. conda.plan.install_actions() • Takes as arguments the install prefix, the index,

    and the install specs (and some options). • First gathers linked packages (conda.install.linked(prefix)). • Then the specs are modified.
  54. Spec modification • Unless --no-pinned was used, any specs in

    prefix/conda- meta/pinned are added to the specs. • Python is added to the specs.
  55. Automatic Python version tracking • Conda does not consider what

    packages are already installed when solving package specifications. • This would be a nice feature, but we need a faster solver. • It might also break many peoples’ setups. A lot of people have environments that don’t strictly have consistent dependencies.
  56. Automatic Python version tracking • However, we want to know

    about what Python version is installed, as a common case. • conda packages require different builds for different Python versions. • C extensions must be rebuilt separately. • Even for pure Python, the directories are different on Unix (lib/python2.7/site- packages/). • Also could have different dependencies, different code (e.g., if it uses 2to3). • Note: the build string typically contains the Python version as a convention, but conda does not use this! It looks at the dependency metadata directly.
  57. Automatic Python version tracking • conda.plan.add_defaults_to_specs() uses the following logic

    to add Python to the specs: • If Python is already in the specs with a version, don’t do anything. Example: conda install scipy python=3.3 • If none of the specs depend on Python, don’t do anything. For simplicity, we only check packages from the latest version for each spec. Example: conda install hdf5
  58. Automatic Python version tracking • If a spec is explicit

    (includes version and build string), don’t do anything. Example: conda install scipy=0.16.0=np19py34_0 • If Python is already linked in the prefix, add that version. Example: conda create -n env python=3.3; conda install -n env scipy
  59. Automatic Python version tracking • Finally, if the root environment

    Python is Python 2, and none of the above conditions apply, add Python 2 to the specs. This way, people who want to use Python 2 will create Python 2 environments. Example: conda create -n env scipy (where conda info shows Python 2.7). • The Python spec is always added like python 2.7*, python 3.3*, etc. • The * in the version spec matches any characters. More on this below.
  60. conda.plan.install_actions() • Now the specs are sent to the solver

    (what was already discussed). • The solver returns a list of packages to be installed. • Some packages may be filtered out at this point. For example, if --no- deps is used, only the specified packages are kept. • Packages are sorted topologically by dependency, so that they can be installed in reverse dependency order (if possible). • This way post-link scripts can assume that dependencies are installed (useful if the script needs Python, for instance).
  61. conda.plan.ensure_linked_actions() • A dictionary of actions is generated (conda.plan.ensure_linked_actions()). •

    For each package: • If a it is already installed, do nothing. • If it is not already extracted and/or fetched, add actions to do that. • If an different version of it is installed, add an UNLINK action for it.
  62. conda.plan.ensure_linked_actions() • There is also a lot of logic here

    to determine if a package should be linked via hard links, soft links, or copying. • Soft links are used automatically if hard links from the pkgs cache are not possible. • Or copying if configured, or if soft links are not possible (e.g., on Windows).
  63. conda.plan.install_actions() • The SYMLINK_CONDA action is performed on every non-root

    prefix on Unix, to ensure each environment can be activated (where the root environment will be removed from the PATH). • The PREFIX meta-action is also added.
  64. The package plan • There are similar functions for different

    kinds of operations, like removing a package, removing a feature, and reverting to a revision from the history file.
  65. The package plan • Once the actions are generated, it

    is printed in a pretty form (conda.plan.display_actions()). • Next the actions are converted to a plan (conda.plan.plan_from_actions()).
  66. conda.plan.plan_from_actions() • The actions dictionary is sorted using an operations

    order. • Either actions[‘op_order’] or the default order. • The default order is (FETCH, EXTRACT, UNLINK, LINK, SYMLINK_CONDA, RM_EXTRACTED, RM_FETCHED).
  67. conda.plan.plan_from_actions() • Some new instructions are injected into the plan.

    • “Verb” actions get PRINT instructions (like “Fetching packages…”) • “Progress” actions get PROGRESS instructions. These set up progress bars that update for the next n instructions. • The result is a sorted list of instructions.
  68. Executing the instructions • Each instruction is mapped to a

    function. • The functions are executed in the order they are given by the plan. • This modular system makes it easy to add different kinds of operations.
  69. has_prefix • One note on the LINK action: the has_prefix

    file. • A conda package may have an info/has_prefix file. • This file specifies files with a placeholder prefix, which should be replaced with the final install prefix. • The default placeholder is /opt/ anaconda1anaconda2anaconda3, but it can also be set to anything (such as the original build prefix).
  70. has_prefix • This allows files in the package to reference

    where they are installed as an absolute path, and it will always be correct. • A common example is shebang lines, but shebang lines are not special-cased. It works with any instance of the install prefix in any text file.
  71. has_prefix • This works well for text files, but it

    can also work with binary files, with come caveats. • For binary files, it assumes C-style NULL terminated strings. • If the placeholder is longer than the install prefix, it can be replaced by padding the end of the string with NULL characters.
  72. The spec • Conda has a well-defined spec (not to

    be confused with “specs” as defined above), which defines what .tar.bz2 packages and repodata.jsons look like. • The spec is outlined at http://conda.pydata.org/docs/spec.html • Note that the behavior of conda (just described) is not part of the spec. It’s just an implementation detail. • You could write your own “conda” that does things completely differently.
  73. Dependency specifications • One thing of note is the dependency

    specification spec. • A package dependency is a string with one, two, or three parts, separated by spaces. • The first part is the name of the package, like python or numpy. • The second part is a version string.
  74. Dependency specifications • Version strings can start with ==, !=,

    >=, <=, >, or <, which do what you expect. • A version string can also contain a *, which matches 0 or more characters (like .* in a regular expression). • Versions can be separated by |, meaning “or” and ,, meaning “and” (, has higher precedence than |).
  75. Dependency specifications • The third part can be the build

    string. If there are three parts, the version needs to be an exact match (no inequalities, stars, or booleans). • Such exact specs are called explicit. For example, numpy 1.8.1 py27_0. • If conda’s solver sees only explicit specs, it can side-step the SAT solver and just install those exact packages.
  76. Dependency specifications • Example: the following all match numpy-1.8.1-py27_0: •

    numpy • numpy 1.8* • numpy 1.8.1 • numpy >=1.8 • numpy ==1.8.1 • numpy 1.8|1.8* • numpy >=1.8,<2 • numpy >=1.8,<2|1.9 • numpy 1.8.1 py27_0
  77. Dependency specifications • A gotcha: The dependency specifications we are

    talking about here are not the same as what is specified at the command line. • Packages at the command line are separated by =’s, not spaces. • python=3 generates the spec python 3|3*. • You can use inequalities at the command line, like “python>2” (although be sure to quote them, or else they might be interpreted by your shell).
  78. Dependency specifications • See the full docs for dependency specifications

    at http:// conda.pydata.org/docs/spec.html#package-match-specifications.
  79. Building packages • Conda proper is only concerned with installing

    packages, which is a relatively simple process (unpack the tarball and link the files). • A lot of complexity of packages should be handled at build time. • Anyone can build a conda package if they follow the spec, but it’s best to use conda-build.
  80. Building packages • conda-build does a lot of stuff for

    you automatically. • Validates your metadata, and generates a valid conda package. • Creates a separate, reproducible build environment with your build dependencies, and gathers the files into a package automatically. • Can run tests on the package after it is built. • A lot of convenience things for Python packages (like automatic entry point generation and automatically depending on a major Python version like python 3.4*).
  81. Building packages • It also does a lot “behind the

    scenes”. • Automatically generates has_prefix files so that hard-coded instances of the install prefix in plain text files are always set to the install prefix. • has_prefix can also be generated automatically for binary files by setting detect_binary_files_with_prefix: true in the meta.yaml. • This automatically uses a long build prefix so that the replacement can happen for any install prefix.
  82. Building packages • It also does a lot “behind the

    scenes”. • On Linux, it automatically changes the RPATH in shared libraries to be relative (otherwise, they would reference the build prefix). • On OS X, a similar thing is done with the install names. • Absolute symlinks to the build prefix are made relative. • Without these so-called “relocatable” steps, packages would have to be installed in the same prefix where they were built.
  83. Building packages • It also does a lot “behind the

    scenes”. • Egg directories for Python packages are “flattened” out (keeps the sys.path cleaner). • Permissions on files are fixed.
  84. Building packages • Most importantly, packages built with conda build

    are reproducible. • Someone should be able to take the same recipe on a similar machine and build the same package. • I wouldn’t recommend building a package without using conda build, unless you are doing it as an exercise, or you (really) know what you are doing.
  85. Questions? • Read the docs (http://conda.pydata.org/docs/). • Read the conda

    source code (it’s pretty readable, once you understand the terms defined in the glossary at the beginning of this presentation). • Ask on the conda mailing list (http://groups.google.com/a/ continuum.io/group/conda/) • @mention me on Twitter or GitHub (@asmeurer). Note that I don’t work for Continuum anymore.