$30 off During Our Annual Pro Sale. View Details »

Using Docker Containers to Improve Reproducibility in Software Engineering Research

Using Docker Containers to Improve Reproducibility in Software Engineering Research

Jürgen Cito

May 17, 2016
Tweet

More Decks by Jürgen Cito

Other Decks in Research

Transcript

  1. ICSE’16 Technical Briefing, May 17th, Austin, TX
    Photo Credits: Astrid Westvang, https://flic.kr/p/pWJLCW
    Using Docker Containers to Improve Reproducibility
    in Software Engineering Research
    Jürgen Cito, Harald C. Gall

    View Slide

  2. Jürgen Cito
    PhD @ UZH
    Harald Gall
    Prof @ UZH
    Software Evolution
    Cloud-based Software Engineering
    Human Factors in Software Engineering

    View Slide

  3. This technical briefing…
    Conceptual,
    more abstract
    notion of
    Reproducibility
    in Research
    Concrete
    instructions
    to aid
    Reproducibility
    in Research

    View Slide

  4. Reproducibility is the ability of an entire experiment or
    study to be duplicated, either by the same researcher
    or by someone else working independently.
    Reproducing an experiment is called replicating it.
    What is Reproducibility?
    No research paper can ever be considered

    to be the final word, and the replication
    and corroboration of research results is key
    to the scientific process.

    View Slide

  5. What is Reproducibility?
    Repeatability of a certain process in order to establish a fact or
    the conditions under which we are able to observe the same fact*
    A process to share methods, describe the environment, 

    in order to recreate results.
    * Mockus et al. “Experiences from replicating a case study to investigate reproducibility of software development.”

    View Slide

  6. Scientific Process
    Experiment Design
    Data Collection
    Data Analysis
    Interpret Results
    Hypotheses
    Reproducibility

    View Slide

  7. What is Reproducibility?
    Establishing facts

    > Steps (or method) to establish the fact

    > Sharing computational knowledge
    Controlling environment

    > Execution Environment

    > Dependencies
    Providing data

    > Ability to interpret data

    > Computational Analyses
    Low barriers to replicate

    > Comprehensible results

    > Ease of achieving replication

    View Slide

  8. What is Reproducibility in SE Research?
    Algorithms / Computational Analyses
    Developed Tools or Prototypes
    Quantitative Evaluations
    + internal knowledge of the necessary process to
    derive/establish results


    View Slide

  9. Artifact Evaluation & Replication Packages
    In Software Engineering Research:

    - FSE’15/16, MSR’15/16
    In Programming Languages Research:

    - PLDI, POPL, OOPSLA

    View Slide

  10. Current State of Sharing Artifacts in SE
    Researcher’s website

    View Slide

  11. Case Study: ChangeDistiller
    https://bitbucket.org/sealuzh/tools-changedistiller
    Research Project

    View Slide

  12. Why is reproducibility hard?
    Why does it fail?

    View Slide

  13. Reasons for Failed Reproducibility (1/2)
    Here is a link to download the code for this paper.
    Good luck trying to download the only post-doc who
    knows how to run this thing*
    * Paraphrase of a tweet I cannot seem to find anymore

    View Slide

  14. Source: Collberg et al., “Measuring Reproducibility in Computer Systems Research”, http://reproducibility.cs.arizona.edu/tr.pdf
    Reasons for Failed Reproducibility (2/2)

    View Slide

  15. Case Study: ChangeDistiller
    https://bitbucket.org/sealuzh/tools-changedistiller
    Research Project
    - Developed 2006-2009

    How many Java versions have we passed?

    - Dependencies defined in a Maven pom file

    Are they all still available in the repository?

    - How does analysis in ChangeDistiller work?

    What is the entry point?

    View Slide

  16. Challenges in Reproducibility
    > No standard way of describing experiments, environments, (derived)
    data, and workflows
    > No transparency in creating environments and the steps/methods to
    establish facts or recreate analysis

    > Experimental nature of research code and ecosystems makes it often
    hard to build
    > Unresolved or undocumented dependencies
    > Infrastructure for storage and distribution

    View Slide

  17. … to the rescue

    View Slide

  18. Docker Containers to the rescue (1/2)
    What is Docker?
    Docker allows you to package an application with all of its
    dependencies into a standardized unit for software development*
    Containers consist of everything that enables software to run:

    > Code

    > Runtime

    > System Tools

    > System Libraries
    * https://www.docker.com/what-docker

    View Slide

  19. Docker Containers to the rescue (2/2)
    What can Docker be for SE research?
    Docker allows you to package a 

    - Prototype

    - Proof-of-concept Implementation

    - Computational analysis or experiment 

    with all of its dependencies into a standardized unit for
    reproducible research
    * https://www.docker.com/what-docker

    View Slide

  20. https://www.docker.com/what-is-docker
    Technical Overview / Virtual Machines vs Containers
    “Lightweight” VM

    > Container is an isolated process 

    (“chroot on steroids”)

    > Own process space

    > Own network interface

    > Feels like a VM

    > Share kernel with the host

    > Isolation through cgroups/namespaces
    https://blog.docker.com/2016/03/containers-are-not-vms/

    View Slide

  21. Docker Engine
    Centralized runtime environment for containers
    Enables portability
    Sole dependency for Docker
    No Emulation layer (almost no performance impact)
    https://www.docker.com/products/docker-engine

    View Slide

  22. Benefits of Docker Containers
    Fast instantiation (~1-3 seconds)

    Almost native performance

    Transparent build process

    Smaller Images

    Easy to build, share, and publish
    * https://www.docker.com/what-docker
    also compared to

    other container technology

    View Slide

  23. Local Docker Workflow
    # Build redis from source
    # Make sure you have the redis source code
    checked out in
    # the same directory as this Dockerfile
    FROM ubuntu:12.04
    MAINTAINER dockerfiles http://
    dockerfiles.github.io
    RUN echo "deb http://archive.ubuntu.com/
    ubuntu precise main universe" > /etc/apt/
    sources.list
    RUN apt-get update
    RUN apt-get upgrade -y
    RUN apt-get install -y gcc make g++ build-
    essential libc6-dev tcl wget
    RUN wget http://download.redis.io/redis-
    stable.tar.gz -O - | tar -xvz
    # RUN tar -zvzf /redis/redis-stable.tar.gz
    RUN (cd /redis-stable && make)
    RUN (cd /redis-stable && make test)
    RUN mkdir -p /redis-data
    VOLUME ["/redis-data"]
    EXPOSE 6379
    ENTRYPOINT ["/redis-stable/src/redis-
    server"]
    CMD ["--dir", "/redis-data"]
    Dockerfile
    build
    Image
    Docker Image Docker Container
    run

    View Slide

  24. Terminology
    Dockerfile

    Declarative definition of an environment for producing an image


    Docker Image

    Immutable artifact built from a Dockerfile, has one to many layers.
    Docker Container

    Execution environment - Instantiation/running version of an image (can be
    parameterized)
    Docker Registry

    Public or private repository that stores allows for distribution of images

    (Docker Hub - https://hub.docker.com/ or CoreOS Quay - https://quay.io/)


    View Slide

  25. Local Docker Workflow
    # Build redis from source
    # Make sure you have the redis source code
    checked out in
    # the same directory as this Dockerfile
    FROM ubuntu:12.04
    MAINTAINER dockerfiles http://
    dockerfiles.github.io
    RUN echo "deb http://archive.ubuntu.com/
    ubuntu precise main universe" > /etc/apt/
    sources.list
    RUN apt-get update
    RUN apt-get upgrade -y
    RUN apt-get install -y gcc make g++ build-
    essential libc6-dev tcl wget
    RUN wget http://download.redis.io/redis-
    stable.tar.gz -O - | tar -xvz
    # RUN tar -zvzf /redis/redis-stable.tar.gz
    RUN (cd /redis-stable && make)
    RUN (cd /redis-stable && make test)
    RUN mkdir -p /redis-data
    VOLUME ["/redis-data"]
    EXPOSE 6379
    ENTRYPOINT ["/redis-stable/src/redis-
    server"]
    CMD ["--dir", "/redis-data"]
    Dockerfile
    build
    Image
    Docker Image Docker Container
    run

    View Slide

  26. Dockerfile
    Definition of infrastructure and dependencies of a container through instructions
    # Build redis from source
    # Make sure you have the redis source code checked out in
    # the same directory as this Dockerfile
    FROM ubuntu:12.04
    MAINTAINER dockerfiles
    RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/
    sources.list
    RUN apt-get update
    RUN apt-get upgrade -y
    RUN apt-get install -y gcc make g++ build-essential libc6-dev tcl wget
    RUN wget http://download.redis.io/redis-stable.tar.gz -O - | tar -xvz
    # RUN tar -zvzf /redis/redis-stable.tar.gz
    RUN (cd /redis-stable && make)
    RUN (cd /redis-stable && make test)
    COPY redis.conf /var/www/redis.conf
    RUN mkdir -p /redis-data
    VOLUME ["/redis-data"]
    EXPOSE 6379
    ENTRYPOINT ["/redis-stable/src/redis-server"]
    CMD ["--dir", "/redis-data"]
    Dependencies
    Base Image
    Install
    Open Port
    Start Server
    Volume
    Base Image can be an OS (Ubuntu)
    or a different, existing image
    Runs commands as if you were typing
    them in the command line
    Copies local files from
    build context into container

    View Slide

  27. Data Volumes
    A specially-designated directory within one or more
    containers that bypasses the Union File System*


    Volumes allow you to manage data within containers

    > Mount a host directory (dependency to the host filesystem)

    > Mount a data volume container (dependency to another container)

    > Mount a shared-storage volume (NFS, iSCSI, etc.)
    * https://docs.docker.com/engine/userguide/containers/dockervolumes/

    View Slide

  28. Local Docker Workflow
    # Build redis from source
    # Make sure you have the redis source code
    checked out in
    # the same directory as this Dockerfile
    FROM ubuntu:12.04
    MAINTAINER dockerfiles http://
    dockerfiles.github.io
    RUN echo "deb http://archive.ubuntu.com/
    ubuntu precise main universe" > /etc/apt/
    sources.list
    RUN apt-get update
    RUN apt-get upgrade -y
    RUN apt-get install -y gcc make g++ build-
    essential libc6-dev tcl wget
    RUN wget http://download.redis.io/redis-
    stable.tar.gz -O - | tar -xvz
    # RUN tar -zvzf /redis/redis-stable.tar.gz
    RUN (cd /redis-stable && make)
    RUN (cd /redis-stable && make test)
    RUN mkdir -p /redis-data
    VOLUME ["/redis-data"]
    EXPOSE 6379
    ENTRYPOINT ["/redis-stable/src/redis-
    server"]
    CMD ["--dir", "/redis-data"]
    Dockerfile
    build
    Image
    Docker Image Docker Container
    run

    View Slide

  29. Dockerfile —> Image
    Definition of infrastructure and dependencies of a container through instructions
    docker build -t .
    Build Context
    containing all local
    dependencies and
    Dockerfile

    View Slide

  30. Docker Images
    # docker images
    REPOSITORY TAG IMAGE ID CREATED SIZE
    mhart/alpine-node latest 2a15d8568f75 1 week ago 36.76 MB
    hakyll latest d575da1e730c 2 weeks ago 1.487 GB
    redis alpine 50405530a7e5 4 weeks ago 15.95 MB
    Lists all previously built images
    # docker rmi hakyll
    Untagged: hakyll:latest
    Deleted: sha256:3240943c9ea3f72db51…
    Deleted: sha256:a3aeefae0d4b8f61…
    Deleted: sha256:16a7ebd378002f1261…
    Removes image
    ‘hakyll’
    and all its layers
    from disk

    View Slide

  31. Local Docker Workflow
    # Build redis from source
    # Make sure you have the redis source code
    checked out in
    # the same directory as this Dockerfile
    FROM ubuntu:12.04
    MAINTAINER dockerfiles http://
    dockerfiles.github.io
    RUN echo "deb http://archive.ubuntu.com/
    ubuntu precise main universe" > /etc/apt/
    sources.list
    RUN apt-get update
    RUN apt-get upgrade -y
    RUN apt-get install -y gcc make g++ build-
    essential libc6-dev tcl wget
    RUN wget http://download.redis.io/redis-
    stable.tar.gz -O - | tar -xvz
    # RUN tar -zvzf /redis/redis-stable.tar.gz
    RUN (cd /redis-stable && make)
    RUN (cd /redis-stable && make test)
    RUN mkdir -p /redis-data
    VOLUME ["/redis-data"]
    EXPOSE 6379
    ENTRYPOINT ["/redis-stable/src/redis-
    server"]
    CMD ["--dir", "/redis-data"]
    Dockerfile
    build
    Image
    Docker Image Docker Container
    run

    View Slide

  32. Image —> Container
    docker run -d --name -p 80:5000
    Run container
    in the background

    (d for daemon)
    https://docs.docker.com/engine/reference/run/
    Give the container
    a unique name
    Port mapping

    First the exposed port (80)
    Second the port within the container (5000)
    Many more possibilities to run containers, see full reference here:
    (A typical example)

    View Slide

  33. Container Management
    docker ps
    List all running containers
    docker ps -a
    List all containers (also stopped)
    docker stop
    Stop a running container
    docker rm
    Remove a stopped container

    View Slide

  34. Container Debugging
    # docker run -ti --entrypoint=bash
    Start image with a different entrypoint
    # docker exec -ti bash
    Start an interactive shell into a running container
    # docker inspect
    Low-level information on a container or image

    View Slide

  35. Docker Hub: Public Registry

    View Slide

  36. Pulling Docker Images

    Getting started with existing images
    docker pull nginx:latest

    View Slide

  37. Pulling Docker Images

    Getting started with existing images
    docker pull nginx:latest
    Reference to a Docker Image
    in the Docker Hub

    View Slide

  38. Pulling Docker Images

    Getting started with existing images
    docker pull nginx:latest
    Images can have
    many “tags”

    View Slide

  39. Pulling Docker Images

    Getting started with existing images
    docker pull nginx:latest
    Pulls an image from a Docker registry

    View Slide

  40. Pushing Docker Images to a Registry

    Tag Image
    docker tag c6fdd6639541 /:
    Image Id 

    (retrieve through )
    docker images

    View Slide

  41. Pushing Docker Images to a Registry

    Push Image
    docker login --username= --email=
    docker push /:

    View Slide

  42. Case Study: ChangeDistiller
    https://bitbucket.org/sealuzh/tools-changedistiller
    Research Project
    # Build redis from source
    # Make sure you have the redis source code
    checked out in
    # the same directory as this Dockerfile
    FROM ubuntu:12.04
    MAINTAINER dockerfiles http://
    dockerfiles.github.io
    RUN echo "deb http://archive.ubuntu.com/
    ubuntu precise main universe" > /etc/apt/
    sources.list
    RUN apt-get update
    RUN apt-get upgrade -y
    RUN apt-get install -y gcc make g++ build-
    essential libc6-dev tcl wget
    RUN wget http://download.redis.io/redis-
    stable.tar.gz -O - | tar -xvz
    # RUN tar -zvzf /redis/redis-stable.tar.gz
    RUN (cd /redis-stable && make)
    RUN (cd /redis-stable && make test)
    RUN mkdir -p /redis-data
    VOLUME ["/redis-data"]
    EXPOSE 6379
    ENTRYPOINT ["/redis-stable/src/redis-
    server"]
    CMD ["--dir", "/redis-data"]
    Dockerfile
    build
    Image
    Docker Image Docker Container
    run

    View Slide

  43. Recap: Challenges in Reproducibility
    > No standard way of describing experiments, environments,
    (derived) data, and workflows
    > No transparency in creating environments and the steps/methods
    to establish facts or recreate analysis


    > Experimental nature of research code and ecosystems makes it
    often hard to build
    > Unresolved or undocumented dependencies
    > Infrastructure for storage and distribution
    Dockerfile
    Docker Image
    Docker Container
    Registries
    (Docker Hub, Quay, …)
    Dockerfile

    View Slide

  44. Limitations
    > Performance sensitivity 

    [Jimenez et al., The Role of Container Technology in Reproducible Computer Systems Research]
    > Proprietary Software and Dependencies
    > Non-Disclosure Agreements / Intellectual Property


    > Can we build the same artifact from the specification
    (Dockerfile) even in 10 years? [Suggestion: Version Pinning]

    View Slide

  45. Conclusions
    > Containers enable a standard, fast, and easy way of
    describing experiments and environments
    > Helps your future self, reviewers, and other
    researchers to make use of your work

    View Slide

  46. Using Docker Containers to Improve
    Reproducibility in SE Research
    Jürgen Cito , Harald Gall
    Photo Credits: Nan Palmero, https://flic.kr/p/nPLSpe
    @citostyle
    Slides: speakerdeck.com/citostyle
    Photo Credits: Astrid Westvang, https://flic.kr/p/pWJLCW

    View Slide