Using Docker Containers to Improve Reproducibility in Software Engineering Research

Using Docker Containers to Improve Reproducibility in Software Engineering Research

559da0ff5e64b92aaa5ae354236d1329?s=128

Jürgen Cito

May 17, 2016
Tweet

Transcript

  1. ICSE’16 Technical Briefing, May 17th, Austin, TX Photo Credits: Astrid

    Westvang, https://flic.kr/p/pWJLCW Using Docker Containers to Improve Reproducibility in Software Engineering Research Jürgen Cito, Harald C. Gall
  2. Jürgen Cito PhD @ UZH Harald Gall Prof @ UZH

    Software Evolution Cloud-based Software Engineering Human Factors in Software Engineering
  3. This technical briefing… Conceptual, more abstract notion of Reproducibility in

    Research Concrete instructions to aid Reproducibility in Research
  4. Reproducibility is the ability of an entire experiment or study

    to be duplicated, either by the same researcher or by someone else working independently. Reproducing an experiment is called replicating it. What is Reproducibility? No research paper can ever be considered to be the final word, and the replication and corroboration of research results is key to the scientific process.
  5. What is Reproducibility? Repeatability of a certain process in order

    to establish a fact or the conditions under which we are able to observe the same fact* A process to share methods, describe the environment, 
 in order to recreate results. * Mockus et al. “Experiences from replicating a case study to investigate reproducibility of software development.”
  6. Scientific Process Experiment Design Data Collection Data Analysis Interpret Results

    Hypotheses Reproducibility
  7. What is Reproducibility? Establishing facts
 > Steps (or method) to

    establish the fact
 > Sharing computational knowledge Controlling environment
 > Execution Environment
 > Dependencies Providing data
 > Ability to interpret data
 > Computational Analyses Low barriers to replicate
 > Comprehensible results
 > Ease of achieving replication
  8. What is Reproducibility in SE Research? Algorithms / Computational Analyses

    Developed Tools or Prototypes Quantitative Evaluations + internal knowledge of the necessary process to derive/establish results

  9. Artifact Evaluation & Replication Packages In Software Engineering Research:
 -

    FSE’15/16, MSR’15/16 In Programming Languages Research:
 - PLDI, POPL, OOPSLA
  10. Current State of Sharing Artifacts in SE Researcher’s website

  11. Case Study: ChangeDistiller https://bitbucket.org/sealuzh/tools-changedistiller Research Project

  12. Why is reproducibility hard? Why does it fail?

  13. Reasons for Failed Reproducibility (1/2) Here is a link to

    download the code for this paper. Good luck trying to download the only post-doc who knows how to run this thing* * Paraphrase of a tweet I cannot seem to find anymore
  14. Source: Collberg et al., “Measuring Reproducibility in Computer Systems Research”,

    http://reproducibility.cs.arizona.edu/tr.pdf Reasons for Failed Reproducibility (2/2)
  15. Case Study: ChangeDistiller https://bitbucket.org/sealuzh/tools-changedistiller Research Project - Developed 2006-2009
 How

    many Java versions have we passed?
 - Dependencies defined in a Maven pom file
 Are they all still available in the repository?
 - How does analysis in ChangeDistiller work?
 What is the entry point?
  16. Challenges in Reproducibility > No standard way of describing experiments,

    environments, (derived) data, and workflows > No transparency in creating environments and the steps/methods to establish facts or recreate analysis 
 > Experimental nature of research code and ecosystems makes it often hard to build > Unresolved or undocumented dependencies > Infrastructure for storage and distribution
  17. … to the rescue

  18. Docker Containers to the rescue (1/2) What is Docker? Docker

    allows you to package an application with all of its dependencies into a standardized unit for software development* Containers consist of everything that enables software to run:
 > Code
 > Runtime
 > System Tools
 > System Libraries * https://www.docker.com/what-docker
  19. Docker Containers to the rescue (2/2) What can Docker be

    for SE research? Docker allows you to package a 
 - Prototype
 - Proof-of-concept Implementation
 - Computational analysis or experiment 
 with all of its dependencies into a standardized unit for reproducible research * https://www.docker.com/what-docker
  20. https://www.docker.com/what-is-docker Technical Overview / Virtual Machines vs Containers “Lightweight” VM


    > Container is an isolated process 
 (“chroot on steroids”)
 > Own process space
 > Own network interface
 > Feels like a VM
 > Share kernel with the host
 > Isolation through cgroups/namespaces https://blog.docker.com/2016/03/containers-are-not-vms/
  21. Docker Engine Centralized runtime environment for containers Enables portability Sole

    dependency for Docker No Emulation layer (almost no performance impact) https://www.docker.com/products/docker-engine
  22. Benefits of Docker Containers Fast instantiation (~1-3 seconds)
 Almost native

    performance
 Transparent build process
 Smaller Images
 Easy to build, share, and publish * https://www.docker.com/what-docker also compared to
 other container technology
  23. Local Docker Workflow # Build redis from source # Make

    sure you have the redis source code checked out in # the same directory as this Dockerfile FROM ubuntu:12.04 MAINTAINER dockerfiles http:// dockerfiles.github.io RUN echo "deb http://archive.ubuntu.com/ ubuntu precise main universe" > /etc/apt/ sources.list RUN apt-get update RUN apt-get upgrade -y RUN apt-get install -y gcc make g++ build- essential libc6-dev tcl wget RUN wget http://download.redis.io/redis- stable.tar.gz -O - | tar -xvz # RUN tar -zvzf /redis/redis-stable.tar.gz RUN (cd /redis-stable && make) RUN (cd /redis-stable && make test) RUN mkdir -p /redis-data VOLUME ["/redis-data"] EXPOSE 6379 ENTRYPOINT ["/redis-stable/src/redis- server"] CMD ["--dir", "/redis-data"] Dockerfile build Image Docker Image Docker Container run
  24. Terminology Dockerfile
 Declarative definition of an environment for producing an

    image
 
 Docker Image
 Immutable artifact built from a Dockerfile, has one to many layers. Docker Container
 Execution environment - Instantiation/running version of an image (can be parameterized) Docker Registry
 Public or private repository that stores allows for distribution of images
 (Docker Hub - https://hub.docker.com/ or CoreOS Quay - https://quay.io/)
 

  25. Local Docker Workflow # Build redis from source # Make

    sure you have the redis source code checked out in # the same directory as this Dockerfile FROM ubuntu:12.04 MAINTAINER dockerfiles http:// dockerfiles.github.io RUN echo "deb http://archive.ubuntu.com/ ubuntu precise main universe" > /etc/apt/ sources.list RUN apt-get update RUN apt-get upgrade -y RUN apt-get install -y gcc make g++ build- essential libc6-dev tcl wget RUN wget http://download.redis.io/redis- stable.tar.gz -O - | tar -xvz # RUN tar -zvzf /redis/redis-stable.tar.gz RUN (cd /redis-stable && make) RUN (cd /redis-stable && make test) RUN mkdir -p /redis-data VOLUME ["/redis-data"] EXPOSE 6379 ENTRYPOINT ["/redis-stable/src/redis- server"] CMD ["--dir", "/redis-data"] Dockerfile build Image Docker Image Docker Container run
  26. Dockerfile Definition of infrastructure and dependencies of a container through

    instructions # Build redis from source # Make sure you have the redis source code checked out in # the same directory as this Dockerfile FROM ubuntu:12.04 MAINTAINER dockerfiles RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/ sources.list RUN apt-get update RUN apt-get upgrade -y RUN apt-get install -y gcc make g++ build-essential libc6-dev tcl wget RUN wget http://download.redis.io/redis-stable.tar.gz -O - | tar -xvz # RUN tar -zvzf /redis/redis-stable.tar.gz RUN (cd /redis-stable && make) RUN (cd /redis-stable && make test) COPY redis.conf /var/www/redis.conf RUN mkdir -p /redis-data VOLUME ["/redis-data"] EXPOSE 6379 ENTRYPOINT ["/redis-stable/src/redis-server"] CMD ["--dir", "/redis-data"] Dependencies Base Image Install Open Port Start Server Volume Base Image can be an OS (Ubuntu) or a different, existing image Runs commands as if you were typing them in the command line Copies local files from build context into container
  27. Data Volumes A specially-designated directory within one or more containers

    that bypasses the Union File System*
 
 Volumes allow you to manage data within containers
 > Mount a host directory (dependency to the host filesystem)
 > Mount a data volume container (dependency to another container)
 > Mount a shared-storage volume (NFS, iSCSI, etc.) * https://docs.docker.com/engine/userguide/containers/dockervolumes/
  28. Local Docker Workflow # Build redis from source # Make

    sure you have the redis source code checked out in # the same directory as this Dockerfile FROM ubuntu:12.04 MAINTAINER dockerfiles http:// dockerfiles.github.io RUN echo "deb http://archive.ubuntu.com/ ubuntu precise main universe" > /etc/apt/ sources.list RUN apt-get update RUN apt-get upgrade -y RUN apt-get install -y gcc make g++ build- essential libc6-dev tcl wget RUN wget http://download.redis.io/redis- stable.tar.gz -O - | tar -xvz # RUN tar -zvzf /redis/redis-stable.tar.gz RUN (cd /redis-stable && make) RUN (cd /redis-stable && make test) RUN mkdir -p /redis-data VOLUME ["/redis-data"] EXPOSE 6379 ENTRYPOINT ["/redis-stable/src/redis- server"] CMD ["--dir", "/redis-data"] Dockerfile build Image Docker Image Docker Container run
  29. Dockerfile —> Image Definition of infrastructure and dependencies of a

    container through instructions docker build -t <imagename> . Build Context containing all local dependencies and Dockerfile
  30. Docker Images # docker images REPOSITORY TAG IMAGE ID CREATED

    SIZE mhart/alpine-node latest 2a15d8568f75 1 week ago 36.76 MB hakyll latest d575da1e730c 2 weeks ago 1.487 GB redis alpine 50405530a7e5 4 weeks ago 15.95 MB Lists all previously built images # docker rmi hakyll Untagged: hakyll:latest Deleted: sha256:3240943c9ea3f72db51… Deleted: sha256:a3aeefae0d4b8f61… Deleted: sha256:16a7ebd378002f1261… Removes image ‘hakyll’ and all its layers from disk
  31. Local Docker Workflow # Build redis from source # Make

    sure you have the redis source code checked out in # the same directory as this Dockerfile FROM ubuntu:12.04 MAINTAINER dockerfiles http:// dockerfiles.github.io RUN echo "deb http://archive.ubuntu.com/ ubuntu precise main universe" > /etc/apt/ sources.list RUN apt-get update RUN apt-get upgrade -y RUN apt-get install -y gcc make g++ build- essential libc6-dev tcl wget RUN wget http://download.redis.io/redis- stable.tar.gz -O - | tar -xvz # RUN tar -zvzf /redis/redis-stable.tar.gz RUN (cd /redis-stable && make) RUN (cd /redis-stable && make test) RUN mkdir -p /redis-data VOLUME ["/redis-data"] EXPOSE 6379 ENTRYPOINT ["/redis-stable/src/redis- server"] CMD ["--dir", "/redis-data"] Dockerfile build Image Docker Image Docker Container run
  32. Image —> Container docker run -d --name <containername> -p 80:5000

    <imagename> Run container in the background
 (d for daemon) https://docs.docker.com/engine/reference/run/ Give the container a unique name Port mapping
 First the exposed port (80) Second the port within the container (5000) Many more possibilities to run containers, see full reference here: (A typical example)
  33. Container Management docker ps List all running containers docker ps

    -a List all containers (also stopped) docker stop <container> Stop a running container docker rm <container> Remove a stopped container
  34. Container Debugging # docker run -ti --entrypoint=bash <imagename> Start image

    with a different entrypoint # docker exec -ti <container> bash Start an interactive shell into a running container # docker inspect <container> Low-level information on a container or image
  35. Docker Hub: Public Registry

  36. Pulling Docker Images
 Getting started with existing images docker pull

    nginx:latest
  37. Pulling Docker Images
 Getting started with existing images docker pull

    nginx:latest Reference to a Docker Image in the Docker Hub
  38. Pulling Docker Images
 Getting started with existing images docker pull

    nginx:latest Images can have many “tags”
  39. Pulling Docker Images
 Getting started with existing images docker pull

    nginx:latest Pulls an image from a Docker registry
  40. Pushing Docker Images to a Registry
 Tag Image docker tag

    c6fdd6639541 <username>/<imagename>:<tagname> Image Id 
 (retrieve through ) docker images
  41. Pushing Docker Images to a Registry
 Push Image docker login

    --username=<username> --email=<email> docker push <username>/<imagename>:<tagname>
  42. Case Study: ChangeDistiller https://bitbucket.org/sealuzh/tools-changedistiller Research Project # Build redis from

    source # Make sure you have the redis source code checked out in # the same directory as this Dockerfile FROM ubuntu:12.04 MAINTAINER dockerfiles http:// dockerfiles.github.io RUN echo "deb http://archive.ubuntu.com/ ubuntu precise main universe" > /etc/apt/ sources.list RUN apt-get update RUN apt-get upgrade -y RUN apt-get install -y gcc make g++ build- essential libc6-dev tcl wget RUN wget http://download.redis.io/redis- stable.tar.gz -O - | tar -xvz # RUN tar -zvzf /redis/redis-stable.tar.gz RUN (cd /redis-stable && make) RUN (cd /redis-stable && make test) RUN mkdir -p /redis-data VOLUME ["/redis-data"] EXPOSE 6379 ENTRYPOINT ["/redis-stable/src/redis- server"] CMD ["--dir", "/redis-data"] Dockerfile build Image Docker Image Docker Container run
  43. Recap: Challenges in Reproducibility > No standard way of describing

    experiments, environments, (derived) data, and workflows > No transparency in creating environments and the steps/methods to establish facts or recreate analysis
 
 > Experimental nature of research code and ecosystems makes it often hard to build > Unresolved or undocumented dependencies > Infrastructure for storage and distribution Dockerfile Docker Image Docker Container Registries (Docker Hub, Quay, …) Dockerfile
  44. Limitations > Performance sensitivity 
 [Jimenez et al., The Role

    of Container Technology in Reproducible Computer Systems Research] > Proprietary Software and Dependencies > Non-Disclosure Agreements / Intellectual Property
 
 > Can we build the same artifact from the specification (Dockerfile) even in 10 years? [Suggestion: Version Pinning]
  45. Conclusions > Containers enable a standard, fast, and easy way

    of describing experiments and environments > Helps your future self, reviewers, and other researchers to make use of your work
  46. Using Docker Containers to Improve Reproducibility in SE Research Jürgen

    Cito <cito@ifi.uzh.ch>, Harald Gall <gall@ifi.uzh.ch> Photo Credits: Nan Palmero, https://flic.kr/p/nPLSpe @citostyle Slides: speakerdeck.com/citostyle Photo Credits: Astrid Westvang, https://flic.kr/p/pWJLCW