Upgrade to Pro — share decks privately, control downloads, hide ads and more …

100行codeでDockerの基本を実現せよ!

 100行codeでDockerの基本を実現せよ!

このセッションでは、100行のシェルコマンドでDockerの基本機能(imageの取得/確認/削除、containerの作成/削除、container内コマンドの実行、ログの確認など)を実現するプロジェクトを解読する。Dockerの基本の仕組みと、Linux namespace, cgroup, iptablesを使ってそれを実現する方法を説明します。必要に応じてデモもやる予定です。

Wenhan Shi

July 23, 2019
Tweet

More Decks by Wenhan Shi

Other Decks in Programming

Transcript

  1. • Hitachi Ltd, ◦ RHEL support ◦ Linux Kernel module

    development ◦ SSD Firmware development • Red Hat K.K. ◦ GlusterFS, OpenShift support • Canonical Japan K.K. ◦ Ubuntu, OpenStack support Twitter : @shi_wenhan Github: https://github.com/xibuka/bocker
  2. • This topic is based on https://github.com/p8952/bocker • I folked

    one at https://github.com/xibuka/bocker and did some update.
  3. • Manage Image ◦ docker pull ◦ docker build ▪

    very limited implementation ◦ docker images ◦ docker rmi • Manage container ◦ docker ps ◦ docker run ◦ docker exec ◦ docker logs ◦ docker commit ◦ docker rm • Others ◦ Networking ◦ Resource control
  4. • A Bearer Token can be obtained by GET request

    to below URL ◦ https://auth.docker.io/token?service=registry.docker.io&scope=repository :<USER>/<APP>:<ACTION> i. USER/APP: • The repository name, use library this time to get official image ii. ACTION: • [pull, push] iii. E.g. : library/ubuntu:pull,push https://docs.docker.com/registry/spec/auth/token/#requesting-a-token
  5. { "token":"eyJh...DMFg", "access_token":"eyJh...DMFg", "expires_in":300, "issued_at":"2019-07-13T21:59:41.043389073Z" } A Bearer token for

    client to use for the requests in the Authorization header. A Bearer token for OAuth2. Lifetime in second for the token remain valid. UTC time of the token was being issued. https://docs.docker.com/registry/spec/auth/token/#requesting-a-token
  6. The token can be placed in the HTTP Authorization header.

    Use curl to request the Manifest(Version 2, Schema 2) of an <image> with a specified <version> Or try below for Manifest(Version 2, Schema 1) https://docs.docker.com/registry/spec/auth/token/#using-the-bearer-token https://docs.docker.com/registry/spec/manifest-v2-2/ $ token=eyJh...DMFg $ curl -sL -H "Accept: application/vnd.docker.distribution.manifest. v2+json" \ -H "Authorization: Bearer $ token" \ https://registry-1.docker.io/v2/library/< images>/manifests/< version> $ curl -sL -H "Authorization: Bearer $ token" \ https://registry-1.docker.io/v2/library/< images>/manifests/< version>
  7. { ... "layers":[ { "mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip" , "size":27619579, "digest":"sha256:29c2f229a1281554ccce1972738cfb3232c4ace59343b4090d6e09f148fee588" }, {

    "mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip" , "size":30946, "digest":"sha256:5d724aeeb8846c1a94a323f756fcf2ae6ca4e573edd64266c2c220a7951827b3" }, { "mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip" , "size":865, "digest":"sha256:60682405dd89aa6c1621f0f2df2ba6b430402c5b59c27b6c71af92b07e535893" }, { "mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip" , "size":162, "digest":"sha256:3b9cd020258303b6913dc8e15db14030877decb816e87cde6fc4d24b060881a4" } ] }%
  8. • layers ◦ The ordered list from the base image.

    ◦ An image directly references one or more layers. • digest ◦ One layer content is identified by this hex digest ◦ The hex element is calculated by SHA256 to the layer’s content ▪ It will be changed when the content changes. ◦ Layer can be pulling by https://registry-1.docker.io/v2/library/< images>/blob/<digest> https://docs.docker.com/registry/spec/manifest-v2-2/#image-manifest-field-descriptions https://docs.docker.com/registry/spec/api/#digest-parameter
  9. Image B Base OS OS custom Nginx Image C Base

    OS OS custom Nginx Image D Base OS OS custom Apache Git Source Image A Base OS OS custom
  10. FROM ubuntu:18.04 COPY . /app RUN make /app CMD python

    /app/app.py 18.04 18.04 https://docs.docker.com/storage/storagedriver/#images-and-layers
  11. Read a file/dir exists in a lower layer https://docs.docker.com/storage/storagedriver/#the-copy-on-write-cow-strategy Modify

    a file/dir exists in a lower layer Image layer (R/O) Container layer (R/W) File File File READ WRITE
  12. • Use BTRFS, a next generation copy-on-write filesystem to store

    images. ◦ Copy-on-write snapshots(redirect-on-write is the Btrfs terminology) ◦ Ease of managing Btrfs filesystems, without unmount or restart Docker i. ii. • Use subvolume to save all image layers as a base image ◦ • Use snapshot of a subvolume for container’s filesystem ◦ https://docs.docker.com/storage/storagedriver/btrfs-driver/ $ sudo btrfs device add /dev/svdh /var/lib/docker $ sudo btrfs filesystem balance /var/lib/docker $ sudo btrfs subvolume create /var/bocker/imageA $ sudo btrfs subvolume snapshot /var/bocker/imageA /var/bocker/containerA
  13. They look and feel the same, but snapshots are much

    smaller. Data is sharing between them.
  14. • Container must has its own resources isolated from host.

    ◦ Hostname ◦ Process tree ◦ Mounts ◦ Interprocess Communication (IPC) ◦ Network ◦ filesystem
  15. chroot Namespace • Container must has its own resources isolated

    from host. ◦ Hostname ◦ Process tree ◦ Mounts ◦ Interprocess Communication (IPC) ◦ Network ◦ filesystem
  16. • Each process is included in a namespace and can

    only see or use the resources in that namespace (and its child namespaces) Namespace 27 Hostname: cloud Namespace 28 Hostname: native Namespace 29 Hostname: days PID 1 2 3 4(1) 5(1) 6(2)
  17. • unshare --fmuip --mount-proc [cmd] • ip netns Options: -m,

    --mount[=<file>] unshare mounts namespace -u, --uts[=<file>] unshare UTS namespace (hostname etc) -i, --ipc[=<file>] unshare System V IPC namespace -n, --net[=<file>] unshare network namespace -p, --pid[=<file>] unshare pid namespace -f, --fork fork before launching <program> --mount-proc[=<dir>] mount proc filesystem first (implies --mount) Usage: ip netns add [NAME] ip link set [NIC] netns [NAME] ip [-all] netns exec [NAME] cmd ...
  18. • 2 ways ➜ ~ sudo ls -l /proc/4823/ns total

    0 lrwxrwxrwx 1 1000000 1000000 0 Jul 23 01:55 cgroup -> 'cgroup:[4026532656]' lrwxrwxrwx 1 1000000 1000000 0 Jul 23 01:55 ipc -> 'ipc:[4026532595]' lrwxrwxrwx 1 1000000 1000000 0 Jul 23 01:55 mnt -> 'mnt:[4026532593]' lrwxrwxrwx 1 1000000 1000000 0 Jul 23 01:55 net -> 'net:[4026532598]' lrwxrwxrwx 1 1000000 1000000 0 Jul 23 01:55 pid -> 'pid:[4026532596]' lrwxrwxrwx 1 1000000 1000000 0 Jul 23 01:55 pid_for_children -> 'pid:[4026532596]' lrwxrwxrwx 1 1000000 1000000 0 Jul 23 01:55 user -> 'user:[4026532592]' lrwxrwxrwx 1 1000000 1000000 0 Jul 23 01:55 uts -> 'uts:[4026532594]' ➜ ~ sudo ps -h -o pidns,ipcns -p 4823 4026532596 4026532595
  19. • nsenter -t <PID> -a|-m -u -i -n -p Options:

    -a, --all enter all namespaces -t, --target <pid> target process to get namespaces from -m, --mount[=<file>] enter mount namespace -u, --uts[=<file>] enter UTS namespace (hostname etc) -i, --ipc[=<file>] enter System V IPC namespace -n, --net[=<file>] enter network namespace -p, --pid[=<file>] enter pid namespace
  20. accessible • Chroot <path> [cmd] / bin etc var bocker

    ps_xxx bin etc var $ chroot /var/bocker/ps_xxx /bin/bash New root bash
  21. ens3 Internet Bridge0 ipv4 ip_forward iptables -t nat -A POSTROUTING

    -o bridge0 -j MASQUERADE iptables -t nat -A POSTROUTING -o ens3 -j MASQUERADE iptables -t nat -A POSTROUTING -o bridge0 -j MASQUERADE iptables -t nat -A POSTROUTING -o ens3 -j MASQUERADE Create a Bridge and enable IP forwarding and SNAT&DNAT
  22. ens3 Internet Bridge0 ipv4 ip_forward Create veth pair ns_9527 ns_9528

    proc_27 proc_28 veth0_27 veth1_27 veth0_28 veth1_28 ip link set veth0_[x] master bridge0 ip link set veth0_[x] master bridge0 ip link set veth1_[x] netns ns_[y] ip link set veth1_[x] netns ns_[y]
  23. • Create by command • VETH(virtual Ethernet) device is a

    local ethernet tunnel. • Created in pairs. • Packets transferred on one device are immediately received on the other device. • When either device is DOWN, the state of the veth pair is DOWN ip link add dev veth0_[x] type veth peer name veth1_[x]
  24. cgroup (control group) can make a group with limitation of

    resources. All process running in that group will share the limited resource defined in the cgroup $ cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 11 1 1 cpu 3 71 1 cpuacct 3 71 1 blkio 6 39 1 memory 5 105 1 devices 7 39 1 freezer 8 1 1 net_cls 4 1 1 perf_event 12 1 1 net_prio 4 1 1 hugetlb 2 1 1 pids 10 44 1 rdma 9 1 1
  25. • cgcreate • cgset • cgexec Create control group(s) -g

    <controllers>:<path> Control group which should be added Usage: cgset [-r <name=value>] <cgroup_path> ... Set the parameters of given cgroup(s) -r, --variable <name> Define parameter to set Usage: cgexec [-h] [-g <controllers>:<path>] [--sticky] command [arguments] ... Run the task in given control group(s) -g <controllers>:<path> Control group which should be added
  26. function bocker_pull() { #HELP Pull an image from Docker Hub:\nBOCKER

    pull <name> <tag> token=$(curl "https://auth.docker.io/token?service=registry.docker.io&scope=repository:library/$1:pull" | jq '.token'| sed 's/\"//g') registry_base='https://registry-1.docker.io/v2' tmp_uuid="$(uuidgen)" && mkdir /tmp/"$tmp_uuid" manifest=$(curl -sL -H "Accept:application/vnd.docker.distribution.manifest.v2+json" -H "Authorization: Bearer $token" "$registry_base/library/$1/manifests/$2" | jq -r '.layers' | jq -r '.[].digest' ) for id in ${manifest[@]}; do curl -#L -H "Authorization: Bearer $token" "$registry_base/library/$1/blobs/$id" -o /tmp/"$tmp_uuid"/layer.tar tar xf /tmp/"$tmp_uuid"/layer.tar -C /tmp/"$tmp_uuid" done echo "$1:$2" > /tmp/"$tmp_uuid"/img.source bocker_init /tmp/"$tmp_uuid" && rm -rf /tmp/"$tmp_uuid" }
  27. function bocker_init() { #HELP Create an image from a directory:\nBOCKER

    init <directory> uuid="img_$(shuf -i 42002-42254 -n 1)" if [[ -d "$1" ]]; then [[ "$(bocker_check "$uuid")" == 0 ]] && echo "UUID conflict, retrying..." && bocker_init "$@" && return btrfs subvolume create "$btrfs_path/$uuid" > /dev/null cp -rf --reflink=auto "$1"/* "$btrfs_path/$uuid" > /dev/null [[ ! -f "$btrfs_path/$uuid"/img.source ]] && echo "$1" > "$btrfs_path/$uuid"/img.source echo "Created: $uuid" else echo "No directory named '$1' exists" fi }
  28. function bocker_images() { #HELP List images:\nBOCKER images echo -e "IMAGE_ID\t\tSOURCE"

    for img in "$btrfs_path"/img_*; do img=$(basename "$img") echo -e "$img\t\t$(cat "$btrfs_path/$img/img.source")" done } function bocker_rm() { #HELP Delete an image or container:\nBOCKER rm <image_id or container_id> [[ "$(bocker_check "$1")" == 1 ]] && echo "No container named '$1' exists" && exit 1 btrfs subvolume delete "$btrfs_path/$1" > /dev/null ip link del dev veth0_$1 2&> /dev/null cgdelete -g "$cgroups:/$1" &> /dev/null || true echo "Removed: $1" }
  29. function bocker_run() { #HELP Create a container:\nBOCKER run <image_id> <command>

    uuid="ps_$(shuf -i 42002-42254 -n 1)" [[ "$(bocker_check "$1")" == 1 ]] && echo "No image named '$1' exists" && exit 1 [[ "$(bocker_check "$uuid")" == 0 ]] && echo "UUID conflict, retrying..." && bocker_run "$@" && return cmd="${@:2}" && ip="$(echo "${uuid: -3}" | sed 's/0//g')" && mac="${uuid: -2}" ip link add dev veth0_"$uuid" type veth peer name veth1_"$uuid" ip link set dev veth0_"$uuid" up ip link set veth0_"$uuid" master bridge0 ip netns add netns_"$uuid" ip link set veth1_"$uuid" netns netns_"$uuid" ip netns exec netns_"$uuid" ip link set dev lo up ip netns exec netns_"$uuid" ip link set veth1_"$uuid" address 02:42:ac:11:00:"$mac" ip netns exec netns_"$uuid" ip addr add 10.0.0."$ip"/24 dev veth1_"$uuid" ip netns exec netns_"$uuid" ip link set dev veth1_"$uuid" up ip netns exec netns_"$uuid" ip route add default via 10.0.0.1 <...>
  30. <...> btrfs subvolume snapshot "$btrfs_path/$1" "$btrfs_path/$uuid" > /dev/null echo 'nameserver

    8.8.8.8' > "$btrfs_path/$uuid"/etc/resolv.conf echo "$cmd" > "$btrfs_path/$uuid/$uuid.cmd" cgcreate -g "$cgroups:/$uuid" : "${BOCKER_CPU_SHARE:=512}" && cgset -r cpu.shares="$BOCKER_CPU_SHARE" "$uuid" : "${BOCKER_MEM_LIMIT:=512}" && cgset -r memory.limit_in_bytes="$((BOCKER_MEM_LIMIT * 1000000))" "$uuid" cgexec -g "$cgroups:$uuid" \ ip netns exec netns_"$uuid" \ unshare -fmuip --mount-proc \ chroot "$btrfs_path/$uuid" \ /bin/sh -c "/bin/mount -t proc proc /proc && $cmd" \ 2>&1 | tee "$btrfs_path/$uuid/$uuid.log" || true ip link del dev veth0_"$uuid" ip netns del netns_"$uuid" }
  31. function bocker_ps() { #HELP List containers:\nBOCKER ps echo -e "CONTAINER_ID\t\tCOMMAND"

    for ps in "$btrfs_path"/ps_*; do ps=$(basename "$ps") echo -e "$ps\t\t$(cat "$btrfs_path/$ps/$ps.cmd")" done } function bocker_exec() { #HELP Execute a command in a running container:\nBOCKER exec <container_id> <command> [[ "$(bocker_check "$1")" == 1 ]] && echo "No container named '$1' exists" && exit 1 unshare_pid=$(ps o pid,cmd -u root | grep "unshare.*$1" | awk '{print $1}') cid="$(ps o ppid,pid -u root| grep "$unshare_pid" | tail -n 1 | awk '{print $2}')" [[ ! "$cid" =~ ^\ *[0-9]+$ ]] && echo "Container '$1' exists but is not running" && exit 1 nsenter -t "$cid" -m -u -i -n -p chroot "$btrfs_path/$1" "${@:2}" }
  32. function bocker_commit() { #HELP Commit a container to an image:\nBOCKER

    commit <container_id> <image_id> [[ "$(bocker_check "$1")" == 1 ]] && echo "No container named '$1' exists" && exit 1 [[ "$(bocker_check "$2")" == 1 ]] && echo "No image named '$2' exists" && exit 1 bocker_rm "$2" && btrfs subvolume snapshot "$btrfs_path/$1" "$btrfs_path/$2" > /dev/null echo "Created: $2" } function bocker_logs() { #HELP View logs from a container:\nBOCKER logs <container_id> [[ "$(bocker_check "$1")" == 1 ]] && echo "No container named '$1' exists" && exit 1 cat "$btrfs_path/$1/$1.log" }
  33. #!/bin/bash set -o errexit -o nounset -o pipefail; shopt -s

    nullglob btrfs_path='/var/bocker' && cgroups='cpu,cpuacct,memory'; [[ $# -gt 0 ]] && while [ "${1:0:2}" == '--' ]; do OPTION=${1:2}; [[ $OPTION =~ = ]] && declare "BOCKER_${OPTION/=*/}=${OPTION/*=/}" || declare "BOCKER_${OPTION}=x"; shift; done function bocker_check() { btrfs subvolume list "$btrfs_path" | grep -qw "$1" && echo 0 || echo 1 } function bocker_help() { #HELP Display this message:\nBOCKER help sed -n "s/^.*#HELP\\s//p;" < "$1" | sed "s/\\\\n/\n\t/g;s/$/\n/;s!BOCKER!${1/!/\\!}!g" } [[ -z "${1-}" ]] && bocker_help "$0" case $1 in pull|init|rm|images|ps|run|exec|logs|commit) bocker_"$1" "${@:2}" ;; *) bocker_help "$0" ;; esac