Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RPC Library for Distributed Processing Used for Machine Learning

RPC Library for Distributed Processing Used for Machine Learning

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Machine Learning Job with Kubernetes - Machine learning jobs run

    on multiple nodes that communicate with each other. - CPU nodes fetch data from storage. - After preprocessing, sends it to the GPU node. - Pods can have access credentials for the storage using Secret. - Resource limits (CPU, Mem, GPU). - Collect and read logs from each Pod. $16 /PEF $16 /PEF (16 /PEF 1PE 5SBJO 1PE 1PE 1PE 1PE data storage (16 /PEF 5SBJO 1PE
  2. Machine Learning Job with Kubernetes - Machine learning jobs run

    on multiple nodes that communicate with each other. - CPU nodes fetch data from storage. - After preprocessing, sends it to the GPU node. - Pods can have access credentials for the storage using Secret. - Resource limits (CPU, Mem, GPU). - Collect and read logs from each Pod. $16 /PEF $16 /PEF (16 /PEF 1PE 5SBJO 1PE 1PE 1PE 1PE data storage (16 /PEF 5SBJO 1PE
  3. apiVersion: batch/v1 kind: Job metadata: generateName: hello- spec: backoffLimit: 0

    ttlSecondsAfterFinished: 10 template: spec: restartPolicy: Never containers: - name: job image: bash args: - echo - Hello, world! resources: limits: cpu: 200m memory: 200Mi
  4. apiVersion: batch/v1 kind: Job metadata: generateName: hello- spec: backoffLimit: 0

    ttlSecondsAfterFinished: 10 template: spec: restartPolicy: Never containers: - name: job image: bash args: - echo - Hello, world! resources: limits: cpu: 200m memory: 200Mi Add suffixes to make it re-runnable
  5. apiVersion: batch/v1 kind: Job metadata: generateName: hello- spec: backoffLimit: 0

    ttlSecondsAfterFinished: 10 template: spec: restartPolicy: Never containers: - name: job image: bash args: - echo - Hello, world! resources: limits: cpu: 200m memory: 200Mi Set for all jobs to avoid retry
  6. apiVersion: batch/v1 kind: Job metadata: generateName: hello- spec: backoffLimit: 0

    ttlSecondsAfterFinished: 10 template: spec: restartPolicy: Never containers: - name: job image: bash args: - echo - Hello, world! resources: limits: cpu: 200m memory: 200Mi cleaned up TTL seconds after the resource has finished
  7. apiVersion: batch/v1 kind: Job metadata: generateName: hello- spec: backoffLimit: 0

    ttlSecondsAfterFinished: 10 template: spec: restartPolicy: Never containers: - name: job image: bash args: - echo - Hello, world! resources: limits: cpu: 200m memory: 200Mi Image name and command arguments
  8. apiVersion: batch/v1 kind: Job metadata: generateName: hello- spec: backoffLimit: 0

    ttlSecondsAfterFinished: 10 template: spec: restartPolicy: Never containers: - name: job image: bash args: - echo - Hello, world! resources: limits: cpu: 200m memory: 200Mi CPU and memory usage can be limited
  9. apiVersion: batch/v1 kind: Job metadata: generateName: hello- spec: backoffLimit: 0

    ttlSecondsAfterFinished: 10 template: spec: restartPolicy: Never containers: - name: job image: bash args: - echo - Hello, world! resources: limits: cpu: 200m memory: 200Mi kubectl create -f example.yaml kubectl logs -f -l job-name=hello-xxxxx
  10. apiVersion: batch/v1 kind: Job metadata: generateName: ${job_prefix}- spec: backoffLimit: 0

    template: spec: restartPolicy: Never nodeSelector: example/gpu: "v100" tolerations: - key: nvidia.com/gpu effect: NoSchedule operator: Exists volumes: - name: dshm emptyDir: medium: Memory containers: - name: job image: ${IMAGE} args: [bash, -c, "${args}"] env: - name: ACCESS_ID valueFrom: secretKeyRef: name: my-storage key: access_id - name: SECRET_KEY valueFrom: secretKeyRef: name: my-storage key: secret_key resources: limits: nvidia.com/gpu: 1 cpu: 30 memory: 100Gi volumeMounts: - mountPath: /dev/shm name: dshm set -eux export template_yaml=$1 export job_prefix=$2 echo IMAGE: $IMAGE if [ "$#" -gt 3 ]; then shift 2 export args="$@" fi basedir=$(dirname "$0") jobyaml=$job_prefix.yaml tmp_job=$(mktemp) envsubst < $basedir/$template_yaml > $jobyaml cat $jobyaml trap 'kubectl delete job $job_name; rm -f $tmp_job' EXIT kubectl create -f $jobyaml -o json > $tmp_job namespace=$(jq -r .metadata.namespace $tmp_job) job_name=$(jq -r .metadata.name $tmp_job) for ((k = 0; k < 20; ++k)); do # wait until pod is ready if kubectl wait --for=condition=ready pod -l job-name=$job_name --timeout=10s -n ${namespace}; then kubectl logs -f -n ${namespace} -l job-name=$job_name break else # check init status initStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{if .state.terminated}}{{"Exit"}}{{end}}{{end}}{{"\n"}}{{end}}' -n ${namespace} -l job-name=${job_name}) if [ X"${initStatus}" != X ]; then break fi fi done errorStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{.state.terminated.exitCode}}{{end}}{{end}}' -n ${namespace} -l job-name=${job_name} | sed 's/0\.//g') if [ -n "${errorStatus}" ]; then echo "${errorStatus}" exit 1 fi
  11. apiVersion: batch/v1 kind: Job metadata: generateName: ${job_prefix}- spec: backoffLimit: 0

    template: spec: restartPolicy: Never nodeSelector: example/gpu: "v100" tolerations: - key: nvidia.com/gpu effect: NoSchedule operator: Exists volumes: - name: dshm emptyDir: medium: Memory containers: - name: job image: ${IMAGE} args: [bash, -c, "${args}"] env: - name: ACCESS_ID valueFrom: secretKeyRef: name: my-storage key: access_id - name: SECRET_KEY valueFrom: secretKeyRef: name: my-storage key: secret_key resources: limits: nvidia.com/gpu: 1 cpu: 30 memory: 100Gi volumeMounts: - mountPath: /dev/shm name: dshm set -eux export template_yaml=$1 export job_prefix=$2 echo IMAGE: $IMAGE if [ "$#" -gt 3 ]; then shift 2 export args="$@" fi basedir=$(dirname "$0") jobyaml=$job_prefix.yaml tmp_job=$(mktemp) envsubst < $basedir/$template_yaml > $jobyaml cat $jobyaml trap 'kubectl delete job $job_name; rm -f $tmp_job' EXIT kubectl create -f $jobyaml -o json > $tmp_job namespace=$(jq -r .metadata.namespace $tmp_job) job_name=$(jq -r .metadata.name $tmp_job) for ((k = 0; k < 20; ++k)); do # wait until pod is ready if kubectl wait --for=condition=ready pod -l job-name=$job_name --timeout=10s -n ${namespace}; then kubectl logs -f -n ${namespace} -l job-name=$job_name break else # check init status initStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{if .state.terminated}}{{"Exit"}}{{end}}{{end}}{{"\n"}}{{end}}' -n ${namespace} -l job-name=${job_name}) if [ X"${initStatus}" != X ]; then break fi fi done errorStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{.state.terminated.exitCode}}{{end}}{{end}}' -n ${namespace} -l job-name=${job_name} | sed 's/0\.//g') if [ -n "${errorStatus}" ]; then echo "${errorStatus}" exit 1 fi An image and args can be replaced
  12. apiVersion: batch/v1 kind: Job metadata: generateName: ${job_prefix}- spec: backoffLimit: 0

    template: spec: restartPolicy: Never nodeSelector: example/gpu: "v100" tolerations: - key: nvidia.com/gpu effect: NoSchedule operator: Exists volumes: - name: dshm emptyDir: medium: Memory containers: - name: job image: ${IMAGE} args: [bash, -c, "${args}"] env: - name: ACCESS_ID valueFrom: secretKeyRef: name: my-storage key: access_id - name: SECRET_KEY valueFrom: secretKeyRef: name: my-storage key: secret_key resources: limits: nvidia.com/gpu: 1 cpu: 30 memory: 100Gi volumeMounts: - mountPath: /dev/shm name: dshm set -eux export template_yaml=$1 export job_prefix=$2 echo IMAGE: $IMAGE if [ "$#" -gt 3 ]; then shift 2 export args="$@" fi basedir=$(dirname "$0") jobyaml=$job_prefix.yaml tmp_job=$(mktemp) envsubst < $basedir/$template_yaml > $jobyaml cat $jobyaml trap 'kubectl delete job $job_name; rm -f $tmp_job' EXIT kubectl create -f $jobyaml -o json > $tmp_job namespace=$(jq -r .metadata.namespace $tmp_job) job_name=$(jq -r .metadata.name $tmp_job) for ((k = 0; k < 20; ++k)); do # wait until pod is ready if kubectl wait --for=condition=ready pod -l job-name=$job_name --timeout=10s -n ${namespace}; then kubectl logs -f -n ${namespace} -l job-name=$job_name break else # check init status initStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{if .state.terminated}}{{"Exit"}}{{end}}{{end}}{{"\n"}}{{end}}' -n ${namespace} -l job-name=${job_name}) if [ X"${initStatus}" != X ]; then break fi fi done errorStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{.state.terminated.exitCode}}{{end}}{{end}}' -n ${namespace} -l job-name=${job_name} | sed 's/0\.//g') if [ -n "${errorStatus}" ]; then echo "${errorStatus}" exit 1 fi common configurations for GPU
  13. apiVersion: batch/v1 kind: Job metadata: generateName: ${job_prefix}- spec: backoffLimit: 0

    template: spec: restartPolicy: Never nodeSelector: example/gpu: "v100" tolerations: - key: nvidia.com/gpu effect: NoSchedule operator: Exists volumes: - name: dshm emptyDir: medium: Memory containers: - name: job image: ${IMAGE} args: [bash, -c, "${args}"] env: - name: ACCESS_ID valueFrom: secretKeyRef: name: my-storage key: access_id - name: SECRET_KEY valueFrom: secretKeyRef: name: my-storage key: secret_key resources: limits: nvidia.com/gpu: 1 cpu: 30 memory: 100Gi volumeMounts: - mountPath: /dev/shm name: dshm set -eux export template_yaml=$1 export job_prefix=$2 echo IMAGE: $IMAGE if [ "$#" -gt 3 ]; then shift 2 export args="$@" fi basedir=$(dirname "$0") jobyaml=$job_prefix.yaml tmp_job=$(mktemp) envsubst < $basedir/$template_yaml > $jobyaml cat $jobyaml trap 'kubectl delete job $job_name; rm -f $tmp_job' EXIT kubectl create -f $jobyaml -o json > $tmp_job namespace=$(jq -r .metadata.namespace $tmp_job) job_name=$(jq -r .metadata.name $tmp_job) for ((k = 0; k < 20; ++k)); do # wait until pod is ready if kubectl wait --for=condition=ready pod -l job-name=$job_name --timeout=10s -n ${namespace}; then kubectl logs -f -n ${namespace} -l job-name=$job_name break else # check init status initStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{if .state.terminated}}{{"Exit"}}{{end}}{{end}}{{"\n"}}{{end}}' -n ${namespace} -l job-name=${job_name}) if [ X"${initStatus}" != X ]; then break fi fi done errorStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{.state.terminated.exitCode}}{{end}}{{end}}' -n ${namespace} -l job-name=${job_name} | sed 's/0\.//g') if [ -n "${errorStatus}" ]; then echo "${errorStatus}" exit 1 fi For ObjectStorage
  14. apiVersion: batch/v1 kind: Job metadata: generateName: ${job_prefix}- spec: backoffLimit: 0

    template: spec: restartPolicy: Never nodeSelector: example/gpu: "v100" tolerations: - key: nvidia.com/gpu effect: NoSchedule operator: Exists volumes: - name: dshm emptyDir: medium: Memory containers: - name: job image: ${IMAGE} args: [bash, -c, "${args}"] env: - name: ACCESS_ID valueFrom: secretKeyRef: name: my-storage key: access_id - name: SECRET_KEY valueFrom: secretKeyRef: name: my-storage key: secret_key resources: limits: nvidia.com/gpu: 1 cpu: 30 memory: 100Gi volumeMounts: - mountPath: /dev/shm name: dshm set -eux export template_yaml=$1 export job_prefix=$2 echo IMAGE: $IMAGE if [ "$#" -gt 3 ]; then shift 2 export args="$@" fi basedir=$(dirname "$0") jobyaml=$job_prefix.yaml tmp_job=$(mktemp) envsubst < $basedir/$template_yaml > $jobyaml cat $jobyaml trap 'kubectl delete job $job_name; rm -f $tmp_job' EXIT kubectl create -f $jobyaml -o json > $tmp_job namespace=$(jq -r .metadata.namespace $tmp_job) job_name=$(jq -r .metadata.name $tmp_job) for ((k = 0; k < 20; ++k)); do # wait until pod is ready if kubectl wait --for=condition=ready pod -l job-name=$job_name --timeout=10s -n ${namespace}; then kubectl logs -f -n ${namespace} -l job-name=$job_name break else # check init status initStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{if .state.terminated}}{{"Exit"}}{{end}}{{end}}{{"\n"}}{{end}}' -n ${namespace} -l job-name=${job_name}) if [ X"${initStatus}" != X ]; then break fi fi done errorStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{.state.terminated.exitCode}}{{end}}{{end}}' -n ${namespace} -l job-name=${job_name} | sed 's/0\.//g') if [ -n "${errorStatus}" ]; then echo "${errorStatus}" exit 1 fi "logs" will fail if called before the pod is started. "wait" waits until the pod goes running. It means that if the pod fails immediately after created, It will wait forever. 1SFQBSJOH 'JOJTIFE 3VOOJOH logs wait
  15. apiVersion: batch/v1 kind: Job metadata: generateName: ${job_prefix}- spec: backoffLimit: 0

    template: spec: restartPolicy: Never nodeSelector: example/gpu: "v100" tolerations: - key: nvidia.com/gpu effect: NoSchedule operator: Exists volumes: - name: dshm emptyDir: medium: Memory containers: - name: job image: ${IMAGE} args: [bash, -c, "${args}"] env: - name: ACCESS_ID valueFrom: secretKeyRef: name: my-storage key: access_id - name: SECRET_KEY valueFrom: secretKeyRef: name: my-storage key: secret_key resources: limits: nvidia.com/gpu: 1 cpu: 30 memory: 100Gi volumeMounts: - mountPath: /dev/shm name: dshm set -eux export template_yaml=$1 export job_prefix=$2 echo IMAGE: $IMAGE if [ "$#" -gt 3 ]; then shift 2 export args="$@" fi basedir=$(dirname "$0") jobyaml=$job_prefix.yaml tmp_job=$(mktemp) envsubst < $basedir/$template_yaml > $jobyaml cat $jobyaml trap 'kubectl delete job $job_name; rm -f $tmp_job' EXIT kubectl create -f $jobyaml -o json > $tmp_job namespace=$(jq -r .metadata.namespace $tmp_job) job_name=$(jq -r .metadata.name $tmp_job) for ((k = 0; k < 20; ++k)); do # wait until pod is ready if kubectl wait --for=condition=ready pod -l job-name=$job_name --timeout=10s -n ${namespace}; then kubectl logs -f -n ${namespace} -l job-name=$job_name break else # check init status initStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{if .state.terminated}}{{"Exit"}}{{end}}{{end}}{{"\n"}}{{end}}' -n ${namespace} -l job-name=${job_name}) if [ X"${initStatus}" != X ]; then break fi fi done errorStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{.state.terminated.exitCode}}{{end}}{{end}}' -n ${namespace} -l job-name=${job_name} | sed 's/0\.//g') if [ -n "${errorStatus}" ]; then echo "${errorStatus}" exit 1 fi "logs" will fail if called before the pod is started. "wait" waits until the pod goes running. It means that if the pod fails immediately after created, It will wait forever. 1SFQBSJOH 'BJMFE wait forever
  16. apiVersion: batch/v1 kind: Job metadata: generateName: ${job_prefix}- spec: backoffLimit: 0

    template: spec: restartPolicy: Never nodeSelector: example/gpu: "v100" tolerations: - key: nvidia.com/gpu effect: NoSchedule operator: Exists volumes: - name: dshm emptyDir: medium: Memory containers: - name: job image: ${IMAGE} args: [bash, -c, "${args}"] env: - name: ACCESS_ID valueFrom: secretKeyRef: name: my-storage key: access_id - name: SECRET_KEY valueFrom: secretKeyRef: name: my-storage key: secret_key resources: limits: nvidia.com/gpu: 1 cpu: 30 memory: 100Gi volumeMounts: - mountPath: /dev/shm name: dshm set -eux export template_yaml=$1 export job_prefix=$2 echo IMAGE: $IMAGE if [ "$#" -gt 3 ]; then shift 2 export args="$@" fi basedir=$(dirname "$0") jobyaml=$job_prefix.yaml tmp_job=$(mktemp) envsubst < $basedir/$template_yaml > $jobyaml cat $jobyaml trap 'kubectl delete job $job_name; rm -f $tmp_job' EXIT kubectl create -f $jobyaml -o json > $tmp_job namespace=$(jq -r .metadata.namespace $tmp_job) job_name=$(jq -r .metadata.name $tmp_job) for ((k = 0; k < 20; ++k)); do # wait until pod is ready if kubectl wait --for=condition=ready pod -l job-name=$job_name --timeout=10s -n ${namespace}; then kubectl logs -f -n ${namespace} -l job-name=$job_name break else # check init status initStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{if .state.terminated}}{{"Exit"}}{{end}}{{end}}{{"\n"}}{{end}}' -n ${namespace} -l job-name=${job_name}) if [ X"${initStatus}" != X ]; then break fi fi done errorStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{.state.terminated.exitCode}}{{end}}{{end}}' -n ${namespace} -l job-name=${job_name} | sed 's/0\.//g') if [ -n "${errorStatus}" ]; then echo "${errorStatus}" exit 1 fi "logs" will fail if called before the pod is started. "wait" waits until the pod goes running. It means that if the pod fails immediately after created, It will wait forever. 1SFQBSJOH 'JOJTIFE 3VOOJOH logs failed
  17. apiVersion: batch/v1 kind: Job metadata: generateName: ${job_prefix}- spec: backoffLimit: 0

    template: spec: restartPolicy: Never nodeSelector: example/gpu: "v100" tolerations: - key: nvidia.com/gpu effect: NoSchedule operator: Exists volumes: - name: dshm emptyDir: medium: Memory containers: - name: job image: ${IMAGE} args: [bash, -c, "${args}"] env: - name: ACCESS_ID valueFrom: secretKeyRef: name: my-storage key: access_id - name: SECRET_KEY valueFrom: secretKeyRef: name: my-storage key: secret_key resources: limits: nvidia.com/gpu: 1 cpu: 30 memory: 100Gi volumeMounts: - mountPath: /dev/shm name: dshm set -eux export template_yaml=$1 export job_prefix=$2 echo IMAGE: $IMAGE if [ "$#" -gt 3 ]; then shift 2 export args="$@" fi basedir=$(dirname "$0") jobyaml=$job_prefix.yaml tmp_job=$(mktemp) envsubst < $basedir/$template_yaml > $jobyaml cat $jobyaml trap 'kubectl delete job $job_name; rm -f $tmp_job' EXIT kubectl create -f $jobyaml -o json > $tmp_job namespace=$(jq -r .metadata.namespace $tmp_job) job_name=$(jq -r .metadata.name $tmp_job) for ((k = 0; k < 20; ++k)); do # wait until pod is ready if kubectl wait --for=condition=ready pod -l job-name=$job_name --timeout=10s -n ${namespace}; then kubectl logs -f -n ${namespace} -l job-name=$job_name break else # check init status initStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{if .state.terminated}}{{"Exit"}}{{end}}{{end}}{{"\n"}}{{end}}' -n ${namespace} -l job-name=${job_name}) if [ X"${initStatus}" != X ]; then break fi fi done errorStatus=$(kubectl get po -o go-template='{{range .items}}{{range .status.containerStatuses}}\ {{.state.terminated.exitCode}}{{end}}{{end}}' -n ${namespace} -l job-name=${job_name} | sed 's/0\.//g') if [ -n "${errorStatus}" ]; then echo "${errorStatus}" exit 1 fi Depending on the kubectl version, the exitCode can be 0 or 0.0.
  18. swimmy.cmd python -m swimmy.cmd --name example-predict --image example.com/predictor python run-predict.py

    --all - No need to Deploy anything before - Execute command on the image - Print pod logs and events -
  19. swimmy.cmd python -m swimmy.cmd --name example-predict --image example.com/predictor --cpu 4000m

    --mem 4Gi python run-predict.py --all - No need to Deploy anything before - Execute command on the image - Print pod logs and events - Resource limits -
  20. swimmy.cmd python -m swimmy.cmd --name example-predict --image example.com/predictor --cpu 4000m

    --mem 4Gi -e KEY_SECRET -e STAGE=dev python run-predict.py --all - No need to Deploy anything before - Execute command on the image - Print pod logs and events - Resource limits - Environment variable -
  21. swimmy.cmd python -m swimmy.cmd --name example-predict --image example.com/predictor --cpu 4000m

    --mem 4Gi -e KEY_SECRET -e STAGE=dev --template object-storage.yaml python run-predict.py --all - No need to Deploy anything before - Execute command on the image - Print pod logs and events - Resource limits - Environment variable - Pod template system - containers: - env: - name: ACCESS_ID valueFrom: secretKeyRef: name: my-storage key: access_id - name: SECRET_KEY valueFrom: secretKeyRef: name: my-storage key: secret_key
  22. swimmy.cmd python -m swimmy.cmd --name example-predict --image example.com/predictor --cpu 4000m

    --mem 4Gi -e KEY_SECRET -e STAGE=dev --template object-storage.yaml --template gpu-v100.yaml python run-predict.py --all - No need to Deploy anything before - Execute command on the image - Print pod logs and events - Resource limits - Environment variable - Pod template system - nodeSelector: example/gpu: "v100" tolerations: - key: nvidia.com/gpu effect: NoSchedule operator: Exists containers: - resources: limits: nvidia.com/gpu: 1 volumeMounts: - mountPath: /dev/shm name: dshm
  23. swimmy.cmd python -m swimmy.cmd --name example-predict --image example.com/predictor --cpu 4000m

    --mem 4Gi -e KEY_SECRET -e STAGE=dev --template object-storage --template configmap:mycm/gpu-v100.yaml python run-predict.py --all - No need to Deploy anything before - Execute command on the image - Print pod logs and events - Resource limits - Environment variable - Pod template system - 5FNQMBUFMPDBUJPOQSJPSJUZ QSFpYXJUIDPOpHNBQ MPDBMpMF SFBEBTDPOpHNBQTXJNNZUFNQMBUFT\UFNQMBUF^ZBNM
  24. swimmy.cmd python -m swimmy.cmd --name example-predict --image example.com/predictor --cpu 4000m

    --mem 4Gi -e KEY_SECRET -e STAGE=dev --template object-storage.yaml --template gpu-v100.yaml --files run-predict.py python run-predict.py --all - No need to Deploy anything before - Execute command on the image - Print pod logs and events - Resource limits - Environment variable - Pod template system - Send any files -
  25. swimmy.cmd python -m swimmy.cmd --name example-predict --image example.com/predictor --cpu 4000m

    --mem 4Gi -e KEY_SECRET -e STAGE=dev --template object-storage.yaml --template gpu-v100.yaml --files run-predict.py --connect-timeout 120 --ttl-seconds 3600 python run-predict.py --all - No need to Deploy anything before - Execute command on the image - Print pod logs and events - Resource limits - Environment variable - Pod template system - Send any files - TTL and startup time limit
  26. swimmy.cmd python -m swimmy.cmd --name example-predict --image example.com/predictor --cpu 4000m

    --mem 4Gi -e KEY_SECRET -e STAGE=dev --template object-storage.yaml --template gpu-v100.yaml --files run-predict.py --connect-timeout 120 --ttl-seconds 3600 python run-predict.py --all - No need to Deploy anything before - Execute command on the image - Print pod logs and events - Resource limits - Environment variable - Pod template system - Send any files - TTL and startup time limit 4BNFWFSTJPOPG1ZUIPOJTSFRVJSFEJOUIFJNBHF
  27. Hello, Swimmy! import asyncio import logging import swimmy import foo

    @swimmy.remotefn def hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main())
  28. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS *OUFSOBMT Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS $SFBUF+PC kind: Role rules: - apiGroups: - "" resources: - configmaps - pods - pods/status - pods/log - events verbs: - get - list - watch - apiGroups: - batch resources: - jobs verbs: - create - patch
  29. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS *OUFSOBMT Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS $SFBUF+PC kind: Role rules: - apiGroups: - "" resources: - configmaps - pods - pods/status - pods/log - events verbs: - get - list - watch - apiGroups: - batch resources: - jobs verbs: - create - patch
  30. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS 1PE *OUFSOBMT
  31. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS "HFOU 1:1* 'FUDI4XJNNZ *OUFSOBMT
  32. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS "HFOU 1:1* *OUFSOBMT )BEPPQ ,BGLB %BUB'SBNF 3FEJT /VN1Z ,FFQ.JOJNBM %FQFOEFODJFT
  33. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS "HFOU 1:1* 0QFO;.2$POOFDUJPO *OUFSOBMT
  34. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS "HFOU 1:1* 0QFO;.2$POOFDUJPO 8IZ;.2  'BTU 4NBMM 'MFYJCMF )JHITQFFEBTZODISPOPVT*0 5JOZTJOHMFXIFFM 'JOFHSBJOFEqPXDPOUSPM *OUFSOBMT
  35. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS "HFOU 1:1* )551GPSpMFT *OUFSOBMT
  36. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS "HFOU 1:1* )FBMUI$IFDL *OUFSOBMT
  37. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS "HFOU 1:1* &YUFOEBDUJWF%FBEMJOF4FDPOET *OUFSOBMT kind: Role rules: - apiGroups: - "" resources: - configmaps - pods - pods/status - pods/log - events verbs: - get - list - watch - apiGroups: - batch resources: - jobs verbs: - create - patch
  38. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS "HFOU 1:1* 3FBEQPEMPHTBOEFWFOUT *OUFSOBMT kind: Role rules: - apiGroups: - "" resources: - configmaps - pods - pods/status - pods/log - events verbs: - get - list - watch - apiGroups: - batch resources: - jobs verbs: - create - patch
  39. import asyncio import logging import swimmy import foo @swimmy.remotefn def

    hello(name: str) -> str: return foo.bang(f"Hello, {name}") async def main(): async with swimmy.KubeCluster(name="swimmy-test") as cluster: x = await cluster.run( swimmy.PodSpec( image="example.com/py37", cpu="200m", mem="200Mi", count=2, pyfiles=["foo.py"], ) ) print(await x.remote(hello("Swimmy"))) logging.basicConfig(level=logging.INFO) asyncio.run(main()) %SJWFS Hello, Swimmy! ,VCFSOFUFT "1*4FSWFS +PC %SJWFS "HFOU 1:1* 31$ *OUFSOBMT
  40. What is it used for? - Distributed Machine Learning Library

    - Redis Monitoring - Scalable Load Testing
  41. async def mpi_train(train, preprocess): async with swimmy.KubeCluster(name="mpi-train") as cluster: cp

    = await cluster.run(preprocess.podspec, name='preprocess') ct = await cluster.run(train.podspec, name='train') conn = connection_manager(cp, ct) fp = cp.remote(mpi_preprocess(preprocess.func)) ft = ct.remote(mpi_fit(train.func)) tasks = [asyncio.ensure_future(t) for t in [conn, fp, ft]] await asyncio.gather(*tasks) Ghee $16 /PEF $16 /PEF (16 /PEF 1PE 5SBJO 1PE 1PE 1PE 1PE HDFS (16 /PEF 5SBJO 1PE 1Z"SSPX .1* ;.2
  42. Redis Monitoring Redis Redis Redis Redis Redis Redis 4XJNNZ%SJWFS $MJFOU

    "HFOU "HFOU "HFOU 10x performance Scanning 100M keys from the Redis-cluster, it took more than an hour
  43. Thank you! - Python library for Kubernetes Job - Send

    function and files - Minimal dependencies - Pod template - Health check