Zero Scale Abstraction in Knative Serving

Zero Scale Abstraction in Knative Serving 2019-10-22 ServerlessDays Tokyo 2019
Tsubasa Nagasawa 1

2 • 長澤翼 (Tsubasa Nagasawa)  • インフラエンジニア  • 富士通株式会社
(2016.04-2018.10)  • CyberAgent (2018.11-)  ◦ メディア事業部, Service Reliability Group  ◦ ポイントシステムや認証基盤のインフラ担当  • その他  ◦ Kubernetes/Knative を触り始めて半年  ◦ Knative へのコントリビュート始めました  About me toVersus 

• Knative プロジェクトとは (3分)  • Knative Serving  ◦ 紹介と簡単なデモ (7分) 
◦ ゼロスケールの仕組み (7分)  ◦ 優れた点 (5分)  ◦ 制約と課題 (5分)  • Knative Serving の本番導入に向けて  ◦ アーキテクチャの紹介 (5分)  ◦ 直面した問題と解決策 (5分)  3

Knative Kubernetes native platform for serverless workload [kay-native] • Build:
Source to Container • Serving: Request-driven, scale-to-zero, container-based compute. • Eventing: Attach work to event sources. 4

K8S GKE EKS Rancher Knative 'Kubernetes is becoming the Linux
of the cloud' - Jim Zemlin, Linux Foundation CaaS PaaS FaaS Eirini + Runtime Cloud Run OpenShift Knative Serving API に準拠  ※ This is just my personal insight. 5

Knative — Kubernetes-native PaaS with Serverless https://itnext.io/knative-kubernetes-native-paas-with-serverless-a1e0a0612943 > The PaaS
handles source-to-deployment workflow, building the user’s container image, rolling out a new deployment, and configuring a new route and DNS subdomain to allow traffic to reach the deployed containers. 7

Knative serverless Kubernetes bypasses FaaS to revive PaaS https://searchitoperations.techtarget.com/news/252469607/Knative-serverless-Kuber netes-bypasses-FaaS-to-revive-PaaS 
> Unlike the earlier generation of VM-based, fully managed PaaS oﬀerings, Knative and Kubernetes allow enterprise DevOps and SRE teams to retain control over the Kubernetes infrastructure behind the scenes. 8

What Exactly Is Knative? https://medium.com/datadriveninvestor/what-exactly-is-knative-252ec94e4de7    > PaaS providers such
as CloudFoundry and OpenShift are also actively making contributions to the Knative community. 9

Knative Raises the level of abstraction to running and connecting
stateless applications on Kubernetes in practical ways 10

Knative Launch Partners 11 

Knative Community Enabled Build Serving Kubernetes Platform Products Primitives Events
... Google Cloud Run SAP Kyma Pivotal Function Service IBM Cloud Functions Red Hat Openshift Cloud Functions Pivotal riff TriggerMesh T-mobile Jazz Rancher Rio 12 

Knative Serving Stateless, HTTP request-driven, container autoscaling platform on top
of Kubernetes 13

• コンテナオーケストレーションプラットフォーム  ◦ サービスディスカバリと負荷分散  ◦ ストレージ管理  ◦ アプリのローリングアップデート/切り戻しの単純化  ◦ リソースに基づいたアプリ配置の最適化 
◦ 自動回復  ◦ 秘匿情報と設定の管理  • Kubernetes Components  ◦ kube-proxy  ▪ サービスディスカバリのデータプレーン  ▪ ワーカーノード毎  ▪ iptables で Netfilter のルールを操作  ▪ VIP とバックエンドの紐付け  Brief Review - Kubernetes 14 master worker

Pod  • アプリをホストする　最小単位のリソース  • 単一もしくは複数の　コンテナで構成  • ひとつの IP
アドレスとポート空間を共有  • 短寿命  Deployment  • 複数の Pod のセット  • Pod の作成・更新などライフサイクル管理  • セルフヒーリング  • 開発者が触るリソース  • 永続的  Service  • アプリにアクセスする方法を抽象化  • Service Discovery  • 内部 VIP の提供  • L4 負荷分散  • 開発者が触るリソース  • 永続的  15 Brief Review - Kubernetes Primitives

apiVersion: apps/v1 kind: Deployment metadata: name: grpc-k8s spec: selector: matchLabels:
app: grpc-k8s template: metadata: labels: app: grpc-k8s spec: containers: - name: user-container image: toversus/grpc-ping-go ports: - name: h2c containerPort: 8080 readinessProbe: tcpSocket: port: 8080 periodSeconds: 1 apiVersion: v1 kind: Service metadata: name: grpc-k8s spec: ports: - name: http2 port: 80 targetPort: h2c selector: app: grpc-k8s apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: grpc-istio spec: gateways: - knative-ingress-gateway.knative-serving.svc.cluster.local hosts: - k8s.mattmoor.io http: - match: - authority: regex: ^k8s\.default\.io(?::\d{1,5})?$ route: - destination: host: grpc-k8s.default.svc.cluster.local port: number: 80 weight: 100 apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler metadata: name: grpc-k8s spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: grpc-k8s minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 Kubernetes Deployment and others... 16

apiVersion: serving.knative.dev/v1 kind: Service metadata: name: grpc-knative spec: template: metadata:
annotations: autoscaling.knative.dev/minScale: "1" autoscaling.knative.dev/maxScale: "10" spec: containers: - image: toversus/grpc-ping-go ports: - name: h2c containerPort: 8080 Knative Service 17

Demo #1 Hello, Knative! 18

19 Demo #1: Hello, Knative!

Knative Service アプリの生成から削除までのライフサイクルを管理     Conﬁguration アプリの状態管理 (現状と望ましい状態)   Twelve-Factor
App の思想  コードと設定を分離    Revision  コードと設定の不変のスナップショット    Route  トラフィックを Revision に紐付け  Knative Serving Primitives 24 

name: grpc-knative-v1 spec: containers: - image: toversus/grpc-ping-go ports: - name: h2c containerPort: 8080 Revision (grpc-knative-v1) Conﬁguration (grpc-knative) records history of Route (grpc-knative) Service (grpc-knative) traﬃc: 100% Knative Service 25

name: grpc-knative-v1 spec: containers: - image: toversus/grpc-ping-go ports: - name: h2c containerPort: 8080 traffic: - latestRevision: true percent: 100 Knative Service 26 Configuration (grpc-knative) Route (grpc-knative) Service (grpc-knative) traffic: 100% records history of Revision (grpc-knative-v1)

name: grpc-knative-v2 spec: containers: - image: toversus/grpc-ping-go args: ["-value=KNATIVE"] ports: - name: h2c containerPort: 8080 traffic: - revisionName: grpc-knative-v1 percent: 100 tag: blue - revisionName: grpc-knative-v2 percent: 0 tag: green Knative Service (rolling update) Configuration (grpc-knative) Route (grpc-knative) Service (grpc-knative) Revision (grpc-knative-v1) records history of Revision (grpc-knative-v2) traffic: 100% 27

name: grpc-knative-v2 spec: containers: - image: toversus/grpc-ping-go args: ["-value=KNATIVE"] ports: - name: h2c containerPort: 8080 traffic: - revisionName: grpc-knative-v1 percent: 0 tag: blue - revisionName: grpc-knative-v2 percent: 100 tag: green Knative Service (rolling update) Configuration (grpc-knative) Route (grpc-knative) Service (grpc-knative) traffic: 100% Revision (grpc-knative-v1) records history of Revision (grpc-knative-v2) 28

Demo #2 Rolling Update 29

Demo #2: Rolling Update 30

Demo #2: Rolling Update 31 apiVersion: serving.knative.dev/v1 kind: Service metadata:
name: account spec: template: metadata: name: account-grpc-v0-1-0 labels: serving.knative.dev/visibility: cluster-local annotations: autoscaling.knative.dev/target: "10" autoscaling.knative.dev/minScale: "1" autoscaling.knative.dev/maxScale: "3" ... spec: containers: - image: toversus/rosetta-account:v0.1.0 ports: - name: h2c containerPort: 8070 resources: requests: cpu: 100m memory: 128Mi limits: memory: 128Mi apiVersion: serving.knative.dev/v1 kind: Service metadata: name: gateway spec: template: metadata: name: gateway-graphql-v0-1-0 annotations: autoscaling.knative.dev/target: "10" autoscaling.knative.dev/minScale: "1" autoscaling.knative.dev/maxScale: "3" ... spec: containers: - image: toversus/rosetta-gateway:v0.1.0 ports: - name: h2c containerPort: 8080 env: - name: ACCOUNT_SERVICE_URL value: account.default.svc.cluster.local:80 resources: requests: cpu: 100m memory: 128Mi limits: memory: 128Mi

Demo #2: Rolling Update 32 apiVersion: serving.knative.dev/v1 kind: Service metadata:
name: account spec: template: metadata: name: account-grpc-v0-2-0 labels: serving.knative.dev/visibility: cluster-local annotations: autoscaling.knative.dev/target: "10" autoscaling.knative.dev/minScale: "1" autoscaling.knative.dev/maxScale: "3" ... spec: containers: - image: toversus/rosetta-account:v0.2.0 ports: - name: h2c containerPort: 8070 resources: requests: cpu: 100m memory: 128Mi limits: memory: 128Mi apiVersion: serving.knative.dev/v1 kind: Service metadata: name: gateway spec: template: metadata: name: gateway-graphql-v0-2-0 annotations: autoscaling.knative.dev/target: "10" autoscaling.knative.dev/minScale: "1" autoscaling.knative.dev/maxScale: "3" ... spec: containers: - image: toversus/rosetta-gateway:v0.2.0 ports: - name: h2c containerPort: 8080 env: - name: ACCOUNT_SERVICE_URL value: account.default.svc.cluster.local:80 resources: requests: cpu: 100m memory: 128Mi limits: memory: 128Mi

Demo #2: Rolling Update 33

Inside Knative Serving - Dominik Tornow, SAP & Andrew Chen,
Google https://www.youtube.com/watch?v=-tvQgLbcNtg 34

Inside Zero Scale Abstraction 35

To archive scaling to zero… How can we retain synchronous
user requests until desired number of Pods are scaled out? 36

• Queuing/Buﬀering requests • Fast scaling out • Switching request
path • Observing request concurrency • Making autoscale decision  The key elements are... 37

• Activator (a.k.a. KQueue) ◦ Centralized request queue ◦ Routing/Balancing
requests ◦ Expose metrics like request concurrency routed to Activator • Autoscaler ◦ Scraping metrics ◦ Calculating and making autoscale decision • queue-proxy ◦ Distributed request queue ◦ Reverse proxy ◦ Expose metrics like request concurrency routed to queue-proxy ◦ Deploy as sidecar container The key components are... 38

A journey through initial scaling to zero 39

Initial state • Activator と Autoscaler は WebSocket で常時接続  •
メトリクスを定期的に送信  40

Deploy Knative Service  • Revision Controller が KPA 作成  •
KPA は、Revision のスケール方式、スケール数、スケールを決定するメトリクスのソースを管理 41

Serverless-style K8S Service   • リクエストパスの切り替えを　行う重要なリソース 42

K8S Services • Public K8S Service がリクエストパスのスイッチ役 (non-Endpoints)  •
Revision K8S Service は、リクエストの転送先を監視するため (ClusterIP or Endpoints)  43

Queuing requests  • Endpoints を手動コピーしてリクエストパスを Activator に向ける  • Activator
で一旦リクエストをキューに溜められるように  44

Proving Endpoints  • Activator は、Pod がリクエストの処理ができる状態か Revision Service の
Endpoints を監視  45

Healthy Endpoints  • Pod がリクエストを処理できる　状態であることが判明  46

Switching Endpoints  • Revision Service の Endpoints オブジェクトで、Public Service
の Endpoints を上書き  47

Monitoring metrics  • Autoscaler は、queue-proxy のメトリクスを収集  • Activator のメトリクスと突き合わせ
てスケールが必要か判断  48

Scaling down to zero  • スケールダウンする前に Public Service の Endpoints
を Autoscaler に向ける  • Deployment の replicas が 0 に書き換えられる  49

Zero scaled  • Pod が削除され、ゼロスケール  • Revision K8S Service の
Endpoints も削除  50

A journey through scaling from zero 51

52 Incoming requests • クラウドロードバランサが  リクエストを受ける 

53 Forwarding requests • クラウドロードバランサが   K8S のワーカーノードに  リクエスト振り分け 

54 Pass-through requests • K8S Service が nodePort で  受けたリクエストを 
Front Envoy (L7 LB) に流す 

55 Istio magic • リクエストのホストヘッダーから転送先の Public Service を識別  •
Public Service にリクエスト転送 

56 Activator queuing • Public Service は、Activator にリクエストを転送  •
Activator は、リクエストをキューに溜める 

57 Making autoscale decision • Autoscaler がメトリクスからスケールが必要と検知  • KPA
にスケールを指示 

58 Scaling from zero • KPA が Deployment の  レプリカ数を書き換え 
• Pod 作成中 

59 Proving endpoints  • Activator は、Pod がリクエストの処理ができる状態か Revision Service
の Endpoints を監視 

60 Transfering requests  • Activator は、ClusterIP もしくは PodIP にリクエストを転送 

61 Switching Endpoints  • Revision Service の Endpoints オブジェクトで、Public
Service の Endpoints を上書き 

62 Stable request path  • Activator を経由することなく  リクエストがバックエンドに  流れる 

• スケール時  ◦ 最初の proxy が Activator -> queue-proxy  ◦
2 回目の proxy が queue-proxy -> user-container  • 通常時  ◦ queue-proxy -> user-container への proxy のみ   63 Tracing results スケール時  通常時 

• Knative のオートスケールの仕組みを Kubernetes に取り込む  ◦ K8S Service/Ingress でゼロスケールをサポート  ◦
Ingress v2 へのフィードバックを積極的に行なっている  ▪ SIG-NETWORK: [PUBLIC] A sketch of the API • Kubernetes HPA の現状  ◦ API レベルで `minReplicas: 0` をサポート (v1.16)  ▪ Support scaling HPA to/from zero pods for object/external metrics  ◦ リクエストバッファの仕組みは K8S の世界にない  ◦ 自作もしくは外部メトリクスをベースにゼロスケール  ▪ 外部キューの長さがゼロ、LB の rps がゼロ、...  64 Autoscaling ultimate goal

• kube-proxy を使ったゼロスケールの仕組み (Proposal)  ◦ ClusterIP へのリクエストは kube-proxy を経由する  ◦
ゼロスケール時は Pod 起動まで kube-proxy でキューに溜める  ◦ kube-proxy が Endpoints を監視して、バックエンドの正常性を判断  • Osiris (by Microsoft)  ◦ HTTP リクエストベースでゼロスケールする仕組みを提供  ◦ K8S Service の Endpoints を手動付け替え  ▪ Knative が参考にした仕組み  ◦ 2つの Service を使ってリクエストパスの切り替えを行う  ▪ Knative の Public Service の役目はなし  ▪ 直接 Revision Service を selector 有り無しで切り替える  65 Autoscaling ultimate goal

        • 課金の開始いつから？  ◦ 初期デプロイ時の起動から停止の 90 秒間を課金に含めない 
◦ 初期デプロイ時に Pod をすぐに停止させる  ◦ 初期デプロイ時に Pod を起動させない  Oﬀ-topic: Don't scale to 1 upon deploy 90 sec  Initial  deploy  66

        • 課金の開始いつから？  ◦ 初期デプロイ時の起動から停止の 90 秒間を課金に含めない 
◦ 初期デプロイ時に Pod をすぐに停止させる  ◦ 初期デプロイ時に Pod を起動させない  Oﬀ-topic: Don't scale to 1 upon deploy 90 sec  Initial  deploy  readinessProbe を明示的に指定していない場合は、  Pod を起動させない方向で調整中  67

Knative Serving Pros. 68

• HTTP リクエストベースでスケール可能  ◦ アプリが同時処理中のリクエスト数  ◦ アプリが 1 秒間に処理したリクエスト数  ◦
CPU の使用率 (HPA)  ◦ メモリの使用率 (HPA)  • スケールのポリシーをある程度カスタマイズ可能  ◦ アプリの並列処理数  ◦ 最大/最小スケール数  ◦ ゼロスケールまでの期間  ◦ Burst Capacity  Auto scaling with request metrics 69

• Kubernetes の世界  ◦ コンテナイメージのタグは mutable  ◦ Docker Registry v1
時代に GAE 上に GCR の前身を構築  ▪ google/docker-registry は (Python) ベース  ▪ google/docker-registry で文法エラーがあり、修正をロールアウト  ▪ DDoS 攻撃並みのアクセス発生、DockerHub で数時間の障害が  ▪ Kubernetes に imagePullPolicy が存在するキッカケなそうな  70 Free from `imagePullPolicy: Always`

• Kubernetes の世界  ◦ latest タグ以外はデフォで ìmagePullPolicy: IfNotPresent` ◦ Immutable
なタグを Deployment に指定することを推奨  ◦ ...イメージ更新したが Deployment に変更が反映されない！  ▪ ìmagePullPolicy: Always` に逆戻り...  This is not what I wanted... 71 Free from ìmagePullPolicy: Always`

• Kubernetes の世界  ◦ 同じタグを更新していて、ノードがスケールアウトすると...   v1.0.0  v1.0.0’  72 Free
from `imagePullPolicy: Always`

• Knative の世界  ◦ Revision コントローラがイメージタグを Digest に変換  ◦ Deployement
の `image` にタグ解決で得られた Digest を指定  ◦ Revision は immutable  ▪ ノードがスケールしても事故は起きない  ◦ より実用的な方法でフェイルセーフを実現  ◦ Tag to digest の機能をオフにもできる  Free from `imagePullPolicy: Always` 73

Why we resolve tags in Knative https://docs.google.com/presentation/d/1gjcVniYD95H1DmGM_n7dYJ69vD9d6Kg JiA-D9dydWGU 10 Ways
to Shoot Yourself in the Foot with Kubernetes, #9 Will Surprise You https://youtu.be/QKI-JRs2RIE?t=483 > Jobs are not starting, image pull fails https://github.com/knative/serving/issues/4098#issuecomment-506194163 > Another syntax error, but this one actually managed to take down DockerHub for a few hours. This outage was eﬀectively the reason imagePullPolicy: exists in Kubernetes, and has such a terrible default value (that makes me sad). 74 Google Slide を見るには knative-users@ に参加する必要があります。  

• サービス停止を伴わないアプリのローリングアップデートの難しさ  Safe rolling update K8S Service (L4) や Ingress
(L7) の更新のレイテンシ理想  1. 削除する Pod を決定  2. Service/Ingress の PodIP リスト更新  3. アプリに SIGTERM が飛ぶ  4. Graceful Shutdown が発動  現実  1. 削除する Pod を決定  2. Service/Ingress の PodIP リスト更新 (更新に時間がかかる)  3. アプリに SIGTERM が飛ぶ  4. Graceful Shutdown が発動  5. Pod にリクエストが流れてしまう  6. アプリ停止により 5xx を返す      75

Safe rolling update • Knative Service のローリングアップデート  ◦ 新バージョンの Pod
の起動  ▪ Activator のリクエストキュー   ◦ 旧バージョンの Pod はリクエストの処理が完了してから、  ▪ Replicaset Controller が削除する Pod をランダムに決定  ▪ Istio Proxy の設定の更新 (時間がかかる)  ▪ Pod に SIGTERM が飛ぶ  ▪ queue-proxy は、SIGTERM を受けると 20 秒間スリープ  ▪ リクエストパスが変わっているので、新しいリクエストなし  76

Reliability, reliability, reliability... • Knative Serving のテストと信頼性  ◦ fake client
を使ったユニットテスト  ◦ test-infra を使った E2E テスト  ▪ PR のコミット単位で都度 GKE に Knative をインストール  ▪ 作成したリソース状態や各コンポーネントのログを検索可能  ◦ Flaky (不安定) なテストの自動検出  ◦ Serving API に準拠しているかを確認するコンフォーマンステスト  ◦ パフォーマンステストと SLA の設定  ▪ google/mako で監視  ◦ pprof で各コンポーネントのプロファイリングを取得  ▪ ボトルネックを見つけて最適化予定  77

Reliability, reliability, reliability... • パフォーマンステスト  ◦ 30分置きに vegeta で 0
-> 1k -> 2k -> 3k と負荷をかける  ◦ レイテンシやエラー率、理想/現実の Pod 数を計測  ◦ レイテンシの最大値 => ゼロスケールのレイテンシ  ▪ https://mako.dev/benchmark?benchmark_key=5352009922248704 &~l=max&tseconds=345600&tag=tbc%3D200 78

• マルチテナント対応  ◦ 1 つの Istio IngressGateway に複数のドメインを設定可能  ◦ Knative
Serving で使用する Istio IngressGateway を指定可能  • Istio 依存からの脱却  ◦ Gloo, Ambassador に対応済み (SMI、Ingress v2 も視野に)  ◦ Cold Start (non Istio side-car) ~ 3 s  • Revision GC  ◦ リクエストパスに指定されていない Revision をお掃除  • 独自のオートスケーラーを実装可能  • メトリクス・ログ監視  ◦ クラスター内に Fluentd DS/Prometheus/Grafana/ES Stack を構築  ◦ GKE なら Stack Driver に流すことが可能  Others... 79

Knative Serving Cons. 80

• Pod に対してユーザーが定義できるコンテナは 1 つだけ  • Init/Sidecar コンテナ未サポート  ◦ CloudSQL
Proxy、Dapr、ロギング・監視エージェント...  ◦ Validating Admission Webhook で作成時に弾かれる  • istio-proxy や queue-proxy はなぜ存在する？  ◦ Mutating Admission Webhook を使っている  • コンテナを複数梱包したいなら、Webhook 自作  81 Single container per Pod

• [Proposal] Support Multiple Containers ◦ アプリのポートを上手くマッピングする必要あり  ▪ queue-proxy のリクエスト転送先のポートを知る必要あり 
▪ 現状、ポートを複数公開することもできない  ▪ 最終的に [KEP] Sidecar Containers に寄せていく  82 Single container per Pod

• Pod にアタッチできる Volume に制限あり  • 以下の Volume タイプのみサポート  ◦
ConfigMap  ◦ Secret  ◦ Projected Volume  • 読み取り専用の Persistent Volume もマウント不可  • 要望が多数あるため慎重に導入を検討中  ◦ Feature Request: Relax volume constraint to support more volume types    83 Limits on Pod mounted Volume

• Pod を特定のノードに配備させるポリシーを設定できない  ◦ nodeSelector: 特定のラベルの付いたノードに配備  ◦ Node Affinity: 特定のラベルの付いたノードに配備 
◦ Tolerations: 特定のラベルの付いたノードに配備されないように  • ユースケース  ◦ Preemptive (Spot) Instance の利用  ◦ GPU マシンの使い分け  ◦ Windows Server の使い分け  ◦ Virtual Kubelete との併用  • アプリ開発者にノードの存在を意識させてしまう...  ◦ Serverless の理念にそぐわない    84 Limits on Node scheduling policy

• Knative Serving API v1 は MVP の実装  ◦ 多くの
Pod Spec 配下のフィールドが指定できない状態  ▪ DownwardAPI  ▪ InitContainers  ▪ PriorityClassName  ▪ Priority  ▪ ReadinessGates  ▪ RuntimeClassName  ▪ Capabilities  ▪ Privileged  ▪ ...    85 And other masked ﬁelds...

• スケールダウン時に Deployment のレプリカ数を書き換えている  ◦ ReplicaSet コントローラがどの Pod を削除するか決定  ◦
Pod のステータスで優先順位はあるがランダム削除  ◦ Knative の問題というより Kubernetes の課題...  • Knative の Graceful Termination  ◦ 30 秒以上処理時間が掛かる場合に問題に...  ▪ バッチ処理、機械学習モデルの実行など  • [draft proposal] Graceful Scaledown  ◦ メトリクスを元に削除する Pod を決定  ◦ queue-proxy の readinessProbe を失敗させる  ◦ Deployment のレプリカ数を更新する  86 Pod phantom killer

• K8S Service のコスト  ◦ 1つの Revision に対して 3 つの
K8S Service  ◦ K8S Service が増えると ClusterIP 経由での接続のレイテンシが悪化  • Non-Inteligent Autoscaling Algorithm  ◦ 一定の割合で増えるリクエスト数を検知して事前に Pod をスケールできず  • 非同期処理未対応  ◦ Knative Eventing と組み合わせるなど別の方法を検討する必要あり  • Stateful アプリケーション未対応  • Pod で公開できるポートが1つのみ  ◦ メトリクス公開用のポートはカスタムメトリクス用のポートを使う  • IAM Roles for Service Account が使えない  ◦ Revision 作成時に tag-to-digest の変換が入る  ◦ credential_helper.go で参照している AWS SDK のバージョン...  Others... 87

Road to Knative Serving v1 v0.9.0 (v1rc2) 2019/09/17 2019/10/29 2019/12/10
v0.10.0 (v1) • Burst Capacity の導入 • Autoscale の安定性向上 • Zero Scale 時間の改善 • Validating Webhook • Performance 計測/SLA 導入 • v0.11.0 (v1) 88

Knative Feature Tracks 89

• Node local scheduling  ◦ 高負荷やゼロからスケールする時だけ自前スケジューリング  ▪ Pod のスケールは K8S
のコントロールプレーンに完全に依存  ▪ Activator をノード毎に配置  ▪ Activator が kube-apiserver を経由して kubelet にスケールを指示  ◦ これでも大幅な改善が見られない場合は kubelet を最適化  • Request load balancing and routing  ◦ Activator をよりロードバランサーに仕立てる    Autoscaling 90

• Pipeline cold start  ◦ Knative Service 間のネストが深い場合のレスポンス時間の改善  ▪ KService
A -> KService B -> KService C -> ...  ◦ トレース情報を使って事前に Pod を起動できないか  • Merge KPA to HPA  ◦ 必要であれば HPA に機能を移植  ◦ HPA でオートスケールの仕組みを組み立てる  ◦ 独自実装の KPA のメンテナンスコストを下げる  Autoscaling 91

• Revision へのルーティング方式の拡張  ◦ HTTP リクエストヘッダーによる Feature Flag ルーティングなど  •
Zero to 1M QPS  ◦ ゼロから 1M QPS のリクエストを処理できるように  ◦ 必要な Istio/Knative のリソースを示す  • Acrtivator の置き換え  ◦ Activator を Envory に置き換える  ▪ リクエストバッファ、負荷分散以外にもメトリクスを送信  ◦ Mixer Adapter の実装  • queue-proxy の置き換え  ◦ queue-proxy を Envoy に置き換える  Networking 92

https://groups.google.com/forum/#!topic/knative-dev/YmL2vgMC4rc > Since the start of the Knative project, there
have been questions about whether Knative would be donated to a foundation, such as CNCF. Google leadership has considered this, and has decided not to donate Knative to any foundation for the foreseeable future. Oﬀ-topic: CNCF Donation 93

https://twitter.com/brendandburns/status/1179176440647913472 Oﬀ-topic: CNCF Donation 94

• CNCF 寄贈しない宣言契機に Steering Committee (SC) に対する不信感が...  ◦ SC の議論や決定のプロセスが外から見えない 
◦ SC の席の 4/6 が Google  ◦ SC のメンバーは組織の貢献度をもとにどう決定している？                • Steering Committee で議論中で近日中に発表があるとのこと    Oﬀ-topic: CNCF Donation 95

Knative Adoption 96 • Why Knative? • High-level architecture walkthrough
• gRPC load balancing • Strategy for upgrading EKS cluster • Event driven system • Resource requests and limits

• Kubernetes の抽象化  ◦ Kubernetes の難しさ/学習コストの高さ  ◦ ネットワークレイヤーも含めた K8S のほどよいオレオレ抽象化 
◦ 誰が苦労すべきか  • YAML 地獄からの脱却  ◦ Helm, kustomize とは違ったアプローチ  • ゼロスケール出来ることの意義  ◦ パフォーマンス＜可用性  • Dual Stack ◦ Knative と Kubernetes リソースの併用が可能  • Portability の向上  ◦ Kubernetes さえあればどこでもデプロイ可能で構築が容易  97 Why Knative?

• クラスターのアップグレードは Blue/Green 方式  ◦ Global Accelerator でクラスターレベルのリクエスト振り分け  ◦
切り戻し優先のため、DNS レコード切り替えは不採用  99

100 • Global Accelerator 利用のため、クラウドロードバランサーとして Network Load Balancer (NLB) 
• NLB で TLS 終端  • mTLS は使わない 

• Istio の Gateway リソースで HTTP -> HTTPS へリダイレクトを設定  101

• gRPC LoadBalancing を Envoy Proxy で行うため、Mesh の機能を止む無く有効化 (後述) 
• HTTP/1.1 かつランダム振り分けなら Service Mesh 必須ではない  102

• 定期的に SQS をポーリングして、アプリの HTTP エンドポイントを叩く  • Knative Eventing
を使いたかった部分 (後述)  103

104 • イベント駆動で処理を実行する部分だけゼロスケール利用  • 制約上、常に同時処理数を1つに固定したい (後述) 

105 • それ以外はゼロスケール利用しない  ◦ あくまで可用性向上の仕組み  ◦ ゼロスケールのレイテンシが 1 秒未満になれば使うかも...？ 

• メトリクス/ログ収集は DaemonSet を利用  ◦ Sidecar パターンが使えない  ◦ リソース共有の問題もあるし、最
近余り見ないかも...？  106

• gRPC (HTTP/2) はコネクションを使い回す  ◦ K8S Service (HTTP/1.1) だとバックエンドへの接続が偏る  •
gRPC のロードバランスの手法  ◦ クライアントサイド  ▪ Headless サービスを自前管理する必要あり  ◦ サイドカーパターン  ▪ Istio の場合、Automatic Sidecar Injection の機能が必要  • Istio はそこまで Plugable ではない印象  • Service Mesh を飼うコスト  ▪ Istio 以外の軽量な Service Mesh を使う (Gloo, Ambassador)  • 成熟度に課題  • Knative の E2E/パフォーマンステストが充実していない  107 gRPC load balancing

• gRPC (HTTP/2) はコネクションを使い回す  ◦ K8S Service (HTTP/1.1) だとバックエンドへの接続が偏る  •
gRPC のロードバランスの手法  ◦ クライアントサイド  ▪ Headless サービスを自前管理する必要あり   ◦ サイドカーパターン  ▪ Istio の場合、Automatic Sidecar Injection の機能が必要   • Istio はそこまで Plugable ではない  • Service Mesh を飼うコスト  ▪ Istio 以外の軽量な Service Mesh を使う (Gloo, Ambassador)   • 成熟度に課題  • Knative の E2E/パフォーマンステストが充実していない   108 Istio with Service Mesh の採用を  強いられている...！  gRPC load balancing

• Activator で gRPC ロードバランシングすれば Mesh 不要では!?  ◦ L7 Load
Balancer の役目も果たす (Throttler)  ◦ HTTP/2 without TLS もサポート  ◦ Activator を常にリクエストパスに置くことが可能 (target-burst-capacity: “-1”)  ◦ PodIP と ClusterIP で先に有効になった方にリクエストを投げる  ◦ v0.9.0 では PodIP と ClusterIP だと ClusterIP が優先  • Better Load Balancing in Activator ◦ Prefer Pod IPs to ClusterIPs when possible #5820 ▪ 有効な PodIP に優先でリクエストを送る  109 gRPC load balancing

• Istio の問題点  ◦ アップグレードが鬼門  ▪ https://istio.io/docs/setup/upgrade/steps/ > The upgrade
process may install new binaries and may change conﬁguration and API schemas. The upgrade process may result in service downtime. To minimize downtime, please ensure your Istio control plane components and your applications are highly available with multiple replicas.  ◦ LTS の周期が速い  ▪ https://istio.io/about/release-cadence/ Support is provided until 3 months after the next LTS 110 Strategy for upgrading EKS cluster

• Knative の問題点  ◦ アップグレードへの気遣い  ▪ https://docs.google.com/document/d/1GXDe6163lO8ohtLUMPCplbjm3 OTpYo_4U8f9K1fgZPo  ▪ どのバージョンからどのバージョンに上げられるか明記 
▪ Knative アップグレードの E2E テストも追加される予定  ◦ ダウングレードは前バージョンまで  ◦ 複数の API バージョンに対応  ▪ https://docs.google.com/presentation/d/1mOhnhy8kA4-K9Necct-NeIwy sxze_FUule-8u5ZHmwA  111 Strategy for upgrading EKS cluster

• クラスターを Blue/Green でアップグレード  ◦ 本番稼働中のクラスターで Istio/Knative をアップグレードしたくない  ◦ Global
Accelerator で EKS クラスターレベルの一時的なルーティング  ▪ クラスターの内外を繋げる NLB に振り分ける  • 既存クラスターの Knative Service の移行  ◦ Revision は immutable で、直接作成できない  ▪ Configuration リソース経由  ◦ Revision の履歴は Configuration で管理していない  ◦ リクエストが向いていない Revision も含めて移行したい  112 $ kn service migrate --source-context xxx:xxxxxxx Strategy for upgrading EKS cluster

• SQS キューのメッセージ駆動で特定の処理 (Function) を走らせたい  ◦ Lambda  ◦ ECS on
EC2/Fargate  ◦ Knative Eventing  ◦ K8S Deployment + custom scheduler  • 懸念事項  ◦ API サーバーなど他アプリとデプロイ方法を分けたくない  ▪ デプロイフローの整備/メンテナンスコスト  ▪ サービス利用の学習曲線  ◦ Function の依存関係によるベースイメージのカスタマイズ性  ◦ Knative Eventing の成熟度  ▪ 本番利用可能とはまだまだ言えない...  113 Event driven system

• SQS キューのメッセージ駆動で特定の処理 (Function) を走らせたい  ◦ Lambda  ◦ ECS on
EC2/Fargate  ◦ Knative Eventing  ◦ K8S Deployment + custom scheduler  • 懸念事項  ◦ API サーバーなど他アプリとデプロイ方法を分けたくない  ▪ デプロイフローの整備/メンテナンスコスト  ▪ サービス利用の学習曲線  ◦ Function の依存関係によるベースイメージのカスタマイズ性  ◦ Knative Eventing の成熟度  ▪ 本番利用可能とはまだまだ言えない...  114 Event driven system K8S Deployment + 自前スケジューラ  最終的には Knative Eventing を使いたい... 

• SQS イベントのサブスクライバー(処理部分) の制約  ◦ サブスクライバーは並列処理を許さない  ▪ `containerConcurrency: 1` と
`maxScale: 1` を指定  • それ以上のリクエストが届いた場合にどうなる？  ◦ 超過分のリクエストは、queue-proxy のキューに溜まって処理待ち  ◦ 処理が終わると SQS のキューからメッセージを削除  ◦ 前の処理が終わると、queue-proxy がサブスクライバーにリクエスト転送  ◦ サブスクライバーの処理が詰まった時は、スケジューラがタイムアウト  ◦ キューの処理順は問題でない  ◦ スケジューラが SQS からキューを拾って再送を繰り返す  115 Event driven system

• [WIP]: Istio/Knative の resources.requests の合計  ◦ Istio IngressGateway で
~ 1k rps を想定  ▪ min: 7.65 vCPU  ▪ min: 10.2 Gi  • [WIP]: ワーカーノードのサイズと台数  ◦ c5.2xlarge (8vCPU, 16GB RAM) x 3 台 (> 1178.83 $)  ◦ Cluster Autoscaler で最大 10 台までスケール  ▪ Knative コンポーネントがいるノードはスケールダウンしない    116 Resource requests and limits

117 Container min. max. requests limits cpu memory cpu memory
Activator 1 20 300m 60Mi 1000m 600Mi Autoscaler 1 1 30m 40Mi 300m 400Mi Knative Webhook 1 1 20m 20Mi 200m 200Mi Knative Controller 1 1 100m 100Mi 1000m 1000Mi Networking Istio 1 1 100m 100Mi 1000m 1000Mi ClusterLocalGW 2 4 3000m 2048Mi 3000m 2048Mi IngressGateway 2 4 3000m 2048Mi 3000m 2048Mi Pilot 2 5 100m 128Mi 2000m 1024Mi Telemetry 1 5 1000m 1024Mi 4800m 4096Mi Policy 1 5 100m 128Mi 2000m 1024Mi queue-proxy ? ? 25m - - - istio-proxy ? ? 100m 128Mi 2000m 1024Mi user-container ? ? ? ? ? ? Default setting

118 Container min. max. requests limits cpu memory cpu memory
Activator 1 5 300m 600Mi - 600Mi Autoscaler 1 1 30m 40Mi - 400Mi Knative Webhook 1 1 20m 20Mi - 200Mi Knative Controller 1 1 100m 100Mi - 1000Mi Networking Istio 1 1 100m 100Mi - 1000Mi ClusterLocalGW 2 4 1000m 1024Mi - 1024Mi IngressGateway 2 4 1000m 1024Mi - 1024Mi Pilot 2 5 1000m 2048Mi - 2048Mi Telemetry 1 5 1000m 1024Mi - 1024Mi Policy 1 5 100m 128Mi - 1024Mi queue-proxy ? ? ? ? - ? istio-proxy ? ? 100m 128Mi - 512m user-container ? ? ? ? ? ? Our planning setting for future load test

https://twitter.com/mattklein123/status/1179455833265987584 119 Oﬀ-topic: Envoy as ﬁrst-class citizens

• Serving Operator ◦ Knative Serving のインストールからアップデートまでを管理  • Client (kn)
◦ Knative/Tekton リソースを操作するための CLI ツールかつライブラリ  • Observaility ◦ Telegraf/Fluent Bit ベースでメトリクスやログを外部のサービスに転送  • ko ◦ import パス指定で Go アプリをビルド (like Jib in Java) & デプロイ  • kf ◦ Knative 上で実現する CF (Cloud Foundry)  Oﬀ-topic: Knative Sub Projects 120

• Knative Pod Autoscaler (KPA) のシミュレータ  • スケーラのコード変更後にシミュレータで振る舞いを確認できる  • HPA
でも使えるかも!? CNCF への寄贈含め議論中    Oﬀ-topic: Skenario 121

• Knative Serving API Specification  • Knative Runtime Contract  •
[Tech Talk] Knative Serving API 101  • [TechTalk] Knative Autoscaling  • [Blog] Knative v0.3 Autoscaling — A Love Story  • [Video] Knative: Scaling From 0 to Infinity  • 2019 Autoscaling Roadmap  • Knativeで実現するKubernetes上のサーバーレスアーキテクチャ  • Knative serverless Kubernetes bypasses FaaS to revive PaaS  • Serverless On Your Own Terms Using Knative    Knative References 122

Simple 1-click build and deploy, configurable when you need to
Source to container or URL safely within your cluster Extendable Easy to configure event sources Plugable event bus and persistence Automatic Automatically deploys containers and provision ingress Scale based on requests Scale down to zero • Set of primitives (build, events, serving...) • Solves for modern development patterns • Implements learnings from Google, partners github.com/knative Join Knative community knative/docs/community Have questions? Knative.slack.com Knative News? @KnativeProject Ingredients for Serverless

Zero Scale Abstraction in Knative Serving

Zero Scale Abstraction in Knative Serving

More Decks by Tsubasa Nagasawa

Other Decks in Technology

Featured

Transcript