Slide 1

Slide 1 text

KloudNFV: Declarative and Hierarchical Software-Defined Networking Platform using Kubernetes Extension LINE Corporation Hiroki Shirokura, Hirofumi Ichihara 1

Slide 2

Slide 2 text

TR; TL (claim) (1) Declarative and Hierarchical SDN C-plane makes “VNF life-cycle” shorter to “hours” (2) “Short VNF lifecycle” makes “Engineering Cost” lower in Commercial World (3) Respecting K8s Design principle, we can get (1) without Big-Cost 2

Slide 3

Slide 3 text

net-b1 net-a2 net-a1 Service A Service B Virtual Private Cloud Networking in Production 3 VPN Collaborator Data Center Server Computing Service VM ASBR ASBR DCI-BB DCI-BB internet Networking Requirements ● Isolation & Routing ● Function (NAT, ACL, Mirror, S2S-VPN, etc..) Operation & Software Requirements ● Reliability & Scalability ● Many Mid Software Upgrades ● Fundamental Internal System upgrade ● Efficiency for development Other Region VFP, Orion, Zeta (NSDI’ 17, 18, 22) DSR (NSDI’ 21) SDN DB Controller Controller Controller Data Model Data Model Data Model Data Model Un-revealed This Research Orion (NSDI’ 21) ONIX (OSDI’ 10)

Slide 4

Slide 4 text

Many Mid Software Upgrade in Commercial Case 4 example1: Flow Metering with ACL Action example2: Dedicated/Shared Cluster option user vm App1 App2 user vm App1 App2 dplane dplane dplane On-demand Isolation

Slide 5

Slide 5 text

Data-Plane Overview ● Hows for Networking Requirements: ○ Isolation -> SRv6 L3VPN using Neutron's custom plugin ○ Routing -> VM Based Router-VM (it’s normal vm in OpenStack viewpoint) ○ Functions -> using Linux networking feature (tc, netfilter, ebpf, vti, netns, frr, libreswan, etc..) ● Router-VM is in the single Failure domain, Control-plane will create these Router-VM in different failure domains 5

Slide 6

Slide 6 text

Summary: VM Based vRouter Cluster 6 Each VMs are just a OpenStack VM!

Slide 7

Slide 7 text

Summary: VM Based vRouter Cluster 7 Network interfaces created by Neutron and these connects each networks

Slide 8

Slide 8 text

Summary: VM Based vRouter Cluster 8 VM is SPoF

Slide 9

Slide 9 text

Summary: VM Based vRouter Cluster 9 So Router VM must be redundant

Slide 10

Slide 10 text

Summary: VM Based vRouter Cluster 10 Endpoint1 is damaged by failure-domain outage. Availability -> 66%

Slide 11

Slide 11 text

Summary: VM Based vRouter Cluster 11 Endpoint1 is updated as “service-out”. Availability -> 100% Service OUT

Slide 12

Slide 12 text

Summary: VM Based vRouter Cluster 12 Endpoint1 is updated as “service-out”. Availability -> 100% Service OUT (1) Declarative and Hierarchical SDN C-plane makes “VNF life-cycle” shorter to “hours” (2) “Short VNF lifecycle” makes “Engineering Cost” lower in Commercial World (3) Respecting K8s Design principle, we can get (1) without Big-Cost Next Step is… How to construct/manage vRouter Cluster as Managed service

Slide 13

Slide 13 text

Control-Plane Overview “KloudNFV” ● Generic control-plane platform for private cloud networking at LINE ● Design Concept ○ All the APIs are represented with K8s-CRD (only CRUD) ○ All the Controllers are represented as just a K8s-Custom-Controller 13

Slide 14

Slide 14 text

Resources and Controllers ● Gateway: HA aware endpoints cluster ● Endpoint: Single Failure domain network function ● NfvMachine: Modular virtual router base abstraction for NFV purpose 14

Slide 15

Slide 15 text

NfvMachine ● Virtual Router module in Single Failure domain ● Represented as normal OpenStack VM 15

Slide 16

Slide 16 text

NfvMachine 16

Slide 17

Slide 17 text

NfvMachine 17

Slide 18

Slide 18 text

No need to care Upper-side for NfvMachine Many Upper-side kinds are exist Endpoint Controller will be “dummy yaml translator logic” ← eliminate engineering 18

Slide 19

Slide 19 text

No need to care Upper-side for NfvMachine Many Upper-side kinds are exist Endpoint Controller will be “dummy yaml translator logic” ← eliminate engineering 19 (1) Declarative and Hierarchical SDN C-plane makes “VNF life-cycle” shorter to “hours” (2) “Short VNF lifecycle” makes “Engineering Cost” lower in Commercial World (3) Respecting K8s Design principle, we can get (1) without Big-Cost by inserting “deployment of nfv-stack” into the SDN

Slide 20

Slide 20 text

Gateway -> Endpoint -> NfvMachine 20 NfvMachine as a backend of Endpoint

Slide 21

Slide 21 text

Gateway -> Endpoint -> NfvMachine 21 kind: RoutingGateway metadata: name: gw1 spec: networks - ext-network1 - pri-network1 endpoints: - ep1 - ep2 - ep3 server: flavor: 3vCPU_4gbRAM image: vRouter Watch kind: RoutingEndpoint metadata: name: gw1-ep3 spec: networks - ext-network1 - pri-network1 server: flavor: 3vCPU_4gbRAM image: vRouter kind: RoutingEndpoint metadata: name: gw1-ep2 spec: networks - ext-network1 - pri-network1 server: flavor: 3vCPU_4gbRAM image: vRouter kind: RoutingEndpoint metadata: name: gw1-ep1 spec: networks - ext-network1 - pri-network1 server: flavor: 3vCPU_4gbRAM image: vRouter Create

Slide 22

Slide 22 text

Gateway -> Endpoint -> NfvMachine 22 Routing Gateway Mani- fests Watch kind: RoutingEndpoint metadata: name: gw1-ep3 spec: networks - ext-network1 - pri-network1 server: flavor: 3vCPU_4gbRAM image: vRouter kind: RoutingEndpoint metadata: name: gw1-ep2 spec: networks - ext-network1 - pri-network1 server: flavor: 3vCPU_4gbRAM image: vRouter kind: RoutingEndpoint metadata: name: gw1-ep1 spec: networks - ext-network1 - pri-network1 server: flavor: 3vCPU_4gbRAM image: vRouter Create kind: NfvMachine metadata: name: gw1-ep3 spec: networks - ext-network1 - pri-network1 server: {...} containers: {...} kind: NfvMachine metadata: name: gw1-ep2 spec: networks - ext-network1 - pri-network1 server: {...} containers: {...} kind: NfvMachine metadata: name: gw1-ep1 spec: networks - ext-network1 - pri-network1 server: {...} containers: {...}

Slide 23

Slide 23 text

Gateway -> Endpoint -> NfvMachine 23 Routing Gateway Mani- fests Routing Endpoint Mani- fest Create Routing Endpoint Mani- fest Routing Endpoint Mani- fest kind: NfvMachine metadata: name: gw1-ep3 spec: networks - ext-network1 - pri-network1 server: kind: NfvMachine metadata: name: gw1-ep2 spec: networks - ext-network1 - pri-network1 server: {} kind: NfvMachine metadata: name: gw1-ep1 spec: networks - ext-network1 - pri-network1 ..(snip).. Watch

Slide 24

Slide 24 text

Gateway -> Endpoint -> NfvMachine 24 Routing Gateway Mani- fests Routing Endpoint Mani- fest Create Routing Endpoint Mani- fest Routing Endpoint Mani- fest kind: NfvMachine metadata: name: gw1-ep3 spec: networks - ext-network1 - pri-network1 server: kind: NfvMachine metadata: name: gw1-ep2 spec: networks - ext-network1 - pri-network1 server: {} kind: NfvMachine metadata: name: gw1-ep1 spec: networks - ext-network1 - pri-network1 ..(snip).. Watch (1) Declarative and Hierarchical SDN C-plane makes “VNF life-cycle” shorter to “hours” (2) “Short VNF lifecycle” makes “Engineering Cost” lower in Commercial World (3) Respecting K8s Design principle, we can get (1) without Big-Cost

Slide 25

Slide 25 text

RollingUpdate ● RollingUpdate Controller Capability ○ policy: what kind of action for each “Endpoint” ● Endpoint Capability ○ maintenance mode: routing advertisement stop, etc.. 25

Slide 26

Slide 26 text

RollingUpdate 26 kind: RollingUpdate metadata: name: update-20221015 spec: policy: machine-recreate preAction: maint-mode target: kind: RoutingGateway name: Gateway params: {...} (1) create (2) watch

Slide 27

Slide 27 text

RollingUpdate 27 kind: RollingUpdate metadata: name: update-20221015 spec: policy: machine-recreate preAction: maint-mode target: kind: RoutingGateway name: Gateway params: {...} status: childTargets: - kind: RoutingEndpoint name: Endpoint1 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint2 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint3 state: NOT_STARTED (3) status set

Slide 28

Slide 28 text

RollingUpdate 28 kind: RollingUpdate metadata: name: update-20221015 spec: policy: machine-recreate preAction: maint-mode target: kind: RoutingGateway name: Gateway params: {...} status: childTargets: - kind: RoutingEndpoint name: Endpoint1 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint2 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint3 state: NOT_STARTED (4) Update Endpoint1 as maint-mode Maintenance Mode

Slide 29

Slide 29 text

RollingUpdate 29 kind: RollingUpdate metadata: name: update-20221015 spec: policy: machine-recreate preAction: maint-mode target: kind: RoutingGateway name: Gateway params: {...} status: childTargets: - kind: RoutingEndpoint name: Endpoint1 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint2 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint3 state: NOT_STARTED (5) Delete child NfvMachine Maintenance Mode Delete

Slide 30

Slide 30 text

RollingUpdate 30 kind: RollingUpdate metadata: name: update-20221015 spec: policy: machine-recreate preAction: maint-mode target: kind: RoutingGateway name: Gateway params: {...} status: childTargets: - kind: RoutingEndpoint name: Endpoint1 state: WAIT - kind: RoutingEndpoint name: Endpoint2 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint3 state: NOT_STARTED (6) wait boot-up (7) Reconcile to create NfvMachine for Endpoint RE Ctrlr Maintenance Mode Creating

Slide 31

Slide 31 text

RollingUpdate 31 kind: RollingUpdate metadata: name: update-20221015 spec: policy: machine-recreate preAction: maint-mode target: kind: RoutingGateway name: Gateway params: {...} status: childTargets: - kind: RoutingEndpoint name: Endpoint1 state: WAIT - kind: RoutingEndpoint name: Endpoint2 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint3 state: NOT_STARTED RE Ctrlr Maintenance Mode Created

Slide 32

Slide 32 text

RollingUpdate 32 kind: RollingUpdate metadata: name: update-20221015 spec: policy: machine-recreate preAction: maint-mode target: kind: RoutingGateway name: Gateway params: {...} status: childTargets: - kind: RoutingEndpoint name: Endpoint1 state: FINISHED - kind: RoutingEndpoint name: Endpoint2 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint3 state: NOT_STARTED Maintenance Mode (8) disable maintenance mode

Slide 33

Slide 33 text

RollingUpdate 33 kind: RollingUpdate metadata: name: update-20221015 spec: policy: machine-recreate preAction: maint-mode target: kind: RoutingGateway name: Gateway params: {...} status: childTargets: - kind: RoutingEndpoint name: Endpoint1 state: FINISHED - kind: RoutingEndpoint name: Endpoint2 state: NOT_STARTED - kind: RoutingEndpoint name: Endpoint3 state: NOT_STARTED same procedures Done

Slide 34

Slide 34 text

Growing Journey of RollingUpdate ● 2021.09 initial release ○ machine-recreate ● 2022.05 policy abstraction ○ container-refresh ○ loader-container-refresh ● 2022.08 no-maint-mode 34

Slide 35

Slide 35 text

Growing Journey of RollingUpdate ● 2021.09 initial release ○ machine-recreate ● 2022.05 policy abstraction ○ container-refresh ○ loader-container-refresh ● 2022.08 no-maint-mode 35 (1) Declarative and Hierarchical SDN C-plane makes “VNF life-cycle” shorter to “hours” (2) “Short VNF lifecycle” makes “Engineering Cost” lower in Commercial World (3) Respecting K8s Design principle, we can get (1) without Big-Cost

Slide 36

Slide 36 text

Production Experience ● Service development leadtime -> ½ 36

Slide 37

Slide 37 text

Production Experience ● Service development leadtime -> ½ 37

Slide 38

Slide 38 text

Day1 Service Development Day2 In case of Previous Project (2020.01~) 38 Mar Apr Feb Jan May Jun Project Start System Design System Implement Test-Env Release Real-Env Release Operation design Why What How ● base network technology verification ● base distributed system technology verification Operation Kit (ansible playbooks) Operation Manual Service Level Objective System Development ● Base component (apiserver, information-transfer, database manipulator) ● SDN algorithm Daily/Weekly Task Customer Support Encourage mechanism to another member Additional Feature and Improvement

Slide 39

Slide 39 text

Day1 Service Development Day2 In case of Previous Project (2020.01~) 39 Mar Apr Feb Jan May Jun Project Start System Design System Implement Test-Env Release Real-Env Release Operation design Daily/Weekly Task Customer Support Encourage mechanism to another member Additional Feature and Improvement Day1-Cost Development Operation Day2-Cost Development Operation

Slide 40

Slide 40 text

Day1 Service Development Day2 In case of Previous Project (2020.01~) Mar Apr Feb Jan May Jun Project Start System Design System Implement Test-Env Release Real-Env Release Operation design Day1 Day2 Project Start System Design System Implement TestEnv Release Real-Env Release Operation design In case of KloudNFV (2020.09~)

Slide 41

Slide 41 text

Production Experience (including next issue) ● Service development leadtime -> ½ ● K8s Storage Limitation ○ not only VPC’s but also LB, DNS, another resources can be stored in single k8s cluster or not…? ○ etcd has 8GB storage limitation and many resources make the controller slower ● NfvMachine VM’s Noisy neighbor affection 41

Slide 42

Slide 42 text

Conclusion 42 (1) Declarative and Hierarchical SDN C-plane makes “VNF life-cycle” shorter to “hours” (2) “Short VNF lifecycle” makes “Engineering Cost” lower in Commercial World (3) Respecting K8s Design principle, we can get (1) without Big-Cost A-Endpoint Gateway B-Endpoint C-Endpoint NfvMachine RollingUpd