Slide 1

Slide 1 text

LINEのネットワークオーケストレータをリニューアルした話 LINE株式会社 福田守昴 1

Slide 2

Slide 2 text

@LINE ・Network Orchestrator development ・White box NOS development ・Telecom Infra Project ・IoT Gateway firmware development ・IoT protocol stack development ・enterprise NOS test and release engineering ・test automation system development Subaru Fukuda 2016.Apr - 2018.Apr 2018.Mar - 2019.Sep 2019.Oct - 2020.Oct 2020.Nov - NOW About Me

Slide 3

Slide 3 text

What is Verda? 80,000+ Virtual Machine 40,000+ Baremetal 6,000+ Hypervisor NAT Load Balancer VM / Baremetal MySQL Elasticsearch Image Repo Shared Filesystem DNS App engine (like heroku) Controller And More… 3

Slide 4

Slide 4 text

Underlay Network LINEのネットワークをゼロから再設計した話@JANOG43 4

Slide 5

Slide 5 text

Multi Components Architecture 5

Slide 6

Slide 6 text

Problems 1) SCALABILITY 2) MULTIVENDOR 3) TRIGGER 4) BATCH CHANGE 5) HUMAN ERROR 6

Slide 7

Slide 7 text

SCALABILITY 7

Slide 8

Slide 8 text

Problem Config Update Process ①Update Database ②Create Inventory ③Apply Config(Run Ansible) 8

Slide 9

Slide 9 text

Problem Problem • Ansible server load is big • It takes a long time • manual operations are required. • To update database • To generate inventory • To run Ansible 9

Slide 10

Slide 10 text

Agent Application 1:N 1:1 10

Slide 11

Slide 11 text

Agent Sync Config 11 Config Update Process 0) agent watch DB 1) operator update DB 2) agent detect the change 3) Update config (run Ansible)

Slide 12

Slide 12 text

Agent Sync Config 12 Config Update Process 0) agent watch DB 1) operator update DB 2) agent detect the change 3) Update config (run Ansible)

Slide 13

Slide 13 text

Agent Sync Config 13 Config Update Process 0) agent watch DB 1) operator update DB 2) agent detect the change 3) Update config (run Ansible)

Slide 14

Slide 14 text

Agent Sync Config Config Update Process 0) agent watch DB 1) operator update DB 2) agent detect the change 3) Update config (run Ansible) 14

Slide 15

Slide 15 text

Agent Deployment Process PROVISION • SOME INITIAL SETUP • INSTALL Docker • DEPLOY AGENT ZTP SCRIPT • SETUP FOR SSH • PROVISION REQUEST 15

Slide 16

Slide 16 text

MULTI VENDOR 16

Slide 17

Slide 17 text

Problem ARISTA Cumulus Linux How to apply config is different between Cumulus and ARISTA. 17

Slide 18

Slide 18 text

vendor agnostic vendor specific Data Flow Only Ansible playbook should have vendor specific code. 18

Slide 19

Slide 19 text

Ansible Tag Cumulus Linux ARISTA target: localhost tag: cumulus target: localhost tag: arista - name: example-task1 XXX: XXXARG: "example" tags: cumulus - name: example-task1 XXX: XXXARG: "example" tags: arista 19 Environmental variable nos={cumulus | arista}

Slide 20

Slide 20 text

Ansible Tag Cumulus Linux ARISTA target: localhost tag: cumulus target: localhost tag: arista - name: example-task1 XXX: XXXARG: "example" tags: cumulus - name: example-task1 XXX: XXXARG: "example" tags: arista 20

Slide 21

Slide 21 text

vendor agnostic vendor specific Data Flow How did we realize vendor agnostic config param DB? 21

Slide 22

Slide 22 text

Config Parameter Sheet 1. SWITCH • hostname, os-version, server-room, etc 2. INTERFACE • mac, speed, mtu, ip, etc 3. BGP • AS, neighbor, peer-group 4. QOS • config for shaping 5. ROUTEMAP • ingress/egress routemap 6. PREFIXLIST • Ipv4/ipv6 prefixlist SWITCH INTERFACE BGP QOS ROUTEMAP PREFIXLIST 22

Slide 23

Slide 23 text

Config Parameter Sheet 1. SWITCH • hostname, os-version, server-room, etc 2. INTERFACE • mac, speed, mtu, ip, etc 3. BGP • AS, neighbor, peer-group 4. QOS • config for shaping 5. ROUTEMAP • ingress/egress routemap 6. PREFIXLIST • Ipv4/ipv6 prefixlist { "routemap-001": { "entries": [ { "action": "permit", "sequence": 10, "set_actions": [ { "action": "as-path prepend", "value": "auto auto auto auto auto" } ] } ] } … } EX)ROUTEMAP PARAMETER SHEET 23

Slide 24

Slide 24 text

Config Parameter Sheet KEY VALUE (JSON FORMAT) SWITCH001/SWITCH ... SWITCH001/INTERFACE ... SWITCH001/BGP ... SWITCH001/QOS ... SWITCH001/ROUTEMAP ... SWITCH001/PREFIXLIST ... ... ... SWITCH002/SWITCH ... SWITCH002/INTERFACE ... SWITCH002/BGP ... ... ... ・・・ switch001 ・・・ 24

Slide 25

Slide 25 text

SYNC-AGENT Handle Config Parameter Sheet KEY VALUE (JSON FORMAT) SWITCH001/SWITCH ... SWITCH001/INTERFACE ... SWITCH001/BGP ... SWITCH001/QOS ... SWITCH001/ROUTEMAP ... SWITCH001/PREFIXLIST ... ... ... SWITCH002/SWITCH ... SWITCH002/INTERFACE ... SWITCH002/BGP ... ... ... switch001 ・・・ watch 25

Slide 26

Slide 26 text

SYNC-AGENT Handle Config Parameter Sheet KEY VALUE (JSON FORMAT) SWITCH001/SWITCH ... SWITCH001/INTERFACE ... SWITCH001/BGP ... SWITCH001/QOS ... SWITCH001/ROUTEMAP ... SWITCH001/PREFIXLIST ... ... ... SWITCH002/SWITCH ... SWITCH002/INTERFACE ... SWITCH002/BGP ... ... ... switch001 watch 26 SWITCH INTERFACE ROUTEMAP PREFIXLIST BGP QOS

Slide 27

Slide 27 text

SYNC-AGENT Handle Config Parameter Sheet KEY VALUE (JSON FORMAT) SWITCH001/SWITCH ... SWITCH001/INTERFACE ... SWITCH001/BGP ... SWITCH001/QOS ... SWITCH001/ROUTEMAP ... SWITCH001/PREFIXLIST ... ... ... switch001 watch 27 SWITCH INTERFACE ROUTEMAP PREFIXLIST BGP QOS 0) update SWITCH001/INTERFACE. 1) sync-agent detect the change and get the INTERFACE config pram sheet 2) sync-agent updates switch config ;run Ansible

Slide 28

Slide 28 text

SYNC-AGENT Handle Config Parameter Sheet KEY VALUE (JSON FORMAT) SWITCH001/SWITCH ... SWITCH001/INTERFACE ... SWITCH001/BGP ... SWITCH001/QOS ... SWITCH001/ROUTEMAP ... SWITCH001/PREFIXLIST ... ... ... switch001 watch 28 28 SWITCH INTERFACE ROUTEMAP PREFIXLIST BGP QOS INTERFACE 0) update SWITCH001/INTERFACE. 1) sync-agent detect the change and get the INTERFACE config pram sheet 2) sync-agent updates switch config ;run Ansible

Slide 29

Slide 29 text

SYNC-AGENT Handle Config Parameter Sheet KEY VALUE (JSON FORMAT) SWITCH001/SWITCH ... SWITCH001/INTERFACE ... SWITCH001/BGP ... SWITCH001/QOS ... SWITCH001/ROUTEMAP ... SWITCH001/PREFIXLIST ... ... ... switch001 watch 29 29 SWITCH INTERFACE ROUTEMAP PREFIXLIST BGP QOS 0) update SWITCH001/INTERFACE. 1) sync-agent detect the change and get the INTERFACE config pram sheet 2) sync-agent updates switch config ;run Ansible Update Config

Slide 30

Slide 30 text

SYNC-AGENT Handle Config Parameter Sheet switch001 30 30 SWITCH INTERFACE ROUTEMAP PREFIXLIST BGP QOS - name: include config params include_vars: dir: ”CFG_PARAM_PATH" ・・・ playbook CFG_PARAM_PATH/XXX.json include_vars Imports Every Config Parameter Sheets As Ansible vars

Slide 31

Slide 31 text

Vendor Agnostic? • Ensure operations • On Arista and Cumulus • On LINE’s Network • Need to change schema • When we introduce new vendor switches. • When we change our network architecture drastically. SWITCH INTERFACE BGP QOS ROUTEMAP PREFIXLIST 31

Slide 32

Slide 32 text

Yang Schema RFC7951: JSON Encoding of Data Modeled with YANG YANG JSON define module interface { import ietf-inet-types { prefix "inet"; } import ietf-yang-types { prefix "yang"; } ... leaf mac_address { type yang:mac-address; } ... leaf-list ipv4 { type inet:ipv4-prefix; min-elements 0; } ... parameter sheet schema EXAMPLE SCHEMA 32

Slide 33

Slide 33 text

Schema Driven Development YANG JSON Schema output input Pyang generate json schema! 33

Slide 34

Slide 34 text

Schema Driven Development CONFIG-PARAMETER CHANGE PROCESS 1. update schemas with yang 2. generate json schemas from yang schemas 3. deploy generated json schemas to API server API-SERVER make sure to validate the data just before update etcd. 34

Slide 35

Slide 35 text

{ ... "hostname": "SWITCH00X", "network_os": "cumulus", ... } DHCP Option Cumulus Linux ARISTA ... "mac": "xxxx.xxxx.xxxx", - "ip": "X.X.X.X/X" - "ztp-script option code": "XX", ... { ... "name": "eth0", "type": "management", "mac_address": "xxxx.xxxx.xxxx", "ip": ["X.X.X.X/X"], ... } config parameter sheet dhcpd.conf request response update dhcpd.conf 35

Slide 36

Slide 36 text

TRIGGER 36

Slide 37

Slide 37 text

Operator Trigger Config Update Process 0) agent watch DB 1) operator update DB 2) agent detect the change 3) Update config (run Ansible) 37

Slide 38

Slide 38 text

Problem ToR SWITCH SERVER BGP SESSION BGPD RUNNING 38

Slide 39

Slide 39 text

Problem ToR SWITCH SERVER whitelist to filter unwanted prefix BGPD RUNNING unwanted prefix 39

Slide 40

Slide 40 text

Problem Need to identify the switch a server is connecting to 40

Slide 41

Slide 41 text

Connection Trigger KEY VALUE ... ... SERVER001 { ”hostname": SERVER001, “ipv4_prefixes”: [192.0.2.0/24], “ipv6_prefixes”: [] } ... ... server-config parameter 41 CONFIG PARAM DB 1) Store prefixes for SERVER001 SYNC AGENT 3) Detect the connection of SERVER001 by LLDP 4) Get the prefixes for SERVER001 5) Update whitelist ; run Ansible 6) watch SERVER001 SERVER 2) connect to a switch

Slide 42

Slide 42 text

Connection Trigger KEY VALUE ... ... SERVER001 { ”hostname": SERVER001, “ipv4_prefixes”: [192.0.2.0/24], “ipv6_prefixes”: [] } ... ... 42 CONFIG PARAM DB 1) Store prefixes for SERVER001 SYNC AGENT 3) Detect the connection of SERVER001 by LLDP 4) Get the prefixes for SERVER001 5) Update whitelist ; run Ansible 6) watch SERVER001 SERVER 2) connect to a switch

Slide 43

Slide 43 text

Connection Trigger KEY VALUE ... ... SERVER001 { ”hostname": SERVER001, “ipv4_prefixes”: [192.0.2.0/24], “ipv6_prefixes”: [] } ... ... LLDP Detect SERVER001 43 CONFIG PARAM DB 1) Store prefixes for SERVER001 SYNC AGENT 3) Detect the connection of SERVER001 by LLDP 4) Get the prefixes for SERVER001 5) Update whitelist ; run Ansible 6) watch SERVER001 SERVER 2) connect to a switch

Slide 44

Slide 44 text

Connection Trigger KEY VALUE ... ... SERVER001 { ”hostname": SERVER001, “ipv4_prefixes”: [192.0.2.0/24], “ipv6_prefixes”: [] } ... ... LLDP Get (key=SERVER001) 44 CONFIG PARAM DB 1) Store prefixes for SERVER001 SYNC AGENT 3) Detect the connection of SERVER001 by LLDP 4) Get the prefixes for SERVER001 5) Update whitelist; run Ansible 6) watch SERVER001 SERVER 2) connect to a switch

Slide 45

Slide 45 text

Connection Trigger KEY VALUE ... ... SERVER001 { ”hostname": SERVER001, “ipv4_prefixes”: [192.0.2.0/24], “ipv6_prefixes”: [] } ... ... LLDP Add “192.0.2.0/24” to the prefix-list 45 CONFIG PARAM DB 1) Store prefixes for SERVER001 SYNC AGENT 3) Detect the connection of SERVER001 by LLDP 4) Get the prefixes for SERVER001 5) Update whitelist; run Ansible 6) watch SERVER001 SERVER 2) connect to a switch

Slide 46

Slide 46 text

Connection Trigger KEY VALUE ... ... SERVER001 { ”hostname": SERVER001, “ipv4_prefixes”: [192.0.2.0/24], “ipv6_prefixes”: [] } ... ... CONFIG PARAM DB 1) Store prefixes for SERVER001 SYNC AGENT 3) Detect the connection of SERVER001 by LLDP 4) Get the prefixes for SERVER001 5) Update whitelist ; run Ansible 6) watch SERVER001 LLDP watch watch SERVER 2) connect to a switch 46

Slide 47

Slide 47 text

BATCH CHANGE 47

Slide 48

Slide 48 text

Problem Automation is good but … Batch change is dangerous 48

Slide 49

Slide 49 text

Grouping want to apply a config to multiple switches 49 GROUP

Slide 50

Slide 50 text

Group Config Parameter Sheet KEY VALUE SW001/SWITCH { "hostname": "SW001", "switch_groups": ["GRP-A"], … } ... ... /SWGRP/GRP-A { "ipv4_prefixes": [{"action": "deny","prefix": "192.0.2.0/24"}], … } SW001 config-pram type=group Introduce new config-param; type=group . Multiple switches watch the entry but ... Batch change is dangerous. watch x 3 watch x1 50

Slide 51

Slide 51 text

Sync-Group Config Parameter Sheet SW001 KEY VALUE SW001/SWITCH ... ... ... /SWGRP/GRP-A ... ... ... /SWGRP/GRP-A/SYNC_GRP … ... ... config-pram type=sync-group Introduce new config-param; type=group . Also, introduce new config-param; type=sync-group . Multiple switches watch the sync-group entry. 51

Slide 52

Slide 52 text

Sync-Group Config Parameter Sheet JSON = ・・・ X group-sync-state-machine config parameter state-machine switches in the group STATE • DONE • NOT-YET • SYNC 52

Slide 53

Slide 53 text

Group Sync 53 0) every switch’s state is DONE. 1) operator update SWG-A’s config param. 2) API-SERVER set every switch’s state to NOT-YET. 3) API-SERVER set SW001’s state to SYNC 4) sync-agent fetch SWG-A’s config param. 5) sync-agent update SW001’s config 6) sync-agent set SW001’s state to DONE 7) operator set the others to SYNC. 8) sync-agent fetch SWG-A’s config param. 9) sync-agent update switch’s config. 10) sync-agent set the both states to DONE.

Slide 54

Slide 54 text

Group Sync 54 0) every switch’s state is DONE. 1) operator update SWG-A’s config param. 2) API-SERVER set every switch’s state to NOT-YET. 3) API-SERVER set SW001’s state to SYNC 4) sync-agent fetch SWG-A’s config param. 5) sync-agent update SW001’s config 6) sync-agent set SW001’s state to DONE 7) operator set the others to SYNC. 8) sync-agent fetch SWG-A’s config param. 9) sync-agent update switch’s config. 10) sync-agent set the both states to DONE.

Slide 55

Slide 55 text

Group Sync 55 0) every switch’s state is DONE. 1) operator update SWG-A’s config param. 2) API-SERVER set every switch’s state to NOT-YET. 3) API-SERVER set SW001’s state to SYNC 4) sync-agent fetch SWG-A’s config param. 5) sync-agent update SW001’s config 6) sync-agent set SW001’s state to DONE 7) operator set the others to SYNC. 8) sync-agent fetch SWG-A’s config param. 9) sync-agent update switch’s config. 10) sync-agent set the both states to DONE.

Slide 56

Slide 56 text

Group Sync 56 0) every switch’s state is DONE. 1) operator update SWG-A’s config param. 2) API-SERVER set every switch’s state to NOT-YET. 3) API-SERVER set SW001’s state to SYNC 4) sync-agent fetch SWG-A’s config param. 5) sync-agent update SW001’s config 6) sync-agent set SW001’s state to DONE 7) operator set the others to SYNC. 8) sync-agent fetch SWG-A’s config param. 9) sync-agent update switch’s config. 10) sync-agent set the both states to DONE.

Slide 57

Slide 57 text

Group Sync 57 0) every switch’s state is DONE. 1) operator update SWG-A’s config param. 2) API-SERVER set every switch’s state to NOT-YET. 3) API-SERVER set SW001’s state to SYNC 4) sync-agent fetch SWG-A’s config param. 5) sync-agent update SW001’s config 6) sync-agent set SW001’s state to DONE 7) operator set the others to SYNC. 8) sync-agent fetch SWG-A’s config param. 9) sync-agent update switch’s config. 10) sync-agent set the both states to DONE.

Slide 58

Slide 58 text

Group Sync 58 0) every switch’s state is DONE. 1) operator update SWG-A’s config param. 2) API-SERVER set every switch’s state to NOT-YET. 3) API-SERVER set SW001’s state to SYNC 4) sync-agent fetch SWG-A’s config param. 5) sync-agent update SW001’s config 6) sync-agent set SW001’s state to DONE 7) operator set the others to SYNC. 8) sync-agent fetch SWG-A’s config param. 9) sync-agent update switch’s config. 10) sync-agent set the both states to DONE.

Slide 59

Slide 59 text

How To Join Group 59 switch_groups property in SWITCH parameter sheet shows groups .

Slide 60

Slide 60 text

How To Join Group 60 1) Operator updates SW004’s switch_groups. 2) sync-agent fetches SWG-A PARAM. 3) sync-agent updates switch config. 4) sync-agent add SW004’s state machine. 5) sync-agent watch sync-group parameter.

Slide 61

Slide 61 text

How To Join Group 61 1) Operator updates SW004’s switch_groups. 2) sync-agent fetches SWG-A PARAM. 3) sync-agent updates switch config. 4) sync-agent add SW004’s state machine. 5) sync-agent watch sync-group parameter.

Slide 62

Slide 62 text

How To Join Group 62 1) Operator updates SW004’s switch_groups. 2) sync-agent fetches SWG-A PARAM. 3) sync-agent updates switch config. 4) sync-agent add SW004’s state machine. 5) sync-agent watch sync-group parameter.

Slide 63

Slide 63 text

How To Join Group 63 1) Operator updates SW004’s switch_groups. 2) sync-agent fetches SWG-A PARAM. 3) sync-agent updates switch config. 4) sync-agent add SW004’s state machine. 5) sync-agent watch sync-group parameter.

Slide 64

Slide 64 text

HUMAN ERROR 64

Slide 65

Slide 65 text

One Command Operation • Ex) DEVICE ISOLATION $ device-isolation isolate SW-001 --level 1 65

Slide 66

Slide 66 text

Monitoring • Any application which we develop includes prometheus exporter function. • service discovery by consul • Slack notification 66

Slide 67

Slide 67 text

CONCLUSION 67

Slide 68

Slide 68 text

LINE’s Network Orchestrator 68 1) SCALABILITY 2) MULTIVENDOR 3) TRIGGER 4) BATCH CHANGE 5) HUMAN ERROR

Slide 69

Slide 69 text

Current 2020.May 2021.Feb DEVELOPMENT MAINTENANCE 2,000+ SWITCHES Run on prod env since 2021.Feb NOW 69

Slide 70

Slide 70 text

Future Work • Rollback feature • Dry-Run feature • Introduce k8s CR 2020.May 2021.Feb DEVELOPMENT MAINTENANCE NOW 70

Slide 71

Slide 71 text

DISCUSSION 71