Slide 1

Slide 1 text

03/12/08 Nuova Systems Inc. Page 1 Data Center Networking: Real vs. Ideal Tom Lyon Stanford Clean Slate Seminar May 13, 2008 [email protected]

Slide 2

Slide 2 text

03/12/08 Nuova Systems Inc. Page 2 Disclaimer Opinions expressed herein are my own, and are unlikely to correspond with any official opinions of Nuova Systems or Cisco Systems.

Slide 3

Slide 3 text

03/12/08 Nuova Systems Inc. Page 3 The Data Center Explosion  Google  Around 1,000,000 servers  20-50 datacenters, up to 100MW each  Microsoft:  Adding 20,000 servers per month  15x servers and power over next 5 years  Electronic Trading  Whoever trades the fastest and smartest wins  Cloud Computing  Oil&Gas  Drug Discovery  ...

Slide 4

Slide 4 text

03/12/08 Nuova Systems Inc. Page 4 Networking in the Data Center

Slide 5

Slide 5 text

03/12/08 Nuova Systems Inc. Page 5 Dimensions of Data Center Networks  Functionality  Compatibility  Reliability  Cost  Manageability  Performance  Bandwidth  Latency  Jitter/Fairness

Slide 6

Slide 6 text

03/12/08 Nuova Systems Inc. Page 6 Functionality: Types of Networks  “Normal” Ethernet  “Special” Ethernets – console, backup, IPC, ...  Storage Network – [Ethernet, Fibre Channel]  IPC Network – [Infiniband, Ethernet, exotic]  Console Network [Serial, Ethernet, KVM]  Power [AC, DC]  SMP Interconnect [exotic]

Slide 7

Slide 7 text

03/12/08 Nuova Systems Inc. Page 7 Compatibility  Serial ports are direct descendents of the Telegraph. Still work with 50 year old teletypes!  SCSI is almost 30 years old  Ethernet packets backwards compatible to 1980  x86 instruction set: 1978

Slide 8

Slide 8 text

03/12/08 Nuova Systems Inc. Page 8 Virtualization  Hard to believe in 'Clean Slate'  But, Virtualization can “contain” compatibility problems  Telnet, VLANs, VPNs, VMware, Virtual disks, ..  Virtualization used to share resources  1 mainframe, many systems  Virtualization used for consolidation  Multiple physical objects to 1  Virtualization used for encapsulation  Ability to capture, manage, redeploy state

Slide 9

Slide 9 text

03/12/08 Nuova Systems Inc. Page 9 Reliability: No single point of failure  2 general Ethernet connections  2 storage network connections  2 power connections  usually one 1 console connection  Every server on 7 networks!  Enter the blade server: shared wiring for servers  Net consolidation: Storage, Ethernet & Console  iSCSI or FCOE  KVM/IP

Slide 10

Slide 10 text

03/12/08 Nuova Systems Inc. Page 10 Cost  Driven by volume  Why the PC architecture is dominant  Inhibited by complexity  “Features” are the enemy

Slide 11

Slide 11 text

03/12/08 Nuova Systems Inc. Page 11 Manageability  Most systems require human touch, training, configuration, monitoring  What if I don't want to set the time on my coffee maker?  But systems are just components of a datacenter  Reverse Turing Test – management software pretending to be humans  We need a standard paradigm for devices to manage other devices

Slide 12

Slide 12 text

03/12/08 Nuova Systems Inc. Page 12 Performance  Cost of bandwidth 10,000x less than WAN  1Gb links nearing zero cost  This is the year of affordable 10Gbps  Higher level protocols are speed-independent  PHY layers have largely converged among different standard  Most performance problems not related to “raw” performance

Slide 13

Slide 13 text

03/12/08 Nuova Systems Inc. Page 13 Normalized Line Rates 1G ETH 4G FC 8G FC 10G IB 10G ETH 20G IB 0 2 4 6 8 10 12 14 16 18 Gbps

Slide 14

Slide 14 text

03/12/08 Nuova Systems Inc. Page 14 Performance vs Application  Storage  Large block transfers  Latency sensitive  “Hardware” endpoints  IPC  Extremely latency sensitive  Wide range of packet sizes  “Software” endpoints  Generic Ethernet/TCP/IP  No huge packets  Not as latency sensitive  Default for all software

Slide 15

Slide 15 text

03/12/08 Nuova Systems Inc. Page 15 Storage Networks  Storage Access slowly evolving from hardware bus to open network  NAS vs SAN  NFS & CIFS vs SCSI's many flavors  Ethernet vs Fibre Channel vs Infiniband

Slide 16

Slide 16 text

03/12/08 Nuova Systems Inc. Page 16 Storage Networks: Ethernet vs EtherNot  iSCSI, NFS, CIFS  TCP & Ethernet  Congestion Loss  Stream Oriented  Software Transport  High CPU overhead  SCSI-FCP, SCSI-SRP  F.C. and Infiniband  Credit Flow Control  Block Oriented  Hardware Transport  Low CPU overhead

Slide 17

Slide 17 text

03/12/08 Nuova Systems Inc. Page 17 Storage Networks: Convergence  Data Center Ethernet  Choice of congestion classes  Lossy vs lossless  Choice of storage transports  TCP or F.C. (FCOE)  Choice of hardware or software transport  TOE w TCP, software FCOE, ...

Slide 18

Slide 18 text

03/12/08 Nuova Systems Inc. Page 18 Turning over the rocks... The Ugly Reality & Some Clean Alternatives

Slide 19

Slide 19 text

03/12/08 Nuova Systems Inc. Page 19 Topology  Data Center networks are tree structured  Only topology that Ethernet supports  parallel trees for redundancy  Bandwidth of core nodes limits bandwidth of entire network  Need to evolve to support of fat-tree, mesh, arbitrary topologies - “multi-path”  Redundancy / Incompatibility between different network layers  L2 / Ethernet  L3 / IP  Storage multi-pathing

Slide 20

Slide 20 text

03/12/08 Nuova Systems Inc. Page 20 Control Planes & Politics  Servers are redundantly connected, yet don't participate in network topology determination  Onto which interface should I send a packet?  Multi-path access to storage is very important  But storage doesn't participate in network topology either  What if there was just one control plane?  Unify Ethernet, IP, SCSI addressing  Arbitrary topology – simple graph theory  Huge boost in possible bandwidth  Huge reduction in congestion & latency

Slide 21

Slide 21 text

03/12/08 Nuova Systems Inc. Page 21 Congestion  Protocols have different congestion management approaches  Traditional Ethernet / TCP – drop & retransmit  Fibre Channel – never drop, spead congestion  Infiniband – no drop, virtual channels  Ideal: Application chooses behavior  How to achieve fairness?  Ideal: Integrate congestion & topology  route around congested nodes

Slide 22

Slide 22 text

03/12/08 Nuova Systems Inc. Page 22 Ethernet Congestion Directions  Ethernet today supports “pause”  All or nothing – congestion spreading  Soon: “Per Priority Pause”  8 classes of traffic – independent pause  Congestion signaling – IEEE p802.3au / QCN  backwards congestion notification to source  will take a long time to diffuse to many products  Adapters need better queueing  Single deep queue leads to massive unfairness  Multi-queue / flow awareness needed  Better integration between hw/sw queue management

Slide 23

Slide 23 text

03/12/08 Nuova Systems Inc. Page 23 TCP Directions  TCP today supports ECN  But its turned off because there's too many broken routers in the world  Layer 2 switches don't mark, need L3 awareness  Extend TCP to select ECN or not based on routes  Enable for “local” or “known good” subnets  TCP timeouts are ludicrously high  Need to decrease by 1000x for the datacenter  OS impact

Slide 24

Slide 24 text

03/12/08 Nuova Systems Inc. Page 24 OSI Reference Model

Slide 25

Slide 25 text

03/12/08 Nuova Systems Inc. Page 25 Layer “Violations”  Layer 2 - “transparent” switching  Virtual machines  Virtual networks  IB/FC transport  Proxies, firewalls, appliances

Slide 26

Slide 26 text

03/12/08 Nuova Systems Inc. Page 26 World Views - Topology  Old world: The network (world) is flat  Keep going and you'll eventually get to the edge  New world: The network (world) is round  Wherever the data goes, it's still in the network  Servers just talk to other servers  The data goes in, rattles around, and comes back out again  Storage: just a network with a time dimension  Spacetime Data Fabric?:  Data constantly in motion/transformation  distance == time == latency

Slide 27

Slide 27 text

03/12/08 Nuova Systems Inc. Page 27 World Views - Manageability  Geocentric  The CPU/OS is the center of the universe  Heliocentric  The DataCenter as a holistic unit  Galactic  DataCenters all over the world You are here

Slide 28

Slide 28 text

03/12/08 Nuova Systems Inc. Page 28 Monitoring  How much time, space, bandwidth & energy does my application use?  We can almost answer this within a single server  But how can we answer this for distributed applications?  Monitoring needs transparency – the opposite of virtualization

Slide 29

Slide 29 text

03/12/08 Nuova Systems Inc. Page 29 “Mobile” Processing  99.9% of networking moves the data to the processor  Why not move the processing to the data?  “function shipping”  Hypervisor on disk drive?  Map/Reduce model  What if the data is distributed?  Move the processing into the network – EC2  Steiner points?

Slide 30

Slide 30 text

03/12/08 Nuova Systems Inc. Page 30 Summary  Data Centers are the new heavy industry  Networking is the raison d'être for data centers  From inter-continental to intra-chip, networking issues are the major problems to be solved