Slide 1

Slide 1 text

1 Machines That Think! Tom Lyon For Brocade Communications 9/24/2003

Slide 2

Slide 2 text

2 What’s new in computing?  “The thing that hath been, it is that which shall be; and that which is done is that which shall be done: and there is no new thing under the sun.”  Ecclesiastes 1:9

Slide 3

Slide 3 text

3 Nothing New…  Alan Turing, 1936 – what can be done  “On the Computability of Numbers”  Vannevar Bush, 1945 – what should be done  “As We May Think”  John von Neumann, 1945 – how to build it  “First Draft of a Report to the EDVAC”

Slide 4

Slide 4 text

4 Then & Now  Sequential Tape  Swap to Drives  I/O Bus  Fast DRAM  SMP systems  Software is cheap  Memory is cheap  Sequential Hard Drives  Swap to DRAM  Network  Fast Cache  SMP chips  Hardware is cheap  Memory is least cheap

Slide 5

Slide 5 text

5 The Last Hardware Problem: Latency  Speed of light ~ 1ft/ns  Speed of DRAM ~70 ns  Hard drive rotational latency – 2-16ms

Slide 6

Slide 6 text

6 Processors  Xeon DP (no L3) = 55M transistors  Xeon MP (2MB L3) = 108M transistors  Intel 1B transistor process – 20nm, 80 atoms

Slide 7

Slide 7 text

7 Power  Desktop & Server – ignore power, go for performance  Pentium, Itanium, Opteron, PPC 970  Embedded – balanced power & performance  MIPS  Handheld – power is paramount  ARM – Xscale, OMAP, …  AMD Alchemy  PPC 405LP

Slide 8

Slide 8 text

8 64 bit  AMD Opteron – x86-64  Intel Itanium  IBM/Apple PowerPC 970  Ultrasparc  MIPS64 – Broadcom, PMC-Sierra  SuperH – SH-5

Slide 9

Slide 9 text

9 Deep vs Wide  ILP – instruction level parallelism  Deep pipelining, speculative execution – more ops/clock  CLP (VLIW) – compiler level parallelism  More ops/inst – Itanium, Transmeta  TLP – thread level parallelism  Intel Hyperthreading, Sun Niagara  OLP – OS level parallelism  Vanderboot,VMWare  NLP – network level - clusters

Slide 10

Slide 10 text

10 Beyond Price/Performance  What matters when hardware is free?  Density, Cooling, Cabling  Network Computing  Virtualization  Consolidation  Autonomic Computing  Recovery Oriented Computing

Slide 11

Slide 11 text

11 Blade Servers  Address Density, Power, & Cabling problems in the Datacenter  Blades & Clusters should be made for each other  Choose best p/p processors and replicate  But, too much profit at stake in servers  Exotic processors are where the $$s are

Slide 12

Slide 12 text

12 Network Computing  Clusters – Oracle, Top500, J2EE  Performance, availability, or both?  Homogeneous, Local Area  Web Services – XML everywhere  Grid Computing  How many supercomputers can one scientist use?  Heterogeneous, Wide Area  Utility Computing  Pay by the play

Slide 13

Slide 13 text

13 Virtualization “Any problem in Computer Science can be solved with another level of indirection” - Butler Lampson

Slide 14

Slide 14 text

14 Virtualization Seeing what is not there  Virtual Drives  RAID, LUN Mapping, etc.  Networks  VLANs, VPNs, proxies, etc.  Processors  VMware, HyperThreading, Vanderpool  Software emulations – VirtualPC, etc.

Slide 15

Slide 15 text

15 The Management Challenge  Virtualization creates new management problems  At least doubles the number of managed objects  Creates new security risks, …

Slide 16

Slide 16 text

16 Consolidation  Cost control rules today  Systems have greater capacity  IT always wants fewer things to manage  Easily managed systems never seem to appear  Better networking enables central services

Slide 17

Slide 17 text

17 Autonomic Computing

Slide 18

Slide 18 text

18 Aaron Brown, Dan Hettenna, David Oppenheimer, Noah Treuhaft, Leonard Chung, Patty Enriquez, Susan Housand, Archana Ganapathi, Dan Patterson, Jon Kuroda, Mike Howard, Matthew Mertzbacher, Dave Patterson, and Kathy Yelick University of California at Berkeley In cooperation with George Candea, James Cutler, and Armando Fox Stanford University Recovery-Oriented Computing

Slide 19

Slide 19 text

19 ROC: goals and assumptions of last 15 years  Goal #1: Improve performance  Goal #2: Improve performance  Goal #3: Improve cost-performance  Assumptions  Humans are perfect (they don’t make mistakes during installation, wiring, upgrade, maintenance or repair)  Software will eventually be bug free (good programmers write bug-free code, debugging works)  Hardware MTBF is already very large (~100 years between failures), and will continue to increase

Slide 20

Slide 20 text

20 Today, after 15 years of improving performance  Availability is now the vital metric for servers  near-100% availability is becoming mandatory for e- commerce, enterprise apps, online services, ISPs  But service outages are frequent  65% of IT managers report that their websites were unavailable to customers over a 6-month period  25%: 3 or more outages  Outage costs are high  social effects: negative press, loss of customers who “click over” to competitor  $500,000 to $5,000,000 per hour in lost revenues

Slide 21

Slide 21 text

21 New goals: ACME  Availability: failures are common  Traditional fault-tolerance doesn’t solve the problems  Change  In back-end system tiers, software upgrades difficult, failure-prone, or ignored  For application service over WWW, daily change  Maintainability  human operator error is single largest failure source?  system maintenance environments are unforgiving  Evolutionary growth  1U-PC cluster front-ends scale, evolve well  back-end scalability still limited

Slide 22

Slide 22 text

22 Recovery-Oriented Computing Philosophy • Failures are a fact, and recovery/repair is how we cope with them • Improving recovery/repair improves availability – UnAvailability = MTTR / MTTF – 1/10th MTTR just as valuable as 10X MTBF • If major Sys Admin job is recovery after failure, ROC also helps with sys admin

Slide 23

Slide 23 text

23 R.O.C. http://roc.CS.Berkeley.EDU/

Slide 24

Slide 24 text

24 Software Complexity  It’s in software because it couldn’t be specified precisely  Too many ways to write buggy code  Interfaces (APIs, ABIs, Protocols, …) are not stable  Co-resident software modules interact in weird ways – combinatorial explosion in testing

Slide 25

Slide 25 text

25 Leaning Tower of APIs  APIs are rarely stable or upwards compatible  Any app uses lots of APIs  APIs capture developers by keeping them busy!  APIs are meaningless to any except developers

Slide 26

Slide 26 text

26 Dedicated Servers  Each major application dictates the exact versions & patches to underlying middleware & OS  Therefore, each app requires its own OS image  Each OS requires own server

Slide 27

Slide 27 text

27 Appliances  Traditionally, the end-user or VAR integrates hardware, middleware, application  If this is down by the product vendor, you have an appliance  Appliance hardware can be configured for best value for application

Slide 28

Slide 28 text

28 Linux Rising  Linux == Recall Windows!  Unfortunately, 136 candidates on the ballot:  RedHat, Suse, Debian, Mandrake, Montavista…  “Write once, run anywhere” is truer with Linux than Java, but applies at the source level  Leaning tower of APIs can be “patched”

Slide 29

Slide 29 text

29 Closer Than Ever… Machines That Think!