Slide 1

Slide 1 text

Software Disasters ARNO HUETTER

Slide 2

Slide 2 text

About the Author Arno Huetter Arno wrote his first lines of code on a Sinclair ZX80 in 1984. Over the years, he has been programming in C/C++, Java and C#, and also did quite some database development. Today he is Development Lead at Dynatrace (APM vendor).

Slide 3

Slide 3 text

OS/2 (1985-2001)

Slide 4

Slide 4 text

The PC World in 1985 ➢ PC (with DOS) is clear market leader, but Apple Macintosh is the new cool thing ➢ Windows 1.0 merely a DOS GUI extension ➢ IBM‘s TopView has flopped (rudimentary shell that allowed for copy/paste between and multitasking of DOS programs)

Slide 5

Slide 5 text

Windows 1.0 30 years of innovation

Slide 6

Slide 6 text

Enter OS/2 ➢ OS/2 intended as the protected mode successor of DOS ➢ IBM decides to form another partnership with Microsoft ➢ The Plan: ➢ IBM programmers would develop significant parts ➢ Microsoft to be paid by kLOC contractor rates ➢ Must run on 286, compatible with TopView, run DOS programs in „compatibility box“ ➢ Presentation Manager should allow recompiled Windows applications to run (never worked that way, required rewrite or VDM starting with OS/2 2.0)

Slide 7

Slide 7 text

1987: OS/2 1.0 1988: OS/2 1.1

Slide 8

Slide 8 text

1987 to 1991 ➢ Marketing along with IBM‘s PS/2 platform (although PS/2 not required) leads to customer confusion ➢ RAM prices shoot up in 1987 (USD 133 for 1MB), OS/2 requires 4MB compared to the usual 1MB for DOS ➢ USD 340 for retail copy (DOS shipped for free with new PCs) ➢ USD 3,000 for OS/2 SDK ➢ No printer support except IBM printers, no drivers for common devices ➢ Missing guidance / support / ecosystem for 3rd party software vendors ➢ 1989: OS/2 1.2 introduces HPFS, Ethernet, TCP/IP ➢ 1990: Windows 3.0 takes off, IBM/Microsoft collaboration unravels ➢ 1991: OS/2 1.3 turns out to be a modest success, but fades compared to Windows 3.x

Slide 9

Slide 9 text

1992: OS/2 2.0 1994: OS/2 Warp

Slide 10

Slide 10 text

1992 to 1994 ➢ 1992: Windows 3.1 released ➢ 1992: OS/2 2.0, true 32bit operating system, taking full advantage of 386, and technically ahead of Windows 3.x (preemptive multitasking, memory protection) ➢ Workplace with „object-oriented“ UI behavior and 32bit API ➢ Multiple DOS programs running side-by-side ➢ Windows 3.0/3.1 compatibility via VDM. Windows code included in OS/2 ➢ Due to Windows compatibility, developers simply decided to develop for Windows only (and could state „it runs on OS/2 as well“) ➢ OS/2 versions of Lotus 1-2-3 or Corel Draw sluggish compared to Windows ➢ 1993: Mainframe market collapses. IBM CEO John Akers ousted, replaced by Louis Gerstner. Gerstner turns struggling company around ➢ 1994: OS/2 3.0 (Warp) introduced ➢ 1994: Windows NT 3.5 introduced (modern, rock-solid, multi-core support)

Slide 11

Slide 11 text

1995 to 2001 ➢ 1995: Windows 95 hits market, becomes instant success ➢ IBM weak on marketing, hardly getting PC clone makers on board ➢ OS/2 sold mainly to corporate customers for networking environments, but finally loses there as well to Windows NT ➢ Even IBM‘s „Mr. OS/2“, David Barnes, is quoted saying: „OS/2 is great, but then Sony‘s Betamax was way better than VHS…“ ➢ 1996: OS/2 Warp 4 released, adds Java and speech recognition ➢ IBM finally stops development, but continues to sell OS/2 until 2001 ➢ Gerstner quote #1: “The pro-OS/2 argument was based on technical superiority... What my colleagues seemed unwilling or unable to accept was that the war was already over and was a resounding defeat” ➢ Gerstner quote #2: “The battle between OS/2 and Microsoft Windows was draining tens of millions of dollars, absorbing huge chunks of senior management’s time, and making a mockery of our image.”

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

1998 to 2002: Netscape ➢ 1998: Consensus: Netscape 4 code base is pretty bad. So let’s do a complete rewrite! Mozilla organization formed. ➢ Code base might have been bad, but it worked quite well for most users (browser market share at 50%) ➢ 1999: Netscape acquired by AOL ➢ 2000: Netscape 6 released. Wasn’t really ready, fails miserably ➢ 2002: Mozilla 1.0 released. First real release in four years. Browser market share at 6% ➢ 2003: AOL closes Netscape division, Mozilla Foundation continues independently ➢ 2004: Resurrection: Firefox 1.0 based on Mozilla

Slide 15

Slide 15 text

Ariane 5 (1996)

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

declare vertical_veloc_sensor: float; horizontal_veloc_sensor: float; vertical_veloc_bias: integer; horizontal_veloc_bias: integer; ... begin declare pragma suppress(numeric_error, horizontal_veloc_bias); begin sensor_get(vertical_veloc_sensor); sensor_get(horizontal_veloc_sensor); vertical_veloc_bias := integer(vertical_veloc_sensor); horizontal_veloc_bias := integer(horizontal_veloc_sensor); ... exception when numeric_error => calculate_vertical_veloc(); when others => use_irs1(); end; end irs2;

Slide 18

Slide 18 text

Ariane 5 - Summary of Events ➢ 64bit floating point to 16bit signed integer conversion ➢ Numeric overflow when horizontal velocity sensor value > 32768 (internal unit) ➢ Exception handling deactivated ➢ Redundant system contained different hardware but same software, hence ran into same problem ➢ Unhandled exception triggered self destruction in order to avoid rocket breaking apart ➢ Code originated from Ariane 4, which was slower and flew at different angle ➢ Calculation not even needed during flight (just during prep), but still running ➢ USD 5 billion overall development costs ➢ USD 500 million for rocket + satellites ➢ Program delayed by years

Slide 19

Slide 19 text

2000 to 2005: FBI Virtual Case File ➢ Software system to manage all documents relating to cases being investigated by the FBI ➢ Modern web interface for 22,000 users to replace previous ACS system (which was obsolete already at introduction due to outdated technology) ➢ Estimated completion time: 22 months ➢ Until 2005, 700,000 lines of code written, five different project leads in charge

Slide 20

Slide 20 text

2000 to 2005: FBI Virtual Case File ➢ VCF turns out to be incomplete, inadequate and poorly designed, essentially unusable under real-world conditions ➢ Even in rudimentary tests system did not comply with basic requirements ➢ After having invested 170 Mio USD, the FBI decided to buy off-the-shelf software instead ➢ Causes: No architecture blueprints, repeated changes in specification, engineers with little or no computer science training, code bloat, scope creep

Slide 21

Slide 21 text

2003: US Northeast Blackout ➢ Race condition in General Electric's Unix-based XA/21 energy management system ➢ Bug stalls FirstEnergy's control room alarm system – operators do not receive alerts any more ➢ Unprocessed events queued up and the primary server failed within 30 minutes ➢ Applications automatically transferred to the backup server, which itself failed ➢ Operator screen refresh rate drops from 1sec to 1min ➢ Operators hence dismiss a call about the tripping and reclosure of a 345 kV shared line ➢ More lines to go offline in a chain reaction, undervoltage and overcurrent interpreted as a short circuit ➢ 30 minutes later 256 power plants are off-line, most due to automatic protective controls

Slide 22

Slide 22 text

2005: WoW Glitch ➢ Game update on September 13th introduced new character „Hakkar“ ➢ Hakkar was able to inflict a disease „Corrupted Blood“ on playing characters, draining their health points and finally killing them ➢ Disease could be passed to other players ➢ Effect was meant to be localized to one game area ➢ Developers didn‘t consider WoW teleporting functionality ➢ Infected players teleported into other areas, soon leading to corpses littering the streets ➢ Fortunately, player death is not permanent in WoW and admins resetted the game ➢ (Virtual) death toll: unknown

Slide 23

Slide 23 text

2012: Knight Capital loses 440M USD ➢ August 12th: New Trading Software installed ➢ Administrator forgets to deploy on one out of eigth server nodes ➢ New code repurposed a flag previously used for testing scenarios ➢ On that one server node, old trading algorithm interprets flag differently and starts buying and selling 100 different stocks randomly without human verification ➢ NYSE has to suspend trade of several stocks ➢ Knight Capital loses 440 Mio USD in only 30 minutes, until system is suspended ➢ Investors have to raise 400 Mio USD in order to rescue the company

Slide 24

Slide 24 text

Source: http://www.typemock.com/software-bugs-infographic

Slide 25

Slide 25 text

Why do SW projects fail (IEEE) ➢ Unrealistic or unarticulated project goals ➢ Inaccurate estimates of needed resources ➢ Badly defined system requirements ➢ Poor reporting of the project's status ➢ Unmanaged risks ➢ Poor communication among customers, developers, and users ➢ Use of immature technology ➢ Inability to handle the project's complexity ➢ Sloppy development practices ➢ Poor project management ➢ Stakeholder politics ➢ Commercial pressures

Slide 26

Slide 26 text

Thank you! Twitter: https://twitter.com/ArnoHu Blog: http://arnosoftwaredev.blogspot.com