Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software Disasters

Arno Huetter
August 20, 2015
25

Software Disasters

Arno Huetter

August 20, 2015
Tweet

Transcript

  1. About the Author Arno Huetter Arno wrote his first lines

    of code on a Sinclair ZX80 in 1984. Over the years, he has been programming in C/C++, Java and C#, and also did quite some database development. Today he is Development Lead at Dynatrace (APM vendor).
  2. The PC World in 1985 ➢ PC (with DOS) is

    clear market leader, but Apple Macintosh is the new cool thing ➢ Windows 1.0 merely a DOS GUI extension ➢ IBM‘s TopView has flopped (rudimentary shell that allowed for copy/paste between and multitasking of DOS programs)
  3. Enter OS/2 ➢ OS/2 intended as the protected mode successor

    of DOS ➢ IBM decides to form another partnership with Microsoft ➢ The Plan: ➢ IBM programmers would develop significant parts ➢ Microsoft to be paid by kLOC contractor rates ➢ Must run on 286, compatible with TopView, run DOS programs in „compatibility box“ ➢ Presentation Manager should allow recompiled Windows applications to run (never worked that way, required rewrite or VDM starting with OS/2 2.0)
  4. 1987 to 1991 ➢ Marketing along with IBM‘s PS/2 platform

    (although PS/2 not required) leads to customer confusion ➢ RAM prices shoot up in 1987 (USD 133 for 1MB), OS/2 requires 4MB compared to the usual 1MB for DOS ➢ USD 340 for retail copy (DOS shipped for free with new PCs) ➢ USD 3,000 for OS/2 SDK ➢ No printer support except IBM printers, no drivers for common devices ➢ Missing guidance / support / ecosystem for 3rd party software vendors ➢ 1989: OS/2 1.2 introduces HPFS, Ethernet, TCP/IP ➢ 1990: Windows 3.0 takes off, IBM/Microsoft collaboration unravels ➢ 1991: OS/2 1.3 turns out to be a modest success, but fades compared to Windows 3.x
  5. 1992 to 1994 ➢ 1992: Windows 3.1 released ➢ 1992:

    OS/2 2.0, true 32bit operating system, taking full advantage of 386, and technically ahead of Windows 3.x (preemptive multitasking, memory protection) ➢ Workplace with „object-oriented“ UI behavior and 32bit API ➢ Multiple DOS programs running side-by-side ➢ Windows 3.0/3.1 compatibility via VDM. Windows code included in OS/2 ➢ Due to Windows compatibility, developers simply decided to develop for Windows only (and could state „it runs on OS/2 as well“) ➢ OS/2 versions of Lotus 1-2-3 or Corel Draw sluggish compared to Windows ➢ 1993: Mainframe market collapses. IBM CEO John Akers ousted, replaced by Louis Gerstner. Gerstner turns struggling company around ➢ 1994: OS/2 3.0 (Warp) introduced ➢ 1994: Windows NT 3.5 introduced (modern, rock-solid, multi-core support)
  6. 1995 to 2001 ➢ 1995: Windows 95 hits market, becomes

    instant success ➢ IBM weak on marketing, hardly getting PC clone makers on board ➢ OS/2 sold mainly to corporate customers for networking environments, but finally loses there as well to Windows NT ➢ Even IBM‘s „Mr. OS/2“, David Barnes, is quoted saying: „OS/2 is great, but then Sony‘s Betamax was way better than VHS…“ ➢ 1996: OS/2 Warp 4 released, adds Java and speech recognition ➢ IBM finally stops development, but continues to sell OS/2 until 2001 ➢ Gerstner quote #1: “The pro-OS/2 argument was based on technical superiority... What my colleagues seemed unwilling or unable to accept was that the war was already over and was a resounding defeat” ➢ Gerstner quote #2: “The battle between OS/2 and Microsoft Windows was draining tens of millions of dollars, absorbing huge chunks of senior management’s time, and making a mockery of our image.”
  7. 1998 to 2002: Netscape ➢ 1998: Consensus: Netscape 4 code

    base is pretty bad. So let’s do a complete rewrite! Mozilla organization formed. ➢ Code base might have been bad, but it worked quite well for most users (browser market share at 50%) ➢ 1999: Netscape acquired by AOL ➢ 2000: Netscape 6 released. Wasn’t really ready, fails miserably ➢ 2002: Mozilla 1.0 released. First real release in four years. Browser market share at 6% ➢ 2003: AOL closes Netscape division, Mozilla Foundation continues independently ➢ 2004: Resurrection: Firefox 1.0 based on Mozilla
  8. declare vertical_veloc_sensor: float; horizontal_veloc_sensor: float; vertical_veloc_bias: integer; horizontal_veloc_bias: integer; ...

    begin declare pragma suppress(numeric_error, horizontal_veloc_bias); begin sensor_get(vertical_veloc_sensor); sensor_get(horizontal_veloc_sensor); vertical_veloc_bias := integer(vertical_veloc_sensor); horizontal_veloc_bias := integer(horizontal_veloc_sensor); ... exception when numeric_error => calculate_vertical_veloc(); when others => use_irs1(); end; end irs2;
  9. Ariane 5 - Summary of Events ➢ 64bit floating point

    to 16bit signed integer conversion ➢ Numeric overflow when horizontal velocity sensor value > 32768 (internal unit) ➢ Exception handling deactivated ➢ Redundant system contained different hardware but same software, hence ran into same problem ➢ Unhandled exception triggered self destruction in order to avoid rocket breaking apart ➢ Code originated from Ariane 4, which was slower and flew at different angle ➢ Calculation not even needed during flight (just during prep), but still running ➢ USD 5 billion overall development costs ➢ USD 500 million for rocket + satellites ➢ Program delayed by years
  10. 2000 to 2005: FBI Virtual Case File ➢ Software system

    to manage all documents relating to cases being investigated by the FBI ➢ Modern web interface for 22,000 users to replace previous ACS system (which was obsolete already at introduction due to outdated technology) ➢ Estimated completion time: 22 months ➢ Until 2005, 700,000 lines of code written, five different project leads in charge
  11. 2000 to 2005: FBI Virtual Case File ➢ VCF turns

    out to be incomplete, inadequate and poorly designed, essentially unusable under real-world conditions ➢ Even in rudimentary tests system did not comply with basic requirements ➢ After having invested 170 Mio USD, the FBI decided to buy off-the-shelf software instead ➢ Causes: No architecture blueprints, repeated changes in specification, engineers with little or no computer science training, code bloat, scope creep
  12. 2003: US Northeast Blackout ➢ Race condition in General Electric's

    Unix-based XA/21 energy management system ➢ Bug stalls FirstEnergy's control room alarm system – operators do not receive alerts any more ➢ Unprocessed events queued up and the primary server failed within 30 minutes ➢ Applications automatically transferred to the backup server, which itself failed ➢ Operator screen refresh rate drops from 1sec to 1min ➢ Operators hence dismiss a call about the tripping and reclosure of a 345 kV shared line ➢ More lines to go offline in a chain reaction, undervoltage and overcurrent interpreted as a short circuit ➢ 30 minutes later 256 power plants are off-line, most due to automatic protective controls
  13. 2005: WoW Glitch ➢ Game update on September 13th introduced

    new character „Hakkar“ ➢ Hakkar was able to inflict a disease „Corrupted Blood“ on playing characters, draining their health points and finally killing them ➢ Disease could be passed to other players ➢ Effect was meant to be localized to one game area ➢ Developers didn‘t consider WoW teleporting functionality ➢ Infected players teleported into other areas, soon leading to corpses littering the streets ➢ Fortunately, player death is not permanent in WoW and admins resetted the game ➢ (Virtual) death toll: unknown
  14. 2012: Knight Capital loses 440M USD ➢ August 12th: New

    Trading Software installed ➢ Administrator forgets to deploy on one out of eigth server nodes ➢ New code repurposed a flag previously used for testing scenarios ➢ On that one server node, old trading algorithm interprets flag differently and starts buying and selling 100 different stocks randomly without human verification ➢ NYSE has to suspend trade of several stocks ➢ Knight Capital loses 440 Mio USD in only 30 minutes, until system is suspended ➢ Investors have to raise 400 Mio USD in order to rescue the company
  15. Why do SW projects fail (IEEE) ➢ Unrealistic or unarticulated

    project goals ➢ Inaccurate estimates of needed resources ➢ Badly defined system requirements ➢ Poor reporting of the project's status ➢ Unmanaged risks ➢ Poor communication among customers, developers, and users ➢ Use of immature technology ➢ Inability to handle the project's complexity ➢ Sloppy development practices ➢ Poor project management ➢ Stakeholder politics ➢ Commercial pressures