ICSE 2015: Evaluating Legal Implementation Readiness Decision-making

ICSE 2015: Evaluating Legal Implementation Readiness Decision-making

ICSE decided in 2015 to have a journal-first track in which they chose papers from the past year in either TOSEM or TSE that were not previously presented in a conference setting. Our paper was selected as one of these presentations, and this is that presentation. The abstract of the paper follows:

Software systems are increasingly regulated. Software engineers therefore must determine which requirements have met or exceeded their legal obligations and which requirements have not. Requirements that have met or exceeded their legal obligations are legally implementation ready, whereas requirements that have not met or exceeded their legal obligations need further refinement. In this paper, we examine how software engineers make these determinations using a multi-case study with three cases. Each case involves assessment of requirements for an electronic health record system that must comply with the U.S. Health Insurance Portability and Accountability Act (HIPAA) and is measured against the evaluations of HIPAA compliance subject matter experts. Our first case examines how individual graduate-level software engineering students assess whether the requirements met or exceeded their HIPAA obligations. Our second case replicates the findings from our first case using a different set of participants. Our third case examines how graduate-level software engineering students assess requirements using the Wideband Delphi approach to deriving consensus in groups. Our findings suggest that the average graduate-level software engineering student is ill-prepared to write legally compliant software with any confidence and that domain experts are an absolute necessity.

6abee2cd1633eb4d51e261e664e6ce37?s=128

akmassey

May 22, 2015
Tweet

Transcript

  1. Evaluating Legal Implementation Readiness Decision-making Aaron K. Massey, Paul N.

    Otto and Annie I. Antón May 22, 2015 TSE: http://dx.doi.org/10.1109/TSE.2014.2383374 akmassey@gatech.edu http://www.cc.gatech.edu/~akmassey @akmassey 1
  2. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Overview 2 § Background and Motivation § Case #1: initial study § Case #2: replication study § Case #3: Wideband Delphi study § Summary and Findings § Questions
  3. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Why examine regulatory compliance? § State of the Art in software engineering is not enough for regulatory compliance. § Software systems are failing to meet their legal obligations. – ChoicePoint data breach total cost: $28M – Target data breach total cost: $162M and counting – Anthem data breach total cost: ???? 3
  4. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Domain: Healthcare § Health Insurance Portability and Accountability Act (HIPAA) – Passed in 1996, Amended in 2009 – Serious penalties for non-criminal violations § HIPAA Settlement Actions: – Concentra Health Services – $1.7M (April 2014) – New York and Presbyterian Hospital – $3.3M (May 2014) – Columbia University Hospital – $1.5M (May 2014) § Laws and regulations continue to evolve – Traditional Law: 21st Century Cure bill – Case Law: Emily Byrne vs. Avery Center for Obstetrics and Gynecology 4
  5. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Research Question How are legal implementation readiness decisions made? 5
  6. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Legal Implementation Readiness Requirements that meet or exceed their legal obligations are Legally Implementation Ready (LIR). 6
  7. 7 © 2006-2014 Aaron Massey et al., Georgia Institute of

    Technology Example LIR Requirement Consider Requirement A: iTrust shall generate a unique user ID and default password upon account creation by a system administrator. Traces to § 164.312(a)(1) and § 164.312(a)(2)(i) Relevant HIPAA Sections: (a)(1) Standard: Access control. Implement technical policies and procedures for electronic information systems that maintain electronic protected health information to allow access only to those persons or software programs that have been granted access rights as specified in § 164.308(a)(4). (2) Implementation specifications: (i) Unique user identification (Required). Assign a unique name and/ or number for identifying and tracking user identity.
  8. 8 © 2006-2014 Aaron Massey et al., Georgia Institute of

    Technology Example Non-LIR Requirement Consider Requirement B: iTrust shall allow an authenticated user to change their user ID and password. Traces to §164.312(a)(1) and §164.312(a)(2)(i) Relevant HIPAA Sections: (a)(1) Standard: Access control. Implement technical policies and procedures for electronic information systems that maintain electronic protected health information to allow access only to those persons or software programs that have been granted access rights as specified in § 164.308(a)(4). (2) Implementation specifications: (i) Unique user identification (Required). Assign a unique name and/ or number for identifying and tracking user identity.
  9. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Cases and Participant Population § All participants had… – No prior experience with legal compliance – Coursework included: • 150 minutes of lectures on requirements engineering • 75 minutes of lectures on regulatory and policy compliance § Case #1: 32 graduate-level software engineering students § Case #2: 34 graduate-level software engineering students § Case #3: 14 graduate-level software engineering students 9
  10. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Case Study Materials § 31 iTrust requirements traced to Health Insurance Portability and Accountability Act (HIPAA) § 164.312 § Traceability Matrix § Text of HIPAA §164.312 – Familiarity [BA08, MA10a, MA10b, MOH10, MOA09, MA09a, MA09b] – Focuses on Technical Measures of protection – Complete, self-contained, section of the legal text § Comparison with subject matter expert consensus – Three software engineers with HIPAA experience, one of whom is also a lawyer – Consensus achieved using Wideband Delphi 10
  11. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Case #1: Research Questions § Is there consensus among: – [Q1] subject matter experts about which requirements are LIR? – [Q2] graduate students about which requirements are LIR? § [Q3] Can graduate students accurately assess which requirements are LIR? 11
  12. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Results: Consensus among Subject Matter Experts § Q1: Is there consensus among subject matter experts on which requirements are LIR? § Result: Moderate agreement among the experts about the requirements prior to the discussion session. – Fleiss’ Kappa = 0.517 (p < 0.0001) – Universal agreement on 19 of the 31 requirements 12
  13. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Case #1 Results: 
 Consensus among participants § Q2: Is there consensus among participants on which requirements are LIR? § Result: Slight agreement about the requirements. – Fleiss’ Kappa = 0.0792 (p < 0.0001) – Only somewhat better than “agreement” found in perfectly random responses. 13
  14. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Case #1 Results: 
 Assessment of LIR § Q3: Can graduate students accurately assess which requirements are LIR? § Measurement: Best possible voting cutoff (20/32) § Result: Students cannot accurately assess the LIR status of a requirement and are more likely to miss requirements that are not LIR. – Average Cohen’s Kappa = 0.110 – Voting Cohen’s Kappa: 0.357 (fair), Agreement 67.74% • Best Individual: Kappa 0.362; Agreement 67.74% • Third Quartile: Kappa 0.219, Agreement 61.29% – Sensitivity = 0.714, Specificity = 0.647 14
  15. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Case #2: Replication Study Design § Study Goal: Repeat our previous results with a different graduate student population. § Focusing on two Research Questions: –[Q2] Is there consensus among graduate students about which requirements are LIR? –[Q3] Can graduate students accurately assess which requirements are LIR? 15
  16. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Case #2 Results: 
 Consensus among participants § Q2: Is there consensus among participants on which requirements are LIR? § Result: Slight agreement about the requirements. –Fleiss’ κ = 0.114 (p < 0.0001) –Marginally better than the agreement between participants in the LIR Assessment Case Study These results confirm our previous findings. 16
  17. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Case #2 Results: 
 Assessment of LIR § Q3: Can graduate students accurately assess which requirements are LIR? § Measurement: Best possible voting cutoff (19/34) § Result: Students cannot accurately assess the LIR status of a requirement and are more likely to miss requirements that are not LIR. – Average Cohen’s Kappa = 0.103 – Voting Cohen’s Kappa 0.413 (fair), Agreement 71.0% • Best Individual: Kappa 0.545, Agreement 77.4% • Third Quartile: Kappa 0.218, Agreement 61.29% – Sensitivity = 0.556, Specificity = 0.508 These results confirm our previous findings. 17
  18. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Case #3: Research Questions § Study Goal: Examine how LIR decisions are made in groups working together § [Q4] Can graduate students working together using a Wideband Delphi method accurately assess which requirements are LIR? – Q4 mirrors Q3 from the previous study. The difference is this uses Wideband Delphi for consensus and that study used the “best” voting cutoff § [Q5] What is the extent of the discussion on requirements during the application of the Wideband Delphi method? 18
  19. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Case #3: Wideband Delphi Study Design § 14 graduate student participants –All participants made an initial determination for each of the 31 requirements. –For each requirement, participants either: • Achieved unanimous consensus that the requirement was LIR or was not LIR • Were unable to achieve unanimous consensus, which everyone agreed meant the requirement should be considered not LIR 19
  20. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Results: Wideband Delphi
 Assessment of LIR § Q4: Can graduate students working together using a Wideband Delphi method accurately assess which requirements are LIR? § Result: Students cannot accurately assess the LIR status of a requirement and are more likely to miss requirements that are not LIR. – Cohen’s Kappa 0.111, and Agreement 54.8% – Sensitivity = 0.625, Specificity = 0.522 § The participants were much more conservative when working together to achieve consensus than they were individually. 20
  21. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Results: Consensus among Wideband Delphi Participants § Q5: What is the extent of the discussion on requirements during the application of the Wideband Delphi method? § Result: Fair agreement among the participants about the requirements prior to the discussion session. – Fleiss’ Kappa = 0.252 (p < 0.0001) – Recall: Experts from LIR Assessment Case Study started at Fleiss’ Kappa = 0.517 (p < 0.0001) § Unable to achieve consensus on 7 of the 31 requirements after discussion 21
  22. © 2006-2014 Aaron Massey et al., Georgia Institute of Technology

    Results Summary 22 Case Percent Agreement Cohen’s Kappa Sensitivity Specificity Case #1 (Average) 55.95% 0.110 0.576 0.548 Case #1 (Best Vote) 67.74% 0.357 0.714 0.647 Case #2 (Average) 55.69% 0.103 0.556 0.509 Case #2 (Best Vote) 70.94% 0.413 0.667 0.800 Case #3 (Average) 51.25% 0.044 0.521 0.492 Case #3 (Delphi) 54.84% 0.111 0.625 0.522
  23. § First Case: graduate-level software engineers are ill- prepared to

    make LIR determinations. – Maxwell found professional engineers are similarly ill-prepared to identify cross references in legal texts. § Second Case (Replication Study): confirms our findings from the First Case. § Third Case (Wideband Delphi Study): Wideband Delphi consensus technique slightly improves LIR assessment accuracy. – But our findings suggest that using the Wideband Delphi method to achieve consensus results in overly cautious assessments. Subject Matter Experts are critical! © 2006-2014 Aaron Massey et al., Georgia Institute of Technology Findings 23
  24. None
  25. Thank you! Questions? 25 Aaron K. Massey, Paul N. Otto

    and Annie I. Antón May 22, 2015 TSE: http://dx.doi.org/10.1109/TSE.2014.2383374 akmassey@gatech.edu http://www.cc.gatech.edu/~akmassey @akmassey