Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Smart Like A Fox: How clever students trick dumb automated programming assignment assessment systems

Smart Like A Fox: How clever students trick dumb automated programming assignment assessment systems

This case study reports on two first-semester programmings courses with more than 190 students. Both courses made use of automated assessments of students code submissions. We observed how students trick these systems by analyzing the version history of suspect submissions. By analyzing more than 3300 submissions we revealed four astonishingly simple cheat patterns (overfitting, evasion, redirection, and injection) that students can use to trick automated programming assignment assessment systems (APAAS) but we also propose corresponding countermeasures. This immaturity of existing APAAS solutions might have implications for courses that rely deeply on automation like MOOCs. Therefore, we conclude to look at APAAS solutions much more from security (code injection) point of views. Moreover, we identify the need to evolve existing unit testing frameworks into more evaluation-oriented teaching solutions that provide better cheat detection capabilities and differentiated grading support.

Nane Kratzke

March 02, 2019
Tweet

More Decks by Nane Kratzke

Other Decks in Education

Transcript

  1. Introduction Methodology Analysis Discussion, Counter Measures Limitations, Conclusion Agenda 2

    Presentation on SpeakerDeck Preprint on ResearchGate Presentation at CSEDU 2019, Heraklion, Crete, Greece (2 – 4 May 2019)
  2. • We are at a transition point between the industrialisation

    age and the digitisation age. • Computer science related skills are a vital asset in this context. One of these basic skills is practical programming. • The course sizes of university and college programming courses are steadily increasing. • Even MOOC’s are used more frequently to convey necessary programming capabilities to students of different disciplines. • The coursework is composed of assignments that are highly suited to be assessed automatically. • However, it is very often underestimated how astonishingly easy it is to trick these systems! Introduction 3 The question arises whether “robots” certificate the expertise to program or to cheat?
  3. A small example to get your attention ... 4 VPL

    == Virtual Programming Lab • Count the occurence of a character c in a String s. • Develop a method countChar(). How to get full points in Moodle/VPL? The same works for every assignment! INTRODUCTION
  4. INTRODUCTION • APAAS solutions are systems that execute injected code

    (student submissions). • Code injection is known as a severe threat from a security point of view. • APAAS solutions protect the host system via sandbox mechanisms. • Much effort is invested in sophisticated code plagiarism detection and authorship control of student submissions. • But it was astonishing to see that APAAS solutions like VPL overlook the cheating cleverness of students. • The grading component can be cheated very straightforward. • Unattended automated programming examinations must be rated suspect. APAAS == Code Injection System 5
  5. • Two first semester programming Java courses in the winter

    semester 2018/19: • A regular computer science study programme (CS) • An information technology and design focused study programme (ITD) • In both courses we searched for student submissions that intentionally trick the grading component. • APAAS: Moodle/VPL (Version 3.3.3) Methodology 7 • To minimise Hawthorne and Experimenter effects neither the students nor the advisers were aware to be part of this study. • Even if cheating was detected this had no consequences for the students. It was not even communicated. • Students were unaware that the version history of their submissions were logged and analyzed.
  6. METHODOLOGY • VPL submissions were downloaded from Moodle • Python/Jupyter

    based sample selection • S1: triggered evaluations • S2: maximum versions • S3: low average high end • S4: condition related terms • S5: unusual terms (System.exit, ...) • S6: random submissions • NumPy, matplotlib, statistics, Javaparser libraries • Exported weekly into archived PDF documents (for manual analysis) Searching for cheats Automated sample selection, manual sample analysis 8
  7. ANALYSIS Continuous Example Assignment 12 Count the occurence of a

    character c in a String s (not case-sensitive). We searched for solutions that differed significantly from this intendend (reference) solution. The reference solution used to check for correctness.
  8. ANALYSIS CHEAT PATTERN (1) • Get a maximum of points

    but do not solve the given problem in a general way • Solution is completely useless outside the scope of the test cases • Mapping simply input parameters to expected output parameters (63%) Overfitting 13
  9. ANALYSIS CHEAT PATTERN (2) (30%) Problem Evasion 14 Example assignment:

    Count the occurence of a character c in a String s recursively. Solution pretends to be recursive, but it is merely a redirection to an overloaded method using loops (non- recursive). Intended solution Evasion solution
  10. ANALYSIS CHEAT PATTERN (3) (6%) Redirection 15 (1) A small

    spelling error will result in compiler messages indicating that a specific method is expected by the test logic! (2) Compiler error messages can reveal the reference solution. (3) A clever student might now simply redirect the submission to the reference method (to let the grader evaluate itself). Redirecting solution
  11. ANALYSIS CHEAT PATTERN (4) (2%) Injection 16 Print simply the

    points you want to have in a APAAS specific format on standard out. • Change the intended workflow of the evaluation logic • Use the standard out stream to place text that is evaluated by the APAAS system • The evaluator calls the to be evaluated code. • The submission code can print to standard out and then terminates further evaluation calls. • The evaluator parses standard outs content and will give full points! Some strings with a specific meaning for VPL.
  12. DISCUSSION • Randomize Test Cases Overfitting • AST-based code inspection

    Problem Evasion • AST-based code inspection Redirection • Seperate standard out stream for evaluation and submission logic Injection Counter Measures 18 A more detailed discussion can be found in the paper.
  13. DISCUSSION JEdUnit 19 JEdUnit https://github.com/nkratzke/JEdUnit JEdUnit is a unit testing

    framework with a special focus on educational aspects. It strives to simplify automatic evaluation of (small) Java programming assignments using Moodle/VPL. It is used and developed for programming classes at the Lübeck University of Applied Sciences. However, this framework might be helpful for other programming instructors, so it has been open sourced.
  14. DISCUSSION Randomize Test Cases 20 Don‘t do that: Do that:

    JEdUnit DSL to express randomized test values. E.g. apply regular expressions inversely to generate random strings.
  15. DISCUSSION AST-based code inspections 21 E.g.: Don‘t allow to bypass

    recursions by inspecting and penalizing loop presence. The JEdUnit DSL is able to express selectors on abstract syntax trees (AST) to check for the presence or absence of language constructs. The selector model of JEdUnit works similar like CSS selectors work on DOM- trees.
  16. DISCUSSION Isolation of submission and evaluation logic 22 Submission logic

    gets an isolated fake console Submission shares stdout with evaluation process JEdUnit approach VPL approach
  17. DISCUSSION Further Features of JEdUnit 23 JEdUnit https://github.com/nkratzke/JEdUnit • Weighting

    of test cases (by annotations) • Checkstyle integration (weightened rules) • DSL • to formulate test cases in a check, explain, onError pattern • to randomize test cases • to write arbitrary code inspections based on a selector model • Predefined code inspections (switch on/off): proper collection usage, Loops, Lambdas, inner classes, datafields, sonsole output, etc. • Automated class structure comparison (OO use cases to compare the structural equality of a multi-class submission with a multi-class reference solution.
  18. LIMITATIONS We searched qualitatively and not quantitatively for cheat-patterns •

    Do not draw any conclusions what kind of cheat-pattern occur at what level of programming expertise • Do not draw any conclusions on the quantitative aspects of cheating • The study does not proclaim to have identified all kinds of cheat- patterns The study does not proclaim that all APAAS solutions have the same set of vulnerabilities • Do not generalize Moodle/VPL specific-problems. • However, the Overfitting, Problem Evasion, Redirection, and Injection patterns can be used to check for vulnerabilities in other APAAS solutions. Threats on Validity 25
  19. • We have to be aware that (even first-year) students

    are clever enough to trick automated grading solutions. • Cheat patterns: • Overfitting • Problem Evasion • Redirection • Injection • Options we currently investigate: • Randomise test cases • Pragmatic code inspection • Isolation of submission and evaluation logic • Exactly these features seem to be only incompletely provided by current APAAS systems. Conclusion 26 JEdUnit https://github.com/nkratzke/JEdUnit
  20. Acknowledgement 27 Presentation on SpeakerDeck Preprint on ResearchGate Advisers of

    the practical courses • David Engelhardt, Thomas Hamer, Clemens Stauner, Volker Völz, Patrick Willnow Student tutors • Franz Bretterbauer, Francisco Cardoso, Jannik Gramann, Till Hahn, Thorleif Harder, Jan Steffen Krohn, Diana Meier, Jana Schwieger, Jake Stradling, and Janos Vinz Picture Reference • Hacker: Pixabay.com (CC0) • Robot: Pixabay.com (CC0)
  21. About 28 Nane Kratzke Web: http://nane.kratzke.pages.mylab.th-luebeck.de/about Twitter: @NaneKratzke LinkedIn: https://de.linkedin.com/in/nanekratzke

    GitHub: https://github.com/nkratzke ResearchGate: https://www.researchgate.net/profile/Nane_Kratzke SlideShare: http://de.slideshare.net/i21aneka