Smart Like A Fox: How clever students trick dumb automated programming assignment assessment systems

How clever students trick dumb automated programming assignment assessment systems
(APAAS) Nane Kratzke SMART LIKE A FOX 1

Introduction Methodology Analysis Discussion, Counter Measures Limitations, Conclusion Agenda 2
Presentation on SpeakerDeck Preprint on ResearchGate Presentation at CSEDU 2019, Heraklion, Crete, Greece (2 – 4 May 2019)

• We are at a transition point between the industrialisation
age and the digitisation age. • Computer science related skills are a vital asset in this context. One of these basic skills is practical programming. • The course sizes of university and college programming courses are steadily increasing. • Even MOOC’s are used more frequently to convey necessary programming capabilities to students of different disciplines. • The coursework is composed of assignments that are highly suited to be assessed automatically. • However, it is very often underestimated how astonishingly easy it is to trick these systems! Introduction 3 The question arises whether “robots” certificate the expertise to program or to cheat?

A small example to get your attention ... 4 VPL
== Virtual Programming Lab • Count the occurence of a character c in a String s. • Develop a method countChar(). How to get full points in Moodle/VPL? The same works for every assignment! INTRODUCTION

INTRODUCTION • APAAS solutions are systems that execute injected code
(student submissions). • Code injection is known as a severe threat from a security point of view. • APAAS solutions protect the host system via sandbox mechanisms. • Much effort is invested in sophisticated code plagiarism detection and authorship control of student submissions. • But it was astonishing to see that APAAS solutions like VPL overlook the cheating cleverness of students. • The grading component can be cheated very straightforward. • Unattended automated programming examinations must be rated suspect. APAAS == Code Injection System 5

• Two first semester programming Java courses in the winter
semester 2018/19: • A regular computer science study programme (CS) • An information technology and design focused study programme (ITD) • In both courses we searched for student submissions that intentionally trick the grading component. • APAAS: Moodle/VPL (Version 3.3.3) Methodology 7 • To minimise Hawthorne and Experimenter effects neither the students nor the advisers were aware to be part of this study. • Even if cheating was detected this had no consequences for the students. It was not even communicated. • Students were unaware that the version history of their submissions were logged and analyzed.

METHODOLOGY • VPL submissions were downloaded from Moodle • Python/Jupyter
based sample selection • S1: triggered evaluations • S2: maximum versions • S3: low average high end • S4: condition related terms • S5: unusual terms (System.exit, ...) • S6: random submissions • NumPy, matplotlib, statistics, Javaparser libraries • Exported weekly into archived PDF documents (for manual analysis) Searching for cheats Automated sample selection, manual sample analysis 8

METHODOLOGY Analysis of submissions 9 Manual annotation Task description Result,
workload, working phases, student identifier

ANALYSIS Observed cheat-pattern frequency 11

ANALYSIS Continuous Example Assignment 12 Count the occurence of a
character c in a String s (not case-sensitive). We searched for solutions that differed significantly from this intendend (reference) solution. The reference solution used to check for correctness.

ANALYSIS CHEAT PATTERN (1) • Get a maximum of points
but do not solve the given problem in a general way • Solution is completely useless outside the scope of the test cases • Mapping simply input parameters to expected output parameters (63%) Overfitting 13

ANALYSIS CHEAT PATTERN (2) (30%) Problem Evasion 14 Example assignment:
Count the occurence of a character c in a String s recursively. Solution pretends to be recursive, but it is merely a redirection to an overloaded method using loops (non- recursive). Intended solution Evasion solution

ANALYSIS CHEAT PATTERN (3) (6%) Redirection 15 (1) A small
spelling error will result in compiler messages indicating that a specific method is expected by the test logic! (2) Compiler error messages can reveal the reference solution. (3) A clever student might now simply redirect the submission to the reference method (to let the grader evaluate itself). Redirecting solution

ANALYSIS CHEAT PATTERN (4) (2%) Injection 16 Print simply the
points you want to have in a APAAS specific format on standard out. • Change the intended workflow of the evaluation logic • Use the standard out stream to place text that is evaluated by the APAAS system • The evaluator calls the to be evaluated code. • The submission code can print to standard out and then terminates further evaluation calls. • The evaluator parses standard outs content and will give full points! Some strings with a specific meaning for VPL.

DISCUSSION • Randomize Test Cases Overfitting • AST-based code inspection
Problem Evasion • AST-based code inspection Redirection • Seperate standard out stream for evaluation and submission logic Injection Counter Measures 18 A more detailed discussion can be found in the paper.

DISCUSSION JEdUnit 19 JEdUnit https://github.com/nkratzke/JEdUnit JEdUnit is a unit testing
framework with a special focus on educational aspects. It strives to simplify automatic evaluation of (small) Java programming assignments using Moodle/VPL. It is used and developed for programming classes at the Lübeck University of Applied Sciences. However, this framework might be helpful for other programming instructors, so it has been open sourced.

DISCUSSION Randomize Test Cases 20 Don‘t do that: Do that:
JEdUnit DSL to express randomized test values. E.g. apply regular expressions inversely to generate random strings.

DISCUSSION AST-based code inspections 21 E.g.: Don‘t allow to bypass
recursions by inspecting and penalizing loop presence. The JEdUnit DSL is able to express selectors on abstract syntax trees (AST) to check for the presence or absence of language constructs. The selector model of JEdUnit works similar like CSS selectors work on DOM- trees.

DISCUSSION Isolation of submission and evaluation logic 22 Submission logic
gets an isolated fake console Submission shares stdout with evaluation process JEdUnit approach VPL approach

DISCUSSION Further Features of JEdUnit 23 JEdUnit https://github.com/nkratzke/JEdUnit • Weighting
of test cases (by annotations) • Checkstyle integration (weightened rules) • DSL • to formulate test cases in a check, explain, onError pattern • to randomize test cases • to write arbitrary code inspections based on a selector model • Predefined code inspections (switch on/off): proper collection usage, Loops, Lambdas, inner classes, datafields, sonsole output, etc. • Automated class structure comparison (OO use cases to compare the structural equality of a multi-class submission with a multi-class reference solution.

LIMITATIONS We searched qualitatively and not quantitatively for cheat-patterns •
Do not draw any conclusions what kind of cheat-pattern occur at what level of programming expertise • Do not draw any conclusions on the quantitative aspects of cheating • The study does not proclaim to have identified all kinds of cheat- patterns The study does not proclaim that all APAAS solutions have the same set of vulnerabilities • Do not generalize Moodle/VPL specific-problems. • However, the Overfitting, Problem Evasion, Redirection, and Injection patterns can be used to check for vulnerabilities in other APAAS solutions. Threats on Validity 25

• We have to be aware that (even first-year) students
are clever enough to trick automated grading solutions. • Cheat patterns: • Overfitting • Problem Evasion • Redirection • Injection • Options we currently investigate: • Randomise test cases • Pragmatic code inspection • Isolation of submission and evaluation logic • Exactly these features seem to be only incompletely provided by current APAAS systems. Conclusion 26 JEdUnit https://github.com/nkratzke/JEdUnit

Acknowledgement 27 Presentation on SpeakerDeck Preprint on ResearchGate Advisers of
the practical courses • David Engelhardt, Thomas Hamer, Clemens Stauner, Volker Völz, Patrick Willnow Student tutors • Franz Bretterbauer, Francisco Cardoso, Jannik Gramann, Till Hahn, Thorleif Harder, Jan Steffen Krohn, Diana Meier, Jana Schwieger, Jake Stradling, and Janos Vinz Picture Reference • Hacker: Pixabay.com (CC0) • Robot: Pixabay.com (CC0)

About 28 Nane Kratzke Web: http://nane.kratzke.pages.mylab.th-luebeck.de/about Twitter: @NaneKratzke LinkedIn: https://de.linkedin.com/in/nanekratzke
GitHub: https://github.com/nkratzke ResearchGate: https://www.researchgate.net/profile/Nane_Kratzke SlideShare: http://de.slideshare.net/i21aneka

Smart Like A Fox: How clever students trick dum...

Smart Like A Fox: How clever students trick dumb automated programming assignment assessment systems

Nane Kratzke

More Decks by Nane Kratzke

Other Decks in Education

Featured

Transcript

How clever students trick dumb automated programming assignment assessment systems

Introduction Methodology Analysis Discussion, Counter Measures Limitations, Conclusion Agenda 2

• We are at a transition point between the industrialisation

A small example to get your attention ... 4 VPL

INTRODUCTION • APAAS solutions are systems that execute injected code

Introduction Methodology Analysis Discussion, Counter Measures Limitations, Conclusion Agenda 6

• Two first semester programming Java courses in the winter

METHODOLOGY • VPL submissions were downloaded from Moodle • Python/Jupyter

METHODOLOGY Analysis of submissions 9 Manual annotation Task description Result,

Introduction Methodology Analysis Discussion, Counter Measures Limitations, Conclusion Agenda 10

ANALYSIS Observed cheat-pattern frequency 11

ANALYSIS Continuous Example Assignment 12 Count the occurence of a

ANALYSIS CHEAT PATTERN (1) • Get a maximum of points

ANALYSIS CHEAT PATTERN (2) (30%) Problem Evasion 14 Example assignment:

ANALYSIS CHEAT PATTERN (3) (6%) Redirection 15 (1) A small

ANALYSIS CHEAT PATTERN (4) (2%) Injection 16 Print simply the

Introduction Methodology Analysis Discussion, Counter Measures Limitations, Conclusion Agenda 17

DISCUSSION • Randomize Test Cases Overfitting • AST-based code inspection

DISCUSSION JEdUnit 19 JEdUnit https://github.com/nkratzke/JEdUnit JEdUnit is a unit testing

DISCUSSION Randomize Test Cases 20 Don‘t do that: Do that:

DISCUSSION AST-based code inspections 21 E.g.: Don‘t allow to bypass

DISCUSSION Isolation of submission and evaluation logic 22 Submission logic

DISCUSSION Further Features of JEdUnit 23 JEdUnit https://github.com/nkratzke/JEdUnit • Weighting

Introduction Methodology Analysis Discussion, Counter Measures Limitations, Conclusion Agenda 24

LIMITATIONS We searched qualitatively and not quantitatively for cheat-patterns •

• We have to be aware that (even first-year) students

Acknowledgement 27 Presentation on SpeakerDeck Preprint on ResearchGate Advisers of

About 28 Nane Kratzke Web: http://nane.kratzke.pages.mylab.th-luebeck.de/about Twitter: @NaneKratzke LinkedIn: https://de.linkedin.com/in/nanekratzke