Knowledge Inference - Speaker Deck

Slide 1

Slide 1 text

KNOWLEDGE INFERENCE Amar Lalwani

Slide 2

Slide 2 text

Goal Of Knowledge Inference • Measuring what a student knows at a specific time • Measuring what relevant knowledge components a student knows at a specific time

Slide 3

Slide 3 text

Knowledge Component • Anything a student can know that is meaningful to the current learning situation • Skill • Fact • Concept • Principle • Schema

Slide 4

Slide 4 text

Knowledge Inference • Knowledge is latent • Not directly measurable

Slide 5

Slide 5 text

Why measure student knowledge? • Primary goal of education • Enhancing student knowledge • Measure efficacy of the system • Report to the stakeholders, instructors • Make automated pedagogical decisions

Slide 6

Slide 6 text

Different than measuring performance • Inferring if a student’s performance right now is associated with successfully demonstrating a skill • Not the same as knowing whether the student has the latent skill • Guessing • Slipping (carelessness, lack of concentration)

Slide 7

Slide 7 text

How do we get at latent knowledge? • Can’t measure it directly • Can’t look directly into the brain, yet! • But, can look at the performance • Performance over time • More information than performance at one specific instant

Slide 8

Slide 8 text

Bayesian Knowledge Tracing • The classical approach for measuring tightly defined skill in online learning • Based on the idea that practice on a skill leads to mastery of that skill • Goal: Track student knowledge over time • Measuring how well a student knows a specific skill/knowledge component at a specific time • Based on their past history of performance with that skill/KC

Slide 9

Slide 9 text

Tightly defined skills • Unlike Item Response Theory • The goal is not to measure overall skill for a broadly- defined construct • Such as arithmetic • But to measure a specific skill or knowledge component • Such as addition of two-digit numbers where no carrying is needed

Slide 10

Slide 10 text

Typical use of BKT • Assess a student’s knowledge of skill/KC X • Based on a sequence of items that are dichotomously scored • E.g. the student can get a score of 0 or 1 on each item • Where each item corresponds to a single skill • Where the student can learn on each item, due to help, feedback, scaffolding, etc.

Slide 11

Slide 11 text

Key Assumptions • Single latent trait/skill per item • Each skill has four parameters • From these skill parameters and student’s historical performances, we can compute • Latent Knowledge P(Ln) • The probability P(correct) that the learner will get the item correct

Slide 12

Slide 12 text

Key Assumptions • Two state learning model • Each skill is either learned/unlearned • Each problem is an opportunity for the student to apply the skill and hence learn it • Once known, the student does not forget the skill • Guess, slip

Slide 13

Slide 13 text

BKT • For some skill K • Given student’s response sequence 1 to n, predict n+1 0 0 0 1 1 1 ? 1 ……… n n+1 Chronological response sequence for student Y [ 0 = Incorrect response 1 = Correct response]

Slide 14

Slide 14 text

BKT • Track knowledge over time (model of learning) 0 0 0 1 1 1 1

Slide 15

Slide 15 text

BKT K K K Q Q Q P(T) P(T) meters ability of initial knowledge ability of learning ability of guess ability of slip esentation dge node n node e (0 or 1) te (0 or 1) P(L 0 ) P(G) P(S) Knowledge Tracing Latent Observed Node representations K = Knowledge node Q = Question node Node states K = Two state (0 or 1) Q = Two state (0 or 1)

Slide 16

Slide 16 text

K K K Q Q Q P(T) P(T) of initial knowledge f learning of guess f slip on e 1) 1) P(L 0 ) P(G) P(S) Knowledge Tracing Four parameters of the KT model: P(L 0 ) = Probability of initial knowledge P(T) = Probability of learning P(G) = Probability of guess P(S) = Probability of slip P(L 0 ) P(T) P(T) P(G) P(G) P(G) P(S) Probability of forgetting assumed to be zero (fixed)

Slide 17

Slide 17 text

Simple HMM Learned (know) UnLearned (Does not know) Incorrect CCorrect P(L0 ) 1-P(L0 ) P(T) 1-P(G) 1-P(S) P(G) P(S)

Slide 18

Slide 18 text

BKT • Formulas for inference and prediction If (−1 ) = −1 ∗(1− ) −1 ∗ 1− + (1− −1 )∗( ) (1) (−1 ) = −1 ∗ −1 ∗()+ (1− −1 )∗(1− ) (2) = (−1 ∗ (1 − ) + (1 − (−1 ) ∗ ()) (3)

Slide 19

Slide 19 text

BKT • Predicting Current student correctness • Whenever the student has an opportunity to use the skill • The probability that the student knows the skill is updated

Slide 20

Slide 20 text

Example

Slide 21

Slide 21 text

Influence of Parameter Values P(L0 ): 0.50 P(T): 0.20 P(G): 0.14 P(S): 0.09 Student reached 95% probability of knowledge After 4th opportunity Estimate of knowledge for student with response sequence: 0 1 1 1 1 1 1 1 1 1

Slide 22

Slide 22 text

Influence of Parameter Values P(L0 ): 0.50 P(T): 0.20 P(G): 0.14 P(S): 0.09 P(L0 ): 0.50 P(T): 0.20 P(G): 0.64 P(S): 0.03 Student reached 95% probability of knowledge After 8th opportunity Estimate of knowledge for student with response sequence: 0 1 1 1 1 1 1 1 1 1

Slide 23

Slide 23 text

BKT • Only uses first problem attempt on each item • Throws out information… • But uses the clearest information…

Slide 24

Slide 24 text

Parameter Constraints • Typically, the potential values of BKT parameters are constrained • To avoid model degeneracy • A knowledge model is degenerate when it violates the basic idea of BKT • When knowing a skill leads to worse performance • When getting a skill wrong means you know it

Slide 25

Slide 25 text

Constraints Proposed • P(G) + P(S) < 1.0 • P(G) < 0.5, P(S) < 0.5 • P(G) < 0.3, P(S) < 0.1

Slide 26

Slide 26 text

Knowledge Tracing • How do we know if a knowledge tracing model is any good? • Our primary goal is to predict knowledge • But knowledge is latent • So we instead check our knowledge predictions • by checking how well the model predicts performance

Slide 27

Slide 27 text

Fitting the Model • EM (Expectation Maximization) Algorithm • Grid Search • Genetic Algorithms

Slide 28

Slide 28 text

Performance Factor Analysis (PFA) • An alternative to BKT • Addresses some of the limitations of BKT • But does not have all of the nice features of BKT

Slide 29

Slide 29 text

PFA • Measures how much latent skill a student has, while they are learning • But expresses it in terms of probability of correctness, the next time the skill is encountered • No direct expression of the amount of latent skill, except this probability of correctness

Slide 30

Slide 30 text

Key Assumptions • Each item may involve multiple latent skills or knowledge components • Different from BKT • Each skill has success learning rate γ and failure learning rate ρ • Different from BKT where learning rate is the same, success or failure

Slide 31

Slide 31 text

Key Assumptions • There is also a difficulty parameter β, but its semantics can vary • From these parameters, and the number of successes and failures the student has had on each relevant skill so far, we can compute the probability P(m) that the learner will get the item correct

Slide 32

Slide 32 text

PFA • , ∈ , , = + ( + ) • P(m) = 1 1 + e−m

Slide 33

Slide 33 text

Example

Slide 34

Slide 34 text

Degenerate Example

Slide 35

Slide 35 text

Negative Learning

Slide 36

Slide 36 text

Key Points • Values of ρ below 0 don’t actually mean negative learning • They mean that failure provides more evidence on lack of knowledge • Than the learning opportunity causes improvement • Parameters in PFA combine information from correctness with improvement from practice improvement • Makes PFA models a little harder to interpret than BKT

Slide 37

Slide 37 text

Adjusting 

Slide 38

Slide 38 text

Adjusting 

Slide 39

Slide 39 text

Adjusting 

Slide 40

Slide 40 text

 Parameters • Three different β Parameters proposed • Item • Item-Type • Skill • Result in different number of parameters • And greater or lesser potential concern about over-fitting

Slide 41

Slide 41 text

Fitting PFA • EM (Expectation Maximization) Algorithm • Vulnerable to local minima • Randomized restarts

Slide 42

Slide 42 text

Item Response Theory (IRT) • Classical approach for assessments, used in tests • Measures how much of an overall trait a person has • Assess a student’s current knowledge of topic X

Slide 43

Slide 43 text

Key Assumptions • There is only one latent trait or skill being measured per set of items • No learning is occurring in between items • E.g. a testing situation with no help or feedback • Learner has ability  • Item has difficulty b, discriminability a • Based on these, we can compute P() , that the learner will get item correct

Slide 44

Slide 44 text

Note • The assumption that all items tap the same latent construct, but have different difficulties • Is a very different assumption than is seen in PFA or BKT

Slide 45

Slide 45 text

The Rasch Model • Simplest IRT Model, 1 parameter model • P() = 1 1 + e−(−)

Slide 46

Slide 46 text

Item Characteristic Curve • b=0 • When =b (knowledge=difficulty), p=0.5

Slide 47

Slide 47 text

P(correct) increases with student skill

Slide 48

Slide 48 text

Changing difficulty parameter • Easy (green, b=-2), Hard (Orange, b=2)

Slide 49

Slide 49 text

Note • The good student finds the easy and medium items almost equally difficult • The weak student finds the medium and hard items almost equally hard • When b=θ, Performance is 50%

Slide 50

Slide 50 text

The 2-parameter Model • Discriminability parameter, “a” added • P() = 1 1 + e−a(−)

Slide 51

Slide 51 text

Different values of a • a=2 (higher discriminability) • a=0.5(lower discriminability)

Slide 52

Slide 52 text

Discriminability at extremes • a=0, a approaches infinity

Slide 53

Slide 53 text

Model Degeneracy • a below 0

Slide 54

Slide 54 text

The 3-parameter Model • A more complex model • Adds a guessing parameter c • P() = • Either you guess (and get it right) • Or you don’t guess (and get it right based on knowledge) c + (1-c) 1 1 + e−a(− )

Slide 55

Slide 55 text

3-parameter model

Slide 56

Slide 56 text

Fitting IRT Models • Can be done with Expectation Maximization • Estimate knowledge and difficulty together • Then, given item difficulty estimates, you can assess a student’s knowledge in real time

Slide 57

Slide 57 text

Uses and Applications • IRT is used quite a bit in computer-adaptive testing • Not used quite so often in online learning, where student knowledge is changing as we assess it • For those situations, BKT and PFA are more popular

Slide 58

Slide 58 text

Non KT (Knowledge Tracing) Approach • Motivation • Bayesian method only uses KC, opportunity count and success/failure as features. Much information is left unutilized. Another machine learning method is required • Strategy: • Engineer additional features from the dataset and use other learning algorithms to train a model

Slide 59

Slide 59 text

Features like Features extracted from training set: • Student progress features – Number of data points [today, since the start of unit] – Number of correct responses out of the last [3, 5, 10] – Zscore sum for step duration, hint requests, incorrects – Skill specific version of all these features • Percent correct features – % correct of unit, section, problem and step and total for each skill and also for each student (10 features) • Student Modeling Approach features – The predicted probability of correct for the test row – The number of data points used in training the parameters – The final EM log likelihood fit of the parameters / data points

Slide 60

Slide 60 text

Non KT Approach • Modelled as Classification Problem • ML algorithms like Logistic Regression, SVM, Neural Networks, Decision Trees can be used • Combining user features with skill features is very powerful classification approach • Model tracing based predictions perform formidably against pure machine learning techniques

Slide 61

Slide 61 text

Thank You!