Knowledge Inference

KNOWLEDGE INFERENCE Amar Lalwani

Goal Of Knowledge Inference • Measuring what a student knows
at a specific time • Measuring what relevant knowledge components a student knows at a specific time

Knowledge Component • Anything a student can know that is
meaningful to the current learning situation • Skill • Fact • Concept • Principle • Schema

Knowledge Inference • Knowledge is latent • Not directly measurable

Why measure student knowledge? • Primary goal of education •
Enhancing student knowledge • Measure efficacy of the system • Report to the stakeholders, instructors • Make automated pedagogical decisions

Different than measuring performance • Inferring if a student’s performance
right now is associated with successfully demonstrating a skill • Not the same as knowing whether the student has the latent skill • Guessing • Slipping (carelessness, lack of concentration)

How do we get at latent knowledge? • Can’t measure
it directly • Can’t look directly into the brain, yet! • But, can look at the performance • Performance over time • More information than performance at one specific instant

Bayesian Knowledge Tracing • The classical approach for measuring tightly
defined skill in online learning • Based on the idea that practice on a skill leads to mastery of that skill • Goal: Track student knowledge over time • Measuring how well a student knows a specific skill/knowledge component at a specific time • Based on their past history of performance with that skill/KC

Tightly defined skills • Unlike Item Response Theory • The
goal is not to measure overall skill for a broadly- defined construct • Such as arithmetic • But to measure a specific skill or knowledge component • Such as addition of two-digit numbers where no carrying is needed

Typical use of BKT • Assess a student’s knowledge of
skill/KC X • Based on a sequence of items that are dichotomously scored • E.g. the student can get a score of 0 or 1 on each item • Where each item corresponds to a single skill • Where the student can learn on each item, due to help, feedback, scaffolding, etc.

Key Assumptions • Single latent trait/skill per item • Each
skill has four parameters • From these skill parameters and student’s historical performances, we can compute • Latent Knowledge P(Ln) • The probability P(correct) that the learner will get the item correct

Key Assumptions • Two state learning model • Each skill
is either learned/unlearned • Each problem is an opportunity for the student to apply the skill and hence learn it • Once known, the student does not forget the skill • Guess, slip

BKT • For some skill K • Given student’s response
sequence 1 to n, predict n+1 0 0 0 1 1 1 ? 1 ……… n n+1 Chronological response sequence for student Y [ 0 = Incorrect response 1 = Correct response]

BKT • Track knowledge over time (model of learning) 0
0 0 1 1 1 1

BKT K K K Q Q Q P(T) P(T) meters
ability of initial knowledge ability of learning ability of guess ability of slip esentation dge node n node e (0 or 1) te (0 or 1) P(L 0 ) P(G) P(S) Knowledge Tracing Latent Observed Node representations K = Knowledge node Q = Question node Node states K = Two state (0 or 1) Q = Two state (0 or 1)

K K K Q Q Q P(T) P(T) of initial
knowledge f learning of guess f slip on e 1) 1) P(L 0 ) P(G) P(S) Knowledge Tracing Four parameters of the KT model: P(L 0 ) = Probability of initial knowledge P(T) = Probability of learning P(G) = Probability of guess P(S) = Probability of slip P(L 0 ) P(T) P(T) P(G) P(G) P(G) P(S) Probability of forgetting assumed to be zero (fixed)

Simple HMM Learned (know) UnLearned (Does not know) Incorrect CCorrect
P(L0 ) 1-P(L0 ) P(T) 1-P(G) 1-P(S) P(G) P(S)

BKT • Formulas for inference and prediction If (−1 )
= −1 ∗(1− ) −1 ∗ 1− + (1− −1 )∗( ) (1) (−1 ) = −1 ∗ −1 ∗()+ (1− −1 )∗(1− ) (2) = (−1 ∗ (1 − ) + (1 − (−1 ) ∗ ()) (3)

BKT • Predicting Current student correctness • Whenever the student
has an opportunity to use the skill • The probability that the student knows the skill is updated

Example

Influence of Parameter Values P(L0 ): 0.50 P(T): 0.20 P(G):
0.14 P(S): 0.09 Student reached 95% probability of knowledge After 4th opportunity Estimate of knowledge for student with response sequence: 0 1 1 1 1 1 1 1 1 1

Influence of Parameter Values P(L0 ): 0.50 P(T): 0.20 P(G):
0.14 P(S): 0.09 P(L0 ): 0.50 P(T): 0.20 P(G): 0.64 P(S): 0.03 Student reached 95% probability of knowledge After 8th opportunity Estimate of knowledge for student with response sequence: 0 1 1 1 1 1 1 1 1 1

BKT • Only uses first problem attempt on each item
• Throws out information… • But uses the clearest information…

Parameter Constraints • Typically, the potential values of BKT parameters
are constrained • To avoid model degeneracy • A knowledge model is degenerate when it violates the basic idea of BKT • When knowing a skill leads to worse performance • When getting a skill wrong means you know it

Constraints Proposed • P(G) + P(S) < 1.0 • P(G)
< 0.5, P(S) < 0.5 • P(G) < 0.3, P(S) < 0.1

Knowledge Tracing • How do we know if a knowledge
tracing model is any good? • Our primary goal is to predict knowledge • But knowledge is latent • So we instead check our knowledge predictions • by checking how well the model predicts performance

Fitting the Model • EM (Expectation Maximization) Algorithm • Grid
Search • Genetic Algorithms

Performance Factor Analysis (PFA) • An alternative to BKT •
Addresses some of the limitations of BKT • But does not have all of the nice features of BKT

PFA • Measures how much latent skill a student has,
while they are learning • But expresses it in terms of probability of correctness, the next time the skill is encountered • No direct expression of the amount of latent skill, except this probability of correctness

Key Assumptions • Each item may involve multiple latent skills
or knowledge components • Different from BKT • Each skill has success learning rate γ and failure learning rate ρ • Different from BKT where learning rate is the same, success or failure

Key Assumptions • There is also a difficulty parameter β,
but its semantics can vary • From these parameters, and the number of successes and failures the student has had on each relevant skill so far, we can compute the probability P(m) that the learner will get the item correct

PFA • , ∈ , , = + ( +
) • P(m) = 1 1 + e−m

Example

Degenerate Example

Negative Learning

Key Points • Values of ρ below 0 don’t actually
mean negative learning • They mean that failure provides more evidence on lack of knowledge • Than the learning opportunity causes improvement • Parameters in PFA combine information from correctness with improvement from practice improvement • Makes PFA models a little harder to interpret than BKT

Adjusting 

 Parameters • Three different β Parameters proposed • Item
• Item-Type • Skill • Result in different number of parameters • And greater or lesser potential concern about over-fitting

Fitting PFA • EM (Expectation Maximization) Algorithm • Vulnerable to
local minima • Randomized restarts

Item Response Theory (IRT) • Classical approach for assessments, used
in tests • Measures how much of an overall trait a person has • Assess a student’s current knowledge of topic X

Key Assumptions • There is only one latent trait or
skill being measured per set of items • No learning is occurring in between items • E.g. a testing situation with no help or feedback • Learner has ability  • Item has difficulty b, discriminability a • Based on these, we can compute P() , that the learner will get item correct

Note • The assumption that all items tap the same
latent construct, but have different difficulties • Is a very different assumption than is seen in PFA or BKT

The Rasch Model • Simplest IRT Model, 1 parameter model
• P() = 1 1 + e−(−)

Item Characteristic Curve • b=0 • When =b (knowledge=difficulty), p=0.5

P(correct) increases with student skill

Changing difficulty parameter • Easy (green, b=-2), Hard (Orange, b=2)

Note • The good student finds the easy and medium
items almost equally difficult • The weak student finds the medium and hard items almost equally hard • When b=θ, Performance is 50%

The 2-parameter Model • Discriminability parameter, “a” added • P()
= 1 1 + e−a(−)

Different values of a • a=2 (higher discriminability) • a=0.5(lower
discriminability)

Discriminability at extremes • a=0, a approaches infinity

Model Degeneracy • a below 0

The 3-parameter Model • A more complex model • Adds
a guessing parameter c • P() = • Either you guess (and get it right) • Or you don’t guess (and get it right based on knowledge) c + (1-c) 1 1 + e−a(− )

3-parameter model

Fitting IRT Models • Can be done with Expectation Maximization
• Estimate knowledge and difficulty together • Then, given item difficulty estimates, you can assess a student’s knowledge in real time

Uses and Applications • IRT is used quite a bit
in computer-adaptive testing • Not used quite so often in online learning, where student knowledge is changing as we assess it • For those situations, BKT and PFA are more popular

Non KT (Knowledge Tracing) Approach • Motivation • Bayesian method
only uses KC, opportunity count and success/failure as features. Much information is left unutilized. Another machine learning method is required • Strategy: • Engineer additional features from the dataset and use other learning algorithms to train a model

Features like Features extracted from training set: • Student progress
features – Number of data points [today, since the start of unit] – Number of correct responses out of the last [3, 5, 10] – Zscore sum for step duration, hint requests, incorrects – Skill specific version of all these features • Percent correct features – % correct of unit, section, problem and step and total for each skill and also for each student (10 features) • Student Modeling Approach features – The predicted probability of correct for the test row – The number of data points used in training the parameters – The final EM log likelihood fit of the parameters / data points

Non KT Approach • Modelled as Classification Problem • ML
algorithms like Logistic Regression, SVM, Neural Networks, Decision Trees can be used • Combining user features with skill features is very powerful classification approach • Model tracing based predictions perform formidably against pure machine learning techniques

Thank You!

Knowledge Inference

Knowledge Inference

More Decks by Amar

Other Decks in Technology

Featured

Transcript