Tech Talk at Animoto - ALL DEM MODELS (and R)

ALL DEM MODELS (and R) Predicting Student Performance Tech Talk,
January 26, 2012 1

Predicting Student Test Performance The data Mixed Linear Models Data
Parsing/ﬁltering Clustering and model optimization Overall performance Questions? 2

The Data: “What do you know?” Problem: Not knowing what
to study wastes time and focus Question: How well can we predict areas of difficulty for students so they can study smarter? Data: 4 million samples, 93100 to be predicted Goal: Predict outcome % per question 3

Parsing/ﬁltering Clustering and model optimization Overall performance Questions 5

What’s a Mixed Linear Model? 6

What’s a Mixed Linear Model? Supervised Learning Random and Fixed
Effects Logistic/Linear Regression 7

track_models = list() for (track in unique(training$track_name)) { print(sprintf(“Starting model
for track %s.”, track)) rasch = lmer(correct ~ 1 + (1|user_id) + (1| question_id), data = training[training$track_name == track, c(‘correct’, ‘user_id’, ‘question_id’)], family = binomial, REML = F) 8

track_models = list() for (track in unique(training$track_name)) { print(sprintf(“Starting model
for track %s.”, track)) rasch = lmer(correct ~ 1 + (1|user_id) + (1| question_id), data = training[training$track_name == track, c(‘correct’, ‘user_id’, ‘question_id’)], family = binomial, REML = F) 0/1 predictions for user and question Builds a model per track (0 - 8) Each row is independent 9

Benchmark’s CBD: 0.25663 10

Improving the model 12

Filtering out unusable data 16

Parsing/ﬁltering Clustering and model optimization Overall performance Questions 19

Clustering: Improving the tagging structure 20

Why change it? 21

Data: Edwin Chen, “Quick Introduction to ggplot2”blog.echen.me Poorly drawn circles:
Thomson Nguyen, “Introduction to R(And Machine Learning)”, Jan 24 2012, Lookout Poor Man’s K-Means 22

K-means Clustering m <- read.csv(‘gtagsall.csv’, header = T) id <-
cbind(rowid = as.vector(t(row(m))), colid = as.vector(t(m))) id <- id[complete.cases(id), ] tag.matrix <- matrix(0, nrow = nrow(m), ncol = max(m, na.rm = T)) tag.matrix[id] <- 1 wss <- (nrow(mydata) -1) * sum(apply(tag.matrix, 2, var)) for (i in 2:281) { wss[i] <- sum(kmeans(tag.matrix, centers = i)$withinss) } plot(1:281, wss, type = ‘b’, xlab = ’Clusters’, ylab = ‘WSS’) 23

Performance / Last remarks Kaggle Leaderboard. http://www.kaggle.com/c/WhatDoYouKnow/leaderboard 25

Questions/Feedback 26

Tech Talk at Animoto - ALL DEM MODELS (and R)

Tech Talk at Animoto - ALL DEM MODELS (and R)

podopie

Other Decks in Programming

Featured

Transcript

ALL DEM MODELS (and R) Predicting Student Performance Tech Talk,

Predicting Student Test Performance The data Mixed Linear Models Data

The Data: “What do you know?” Problem: Not knowing what

4

Predicting Student Test Performance The data Mixed Linear Models Data

What’s a Mixed Linear Model? 6

What’s a Mixed Linear Model? Supervised Learning Random and Fixed

track_models = list() for (track in unique(training$track_name)) { print(sprintf(“Starting model

track_models = list() for (track in unique(training$track_name)) { print(sprintf(“Starting model

Benchmark’s CBD: 0.25663 10

11

Improving the model 12

Improving the model 13

Improving the model 14

Predicting Student Test Performance The data Mixed Linear Models Data

Filtering out unusable data 16

Filtering out unusable data 17

Filtering out unusable data 18

Predicting Student Test Performance The data Mixed Linear Models Data

Clustering: Improving the tagging structure 20

Why change it? 21

Data: Edwin Chen, “Quick Introduction to ggplot2”blog.echen.me Poorly drawn circles:

K-means Clustering m <- read.csv(‘gtagsall.csv’, header = T) id <-

Predicting Student Test Performance The data Mixed Linear Models Data

Performance / Last remarks Kaggle Leaderboard. http://www.kaggle.com/c/WhatDoYouKnow/leaderboard 25

Questions/Feedback 26