Slide 1

Slide 1 text

Intro to AI / Machine Learning ronny bjarnason principal data scientist red brain labs [email protected]

Slide 2

Slide 2 text

Intro to Me education: byu (‘00, ‘02), oregon state (‘09) marathoner, father of five, scoutmaster principal data scientist at red brain labs

Slide 3

Slide 3 text

red brain labs draper, ut ★ Predictive Analytics / Business Intelligence ★ Team: Business, Devs, Statistics, Machine Learning ★ Always Looking for Like-Minded Individuals ★ Commercial Over

Slide 4

Slide 4 text

Why are you here? ★ Have a project in mind and interested in getting your hands dirty. (yes) ★ Some intuition how Machine Algorithms work (YES!) ★ Curious about available libraries. (yes) ★ Nothing else to do before lunch?

Slide 5

Slide 5 text

★ Driving Directions ★ Natural Language Processing (Siri) ★ Computer Vision (Captcha solvers) ★ Robotics ★ Internet Search ★ Recommender Systems (Netflix prize) ★ Solving Tic-Tac-Toe ★ Constraint Satisfaction (Sudoku) ★ Nest What is Artificial Intelligence? actually, tons of stuff we take for granted.

Slide 6

Slide 6 text

What is Machine Learning? ★ Subset of Artificial Intelligence ★ All that other stuff if it improves automatically as it gets more experience (data). ★ Unsupervised: Clustering, Reinforcement Learning ★ Supervised: Regression, Classification ★ Data Mining

Slide 7

Slide 7 text

Data Mining ★ Example of Supervised Learning ★ Given a history (training data) predict an unseen variable ★ Which class does this belong to? Classification ★ What value will this be? Regression - Where I spend most of my effort

Slide 8

Slide 8 text

Popular Languages ★ Python (SciPy, scikit-learn, Pandas) ★ R ★ Java (weka) ★ Lisp -John McCarthy ★ SQL? August 2011

Slide 9

Slide 9 text

Machine Learning Algorithms ★ Decision Trees ★ Neural Nets ★ Linear and Logistic Regression ★ Naive Bayes / Bayes Nets ★ Random Forests ★ Support Vector Machines ★ Many, many more More with less data? more data? UCI Repository

Slide 10

Slide 10 text

Prepping the Data ★ Most of the practical work with Machine Learning and Data Mining is getting the data right. Receive a clients data, listen, pay attention, but assume they don’t really know what data they have and where it is - because they don’t. ★ Feature Discovery - what are the important attributes ★ Data Visualization - big hints for prediction ★ Data Cleansing How many databases do you really need? ★ Data Leakage

Slide 11

Slide 11 text

Data Leakage ★ “Time Machining” Can only use data that you will have at the time of prediction ‣ Predicting Stock Values ‣ Attributes added after the fact ★ What attributes have been picked by your algorithm? UserID? ★ Cross fold validation ★ Can we beat random? Does it make sense?

Slide 12

Slide 12 text

Not Data Leakage ★ Early Kaggle winners odd/even de-anonymization outside data sources ★ Take every advantage you possibly can this is data mining at its finest (it only feels like cheating)

Slide 13

Slide 13 text

Decision Trees ★ Predicting membership in a class ★ Split the Data on the single best attribute Repeat for each child node BigML ★ Easily interpretable ★ Scalable easily handle streaming data why might this be important?

Slide 14

Slide 14 text

Commentary on “Big Data” ★ Hadoop/MapReduce is Fabulous ★ But only if you need it ★ For most of what you do, it just isn’t necessary Really, how big is a hard drive these days? ★ There are other options ★ ★ Graphlab is awesome -Ben

Slide 15

Slide 15 text

Linear and Logistic Regression ★ Learn a weight for each of the features ★ Adjust the weight with each example (either individually or by batch) ★ Simple, Fast ★ Good For Regression, not so much for Classification

Slide 16

Slide 16 text

Neural Networks ★ Minsky (perceptron, xor), Rumelhart ★ Input Nodes, output nodes, hidden nodes ★ Assign weights to all connecting edges (backpropagation) ★ Learn much more complex relationships ★ Difficult to interpret results ★ Typically poor for regression

Slide 17

Slide 17 text

Random Forests ★ Kaggle regular averages results of very simple algorithms ★ subselect the attributes ★ build a simple decision tree ★ create n-trees ★ aggregate results

Slide 18

Slide 18 text

Meta ★ Scaling for accuracy Platt, Isotonic ★ K-fold Models extra training, decrease chances of overfitting ★ Ensemble Methods Random Forests, Boosted Forests

Slide 19

Slide 19 text

Ensemble Building ★ Can’t decide on a single algorithm? ★ Do you have to? ★ Benefits If your data changes slightly over time If you aren’t sure of the structure of the data ★ Problems: Time Constraints ‣ Training time ‣ Initial Tuning Time ★ Example: LMP 500+ models, fight for your life, keep yesterday’s winners, add some extras randomly

Slide 20

Slide 20 text

Monte-Carlo ★ Unknown state of the world ★ Known transitions ★ Known probabilities ★ Roll the dice - what happened? repeat until termination ★ Application: March Madness

Slide 21

Slide 21 text

Monte-Carlo ★ Unknown state of the world ★ Known transitions ★ Known probabilities ★ Roll the dice - what happened? repeat until termination ★ Application: March Madness ★ Application: Solitaire - Persi Diaconis

Slide 22

Slide 22 text

Where to Get Started? ★ kaggle.com data mining competitions (prize money) why hire a ML developer? ★ Andrew Ng’s Coursera course ★ Machine Learning for Hackers O’Reilly examples in R ★ Wikipedia

Slide 23

Slide 23 text

Projects I’d Love to Work On ★ Predicting Race Times - more than just the standard race tables Attributes? Suunto Ambit predicts recovery time. ★ Cheating at Solitaire instant recognition, faster prediction ★ NICU lots of intuition going on

Slide 24

Slide 24 text

Questions?

Slide 25

Slide 25 text

Thank You.