An Introduction to Basics of Data Science Part 1

Introduction to Data Science First steps into machine learning and
its applications !/serhanbaker/datasci-presentation

Contents Vocabulary Data Modeling and Data Preparation Recommendation Systems with
Collaborative Filtering - Manhattan Distance - Euclidean Distance - Pearson Correlation Coefﬁcient - Cosine Similarity - Sample Cases - K Nearest Neighbor Linear Regression and Sample Code Probability and Bayes Law - Naïve Bayes Classiﬁcation Using scikit-learn with SVM Sample Code K-Means Clustering Sample Code (@ repository)

Terms Statistics is concerned with probabilistic models, specifically inference on
these models using data. Artificial Intelligence is anything that is concerned with intelligence in computers. So, it includes lots of things. Data Mining is applied machine learning. It focuses more on the practical aspects of deploying machine learning algorithms on large datasets. Machine Learning is concerned with predicting a particular outcome given some data. It looks like statistics. Focuses on computational efficiency and large datasets.

In Short… Statistics quantiﬁes numbers Artiﬁcial Intelligence behaves and reasons
Data Mining explains patterns Machine Learning predicts with models

Machine Learning Topics In general, a machine learning problem has
a set of samples of data and then tries to predict properties of unknown data If each sample is more than just one number (multivariate data): Then it’s said that our sample have several attributes (aka features) We can separate learning problems into 2 large categories: • Supervised Learning • Classiﬁcation • Regression • Unsupervised Learning • Clustering • Density Estimation

Supervised Learning Assumes you have a set of data with
labels Problem can be either: Classiﬁcation: Samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. I know what these things are, can I take a known thing, and put it in the right bucket? Handwritten digit recognition example Regression: If the desired output consists of one or more continuous variables, then the task is called regression. Prediction of the length of a salmon as a function of its age and weight.

Unsupervised Learning Assumes you have bunch of data with no
labels Clustering: I have universe of things. Can I ﬁnd groups in that universe? Density Estimation: Determine the distribution of data within the input space

Entity Disambiguation Are FB, FBOOK, FACEBOOK the same company? Are
Tr, TUR, Turkey the same country? Turkey (animal), Turkey (country) which is which? NY? NYC? New York?

Data Modeling How the Data Look Like user_id name age
gender 101 Luke 26 male 102 Chloe 17 female 103 Leia 23 female book_id title year author genre Xa2Jh1 Great Gatsby 1925 F. Scott Fitzgerald fiction Zb8Km7 A Farewell to Arms 1926 Ernest Hemingway fiction Ao9Cs0 A Beautiful Mind 1994 Sylvia Nasar biography user_id book_id visited rented purchased rating 101 Ao9Cs0 Y N Y 4 103 Xa2Jh1 Y Y N 3 102 Ao9Cs0 Y N N 0 101 Zb8Km7 Y Y Y 5

Data Modeling What Machine Learning Expect age gender year age_range
genre author visited rented purchased rating 26 male 1994 young biography Sylvia Nasar Y N Y 4 23 female 1926 young fiction F.Scott Fitzgerald Y Y N 3 17 female 1994 teen biography Sylvia Nasar Y N N 0 26 male 1926 young fiction Ernest Hemingway Y Y Y 5 even further, machine learning expects something like this… age, gender, age_range, year, genre, author, visited, rented, purchased, rating 26, “male", “young”, 1994, "biography", "Sylvia Nasar", "Y", "N", "Y", 4 23, "female", “young”, 1926, "fiction", "F.Scott Fitzgerald", "Y", "Y", "N", 3 17, "female", “teen”, 1994, "biography", "Sylvia Nasar", "Y", "N", "N", 0 26, "male", “young”, 1926, "fiction", "Ernest Hemingway", "Y", "Y", "Y", 5 book_ratings.csv discretization

Data Modeling De-normalization: De-normalize your data! Multiple records " Single
record Transform: Some algorithms only work with nominal values. Others only work with numbers. Discretization: Divide range of possible values into buckets. {[child : 0-12], [teen: 13-19], [young : 20-35], [middle : 36-59], [senior : 60+]} Dictionary Values: Like primary key columns in a database {[1 : Lion King], [2 : Toy Story], [3 : Finding Nemo]} Dividing up data: n% for training, 100-n% for test. Possible splits: 67/33, 80/20, 90/10

Data Modeling Discretization Methods What if we don’t know which
subranges make sense? Equal-width binning divides the range of possible values into N subranges of the same size. Discretize before splitting ! Bin width = (max value - min value) / N For 0- 100, bin = {[0 : 0-20], [1: 20-40], [2 : 40-60], [3 : 60-80], [4 : 80-100]} Equal-frequency binning divides the range of possible values into N bins, each holds same number of training instances. Training Data: 5, 7, 12, 35, 65, 82, 84, 88, 90, 95 N = 5 -> 5, 7, | 12, 35, | 65, 82, | 84, 88, | 90, 95 Boundary values: (7 + 12)/2 = 9.5 (35 + 65)/2 = 50

Data Preparation Machines learn from data! ! Get the right
data for the right problem ! Better data handling = more consistent results = more better ! ! Steps: Select Data Preprocess Data Transform Data

Data Preparation “more is better” ? Not necessarily always. !
Go for the right data for the right job. ! CONSIDER #: $ what data you actually need to address the problem ? $ what data you actually have ? $ what data is missing ? $ what data can be removed ? Step 1 → Select Data

Data Preparation Getting the selected data into a form that
you can work ! Consider how you are going to use this data ! Formatting the data to ﬁt your needs % & ' ! Cleaning is removal / ﬁxing of missing or corrupted data −/+ ! Sampling is taking representative sample to faster prototyping ) Step 2 → Preprocess Data

Data Preparation The speciﬁc algorithm you are working with, and
the knowledge of the problem domain will inﬂuence this step ! Scaling $, kg, sales volume… Scale for 0 and 1 (small / big ) * ! Decomposition Split out unnecessary details + ! Aggregation of features into a more meaningful one , Step 3 → Transform Data

What is Data Science? Substantive Expertise Programming Skills Software Eng.
Math & Statistics Knowledge Traditional Research Machine Learning Data Science

• Do math • Write code • Ask questions Data
Scientists do 3 things

Stack I Use Development Environment Version Control Deployment to Production
(if it’s a web app) Programming + scikit-learn (numpy, scipy, matplotlib libraries)

Getting Started With Recommendation Systems Collaborative Filtering They like what
you like, they also like these

How to Find People Similar to Each Other Dallas Buyers
Club Wolf of Wallstreet Alice 4 2 Bob 5 3 Eve 4 5 How would you select a movie to recommend to someone who rated Dallas Buyers Club 2 and Wolf of Wallstreet 4? ?

You do this by computing distance Lets call Eve (x1,
y1) and X (x2, y2) In a 2D world, each person is represented by (x, y) Eve’s distance to X is calculated by | x1 - x2| + | y1 - y2 | Distance from X Alice 4 Bob 4 Eve 3 We see that Eve is the closest match to X So we can look at Eve’s history and recommend her favorite movie to X . Manhattan Diﬆance ?In which circumstance the Manhattan Distance would be the best choice? What would be its main beneﬁt?

/ Answer: It’s very fast to compute. If we were
developing a social media application and were trying to ﬁnd similar people to someone among millions of users, and running this computation against every users constantly, the speed would come in handy

Euclidean Diﬆance Pythagorean Theorem: unlike the Manhattan Distance, we measure
the bird’s-eye-view distance. a b c c=√a2+b2 And our formula is: √(x1-x2)2+(y1-y2)2 Distance from X Alice √ Bob √ Eve √ We see that Eve is the closest match again .

Let’s kick it up a notch Bob Alice Eve Merlin
Carol Carlos Wendy Erin 12 Years a Slave 3.5 2 5 3 - - 5 3 Gravity 2 3.5 1 4 4 4.5 2 - Dallas Buyers Club - 4 1 4.5 1 4 - - Blue Jasmine 4.5 - 3 - 4 5 3 5 American Hustle 5 2 5 3 - 5 5 4 The Great Gatsby 1.5 3.5 1 4.5 - 4.5 4 2.5 Frozen 2.5 - - 4 4 4 5 3 Captain Phillips 2 3 - 2 1 4 - - Notice that some users didn’t rate some bands. We will compute the distance based on the number of bands they both reviewed. 0 Let’s compute diﬀerence between Bob and Alice

Alice & Bob Bob Alice Diﬀ 12 Years a Slave
3.5 2 Gravity 2 3.5 Dallas Buyers Club - 4 Blue Jasmine 4.5 - American Hustle 5 2 The Great Gatsby 1.5 3.5 Frozen 2.5 - Captain Phillips 2 3 Manhattan Distance 9 Sum of Squares 18.5 Euclidean Distance 4.3 Manhattan Distance is the sum of diﬀerences (1.5 + 1.5 + 3 + 2 + 1) √(3.5 - 2)2 + (2 - 3.5)2 + (5 - 2)2 + (1.5 - 3.5)2 + (2 - 3)2 Euclidean Distance: ?Can you see a drawback with this process?

The Drawback − When we compute distance between Carol and
Carlos, there were 5 dimensions Now, check out dimensions between Carol and Erin, there are 2 dimensions 2 Manhattan and Euclidian distances work best when there are no missing values √(4 - 5)2 + (4 - 3)2 = 1.41 √(4 - 4.5)2 + (1 - 4)2 + (4 - 5)2 + (4 - 4)2 + (1 - 4)2 = 4.39 Let’s compute distance between Carol and Erin Now, compute distance between Carol and Carlos " This situation puts our consistency in risk

The Formula 3 2 When we generalize Euclidian and Manhattan
distances, we get Minkowski Distance Metric: " if r = 1, then it’s Manhattan Distance " if r = 2, then it’s Euclidian Distance " if r = ∞, then it’s Supremum Distance The bigger the r, the more a large difference in one dimension will influence the total difference

The Code 4 def nearest_neighbor(username, user_dict):! """! Input : a
username that can be found on our ! dict, and our dict! Output: a sorted list of users based on their ! distance to username! Usage : nearest_neighbor('username', dictname)! """! distances = []! for user in user_dict:! if user != username:! distance = manhattan_distance(! user_dict[user], user_dict[! username])! #distance = minkowski_distance(! user_dict[user], user_dict[! username], 2)! # 1 -> Manhattan Distance! # 2 -> Euclidian Distance! distances.append((distance, user))! distances.sort() # sort based on min(distance)! return distances def manhattan_distance(user1, user2):! """ ! Input : 2 dict. elements! Output: manhattan distance value of two users! Usage : manhattan_distance(users['key1'], ! users['key2'])! """! distance = 0! has_common_ratings = False ! for key in user1:! if key in user2:! distance += abs(user1[key] - user2[key])! has_common_ratings = True! if has_common_ratings:! return distance! else:! return -1 # No ratings in common

Alice Carol Carlos 12 Years a Slave 2 - -
Gravity 3.5 4 4.5 Dallas Buyers Club 4 1 4 Blue Jasmine - 4 5 American Hustle 2 - 5 The Great Gatsby 3.5 - 4.5 Frozen - 4 4 Captain Phillips 3 1 4 medium points extreme points very low or very high likes all. 4 or 5 every time ?How would you compare Carol to Carlos? Can we say Carol’s 4 = Carlos’ 5? 5 Can you see a drawback with all this processes?

Pearson Correlation Coeﬀicient We measure the correlation between two variables
It ranges between -1 (total mismatch) and 1 (total match) Person A 0 1.25 2.5 3.75 5 Person B 1 2 3 4 5 4.00 4.25 4.50 4.75 5.00 Person A 0 1.25 2.5 3.75 5 Person B 1 2 3 4 5 4.25 4.00 4.50 4.25 5.00 Person A 0 1.25 2.5 3.75 5 Person B 1 2 3 4 5 4.50 3.50 4.75 2.10 3.50

The Formula It’s a total match! 3

The Code 4 def pearson_corr(rating1, rating2):! sum_xy, sum_x, sum_y, sum_xsq,
sum_ysq, n = 0, 0, 0, 0, 0, 0! for key in rating1:! if key in rating2:! n += 1! x = rating1[key]! y = rating2[key]! sum_xy += x * y! sum_x += x! sum_y += y! sum_xsq += pow(x, 2)! sum_ysq += pow(y, 2)! #denominator! denominator = sqrt(sum_xsq - pow(sum_x, 2) / n) * sqrt(sum_ysq - pow(sum_y, 2) / n)! if denominator == 0:! return 0! else:! return (sum_xy - (sum_x * sum_y) / n) / denominator!

Cosine Similarity There would be very little shared data among
the huge dataset But we don’t want to use shared zeros. Number of Plays Alanis Morissette Ironic B.o.B So Good Coldplay Magic Bob 10 5 32 Alice 15 25 1 Eve 12 36 27 Think word frequency in a book Measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them (in our case, # of plays!) One of the reasons for the popularity of Cosine similarity is that it is very eﬃcient to evaluate, especially for sparse vectors, as only the non-zero dimensions need to be considered.

The Formula 3

The Formula 3 Person A 0 1.25 2.5 3.75 5
Person B 1 2 3 4 5 4.00 4.25 4.50 4.75 5.00

The Code 4 import math! ! def cosine_similarity(v1,v2):! sumxx, sumxy,
sumyy = 0, 0, 0! for i in range(len(v1)):! x = v1[i]; y = v2[i]! sumxx += x*x! sumyy += y*y! sumxy += x*y! return sumxy/math.sqrt(sumxx*sumyy)!

? Question: Which one should we chose? In what circumstances?

/ Answer: • Use Manhattan/Euclidean distances if your data is
dense (No zero valued attributes) • Use Pearson Correlation Coeﬃcient if your data is using diﬀerent scales. • Use Cosine Similarity if your data is sparse

Case #1 People: Bob, Alice, Carlos 6 Movie Rating System:
Bob & Alice • 20 movies in common • Diﬀ. in ratings = 0.5 (in 1-5 scale) • Manhattan(Bob, Alice) : 20 x 0.5 = 10 • Euclidean(Bob, Alice) : √0.52 x 20 = 2.36 Alice & Carlos • 1 movie in common • Diﬀ. in ratings = 2 (in 1-5 scale) • Manhattan(Alice, Carlos) : 1 x 2 = 2 • Euclidean(Bob, Alice) : √22 = 2 Output: Carlos is a better match to Alice (which is wrong)

Idea to ﬁx this If a person didn’t rate any
movie, assign 0 for that one. This will solve the sparse data problem. Merlin & Carlos • 25 of 26 movies are the same • Merlin’s distance to Carlos = 0.25 Output: Merlin is much closer to Carlos, and Carol is far away from Carlos (wrong) Because we assigned 0 to non-rated movies People: Merlin, Carol, Carlos (the same Carlos before) Merlin & Carol • Very very similar Merlin & Carol & Carlos • Carol’s 25/150 ratings are same with both • Carol’s distance to Carlos = 0.25 • Is Carol and Merlin equally close match to Carlos? But let’s test this before arriving any conclusion

Take-Home Lesson 0 values dominate the measure in distances. !
If data is not dense, making it sparse by modiﬁcation will create another problems when solving one. ! If data is sparse, using cosine similarity will be much better for us.

Case #2 People: Bob, Carol, Bob’s wife Alice 6 Book
Rating System: Bob & Carol • Has gave 5 stars to same books (Game of Thrones, Lord of the Rings, Dune) • Bob’s wife wrote a book about botany • Bob gave this book 5 stars because the author is his wife • Since Bob is the closest person to Carol, recommend this botany book to Carol Why?: Because we are still recommending by looking at 1-to-1 relationships ?What can we do?

K-Neareﬆ Neighbor We use K most similar people to determine
recommendations What should K be? Well, it’s up to you, and application Recommendation to Mr.X with kNN where k =3 Users Pearson Bob 0.8 Alice 0.7 Eve 0.5 0.7 + 0.8 + 0.5 = 2.0 Inﬂuence Coeff. 25% 35% 40% Bob Alice Eve Users Frozen Bob 3.5 Alice 5 Eve 4.5

K-Neareﬆ Neighbor Users Pearson Frozen Inﬂuence Bob 0.8 3.5 40
Alice 0.7 5 35 Eve 0.5 4.5 25 Frozen’s projected rating for Mr.X (3.5 x 0.4) + (5 x 0.35) + (4.5 x 0.25) = 4.275

Classiﬁcation Drawback with Collaborative Filtering: Tends to recommend already popular
items. Brand new items will never be found and rated easily, thus never recommended and they will be thrown to side. Why not classify them? That way, we can ﬁnd similar items, not only similar users.

Importance of Selecting Appropriate Values Y axis Genre Value Country
1 Jazz 2 Rock 3 Soul 4 Rap 5 X axis Mood Value Melancholy 1 Joyful 2 Passion 3 Angry 4 Other 5 James Blunt - You’re Beautiful = (1,3) Song A = (1,2) Song B = (4,4) Song C = (4,2) Song A is the closest to our sample. With that logic, we are saying that Jazz is closer to Rock than it is to Soul, and Melancholy is closer to Joy than it is to Angry It’s a big mistake and won’t gone even if we change the order

Importance of Selecting Appropriate Values Why not classify put them
in their own scales? Melancholic? How melancholic? Rock? What is the coefficient? ! E.g : Country(5/5), Soul(3/5), Rock(1/5) and Joy(3/5) Passion(3/5) ! A nice example (all of them has their own scales): ! • Amount of piano • Amount of vocals • Driving beat • Blues Influence • Presence of dirty electric guitar • Presence of backup vocals • Rap Influence ! Once we define our values and scales, it’s just a matter of applying Nearest Neighbor algorithm /

Classiﬁcation Schema Text Feature Extraction Trained Classiﬁer Training Data Feature
Extraction Cat Dog Mouse Feature Extraction: selecting a subset of relevant features

Probability P(A) is the probability that A is true Axioms
of Probability P(True) = 1 ! P(False) = 0 ! 0 ≤ P(A) ≤ 1 ! P(A or B) = P(A) + P(B) - P(A and B) P(A) P(B) P(A and B)

Naïve Bayes Nearest Neighbor: Lazy Learner When we give them
a set of data, they save the set for that instance. Each time it classifies an instance, it goes through the whole training set. 100,000 music tracks -> goes through all of them each time it classifies an instance Bayesian Methods: Eager Learner Immediately analyze data and build model When classifying an instance, it uses this internal model. Probabilistic classifications and faster classification are main benefits of the Bayesian methods

Bayes Law p(A|B)p(B) p(B|A) = —————— p(A) Corner stone of
all Bayesian methods You can calculate the probability of event B if A. If you already knew the probability of A given B and B and A individually Probability of event B if A what is the probability of event B is occurring, if A is occurred

Bayes Law Example There are 10000 people %1 have a
rare disease there is a test that is 99% eﬀective 99% of sick patients test positive 99% of healthy patients test negative ! Given a positive test result, what is the probability that the patient is sick? ? Sick population: 100 (99% -> 99 people / 1% -> 1 person) Healthy population: 9900 (99% -> 9801 people / 1% -> 99 people) 99 sick people test positive 99 healthy people test positive Given a positive test, there is a 50% prob. that the patient is sick P(sick|test_pos)=P(test_pos|sick)P(sick) / P(test_pos) 99/100 * 1/100 ÷ ((99/100^2) + (99/100^2)) = 0.5

Unstructured Text Classiﬁcation with Naïve Bayes Check the code!

Linear Regression Explains the relationship between a dependent variable and
one or more explanatory variables. ! This is a tool to show the relationship between the inputs and outputs of a system. ! Does any change in x, causes something in y? ! Commonly used in customer satisfaction researches. ! More eﬃcient with short term trend data. ! Check the code!

scikit-learn Statistical Learning with scikit-learn (Also the SVM) A simple
example: Iris Dataset >>> from sklearn import datasets >>> iris = datasets.load_iris() >>> data = iris.data >>> data.shape (150, 4) It is made of 150 observations of irises, each described by 4 features: their sepal and petal length and width When the data is not initially in the (n_samples, n_features) shape, it needs to be preprocessed in order to by used by scikit 2

>>> from sklearn import datasets >>> from sklearn import datasets,
svm, metrics >>> digits = datasets.load_digits() >>> for index, (image, label) in enumerate(zip(digits.images, digits.target)[:4]): pl.subplot(2, 4, index + 1) pl.axis('off') pl.imshow(image, cmap=pl.cm.gray_r, interpolation='nearest') pl.title('Training: %i' % label) >>> # To apply an classifier on this data, we need to flatten the image >>> # to turn the data in a (samples, feature) matrix >>> n_samples = len(digits.images) >>> data = digits.images.reshape((n_samples, -1)) ! >>> # support vector classifier >>> classifier = svm.SVC(gamma=0.001) ! >>> # We learn the digits on the first half of the digits >>> classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2]) ! >>> # Now predict the value of the digit on the second half: >>> expected = digits.target[n_samples / 2:] >>> predicted = classifier.predict(data[n_samples / 2:])

>>> for index, (image, prediction) in enumerate( zip(digits.images[n_samples / 2:],
predicted)[:4]): pl.subplot(2, 4, index + 5) pl.subplot(2, 4, index + 5) pl.axis('off') pl.imshow(image, cmap=pl.cm.gray_r, interpolation='nearest') pl.title('Prediction: %i' % prediction) >>> pl.show()

THANK YOU

An Introduction to Basics of Data Science Part 1

An Introduction to Basics of Data Science Part 1

More Decks by Serhan Baker

Other Decks in Programming

Featured

Transcript