Slide 1

Slide 1 text

CS685: Data Mining Introduction Arnab Bhattacharya [email protected] Computer Science and Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/~cs685/ 1st semester, 2012-13 Tue, Wed, Fri 0900-1000 at CS101 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 1 / 19

Slide 2

Slide 2 text

Outline 1 General information 2 Course information 3 Data mining Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 2 / 19

Slide 3

Slide 3 text

Outline 1 General information 2 Course information 3 Data mining Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 3 / 19

Slide 4

Slide 4 text

Rules No pre-requisites except general aptitude Linear algebra, probability and statistics expected Email [email protected] to set up appointment Put “CS685” in the subject so that automatic filter catches it Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 4 / 19

Slide 5

Slide 5 text

Rules No pre-requisites except general aptitude Linear algebra, probability and statistics expected Email [email protected] to set up appointment Put “CS685” in the subject so that automatic filter catches it Participate Attend classes Clear doubts Answer questions Do homeworks (i.e., assignments) individually Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 4 / 19

Slide 6

Slide 6 text

Rules No pre-requisites except general aptitude Linear algebra, probability and statistics expected Email [email protected] to set up appointment Put “CS685” in the subject so that automatic filter catches it Participate Attend classes Clear doubts Answer questions Do homeworks (i.e., assignments) individually No extension of deadlines for degradation of health of Your computer Your family members Your (special) friend(s) If you are unwell, follow standard IITK procedure Produce a sick certificate, etc. Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 4 / 19

Slide 7

Slide 7 text

Grading policy Exams: 25% End-semester: 17.5% Mid-semester: 7.5% Project (at most groups of 3): 40% Results: 20% Presentation and/or Demonstration: 10% Report: 10% Assignments: 20% Paper presentation and discussion: 10% Class participation: 5% Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 5 / 19

Slide 8

Slide 8 text

Project details At most groups of three Form your own idea Just implementation or survey will not be enough Back it up with analysis Deadlines 1 Area of project: Aug 13 2 Initial write-up: Sep 02 3 Mid-term report: Sep 30 4 Presentation and Demonstration: Nov 10 – Nov 12 5 Final report: Nov 16 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 6 / 19

Slide 9

Slide 9 text

Outline 1 General information 2 Course information 3 Data mining Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 7 / 19

Slide 10

Slide 10 text

Course material Slides Classwork Book: no text book Reference books “Data Mining: Concepts and Techniques” by Han, Kamber. Morgan Kaufmann. “Data Mining: Practical Machine Learning Tools and Techniques” by Witten, Frank, Hall. Morgan Kaufmann. “Introduction to Data Mining” by Tan, Steinbach, Kumar. Pearson Education. References mentioned in slides Conference proceedings and journal articles KDD, ICDM, SDM, PKDD, PAKDD, etc. DAMI, DMKD, TKDE, KDD, etc. Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 8 / 19

Slide 11

Slide 11 text

Course contents 1 What is data mining? Connection to machine learning, statistics, databases What is not data mining? Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 9 / 19

Slide 12

Slide 12 text

Course contents 1 What is data mining? Connection to machine learning, statistics, databases What is not data mining? 2 Data pre-processing Data cleaning Missing values Data transformation Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 9 / 19

Slide 13

Slide 13 text

Course contents 1 What is data mining? Connection to machine learning, statistics, databases What is not data mining? 2 Data pre-processing Data cleaning Missing values Data transformation 3 Data warehousing and data cube Multi-dimensional data model OLAP: on-line analytical processing Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 9 / 19

Slide 14

Slide 14 text

Course contents 1 What is data mining? Connection to machine learning, statistics, databases What is not data mining? 2 Data pre-processing Data cleaning Missing values Data transformation 3 Data warehousing and data cube Multi-dimensional data model OLAP: on-line analytical processing 4 Itemset mining Frequent itemsets Association rule mining Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 9 / 19

Slide 15

Slide 15 text

Course contents 1 What is data mining? Connection to machine learning, statistics, databases What is not data mining? 2 Data pre-processing Data cleaning Missing values Data transformation 3 Data warehousing and data cube Multi-dimensional data model OLAP: on-line analytical processing 4 Itemset mining Frequent itemsets Association rule mining 5 Classification Tree-based classification Bayesian classification Rule-based classification Support vector machines Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 9 / 19

Slide 16

Slide 16 text

Course contents (contd.) 6 Prediction Regression Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 10 / 19

Slide 17

Slide 17 text

Course contents (contd.) 6 Prediction Regression 7 Clustering Partition-based methods Hierarchical methods Model-based methods Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 10 / 19

Slide 18

Slide 18 text

Course contents (contd.) 6 Prediction Regression 7 Clustering Partition-based methods Hierarchical methods Model-based methods 8 Anomaly detection Rule-based methods Statistical methods Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 10 / 19

Slide 19

Slide 19 text

Course contents (contd.) 6 Prediction Regression 7 Clustering Partition-based methods Hierarchical methods Model-based methods 8 Anomaly detection Rule-based methods Statistical methods 9 Mining special kinds of data (if time and interests permit) Graph mining Text mining Image analysis Biological data Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 10 / 19

Slide 20

Slide 20 text

Outline 1 General information 2 Course information 3 Data mining Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 11 / 19

Slide 21

Slide 21 text

What is data mining? Extracting or mining knowledge from large amounts of data Knowledge discovery from data (KDD) We are in a data rich but information poor scenario Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 12 / 19

Slide 22

Slide 22 text

What is data mining? Extracting or mining knowledge from large amounts of data Knowledge discovery from data (KDD) We are in a data rich but information poor scenario Data mining is supported by three major technologies 1 Massive data collection 2 Data mining algorithms 3 Powerful multiprocessor computers Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 12 / 19

Slide 23

Slide 23 text

What is data mining? Extracting or mining knowledge from large amounts of data Knowledge discovery from data (KDD) We are in a data rich but information poor scenario Data mining is supported by three major technologies 1 Massive data collection 2 Data mining algorithms 3 Powerful multiprocessor computers It is in the confluence of Machine learning Statistics Databases Information retrieval Visualization techniques Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 12 / 19

Slide 24

Slide 24 text

Data analysis challenges Scalability High dimensionality Heterogeneous and complex data Web Unstructured text Graph Distributed data Data ownership and privacy How to access knowledge without violating privacy Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 13 / 19

Slide 25

Slide 25 text

Data analysis challenges Classification Predicting the class of a data object Clustering Finding groups in data Association Finding co-occurring and related itemsets Visualization Facilitating human discovery of patterns Summarization Succinctly describing a group Anomaly detection Identifying abnormal behavior Estimation Predicting values of a data object Link analysis Finding relationships among data objects Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 14 / 19

Slide 26

Slide 26 text

Extra-sensory perception (ESP) Rhine, a para-psychologist, proceeded to show that people experience extra-sensory perception(ESP) Asked many people to correctly guess a sequence of 10 red or blue cards About 1 in every 1000 was right Rhine declared that they had ESP Called them for further investigation Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 15 / 19

Slide 27

Slide 27 text

Extra-sensory perception (ESP) Rhine, a para-psychologist, proceeded to show that people experience extra-sensory perception(ESP) Asked many people to correctly guess a sequence of 10 red or blue cards About 1 in every 1000 was right Rhine declared that they had ESP Called them for further investigation They lost ESP Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 15 / 19

Slide 28

Slide 28 text

Extra-sensory perception (ESP) Rhine, a para-psychologist, proceeded to show that people experience extra-sensory perception(ESP) Asked many people to correctly guess a sequence of 10 red or blue cards About 1 in every 1000 was right Rhine declared that they had ESP Called them for further investigation They lost ESP Conclusion was one should not inform people that they have ESP Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 15 / 19

Slide 29

Slide 29 text

Terrorism example Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: they will scan hotel logs to identify such occurrences Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 16 / 19

Slide 30

Slide 30 text

Terrorism example Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: they will scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 16 / 19

Slide 31

Slide 31 text

Terrorism example Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: they will scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Deductions A person stays in hotel for 10 days Total number of hotels is 105 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 16 / 19

Slide 32

Slide 32 text

Terrorism example Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: they will scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Deductions A person stays in hotel for 10 days Total number of hotels is 105 Each day, 107 people stay in a hotel Per hotel, 102 people stay Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 16 / 19

Slide 33

Slide 33 text

Terrorism example (contd.) In a day, probability that person A and B stays in the same hotel is 10−9 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 17 / 19

Slide 34

Slide 34 text

Terrorism example (contd.) In a day, probability that person A and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 17 / 19

Slide 35

Slide 35 text

Terrorism example (contd.) In a day, probability that person A and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 17 / 19

Slide 36

Slide 36 text

Terrorism example (contd.) In a day, probability that person A and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103: 103 2 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 17 / 19

Slide 37

Slide 37 text

Terrorism example (contd.) In a day, probability that person A and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103: 103 2 Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 17 / 19

Slide 38

Slide 38 text

Terrorism example (contd.) In a day, probability that person A and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103: 103 2 Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Total pairs of people is (roughly) 5 × 1017 Any 2 out of 109: 109 2 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 17 / 19

Slide 39

Slide 39 text

Terrorism example (contd.) In a day, probability that person A and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103: 103 2 Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Total pairs of people is (roughly) 5 × 1017 Any 2 out of 109: 109 2 Expected number of suspicions, i.e., probability that any pair of people meet twice on any pair of days is 2.5 × 105 5 × 10−13 × 5 × 1017 Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 17 / 19

Slide 40

Slide 40 text

Ice-cream A man goes to an ice-cream parlor every night while returning home He observes that only on days he orders vanilla flavor, his car stalls When any other flavor is ordered, the car does not stall He observes it over an extended period of time He tries changing other attributes such as shirt color, boot type, person accompanying him, etc. No other attribute has any consistent effect Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 18 / 19

Slide 41

Slide 41 text

Ice-cream A man goes to an ice-cream parlor every night while returning home He observes that only on days he orders vanilla flavor, his car stalls When any other flavor is ordered, the car does not stall He observes it over an extended period of time He tries changing other attributes such as shirt color, boot type, person accompanying him, etc. No other attribute has any consistent effect A data mining researcher comes Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 18 / 19

Slide 42

Slide 42 text

Ice-cream A man goes to an ice-cream parlor every night while returning home He observes that only on days he orders vanilla flavor, his car stalls When any other flavor is ordered, the car does not stall He observes it over an extended period of time He tries changing other attributes such as shirt color, boot type, person accompanying him, etc. No other attribute has any consistent effect A data mining researcher comes She finds out that since vanilla is the most popular favor, ordering vanilla induces a significantly longer waiting time Car stalls when the man waits longer and not otherwise Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 18 / 19

Slide 43

Slide 43 text

What is not data mining but can be Rhine paradox ESP story (extra-sensory perception) Moral: Knowing what data mining is and is not will help you look smarter (than others not taking this course) Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 19 / 19

Slide 44

Slide 44 text

What is not data mining but can be Rhine paradox ESP story (extra-sensory perception) Moral: Knowing what data mining is and is not will help you look smarter (than others not taking this course) Bonferroni’s principle: if you look in more places for interesting patterns than your amount of data supports, you are bound to “find” something “interesting” (most likely spurious) Terrorism story Moral: When checking a particular rule or property, if there are many possibilities, then it will happen Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 19 / 19

Slide 45

Slide 45 text

What is not data mining but can be Rhine paradox ESP story (extra-sensory perception) Moral: Knowing what data mining is and is not will help you look smarter (than others not taking this course) Bonferroni’s principle: if you look in more places for interesting patterns than your amount of data supports, you are bound to “find” something “interesting” (most likely spurious) Terrorism story Moral: When checking a particular rule or property, if there are many possibilities, then it will happen Obvious rules may not always make sense Ice-cream story Moral: When deducting rules, look at correct attributes, i.e., those that explain the phenomenon Arnab Bhattacharya ([email protected]) CS685: Introduction 2012-13 19 / 19