Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CS685 Intro

45c045a489856aba7503d9dc6def129f?s=47 pankajmore
August 07, 2012

CS685 Intro

45c045a489856aba7503d9dc6def129f?s=128

pankajmore

August 07, 2012
Tweet

More Decks by pankajmore

Other Decks in Science

Transcript

  1. CS685: Data Mining Introduction Arnab Bhattacharya arnabb@cse.iitk.ac.in Computer Science and

    Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/~cs685/ 1st semester, 2012-13 Tue, Wed, Fri 0900-1000 at CS101 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 1 / 19
  2. Outline 1 General information 2 Course information 3 Data mining

    Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 2 / 19
  3. Outline 1 General information 2 Course information 3 Data mining

    Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 3 / 19
  4. Rules No pre-requisites except general aptitude Linear algebra, probability and

    statistics expected Email arnabb@cse.iitk.ac.in to set up appointment Put “CS685” in the subject so that automatic filter catches it Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 4 / 19
  5. Rules No pre-requisites except general aptitude Linear algebra, probability and

    statistics expected Email arnabb@cse.iitk.ac.in to set up appointment Put “CS685” in the subject so that automatic filter catches it Participate Attend classes Clear doubts Answer questions Do homeworks (i.e., assignments) individually Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 4 / 19
  6. Rules No pre-requisites except general aptitude Linear algebra, probability and

    statistics expected Email arnabb@cse.iitk.ac.in to set up appointment Put “CS685” in the subject so that automatic filter catches it Participate Attend classes Clear doubts Answer questions Do homeworks (i.e., assignments) individually No extension of deadlines for degradation of health of Your computer Your family members Your (special) friend(s) If you are unwell, follow standard IITK procedure Produce a sick certificate, etc. Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 4 / 19
  7. Grading policy Exams: 25% End-semester: 17.5% Mid-semester: 7.5% Project (at

    most groups of 3): 40% Results: 20% Presentation and/or Demonstration: 10% Report: 10% Assignments: 20% Paper presentation and discussion: 10% Class participation: 5% Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 5 / 19
  8. Project details At most groups of three Form your own

    idea Just implementation or survey will not be enough Back it up with analysis Deadlines 1 Area of project: Aug 13 2 Initial write-up: Sep 02 3 Mid-term report: Sep 30 4 Presentation and Demonstration: Nov 10 – Nov 12 5 Final report: Nov 16 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 6 / 19
  9. Outline 1 General information 2 Course information 3 Data mining

    Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 7 / 19
  10. Course material Slides Classwork Book: no text book Reference books

    “Data Mining: Concepts and Techniques” by Han, Kamber. Morgan Kaufmann. “Data Mining: Practical Machine Learning Tools and Techniques” by Witten, Frank, Hall. Morgan Kaufmann. “Introduction to Data Mining” by Tan, Steinbach, Kumar. Pearson Education. References mentioned in slides Conference proceedings and journal articles KDD, ICDM, SDM, PKDD, PAKDD, etc. DAMI, DMKD, TKDE, KDD, etc. Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 8 / 19
  11. Course contents 1 What is data mining? Connection to machine

    learning, statistics, databases What is not data mining? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 9 / 19
  12. Course contents 1 What is data mining? Connection to machine

    learning, statistics, databases What is not data mining? 2 Data pre-processing Data cleaning Missing values Data transformation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 9 / 19
  13. Course contents 1 What is data mining? Connection to machine

    learning, statistics, databases What is not data mining? 2 Data pre-processing Data cleaning Missing values Data transformation 3 Data warehousing and data cube Multi-dimensional data model OLAP: on-line analytical processing Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 9 / 19
  14. Course contents 1 What is data mining? Connection to machine

    learning, statistics, databases What is not data mining? 2 Data pre-processing Data cleaning Missing values Data transformation 3 Data warehousing and data cube Multi-dimensional data model OLAP: on-line analytical processing 4 Itemset mining Frequent itemsets Association rule mining Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 9 / 19
  15. Course contents 1 What is data mining? Connection to machine

    learning, statistics, databases What is not data mining? 2 Data pre-processing Data cleaning Missing values Data transformation 3 Data warehousing and data cube Multi-dimensional data model OLAP: on-line analytical processing 4 Itemset mining Frequent itemsets Association rule mining 5 Classification Tree-based classification Bayesian classification Rule-based classification Support vector machines Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 9 / 19
  16. Course contents (contd.) 6 Prediction Regression Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685:

    Introduction 2012-13 10 / 19
  17. Course contents (contd.) 6 Prediction Regression 7 Clustering Partition-based methods

    Hierarchical methods Model-based methods Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 10 / 19
  18. Course contents (contd.) 6 Prediction Regression 7 Clustering Partition-based methods

    Hierarchical methods Model-based methods 8 Anomaly detection Rule-based methods Statistical methods Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 10 / 19
  19. Course contents (contd.) 6 Prediction Regression 7 Clustering Partition-based methods

    Hierarchical methods Model-based methods 8 Anomaly detection Rule-based methods Statistical methods 9 Mining special kinds of data (if time and interests permit) Graph mining Text mining Image analysis Biological data Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 10 / 19
  20. Outline 1 General information 2 Course information 3 Data mining

    Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 11 / 19
  21. What is data mining? Extracting or mining knowledge from large

    amounts of data Knowledge discovery from data (KDD) We are in a data rich but information poor scenario Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 12 / 19
  22. What is data mining? Extracting or mining knowledge from large

    amounts of data Knowledge discovery from data (KDD) We are in a data rich but information poor scenario Data mining is supported by three major technologies 1 Massive data collection 2 Data mining algorithms 3 Powerful multiprocessor computers Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 12 / 19
  23. What is data mining? Extracting or mining knowledge from large

    amounts of data Knowledge discovery from data (KDD) We are in a data rich but information poor scenario Data mining is supported by three major technologies 1 Massive data collection 2 Data mining algorithms 3 Powerful multiprocessor computers It is in the confluence of Machine learning Statistics Databases Information retrieval Visualization techniques Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 12 / 19
  24. Data analysis challenges Scalability High dimensionality Heterogeneous and complex data

    Web Unstructured text Graph Distributed data Data ownership and privacy How to access knowledge without violating privacy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 13 / 19
  25. Data analysis challenges Classification Predicting the class of a data

    object Clustering Finding groups in data Association Finding co-occurring and related itemsets Visualization Facilitating human discovery of patterns Summarization Succinctly describing a group Anomaly detection Identifying abnormal behavior Estimation Predicting values of a data object Link analysis Finding relationships among data objects Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 14 / 19
  26. Extra-sensory perception (ESP) Rhine, a para-psychologist, proceeded to show that

    people experience extra-sensory perception(ESP) Asked many people to correctly guess a sequence of 10 red or blue cards About 1 in every 1000 was right Rhine declared that they had ESP Called them for further investigation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 15 / 19
  27. Extra-sensory perception (ESP) Rhine, a para-psychologist, proceeded to show that

    people experience extra-sensory perception(ESP) Asked many people to correctly guess a sequence of 10 red or blue cards About 1 in every 1000 was right Rhine declared that they had ESP Called them for further investigation They lost ESP Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 15 / 19
  28. Extra-sensory perception (ESP) Rhine, a para-psychologist, proceeded to show that

    people experience extra-sensory perception(ESP) Asked many people to correctly guess a sequence of 10 red or blue cards About 1 in every 1000 was right Rhine declared that they had ESP Called them for further investigation They lost ESP Conclusion was one should not inform people that they have ESP Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 15 / 19
  29. Terrorism example Is it sensible to try and detect possible

    terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: they will scan hotel logs to identify such occurrences Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 16 / 19
  30. Terrorism example Is it sensible to try and detect possible

    terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: they will scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 16 / 19
  31. Terrorism example Is it sensible to try and detect possible

    terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: they will scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Deductions A person stays in hotel for 10 days Total number of hotels is 105 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 16 / 19
  32. Terrorism example Is it sensible to try and detect possible

    terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: they will scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Deductions A person stays in hotel for 10 days Total number of hotels is 105 Each day, 107 people stay in a hotel Per hotel, 102 people stay Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 16 / 19
  33. Terrorism example (contd.) In a day, probability that person A

    and B stays in the same hotel is 10−9 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 17 / 19
  34. Terrorism example (contd.) In a day, probability that person A

    and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 17 / 19
  35. Terrorism example (contd.) In a day, probability that person A

    and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 17 / 19
  36. Terrorism example (contd.) In a day, probability that person A

    and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103: 103 2 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 17 / 19
  37. Terrorism example (contd.) In a day, probability that person A

    and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103: 103 2 Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 17 / 19
  38. Terrorism example (contd.) In a day, probability that person A

    and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103: 103 2 Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Total pairs of people is (roughly) 5 × 1017 Any 2 out of 109: 109 2 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 17 / 19
  39. Terrorism example (contd.) In a day, probability that person A

    and B stays in the same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103: 103 2 Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Total pairs of people is (roughly) 5 × 1017 Any 2 out of 109: 109 2 Expected number of suspicions, i.e., probability that any pair of people meet twice on any pair of days is 2.5 × 105 5 × 10−13 × 5 × 1017 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 17 / 19
  40. Ice-cream A man goes to an ice-cream parlor every night

    while returning home He observes that only on days he orders vanilla flavor, his car stalls When any other flavor is ordered, the car does not stall He observes it over an extended period of time He tries changing other attributes such as shirt color, boot type, person accompanying him, etc. No other attribute has any consistent effect Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 18 / 19
  41. Ice-cream A man goes to an ice-cream parlor every night

    while returning home He observes that only on days he orders vanilla flavor, his car stalls When any other flavor is ordered, the car does not stall He observes it over an extended period of time He tries changing other attributes such as shirt color, boot type, person accompanying him, etc. No other attribute has any consistent effect A data mining researcher comes Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 18 / 19
  42. Ice-cream A man goes to an ice-cream parlor every night

    while returning home He observes that only on days he orders vanilla flavor, his car stalls When any other flavor is ordered, the car does not stall He observes it over an extended period of time He tries changing other attributes such as shirt color, boot type, person accompanying him, etc. No other attribute has any consistent effect A data mining researcher comes She finds out that since vanilla is the most popular favor, ordering vanilla induces a significantly longer waiting time Car stalls when the man waits longer and not otherwise Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 18 / 19
  43. What is not data mining but can be Rhine paradox

    ESP story (extra-sensory perception) Moral: Knowing what data mining is and is not will help you look smarter (than others not taking this course) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 19 / 19
  44. What is not data mining but can be Rhine paradox

    ESP story (extra-sensory perception) Moral: Knowing what data mining is and is not will help you look smarter (than others not taking this course) Bonferroni’s principle: if you look in more places for interesting patterns than your amount of data supports, you are bound to “find” something “interesting” (most likely spurious) Terrorism story Moral: When checking a particular rule or property, if there are many possibilities, then it will happen Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 19 / 19
  45. What is not data mining but can be Rhine paradox

    ESP story (extra-sensory perception) Moral: Knowing what data mining is and is not will help you look smarter (than others not taking this course) Bonferroni’s principle: if you look in more places for interesting patterns than your amount of data supports, you are bound to “find” something “interesting” (most likely spurious) Terrorism story Moral: When checking a particular rule or property, if there are many possibilities, then it will happen Obvious rules may not always make sense Ice-cream story Moral: When deducting rules, look at correct attributes, i.e., those that explain the phenomenon Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2012-13 19 / 19