Slide 1

Slide 1 text

CS685: Data Mining Data Preprocessing Arnab Bhattacharya [email protected] Computer Science and Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/~cs685/ 1st semester, 2012-13 Tue, Wed, Fri 0900-1000 at CS101 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 1 / 45

Slide 2

Slide 2 text

Outline 1 Data reduction Numerosity reduction Data discretization Data modeling 2 Data integration 3 Data transformation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 2 / 45

Slide 3

Slide 3 text

Outline 1 Data reduction Numerosity reduction Data discretization Data modeling 2 Data integration 3 Data transformation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 3 / 45

Slide 4

Slide 4 text

Fourier analysis Fourier analysis represents a periodic wave as a sum of (infinite) sine and cosine waves Way to analyze the frequency components in a signal Fourier transform is a transformation from time domain f (x) to frequency domain g(u) g(u) = ∞ −∞ f (x)e−2πuxi dx f (x) can be obtained from g(u) by the inverse transformation f (x) = ∞ −∞ g(u)e2πuxi du f (x) and g(u) form a Fourier transform pair Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 4 / 45

Slide 5

Slide 5 text

Discrete Fourier transform (DFT) For discrete case, assume vector x has N components DFT of x, denoted by X, has N components given by Xk = 1. N−1 n=0 xne−(2π/N)kni k = 0, . . . , N − 1 Inverse transformation (IDFT) is given by xn = 1 N . N−1 k=0 Xke(2π/N)kni n = 0, . . . , N − 1 To avoid using separate scaling factors, both can be taken as (1/ √ N) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 5 / 45

Slide 6

Slide 6 text

Properties Parseval’s theorem: ||x||2 = ||X||2, i.e., length of vectors is preserved N−1 n=0 |xn|2 = N−1 k=0 |Xk|2 When both scaling factors are (1/ √ N) Contractive mapping, i.e., lengths get reduced Invertible, linear transformation Essentially, a rotation in N-dimensional space Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 6 / 45

Slide 7

Slide 7 text

Coefficients Expanding eθi = cos θ + i sin θ, Xk = N−1 n=0 xne−(2π/N)kni = N−1 n=0 xn cos( 2π N kn) − i sin( 2π N kn) ∴ X0 = N−1 n=0 xn (cos 0 − i sin 0) = N−1 n=0 xn First coefficient is (scaled) sum or average Other coefficients define the frequency components (cosine and sine) at frequencies 2π(k/N) for k = 0, 1, . . . , N − 1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 7 / 45

Slide 8

Slide 8 text

Dimensionality reduction Retain k lower frequency components Discard higher frequency noise Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 8 / 45

Slide 9

Slide 9 text

Dimensionality reduction Retain k lower frequency components Discard higher frequency noise Alternatively, for a database of objects, retain those coefficients whose variances are highest Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 8 / 45

Slide 10

Slide 10 text

Discrete cosine transform (DCT) Way to represent a signal in terms of cosine waves of different frequencies (and amplitudes) Various definitions X(1) k = 2 N N−1 n=0 u(1) n xn cos π N n + 1 2 k k = 0, . . . , N − 1 X(2) k = 2 N N−1 n=0 u(2) n xn cos π N n + 1 2 k + 1 2 k = 0, . . . , N − 1 u(1) n = 1/ √ 2 n = 0 u(1) n = 1 n = 1, . . . , N − 1 u(2) n = 1 n = 0, . . . , N − 1 Length preserving Inverses are the same functions Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 9 / 45

Slide 11

Slide 11 text

Wavelets Fourier transform analyzes frequency resolutions, but not time Wavelets analyze a function in both time and frequency domains Good time resolution and poor frequency resolution at high frequencies Good frequency resolution and poor time resolution at low frequencies Wavelets are useful for short duration signals of high frequency and long duration signals of short frequency Wavelets are generated from a mother wavelet function ψ Zero mean (oscillatory, i.e., wave nature): ψ(x) dx = 0 Unit length: ψ2(x) dx = 1 Basis functions are generated by scaling (s) and shifting (l) the mother wavelet ψs,l (t) = 1 √ s ψ t − l s Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 10 / 45

Slide 12

Slide 12 text

Discrete wavelet transform (DWT) DWT generates a set of basis function or vectors Two functions: 1 Wavelet function 2 Scaling function Space spanned by n = 2j basis vectors at level j can be spanned by two sets of basis vectors ψ and φ at level j − 1 ψ and φ are wavelet and scaling functions respectively DWT generates basis vectors for wavelet and scaling functions at different levels sj−1 d1 s0 d0 s1 sj sj−2 dj−2 dj−1 . . . ϕ ϕ ϕ ϕ φ φ φ φ Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 11 / 45

Slide 13

Slide 13 text

Haar wavelets: Example Compute sum and difference between each consecutive pairs of entries (and scale) Repeat the steps for the sum coefficients f3 = {2, 5, 8, 9, 7, 4, −1, 1} f2 = 1 √ 2 {2 + 5, 8 + 9, 7 + 4, −1 + 1, 2 − 5, 8 − 9, 7 − 4, −1 − 1} = 7 √ 2 , 17 √ 2 , 11 √ 2 , 0 √ 2 , −3 √ 2 , −1 √ 2 , 3 √ 2 , −2 √ 2 f1 = 24 2 , 11 2 , −10 2 , 11 2 , −3 √ 2 , −1 √ 2 , 3 √ 2 , −2 √ 2 f0 = 35 2 √ 2 , 13 2 √ 2 , −10 2 , 11 2 , −3 √ 2 , −1 √ 2 , 3 √ 2 , −2 √ 2 Length is preserved: ||f0||2 = 15.524 = ||f3||2 Invertible: f3 is losslessly obtained from f0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 12 / 45

Slide 14

Slide 14 text

Haar wavelets: Theory Wavelet (ψ) and scaling (φ) functions: ψ(x) =    +1 when 0 ≤ x < 1/2 −1 when 1/2 < x ≤ 1 0 otherwise φ(x) = 1 when 0 ≤ x ≤ 1 0 otherwise Shifting (i) and scaling (j): ψj,i (x) = 2j/2ψ(2j x − i) j = 0, . . . i = 0, . . . , 2j − 1 φj,i (x) = 2j/2φ(2j x − i) j = 0, . . . i = 0, . . . , 2j − 1 Binary dilation (scaling) and dyadic translation (shifting) Coefficients corresponding to ψj,i are averages or sum coefficients Coefficients corresponding to φj,i are differences or detail coefficients Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 13 / 45

Slide 15

Slide 15 text

Matrix form ψj,i , φj,i etc. form the basis vectors Only ψ0,0 is needed since others are expressed in terms of φ·,· When size of vector is n = 2j , j levels of basis vectors are needed There are 2j−1 + 2j−2 + · · · + 20 = 2j − 1 basis vectors corresponding to detail coefficients and 1 basis vector corresponding to sum coefficient at 0th level Transformation matrix H has these basis vectors as columns Transformed vector v = v.H for data (row) vector v Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 14 / 45

Slide 16

Slide 16 text

Matrix form ψj,i , φj,i etc. form the basis vectors Only ψ0,0 is needed since others are expressed in terms of φ·,· When size of vector is n = 2j , j levels of basis vectors are needed There are 2j−1 + 2j−2 + · · · + 20 = 2j − 1 basis vectors corresponding to detail coefficients and 1 basis vector corresponding to sum coefficient at 0th level Transformation matrix H has these basis vectors as columns Transformed vector v = v.H for data (row) vector v Each step can be defined as multiplication by a matrix Composition of these matrices gives the final transformation matrix H is orthonormal H−1 = HT Hence, inverse operation is easy Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 14 / 45

Slide 17

Slide 17 text

Haar wavelet matrices H2 =                1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2 0 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2 0 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2 0 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2                f2 = f3.H2 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 15 / 45

Slide 18

Slide 18 text

Haar wavelet matrices (contd.) H1 =               1 √ 2 0 1 √ 2 0 0 0 0 0 1 √ 2 0 −1 √ 2 0 0 0 0 0 0 1 √ 2 0 1 √ 2 0 0 0 0 0 1 √ 2 0 −1 √ 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1               f1 = f2.H1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 16 / 45

Slide 19

Slide 19 text

Haar wavelet matrices (contd.) H0 =              1 √ 2 1 √ 2 0 0 0 0 0 0 1 √ 2 −1 √ 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1              f0 = f1.H0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 17 / 45

Slide 20

Slide 20 text

Haar wavelet matrices (contd.) H3 = H2.H1.H0 =                1 2 √ 2 1 2 √ 2 1 2 0 1 √ 2 0 0 0 1 2 √ 2 1 2 √ 2 1 2 0 −1 √ 2 0 0 0 1 2 √ 2 1 2 √ 2 −1 2 0 0 1 √ 2 0 0 1 2 √ 2 1 2 √ 2 −1 2 0 0 −1 √ 2 0 0 1 2 √ 2 −1 2 √ 2 0 1 2 0 0 1 √ 2 0 1 2 √ 2 −1 2 √ 2 0 1 2 0 0 −1 √ 2 0 1 2 √ 2 −1 2 √ 2 0 −1 2 0 0 0 1 √ 2 1 2 √ 2 −1 2 √ 2 0 −1 2 0 0 0 −1 √ 2                f0 = f3.H3 Inversely, f3 = f0.H−1 3 = f0.HT 3 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 18 / 45

Slide 21

Slide 21 text

Dimensionality reduction using Haar wavelets Retain sum and lower level detail coefficients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coefficients Contractive mapping, i.e., lengths get reduced Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 19 / 45

Slide 22

Slide 22 text

Dimensionality reduction using Haar wavelets Retain sum and lower level detail coefficients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coefficients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coefficients whose variances are highest Another option is to zero out coefficients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 19 / 45

Slide 23

Slide 23 text

Dimensionality reduction using Haar wavelets Retain sum and lower level detail coefficients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coefficients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coefficients whose variances are highest Another option is to zero out coefficients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Pad zeros at end: latter half becomes less important Pad equal amount of zeros in each half: should be done recursively Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 19 / 45

Slide 24

Slide 24 text

Dimensionality reduction using Haar wavelets Retain sum and lower level detail coefficients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coefficients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coefficients whose variances are highest Another option is to zero out coefficients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Pad zeros at end: latter half becomes less important Pad equal amount of zeros in each half: should be done recursively Interestingly, shuffling the dimensions produce different wavelet coefficients How to shuffle to satisfy some criteria? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 19 / 45

Slide 25

Slide 25 text

Data discretization Transformation from continuous domain to discrete intervals, i.e., from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 20 / 45

Slide 26

Slide 26 text

Data discretization Transformation from continuous domain to discrete intervals, i.e., from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 20 / 45

Slide 27

Slide 27 text

Data discretization Transformation from continuous domain to discrete intervals, i.e., from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Top-down discretization: repeatedly finds split points or cut points to divide the data Bottom-up discretization: considers all continuous values as split points and then repeatedly merges Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 20 / 45

Slide 28

Slide 28 text

Data discretization Transformation from continuous domain to discrete intervals, i.e., from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Top-down discretization: repeatedly finds split points or cut points to divide the data Bottom-up discretization: considers all continuous values as split points and then repeatedly merges Discretization often leads to conceptualization where each distinct group represents a particular concept Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 20 / 45

Slide 29

Slide 29 text

Concept hierarchy Discrete intervals or concepts can be generated at multiple levels Concept hierarchy captures more basic concepts towards the top and finer details towards the leaves Example Higher level: brilliant, ordinary, deficient Middle level: CPI ≥ 9, 8 − 9, etc. Lower level: actual CPI Can be used to represent knowledge hierarchies as well Ontology Example WordNet: ontology for English words Gene Ontology: three separate ontologies that capture different attributes of a gene Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 21 / 45

Slide 30

Slide 30 text

Methods of discretization For numeric data Binning and Histogram analysis Entropy-based discretization Chi-square merging Clustering Intuitive partitioning Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 22 / 45

Slide 31

Slide 31 text

Entropy Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45

Slide 32

Slide 32 text

Entropy Entropy captures the amount of randomness or chaos in the data Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45

Slide 33

Slide 33 text

Entropy Entropy captures the amount of randomness or chaos in the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45

Slide 34

Slide 34 text

Entropy Entropy captures the amount of randomness or chaos in the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45

Slide 35

Slide 35 text

Entropy Entropy captures the amount of randomness or chaos in the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45

Slide 36

Slide 36 text

Entropy Entropy captures the amount of randomness or chaos in the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45

Slide 37

Slide 37 text

Entropy Entropy captures the amount of randomness or chaos in the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45

Slide 38

Slide 38 text

Entropy Entropy captures the amount of randomness or chaos in the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? When every pi is 1/n Most random Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45

Slide 39

Slide 39 text

Entropy Entropy captures the amount of randomness or chaos in the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? When every pi is 1/n Most random Entropy = log2 n For n = 2, it is 1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45

Slide 40

Slide 40 text

Entropy-based discretization Supervised Top-down Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 24 / 45

Slide 41

Slide 41 text

Entropy-based discretization Supervised Top-down For attribute A, choose n partitions D1, D2, . . . , Dn for the dataset D If Di has instances from m classes C1, C2, . . . , Cm, then entropy of partition Di is entropy(Di ) = − m j=1 (pji log2 pji ) where pji is probability of class Cj in partition Di pji = |Cj ∈ Di | |Di | Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 24 / 45

Slide 42

Slide 42 text

Choosing partitions Choose n − 1 partition values s1, s2, . . . , sn−1 such that ∀v ∈ Di , si−1 < v ≤ si Implicitly, s0 is minimum and sn is maximum How to choose these partition values? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 25 / 45

Slide 43

Slide 43 text

Choosing partitions Choose n − 1 partition values s1, s2, . . . , sn−1 such that ∀v ∈ Di , si−1 < v ≤ si Implicitly, s0 is minimum and sn is maximum How to choose these partition values? Use the concept of information gain or expected information requirement Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 25 / 45

Slide 44

Slide 44 text

Information gain When D is split into n partitions, the expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45

Slide 45

Slide 45 text

Information gain When D is split into n partitions, the expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is defined as gain(D) = entropy(D) − info(D) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45

Slide 46

Slide 46 text

Information gain When D is split into n partitions, the expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is defined as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45

Slide 47

Slide 47 text

Information gain When D is split into n partitions, the expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is defined as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Keep on partitioning into two parts Stopping criterion Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45

Slide 48

Slide 48 text

Information gain When D is split into n partitions, the expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is defined as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Keep on partitioning into two parts Stopping criterion Number of categories greater than a threshold Expected information requirement is below a threshold Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45

Slide 49

Slide 49 text

Example Dataset D consists of two classes 1 and 2 Class 1 10, 14, 22, 28 Class 2 26, 28, 34, 36, 38 Probabilities of two classes are p1 = 4/9 and p2 = 5/9 entropy(D) = −p1 log2 p1 − p2 log2 p2 = 0.99 How to choose a splitting point? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 27 / 45

Slide 50

Slide 50 text

Splitting point 1 Suppose splitting point s = 24 Then, probabilities of classes per partition are p11 = 3/3 p21 = 0/3 p12 = 1/6 p22 = 5/6 Entropies are entropy(D1) = −(3/3) log2 (3/3) − (0/3) log2 (0/3) = 0.00 entropy(D2) = −(1/6) log2 (1/6) − (5/6) log2 (5/6) = 0.65 Expected information requirement and entropy gain are info(D) = (|D1|/|D|)entropy(D1) + (|D2|/|D|)entropy(D2) = (3/9) × 0.00 + (6/9) × 0.65 = 0.43 gain(D) = entropy(D) − info(D) = 0.99 − 0.43 = 0.56 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 28 / 45

Slide 51

Slide 51 text

Splitting point 2 Suppose splitting point s = 31 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 29 / 45

Slide 52

Slide 52 text

Splitting point 2 Suppose splitting point s = 31 Then, probabilities of classes per partition are p11 = 4/6 p21 = 2/6 p12 = 0/3 p22 = 3/3 Entropies are entropy(D1) = −(4/6) log2 (4/6) − (2/6) log2 (2/6) = 0.92 entropy(D2) = −(0/3) log2 (0/3) − (3/3) log2 (3/3) = 0.00 Expected information requirement and entropy gain are info(D) = (|D1|/|D|)entropy(D1) + (|D2|/|D|)entropy(D2) = (6/9) × 0.92 + (3/9) × 0.00 = 0.61 gain(D) = entropy(D) − info(D) = 0.99 − 0.61 = 0.38 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 29 / 45

Slide 53

Slide 53 text

Example (contd.) So, 24 is a better splitting point than 31 What is the optimal splitting point? Exhaustive algorithm Tests all possible n − 1 partitions Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 30 / 45

Slide 54

Slide 54 text

Chi-square test Statistical test More correctly, this is the Pearson’s chi-square test Can test for goodness of fit and independence Uses chi-square statistic and chi-square distribution Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 31 / 45

Slide 55

Slide 55 text

Chi-square test Statistical test More correctly, this is the Pearson’s chi-square test Can test for goodness of fit and independence Uses chi-square statistic and chi-square distribution The value of the chi-square statistic is χ2 = n i=1 (Oi − Ei )2 Ei where Oi is the observed frequency, Ei is the expected frequency and n is the number of observations Chi-square statistic asymptotically approaches the chi-square distribution Chi-square distribution is characterized by a single parameter – degrees of freedom Here, degrees of freedom is n − 1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 31 / 45

Slide 56

Slide 56 text

Chi-square test for independence Two distributions with k frequencies each The value of the chi-square statistic is χ2 = 2 i=1 k j=1 (Oij − Eij )2 Eij Here, degrees of freedom is (2 − 1) × (k − 1) The value of the statistic is compared against the chi-square distribution with df = k − 1 Choose a significance level, say, 0.05 If the statistic obtained is less than the theoretical level, then conclude that the two distributions are independent at the chosen level of significance Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 32 / 45

Slide 57

Slide 57 text

Chi-merge discretization Supervised Bottom-up Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 33 / 45

Slide 58

Slide 58 text

Chi-merge discretization Supervised Bottom-up Tests every adjacent interval If they are independent, then they can be merged Otherwise, the difference in frequencies is statistically significant, and they should not be merged Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 33 / 45

Slide 59

Slide 59 text

Chi-merge discretization Supervised Bottom-up Tests every adjacent interval If they are independent, then they can be merged Otherwise, the difference in frequencies is statistically significant, and they should not be merged Keep on merging intervals with lowest chi-square value Stopping criterion Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 33 / 45

Slide 60

Slide 60 text

Chi-merge discretization Supervised Bottom-up Tests every adjacent interval If they are independent, then they can be merged Otherwise, the difference in frequencies is statistically significant, and they should not be merged Keep on merging intervals with lowest chi-square value Stopping criterion When lowest chi-square value is greater than threshold chosen at a particular level of significance Number of categories greater than a threshold Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 33 / 45

Slide 61

Slide 61 text

Example Dataset D consists of two classes 1 and 2 Class 1 1, 7, 8, 9, 37, 45, 46, 59 Class 2 3, 11, 23, 39 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 34 / 45

Slide 62

Slide 62 text

Example Dataset D consists of two classes 1 and 2 Class 1 1, 7, 8, 9, 37, 45, 46, 59 Class 2 3, 11, 23, 39 Test the first merging, i.e., the first two values: (1, C1) and (3, C2) Contingency table is C1 C2 I1 1 0 1 I2 0 1 1 1 1 2 χ2 = (1 − 1/2)2 1/2 + (0 − 1/2)2 1/2 + (1 − 1/2)2 1/2 + (0 − 1/2)2 1/2 = 2 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 34 / 45

Slide 63

Slide 63 text

Example (contd.) Test the second merging, i.e., the second and third values: (3, C2) and (7, C1) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 35 / 45

Slide 64

Slide 64 text

Example (contd.) Test the second merging, i.e., the second and third values: (3, C2) and (7, C1) χ2 is again 2 Test the third merging, i.e., the third and fourth values: (7, C1) and (8, C1) Contingency table is C1 C2 I1 1 0 1 I2 1 0 1 2 0 2 χ2 = (1 − 1)2 1 + (0 − 0)2 0 + (1 − 1)2 1 + (0 − 0)2 0 = 0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 35 / 45

Slide 65

Slide 65 text

Merging Data 1 | 3 | 7 | 8 | 9 | 11 | 23 | 37 | 39 | 45 | 46 | 59 χ2 2 2 0 0 2 0 2 2 2 0 0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 36 / 45

Slide 66

Slide 66 text

Merging Data 1 | 3 | 7 | 8 | 9 | 11 | 23 | 37 | 39 | 45 | 46 | 59 χ2 2 2 0 0 2 0 2 2 2 0 0 Data 1 | 3 | 7, 8, 9 | 11, 23 | 37 | 39 | 45, 46, 59 χ2 2 4 5 3 2 4 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 36 / 45

Slide 67

Slide 67 text

Merging Data 1 | 3 | 7 | 8 | 9 | 11 | 23 | 37 | 39 | 45 | 46 | 59 χ2 2 2 0 0 2 0 2 2 2 0 0 Data 1 | 3 | 7, 8, 9 | 11, 23 | 37 | 39 | 45, 46, 59 χ2 2 4 5 3 2 4 Continue till a threshold From chi-square distribution table, chi-square value at significance level 0.1 with degrees of freedom 1 is 2.70 So, when none of the chi-square values is less than 2.70, stop Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 36 / 45

Slide 68

Slide 68 text

Intuitive partitioning Relatively uniform, easy to remember and easy to read intervals For example, (500, 600) is a better interval than (512.23, 609.87) Idea is to break up into “natural” partitions Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 37 / 45

Slide 69

Slide 69 text

Intuitive partitioning Relatively uniform, easy to remember and easy to read intervals For example, (500, 600) is a better interval than (512.23, 609.87) Idea is to break up into “natural” partitions Uses the 3-4-5 rule Hierarchical partitioning At each level, examine number of distinct values of the most significant digit for each interval If it is 3, 6, 7 or 9 (3n), partition into 3 equi-width intervals May be 2-3-2 for 7 partitions If it is 2, 4 or 8 (2n), partition into 4 equi-width intervals If it is 1, 5 or 10 (5n), partition into 5 equi-width intervals Recursively applied at each level To avoid outliers, it is applied for data that represents the majority, i.e., 5th percentile to 95th percentile Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 37 / 45

Slide 70

Slide 70 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 71

Slide 71 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 72

Slide 72 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 73

Slide 73 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 74

Slide 74 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 75

Slide 75 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 76

Slide 76 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 77

Slide 77 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 78

Slide 78 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into 5 partitions: (1000, 1200), (1200, 1400), (1400, 1600), (1600, 1800), (1800, 2000) (2000, 5000) into Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 79

Slide 79 text

Example Suppose data has range (-351, 4700), i.e., min = -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into 5 partitions: (1000, 1200), (1200, 1400), (1400, 1600), (1600, 1800), (1800, 2000) (2000, 5000) into 3 partitions: (2000, 3000), (3000, 4000), (4000, 5000) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45

Slide 80

Slide 80 text

Regression Aim is to describe how the response variable Y is generated Can predict missing values There is a set of k predictors, denoted by X There are n observations The general form of regression model is Y = f (X) + ε The function f can be chosen using domain knowledge For linear regression, f is linear ε encodes the error terms associated with each observation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 39 / 45

Slide 81

Slide 81 text

Linear regression The function f is chosen to be linear The response variable depends only linearly with the predictor variables The general form of linear regression model is Y = XW + ε W are the regression coefficients or weights on the predictors Sizes of matrices Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 40 / 45

Slide 82

Slide 82 text

Linear regression The function f is chosen to be linear The response variable depends only linearly with the predictor variables The general form of linear regression model is Y = XW + ε W are the regression coefficients or weights on the predictors Sizes of matrices Y : n × 1 X : n × k W : k × 1 ε : n × 1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 40 / 45

Slide 83

Slide 83 text

Outline 1 Data reduction Numerosity reduction Data discretization Data modeling 2 Data integration 3 Data transformation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 41 / 45

Slide 84

Slide 84 text

Data integration Data integration is the process of transforming multiple data sources into one single coherent source Useful when there are multiple databases about the same set of objects Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 42 / 45

Slide 85

Slide 85 text

Data integration Data integration is the process of transforming multiple data sources into one single coherent source Useful when there are multiple databases about the same set of objects Schema matching and entity identification Is cust id equal to cust number? Correlation analysis to reduce redundancy Chi-square test for categorical data De-duplication Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 42 / 45

Slide 86

Slide 86 text

Outline 1 Data reduction Numerosity reduction Data discretization Data modeling 2 Data integration 3 Data transformation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 43 / 45

Slide 87

Slide 87 text

Data transformation Data transformation is useful when Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 44 / 45

Slide 88

Slide 88 text

Data transformation Data transformation is useful when Identifying trends Normalizing to correctly get statistics Applying particular data mining algorithms Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 44 / 45

Slide 89

Slide 89 text

Data transformation Data transformation is useful when Identifying trends Normalizing to correctly get statistics Applying particular data mining algorithms Smoothing of bins using histograms Aggregation and summarization Generalization Normalization Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 44 / 45

Slide 90

Slide 90 text

Normalization Normalization changes the range of values Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45

Slide 91

Slide 91 text

Normalization Normalization changes the range of values Min-max normalization x = x − min max − min This puts range to Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45

Slide 92

Slide 92 text

Normalization Normalization changes the range of values Min-max normalization x = x − min max − min This puts range to (0, 1) If new range is (min’, max’) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45

Slide 93

Slide 93 text

Normalization Normalization changes the range of values Min-max normalization x = x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45

Slide 94

Slide 94 text

Normalization Normalization changes the range of values Min-max normalization x = x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Z-score normalization x = x − µ σ This puts range to Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45

Slide 95

Slide 95 text

Normalization Normalization changes the range of values Min-max normalization x = x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Z-score normalization x = x − µ σ This puts range to (−∞, +∞) Also called standard score because it is the value for the standard normal distribution N(0, 1) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45