46

# Data Preprocessing 2

## pankajmore

September 11, 2012

## Transcript

1. ### CS685: Data Mining Data Preprocessing Arnab Bhattacharya arnabb@cse.iitk.ac.in Computer Science

and Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/~cs685/ 1st semester, 2012-13 Tue, Wed, Fri 0900-1000 at CS101 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 1 / 45
2. ### Outline 1 Data reduction Numerosity reduction Data discretization Data modeling

2 Data integration 3 Data transformation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 2 / 45
3. ### Outline 1 Data reduction Numerosity reduction Data discretization Data modeling

2 Data integration 3 Data transformation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 3 / 45
4. ### Fourier analysis Fourier analysis represents a periodic wave as a

sum of (inﬁnite) sine and cosine waves Way to analyze the frequency components in a signal Fourier transform is a transformation from time domain f (x) to frequency domain g(u) g(u) = ∞ −∞ f (x)e−2πuxi dx f (x) can be obtained from g(u) by the inverse transformation f (x) = ∞ −∞ g(u)e2πuxi du f (x) and g(u) form a Fourier transform pair Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 4 / 45
5. ### Discrete Fourier transform (DFT) For discrete case, assume vector x

has N components DFT of x, denoted by X, has N components given by Xk = 1. N−1 n=0 xne−(2π/N)kni k = 0, . . . , N − 1 Inverse transformation (IDFT) is given by xn = 1 N . N−1 k=0 Xke(2π/N)kni n = 0, . . . , N − 1 To avoid using separate scaling factors, both can be taken as (1/ √ N) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 5 / 45
6. ### Properties Parseval’s theorem: ||x||2 = ||X||2, i.e., length of vectors

is preserved N−1 n=0 |xn|2 = N−1 k=0 |Xk|2 When both scaling factors are (1/ √ N) Contractive mapping, i.e., lengths get reduced Invertible, linear transformation Essentially, a rotation in N-dimensional space Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 6 / 45
7. ### Coeﬃcients Expanding eθi = cos θ + i sin θ,

Xk = N−1 n=0 xne−(2π/N)kni = N−1 n=0 xn cos( 2π N kn) − i sin( 2π N kn) ∴ X0 = N−1 n=0 xn (cos 0 − i sin 0) = N−1 n=0 xn First coeﬃcient is (scaled) sum or average Other coeﬃcients deﬁne the frequency components (cosine and sine) at frequencies 2π(k/N) for k = 0, 1, . . . , N − 1 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 7 / 45
8. ### Dimensionality reduction Retain k lower frequency components Discard higher frequency

noise Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 8 / 45
9. ### Dimensionality reduction Retain k lower frequency components Discard higher frequency

noise Alternatively, for a database of objects, retain those coeﬃcients whose variances are highest Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 8 / 45
10. ### Discrete cosine transform (DCT) Way to represent a signal in

terms of cosine waves of diﬀerent frequencies (and amplitudes) Various deﬁnitions X(1) k = 2 N N−1 n=0 u(1) n xn cos π N n + 1 2 k k = 0, . . . , N − 1 X(2) k = 2 N N−1 n=0 u(2) n xn cos π N n + 1 2 k + 1 2 k = 0, . . . , N − 1 u(1) n = 1/ √ 2 n = 0 u(1) n = 1 n = 1, . . . , N − 1 u(2) n = 1 n = 0, . . . , N − 1 Length preserving Inverses are the same functions Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 9 / 45
11. ### Wavelets Fourier transform analyzes frequency resolutions, but not time Wavelets

analyze a function in both time and frequency domains Good time resolution and poor frequency resolution at high frequencies Good frequency resolution and poor time resolution at low frequencies Wavelets are useful for short duration signals of high frequency and long duration signals of short frequency Wavelets are generated from a mother wavelet function ψ Zero mean (oscillatory, i.e., wave nature): ψ(x) dx = 0 Unit length: ψ2(x) dx = 1 Basis functions are generated by scaling (s) and shifting (l) the mother wavelet ψs,l (t) = 1 √ s ψ t − l s Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 10 / 45
12. ### Discrete wavelet transform (DWT) DWT generates a set of basis

function or vectors Two functions: 1 Wavelet function 2 Scaling function Space spanned by n = 2j basis vectors at level j can be spanned by two sets of basis vectors ψ and φ at level j − 1 ψ and φ are wavelet and scaling functions respectively DWT generates basis vectors for wavelet and scaling functions at diﬀerent levels sj−1 d1 s0 d0 s1 sj sj−2 dj−2 dj−1 . . . ϕ ϕ ϕ ϕ φ φ φ φ Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 11 / 45
13. ### Haar wavelets: Example Compute sum and diﬀerence between each consecutive

pairs of entries (and scale) Repeat the steps for the sum coeﬃcients f3 = {2, 5, 8, 9, 7, 4, −1, 1} f2 = 1 √ 2 {2 + 5, 8 + 9, 7 + 4, −1 + 1, 2 − 5, 8 − 9, 7 − 4, −1 − 1} = 7 √ 2 , 17 √ 2 , 11 √ 2 , 0 √ 2 , −3 √ 2 , −1 √ 2 , 3 √ 2 , −2 √ 2 f1 = 24 2 , 11 2 , −10 2 , 11 2 , −3 √ 2 , −1 √ 2 , 3 √ 2 , −2 √ 2 f0 = 35 2 √ 2 , 13 2 √ 2 , −10 2 , 11 2 , −3 √ 2 , −1 √ 2 , 3 √ 2 , −2 √ 2 Length is preserved: ||f0||2 = 15.524 = ||f3||2 Invertible: f3 is losslessly obtained from f0 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 12 / 45
14. ### Haar wavelets: Theory Wavelet (ψ) and scaling (φ) functions: ψ(x)

=    +1 when 0 ≤ x < 1/2 −1 when 1/2 < x ≤ 1 0 otherwise φ(x) = 1 when 0 ≤ x ≤ 1 0 otherwise Shifting (i) and scaling (j): ψj,i (x) = 2j/2ψ(2j x − i) j = 0, . . . i = 0, . . . , 2j − 1 φj,i (x) = 2j/2φ(2j x − i) j = 0, . . . i = 0, . . . , 2j − 1 Binary dilation (scaling) and dyadic translation (shifting) Coeﬃcients corresponding to ψj,i are averages or sum coeﬃcients Coeﬃcients corresponding to φj,i are diﬀerences or detail coeﬃcients Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 13 / 45
15. ### Matrix form ψj,i , φj,i etc. form the basis vectors

Only ψ0,0 is needed since others are expressed in terms of φ·,· When size of vector is n = 2j , j levels of basis vectors are needed There are 2j−1 + 2j−2 + · · · + 20 = 2j − 1 basis vectors corresponding to detail coeﬃcients and 1 basis vector corresponding to sum coeﬃcient at 0th level Transformation matrix H has these basis vectors as columns Transformed vector v = v.H for data (row) vector v Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 14 / 45
16. ### Matrix form ψj,i , φj,i etc. form the basis vectors

Only ψ0,0 is needed since others are expressed in terms of φ·,· When size of vector is n = 2j , j levels of basis vectors are needed There are 2j−1 + 2j−2 + · · · + 20 = 2j − 1 basis vectors corresponding to detail coeﬃcients and 1 basis vector corresponding to sum coeﬃcient at 0th level Transformation matrix H has these basis vectors as columns Transformed vector v = v.H for data (row) vector v Each step can be deﬁned as multiplication by a matrix Composition of these matrices gives the ﬁnal transformation matrix H is orthonormal H−1 = HT Hence, inverse operation is easy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 14 / 45
17. ### Haar wavelet matrices H2 =     

          1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2 0 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2 0 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2 0 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2                f2 = f3.H2 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 15 / 45
18. ### Haar wavelet matrices (contd.) H1 =    

          1 √ 2 0 1 √ 2 0 0 0 0 0 1 √ 2 0 −1 √ 2 0 0 0 0 0 0 1 √ 2 0 1 √ 2 0 0 0 0 0 1 √ 2 0 −1 √ 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1               f1 = f2.H1 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 16 / 45
19. ### Haar wavelet matrices (contd.) H0 =    

         1 √ 2 1 √ 2 0 0 0 0 0 0 1 √ 2 −1 √ 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1              f0 = f1.H0 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 17 / 45
20. ### Haar wavelet matrices (contd.) H3 = H2.H1.H0 =  

             1 2 √ 2 1 2 √ 2 1 2 0 1 √ 2 0 0 0 1 2 √ 2 1 2 √ 2 1 2 0 −1 √ 2 0 0 0 1 2 √ 2 1 2 √ 2 −1 2 0 0 1 √ 2 0 0 1 2 √ 2 1 2 √ 2 −1 2 0 0 −1 √ 2 0 0 1 2 √ 2 −1 2 √ 2 0 1 2 0 0 1 √ 2 0 1 2 √ 2 −1 2 √ 2 0 1 2 0 0 −1 √ 2 0 1 2 √ 2 −1 2 √ 2 0 −1 2 0 0 0 1 √ 2 1 2 √ 2 −1 2 √ 2 0 −1 2 0 0 0 −1 √ 2                f0 = f3.H3 Inversely, f3 = f0.H−1 3 = f0.HT 3 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 18 / 45
21. ### Dimensionality reduction using Haar wavelets Retain sum and lower level

detail coeﬃcients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coeﬃcients Contractive mapping, i.e., lengths get reduced Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 19 / 45
22. ### Dimensionality reduction using Haar wavelets Retain sum and lower level

detail coeﬃcients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coeﬃcients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coeﬃcients whose variances are highest Another option is to zero out coeﬃcients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 19 / 45
23. ### Dimensionality reduction using Haar wavelets Retain sum and lower level

detail coeﬃcients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coeﬃcients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coeﬃcients whose variances are highest Another option is to zero out coeﬃcients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Pad zeros at end: latter half becomes less important Pad equal amount of zeros in each half: should be done recursively Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 19 / 45
24. ### Dimensionality reduction using Haar wavelets Retain sum and lower level

detail coeﬃcients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coeﬃcients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coeﬃcients whose variances are highest Another option is to zero out coeﬃcients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Pad zeros at end: latter half becomes less important Pad equal amount of zeros in each half: should be done recursively Interestingly, shuﬄing the dimensions produce diﬀerent wavelet coeﬃcients How to shuﬄe to satisfy some criteria? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 19 / 45
25. ### Data discretization Transformation from continuous domain to discrete intervals, i.e.,

from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 20 / 45
26. ### Data discretization Transformation from continuous domain to discrete intervals, i.e.,

from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 20 / 45
27. ### Data discretization Transformation from continuous domain to discrete intervals, i.e.,

from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Top-down discretization: repeatedly ﬁnds split points or cut points to divide the data Bottom-up discretization: considers all continuous values as split points and then repeatedly merges Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 20 / 45
28. ### Data discretization Transformation from continuous domain to discrete intervals, i.e.,

from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Top-down discretization: repeatedly ﬁnds split points or cut points to divide the data Bottom-up discretization: considers all continuous values as split points and then repeatedly merges Discretization often leads to conceptualization where each distinct group represents a particular concept Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 20 / 45
29. ### Concept hierarchy Discrete intervals or concepts can be generated at

multiple levels Concept hierarchy captures more basic concepts towards the top and ﬁner details towards the leaves Example Higher level: brilliant, ordinary, deﬁcient Middle level: CPI ≥ 9, 8 − 9, etc. Lower level: actual CPI Can be used to represent knowledge hierarchies as well Ontology Example WordNet: ontology for English words Gene Ontology: three separate ontologies that capture diﬀerent attributes of a gene Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 21 / 45
30. ### Methods of discretization For numeric data Binning and Histogram analysis

Entropy-based discretization Chi-square merging Clustering Intuitive partitioning Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 22 / 45

45
32. ### Entropy Entropy captures the amount of randomness or chaos in

the data Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
33. ### Entropy Entropy captures the amount of randomness or chaos in

the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
34. ### Entropy Entropy captures the amount of randomness or chaos in

the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
35. ### Entropy Entropy captures the amount of randomness or chaos in

the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
36. ### Entropy Entropy captures the amount of randomness or chaos in

the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
37. ### Entropy Entropy captures the amount of randomness or chaos in

the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
38. ### Entropy Entropy captures the amount of randomness or chaos in

the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? When every pi is 1/n Most random Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
39. ### Entropy Entropy captures the amount of randomness or chaos in

the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? When every pi is 1/n Most random Entropy = log2 n For n = 2, it is 1 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
40. ### Entropy-based discretization Supervised Top-down Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2

2012-13 24 / 45
41. ### Entropy-based discretization Supervised Top-down For attribute A, choose n partitions

D1, D2, . . . , Dn for the dataset D If Di has instances from m classes C1, C2, . . . , Cm, then entropy of partition Di is entropy(Di ) = − m j=1 (pji log2 pji ) where pji is probability of class Cj in partition Di pji = |Cj ∈ Di | |Di | Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 24 / 45
42. ### Choosing partitions Choose n − 1 partition values s1, s2,

. . . , sn−1 such that ∀v ∈ Di , si−1 < v ≤ si Implicitly, s0 is minimum and sn is maximum How to choose these partition values? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 25 / 45
43. ### Choosing partitions Choose n − 1 partition values s1, s2,

. . . , sn−1 such that ∀v ∈ Di , si−1 < v ≤ si Implicitly, s0 is minimum and sn is maximum How to choose these partition values? Use the concept of information gain or expected information requirement Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 25 / 45
44. ### Information gain When D is split into n partitions, the

expected information requirement is deﬁned as info(D) = n i=1 |Di | |D| entropy(Di ) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 26 / 45
45. ### Information gain When D is split into n partitions, the

expected information requirement is deﬁned as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is deﬁned as gain(D) = entropy(D) − info(D) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 26 / 45
46. ### Information gain When D is split into n partitions, the

expected information requirement is deﬁned as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is deﬁned as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 26 / 45
47. ### Information gain When D is split into n partitions, the

expected information requirement is deﬁned as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is deﬁned as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Keep on partitioning into two parts Stopping criterion Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 26 / 45
48. ### Information gain When D is split into n partitions, the

expected information requirement is deﬁned as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is deﬁned as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Keep on partitioning into two parts Stopping criterion Number of categories greater than a threshold Expected information requirement is below a threshold Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 26 / 45
49. ### Example Dataset D consists of two classes 1 and 2

Class 1 10, 14, 22, 28 Class 2 26, 28, 34, 36, 38 Probabilities of two classes are p1 = 4/9 and p2 = 5/9 entropy(D) = −p1 log2 p1 − p2 log2 p2 = 0.99 How to choose a splitting point? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 27 / 45
50. ### Splitting point 1 Suppose splitting point s = 24 Then,

probabilities of classes per partition are p11 = 3/3 p21 = 0/3 p12 = 1/6 p22 = 5/6 Entropies are entropy(D1) = −(3/3) log2 (3/3) − (0/3) log2 (0/3) = 0.00 entropy(D2) = −(1/6) log2 (1/6) − (5/6) log2 (5/6) = 0.65 Expected information requirement and entropy gain are info(D) = (|D1|/|D|)entropy(D1) + (|D2|/|D|)entropy(D2) = (3/9) × 0.00 + (6/9) × 0.65 = 0.43 gain(D) = entropy(D) − info(D) = 0.99 − 0.43 = 0.56 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 28 / 45
51. ### Splitting point 2 Suppose splitting point s = 31 Arnab

Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 29 / 45
52. ### Splitting point 2 Suppose splitting point s = 31 Then,

probabilities of classes per partition are p11 = 4/6 p21 = 2/6 p12 = 0/3 p22 = 3/3 Entropies are entropy(D1) = −(4/6) log2 (4/6) − (2/6) log2 (2/6) = 0.92 entropy(D2) = −(0/3) log2 (0/3) − (3/3) log2 (3/3) = 0.00 Expected information requirement and entropy gain are info(D) = (|D1|/|D|)entropy(D1) + (|D2|/|D|)entropy(D2) = (6/9) × 0.92 + (3/9) × 0.00 = 0.61 gain(D) = entropy(D) − info(D) = 0.99 − 0.61 = 0.38 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 29 / 45
53. ### Example (contd.) So, 24 is a better splitting point than

31 What is the optimal splitting point? Exhaustive algorithm Tests all possible n − 1 partitions Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 30 / 45
54. ### Chi-square test Statistical test More correctly, this is the Pearson’s

chi-square test Can test for goodness of ﬁt and independence Uses chi-square statistic and chi-square distribution Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 31 / 45
55. ### Chi-square test Statistical test More correctly, this is the Pearson’s

chi-square test Can test for goodness of ﬁt and independence Uses chi-square statistic and chi-square distribution The value of the chi-square statistic is χ2 = n i=1 (Oi − Ei )2 Ei where Oi is the observed frequency, Ei is the expected frequency and n is the number of observations Chi-square statistic asymptotically approaches the chi-square distribution Chi-square distribution is characterized by a single parameter – degrees of freedom Here, degrees of freedom is n − 1 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 31 / 45
56. ### Chi-square test for independence Two distributions with k frequencies each

The value of the chi-square statistic is χ2 = 2 i=1 k j=1 (Oij − Eij )2 Eij Here, degrees of freedom is (2 − 1) × (k − 1) The value of the statistic is compared against the chi-square distribution with df = k − 1 Choose a signiﬁcance level, say, 0.05 If the statistic obtained is less than the theoretical level, then conclude that the two distributions are independent at the chosen level of signiﬁcance Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 32 / 45
57. ### Chi-merge discretization Supervised Bottom-up Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2

2012-13 33 / 45
58. ### Chi-merge discretization Supervised Bottom-up Tests every adjacent interval If they

are independent, then they can be merged Otherwise, the diﬀerence in frequencies is statistically signiﬁcant, and they should not be merged Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 33 / 45
59. ### Chi-merge discretization Supervised Bottom-up Tests every adjacent interval If they

are independent, then they can be merged Otherwise, the diﬀerence in frequencies is statistically signiﬁcant, and they should not be merged Keep on merging intervals with lowest chi-square value Stopping criterion Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 33 / 45
60. ### Chi-merge discretization Supervised Bottom-up Tests every adjacent interval If they

are independent, then they can be merged Otherwise, the diﬀerence in frequencies is statistically signiﬁcant, and they should not be merged Keep on merging intervals with lowest chi-square value Stopping criterion When lowest chi-square value is greater than threshold chosen at a particular level of signiﬁcance Number of categories greater than a threshold Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 33 / 45
61. ### Example Dataset D consists of two classes 1 and 2

Class 1 1, 7, 8, 9, 37, 45, 46, 59 Class 2 3, 11, 23, 39 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 34 / 45
62. ### Example Dataset D consists of two classes 1 and 2

Class 1 1, 7, 8, 9, 37, 45, 46, 59 Class 2 3, 11, 23, 39 Test the ﬁrst merging, i.e., the ﬁrst two values: (1, C1) and (3, C2) Contingency table is C1 C2 I1 1 0 1 I2 0 1 1 1 1 2 χ2 = (1 − 1/2)2 1/2 + (0 − 1/2)2 1/2 + (1 − 1/2)2 1/2 + (0 − 1/2)2 1/2 = 2 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 34 / 45
63. ### Example (contd.) Test the second merging, i.e., the second and

third values: (3, C2) and (7, C1) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 35 / 45
64. ### Example (contd.) Test the second merging, i.e., the second and

third values: (3, C2) and (7, C1) χ2 is again 2 Test the third merging, i.e., the third and fourth values: (7, C1) and (8, C1) Contingency table is C1 C2 I1 1 0 1 I2 1 0 1 2 0 2 χ2 = (1 − 1)2 1 + (0 − 0)2 0 + (1 − 1)2 1 + (0 − 0)2 0 = 0 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 35 / 45
65. ### Merging Data 1 | 3 | 7 | 8 |

9 | 11 | 23 | 37 | 39 | 45 | 46 | 59 χ2 2 2 0 0 2 0 2 2 2 0 0 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 36 / 45
66. ### Merging Data 1 | 3 | 7 | 8 |

9 | 11 | 23 | 37 | 39 | 45 | 46 | 59 χ2 2 2 0 0 2 0 2 2 2 0 0 Data 1 | 3 | 7, 8, 9 | 11, 23 | 37 | 39 | 45, 46, 59 χ2 2 4 5 3 2 4 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 36 / 45
67. ### Merging Data 1 | 3 | 7 | 8 |

9 | 11 | 23 | 37 | 39 | 45 | 46 | 59 χ2 2 2 0 0 2 0 2 2 2 0 0 Data 1 | 3 | 7, 8, 9 | 11, 23 | 37 | 39 | 45, 46, 59 χ2 2 4 5 3 2 4 Continue till a threshold From chi-square distribution table, chi-square value at signiﬁcance level 0.1 with degrees of freedom 1 is 2.70 So, when none of the chi-square values is less than 2.70, stop Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 36 / 45
68. ### Intuitive partitioning Relatively uniform, easy to remember and easy to

read intervals For example, (500, 600) is a better interval than (512.23, 609.87) Idea is to break up into “natural” partitions Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 37 / 45
69. ### Intuitive partitioning Relatively uniform, easy to remember and easy to

read intervals For example, (500, 600) is a better interval than (512.23, 609.87) Idea is to break up into “natural” partitions Uses the 3-4-5 rule Hierarchical partitioning At each level, examine number of distinct values of the most signiﬁcant digit for each interval If it is 3, 6, 7 or 9 (3n), partition into 3 equi-width intervals May be 2-3-2 for 7 partitions If it is 2, 4 or 8 (2n), partition into 4 equi-width intervals If it is 1, 5 or 10 (5n), partition into 5 equi-width intervals Recursively applied at each level To avoid outliers, it is applied for data that represents the majority, i.e., 5th percentile to 95th percentile Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 37 / 45
70. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
71. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
72. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
73. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
74. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
75. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
76. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
77. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
78. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into 5 partitions: (1000, 1200), (1200, 1400), (1400, 1600), (1600, 1800), (1800, 2000) (2000, 5000) into Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
79. ### Example Suppose data has range (-351, 4700), i.e., min =

-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into 5 partitions: (1000, 1200), (1200, 1400), (1400, 1600), (1600, 1800), (1800, 2000) (2000, 5000) into 3 partitions: (2000, 3000), (3000, 4000), (4000, 5000) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
80. ### Regression Aim is to describe how the response variable Y

is generated Can predict missing values There is a set of k predictors, denoted by X There are n observations The general form of regression model is Y = f (X) + ε The function f can be chosen using domain knowledge For linear regression, f is linear ε encodes the error terms associated with each observation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 39 / 45
81. ### Linear regression The function f is chosen to be linear

The response variable depends only linearly with the predictor variables The general form of linear regression model is Y = XW + ε W are the regression coeﬃcients or weights on the predictors Sizes of matrices Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 40 / 45
82. ### Linear regression The function f is chosen to be linear

The response variable depends only linearly with the predictor variables The general form of linear regression model is Y = XW + ε W are the regression coeﬃcients or weights on the predictors Sizes of matrices Y : n × 1 X : n × k W : k × 1 ε : n × 1 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 40 / 45
83. ### Outline 1 Data reduction Numerosity reduction Data discretization Data modeling

2 Data integration 3 Data transformation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 41 / 45
84. ### Data integration Data integration is the process of transforming multiple

data sources into one single coherent source Useful when there are multiple databases about the same set of objects Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 42 / 45
85. ### Data integration Data integration is the process of transforming multiple

data sources into one single coherent source Useful when there are multiple databases about the same set of objects Schema matching and entity identiﬁcation Is cust id equal to cust number? Correlation analysis to reduce redundancy Chi-square test for categorical data De-duplication Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 42 / 45
86. ### Outline 1 Data reduction Numerosity reduction Data discretization Data modeling

2 Data integration 3 Data transformation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 43 / 45
87. ### Data transformation Data transformation is useful when Arnab Bhattacharya (arnabb@cse.iitk.ac.in)

CS685: Preprocessing 2 2012-13 44 / 45
88. ### Data transformation Data transformation is useful when Identifying trends Normalizing

to correctly get statistics Applying particular data mining algorithms Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 44 / 45
89. ### Data transformation Data transformation is useful when Identifying trends Normalizing

to correctly get statistics Applying particular data mining algorithms Smoothing of bins using histograms Aggregation and summarization Generalization Normalization Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 44 / 45
90. ### Normalization Normalization changes the range of values Arnab Bhattacharya (arnabb@cse.iitk.ac.in)

CS685: Preprocessing 2 2012-13 45 / 45
91. ### Normalization Normalization changes the range of values Min-max normalization x

= x − min max − min This puts range to Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 45 / 45
92. ### Normalization Normalization changes the range of values Min-max normalization x

= x − min max − min This puts range to (0, 1) If new range is (min’, max’) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 45 / 45
93. ### Normalization Normalization changes the range of values Min-max normalization x

= x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 45 / 45
94. ### Normalization Normalization changes the range of values Min-max normalization x

= x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Z-score normalization x = x − µ σ This puts range to Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 45 / 45
95. ### Normalization Normalization changes the range of values Min-max normalization x

= x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Z-score normalization x = x − µ σ This puts range to (−∞, +∞) Also called standard score because it is the value for the standard normal distribution N(0, 1) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 45 / 45