Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Preprocessing 2

pankajmore
September 11, 2012

Data Preprocessing 2

pankajmore

September 11, 2012
Tweet

More Decks by pankajmore

Other Decks in Science

Transcript

  1. CS685: Data Mining Data Preprocessing Arnab Bhattacharya [email protected] Computer Science

    and Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/~cs685/ 1st semester, 2012-13 Tue, Wed, Fri 0900-1000 at CS101 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 1 / 45
  2. Outline 1 Data reduction Numerosity reduction Data discretization Data modeling

    2 Data integration 3 Data transformation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 2 / 45
  3. Outline 1 Data reduction Numerosity reduction Data discretization Data modeling

    2 Data integration 3 Data transformation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 3 / 45
  4. Fourier analysis Fourier analysis represents a periodic wave as a

    sum of (infinite) sine and cosine waves Way to analyze the frequency components in a signal Fourier transform is a transformation from time domain f (x) to frequency domain g(u) g(u) = ∞ −∞ f (x)e−2πuxi dx f (x) can be obtained from g(u) by the inverse transformation f (x) = ∞ −∞ g(u)e2πuxi du f (x) and g(u) form a Fourier transform pair Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 4 / 45
  5. Discrete Fourier transform (DFT) For discrete case, assume vector x

    has N components DFT of x, denoted by X, has N components given by Xk = 1. N−1 n=0 xne−(2π/N)kni k = 0, . . . , N − 1 Inverse transformation (IDFT) is given by xn = 1 N . N−1 k=0 Xke(2π/N)kni n = 0, . . . , N − 1 To avoid using separate scaling factors, both can be taken as (1/ √ N) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 5 / 45
  6. Properties Parseval’s theorem: ||x||2 = ||X||2, i.e., length of vectors

    is preserved N−1 n=0 |xn|2 = N−1 k=0 |Xk|2 When both scaling factors are (1/ √ N) Contractive mapping, i.e., lengths get reduced Invertible, linear transformation Essentially, a rotation in N-dimensional space Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 6 / 45
  7. Coefficients Expanding eθi = cos θ + i sin θ,

    Xk = N−1 n=0 xne−(2π/N)kni = N−1 n=0 xn cos( 2π N kn) − i sin( 2π N kn) ∴ X0 = N−1 n=0 xn (cos 0 − i sin 0) = N−1 n=0 xn First coefficient is (scaled) sum or average Other coefficients define the frequency components (cosine and sine) at frequencies 2π(k/N) for k = 0, 1, . . . , N − 1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 7 / 45
  8. Dimensionality reduction Retain k lower frequency components Discard higher frequency

    noise Alternatively, for a database of objects, retain those coefficients whose variances are highest Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 8 / 45
  9. Discrete cosine transform (DCT) Way to represent a signal in

    terms of cosine waves of different frequencies (and amplitudes) Various definitions X(1) k = 2 N N−1 n=0 u(1) n xn cos π N n + 1 2 k k = 0, . . . , N − 1 X(2) k = 2 N N−1 n=0 u(2) n xn cos π N n + 1 2 k + 1 2 k = 0, . . . , N − 1 u(1) n = 1/ √ 2 n = 0 u(1) n = 1 n = 1, . . . , N − 1 u(2) n = 1 n = 0, . . . , N − 1 Length preserving Inverses are the same functions Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 9 / 45
  10. Wavelets Fourier transform analyzes frequency resolutions, but not time Wavelets

    analyze a function in both time and frequency domains Good time resolution and poor frequency resolution at high frequencies Good frequency resolution and poor time resolution at low frequencies Wavelets are useful for short duration signals of high frequency and long duration signals of short frequency Wavelets are generated from a mother wavelet function ψ Zero mean (oscillatory, i.e., wave nature): ψ(x) dx = 0 Unit length: ψ2(x) dx = 1 Basis functions are generated by scaling (s) and shifting (l) the mother wavelet ψs,l (t) = 1 √ s ψ t − l s Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 10 / 45
  11. Discrete wavelet transform (DWT) DWT generates a set of basis

    function or vectors Two functions: 1 Wavelet function 2 Scaling function Space spanned by n = 2j basis vectors at level j can be spanned by two sets of basis vectors ψ and φ at level j − 1 ψ and φ are wavelet and scaling functions respectively DWT generates basis vectors for wavelet and scaling functions at different levels sj−1 d1 s0 d0 s1 sj sj−2 dj−2 dj−1 . . . ϕ ϕ ϕ ϕ φ φ φ φ Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 11 / 45
  12. Haar wavelets: Example Compute sum and difference between each consecutive

    pairs of entries (and scale) Repeat the steps for the sum coefficients f3 = {2, 5, 8, 9, 7, 4, −1, 1} f2 = 1 √ 2 {2 + 5, 8 + 9, 7 + 4, −1 + 1, 2 − 5, 8 − 9, 7 − 4, −1 − 1} = 7 √ 2 , 17 √ 2 , 11 √ 2 , 0 √ 2 , −3 √ 2 , −1 √ 2 , 3 √ 2 , −2 √ 2 f1 = 24 2 , 11 2 , −10 2 , 11 2 , −3 √ 2 , −1 √ 2 , 3 √ 2 , −2 √ 2 f0 = 35 2 √ 2 , 13 2 √ 2 , −10 2 , 11 2 , −3 √ 2 , −1 √ 2 , 3 √ 2 , −2 √ 2 Length is preserved: ||f0||2 = 15.524 = ||f3||2 Invertible: f3 is losslessly obtained from f0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 12 / 45
  13. Haar wavelets: Theory Wavelet (ψ) and scaling (φ) functions: ψ(x)

    =    +1 when 0 ≤ x < 1/2 −1 when 1/2 < x ≤ 1 0 otherwise φ(x) = 1 when 0 ≤ x ≤ 1 0 otherwise Shifting (i) and scaling (j): ψj,i (x) = 2j/2ψ(2j x − i) j = 0, . . . i = 0, . . . , 2j − 1 φj,i (x) = 2j/2φ(2j x − i) j = 0, . . . i = 0, . . . , 2j − 1 Binary dilation (scaling) and dyadic translation (shifting) Coefficients corresponding to ψj,i are averages or sum coefficients Coefficients corresponding to φj,i are differences or detail coefficients Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 13 / 45
  14. Matrix form ψj,i , φj,i etc. form the basis vectors

    Only ψ0,0 is needed since others are expressed in terms of φ·,· When size of vector is n = 2j , j levels of basis vectors are needed There are 2j−1 + 2j−2 + · · · + 20 = 2j − 1 basis vectors corresponding to detail coefficients and 1 basis vector corresponding to sum coefficient at 0th level Transformation matrix H has these basis vectors as columns Transformed vector v = v.H for data (row) vector v Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 14 / 45
  15. Matrix form ψj,i , φj,i etc. form the basis vectors

    Only ψ0,0 is needed since others are expressed in terms of φ·,· When size of vector is n = 2j , j levels of basis vectors are needed There are 2j−1 + 2j−2 + · · · + 20 = 2j − 1 basis vectors corresponding to detail coefficients and 1 basis vector corresponding to sum coefficient at 0th level Transformation matrix H has these basis vectors as columns Transformed vector v = v.H for data (row) vector v Each step can be defined as multiplication by a matrix Composition of these matrices gives the final transformation matrix H is orthonormal H−1 = HT Hence, inverse operation is easy Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 14 / 45
  16. Haar wavelet matrices H2 =     

              1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2 0 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2 0 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2 0 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 1 √ 2 0 0 0 −1 √ 2                f2 = f3.H2 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 15 / 45
  17. Haar wavelet matrices (contd.) H1 =    

              1 √ 2 0 1 √ 2 0 0 0 0 0 1 √ 2 0 −1 √ 2 0 0 0 0 0 0 1 √ 2 0 1 √ 2 0 0 0 0 0 1 √ 2 0 −1 √ 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1               f1 = f2.H1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 16 / 45
  18. Haar wavelet matrices (contd.) H0 =    

             1 √ 2 1 √ 2 0 0 0 0 0 0 1 √ 2 −1 √ 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1              f0 = f1.H0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 17 / 45
  19. Haar wavelet matrices (contd.) H3 = H2.H1.H0 =  

                 1 2 √ 2 1 2 √ 2 1 2 0 1 √ 2 0 0 0 1 2 √ 2 1 2 √ 2 1 2 0 −1 √ 2 0 0 0 1 2 √ 2 1 2 √ 2 −1 2 0 0 1 √ 2 0 0 1 2 √ 2 1 2 √ 2 −1 2 0 0 −1 √ 2 0 0 1 2 √ 2 −1 2 √ 2 0 1 2 0 0 1 √ 2 0 1 2 √ 2 −1 2 √ 2 0 1 2 0 0 −1 √ 2 0 1 2 √ 2 −1 2 √ 2 0 −1 2 0 0 0 1 √ 2 1 2 √ 2 −1 2 √ 2 0 −1 2 0 0 0 −1 √ 2                f0 = f3.H3 Inversely, f3 = f0.H−1 3 = f0.HT 3 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 18 / 45
  20. Dimensionality reduction using Haar wavelets Retain sum and lower level

    detail coefficients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coefficients Contractive mapping, i.e., lengths get reduced Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 19 / 45
  21. Dimensionality reduction using Haar wavelets Retain sum and lower level

    detail coefficients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coefficients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coefficients whose variances are highest Another option is to zero out coefficients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 19 / 45
  22. Dimensionality reduction using Haar wavelets Retain sum and lower level

    detail coefficients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coefficients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coefficients whose variances are highest Another option is to zero out coefficients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Pad zeros at end: latter half becomes less important Pad equal amount of zeros in each half: should be done recursively Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 19 / 45
  23. Dimensionality reduction using Haar wavelets Retain sum and lower level

    detail coefficients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coefficients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coefficients whose variances are highest Another option is to zero out coefficients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Pad zeros at end: latter half becomes less important Pad equal amount of zeros in each half: should be done recursively Interestingly, shuffling the dimensions produce different wavelet coefficients How to shuffle to satisfy some criteria? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 19 / 45
  24. Data discretization Transformation from continuous domain to discrete intervals, i.e.,

    from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 20 / 45
  25. Data discretization Transformation from continuous domain to discrete intervals, i.e.,

    from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 20 / 45
  26. Data discretization Transformation from continuous domain to discrete intervals, i.e.,

    from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Top-down discretization: repeatedly finds split points or cut points to divide the data Bottom-up discretization: considers all continuous values as split points and then repeatedly merges Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 20 / 45
  27. Data discretization Transformation from continuous domain to discrete intervals, i.e.,

    from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Top-down discretization: repeatedly finds split points or cut points to divide the data Bottom-up discretization: considers all continuous values as split points and then repeatedly merges Discretization often leads to conceptualization where each distinct group represents a particular concept Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 20 / 45
  28. Concept hierarchy Discrete intervals or concepts can be generated at

    multiple levels Concept hierarchy captures more basic concepts towards the top and finer details towards the leaves Example Higher level: brilliant, ordinary, deficient Middle level: CPI ≥ 9, 8 − 9, etc. Lower level: actual CPI Can be used to represent knowledge hierarchies as well Ontology Example WordNet: ontology for English words Gene Ontology: three separate ontologies that capture different attributes of a gene Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 21 / 45
  29. Methods of discretization For numeric data Binning and Histogram analysis

    Entropy-based discretization Chi-square merging Clustering Intuitive partitioning Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 22 / 45
  30. Entropy Entropy captures the amount of randomness or chaos in

    the data Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45
  31. Entropy Entropy captures the amount of randomness or chaos in

    the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45
  32. Entropy Entropy captures the amount of randomness or chaos in

    the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45
  33. Entropy Entropy captures the amount of randomness or chaos in

    the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45
  34. Entropy Entropy captures the amount of randomness or chaos in

    the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45
  35. Entropy Entropy captures the amount of randomness or chaos in

    the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45
  36. Entropy Entropy captures the amount of randomness or chaos in

    the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? When every pi is 1/n Most random Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45
  37. Entropy Entropy captures the amount of randomness or chaos in

    the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? When every pi is 1/n Most random Entropy = log2 n For n = 2, it is 1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 23 / 45
  38. Entropy-based discretization Supervised Top-down For attribute A, choose n partitions

    D1, D2, . . . , Dn for the dataset D If Di has instances from m classes C1, C2, . . . , Cm, then entropy of partition Di is entropy(Di ) = − m j=1 (pji log2 pji ) where pji is probability of class Cj in partition Di pji = |Cj ∈ Di | |Di | Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 24 / 45
  39. Choosing partitions Choose n − 1 partition values s1, s2,

    . . . , sn−1 such that ∀v ∈ Di , si−1 < v ≤ si Implicitly, s0 is minimum and sn is maximum How to choose these partition values? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 25 / 45
  40. Choosing partitions Choose n − 1 partition values s1, s2,

    . . . , sn−1 such that ∀v ∈ Di , si−1 < v ≤ si Implicitly, s0 is minimum and sn is maximum How to choose these partition values? Use the concept of information gain or expected information requirement Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 25 / 45
  41. Information gain When D is split into n partitions, the

    expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45
  42. Information gain When D is split into n partitions, the

    expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is defined as gain(D) = entropy(D) − info(D) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45
  43. Information gain When D is split into n partitions, the

    expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is defined as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45
  44. Information gain When D is split into n partitions, the

    expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is defined as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Keep on partitioning into two parts Stopping criterion Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45
  45. Information gain When D is split into n partitions, the

    expected information requirement is defined as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is defined as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Keep on partitioning into two parts Stopping criterion Number of categories greater than a threshold Expected information requirement is below a threshold Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 26 / 45
  46. Example Dataset D consists of two classes 1 and 2

    Class 1 10, 14, 22, 28 Class 2 26, 28, 34, 36, 38 Probabilities of two classes are p1 = 4/9 and p2 = 5/9 entropy(D) = −p1 log2 p1 − p2 log2 p2 = 0.99 How to choose a splitting point? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 27 / 45
  47. Splitting point 1 Suppose splitting point s = 24 Then,

    probabilities of classes per partition are p11 = 3/3 p21 = 0/3 p12 = 1/6 p22 = 5/6 Entropies are entropy(D1) = −(3/3) log2 (3/3) − (0/3) log2 (0/3) = 0.00 entropy(D2) = −(1/6) log2 (1/6) − (5/6) log2 (5/6) = 0.65 Expected information requirement and entropy gain are info(D) = (|D1|/|D|)entropy(D1) + (|D2|/|D|)entropy(D2) = (3/9) × 0.00 + (6/9) × 0.65 = 0.43 gain(D) = entropy(D) − info(D) = 0.99 − 0.43 = 0.56 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 28 / 45
  48. Splitting point 2 Suppose splitting point s = 31 Arnab

    Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 29 / 45
  49. Splitting point 2 Suppose splitting point s = 31 Then,

    probabilities of classes per partition are p11 = 4/6 p21 = 2/6 p12 = 0/3 p22 = 3/3 Entropies are entropy(D1) = −(4/6) log2 (4/6) − (2/6) log2 (2/6) = 0.92 entropy(D2) = −(0/3) log2 (0/3) − (3/3) log2 (3/3) = 0.00 Expected information requirement and entropy gain are info(D) = (|D1|/|D|)entropy(D1) + (|D2|/|D|)entropy(D2) = (6/9) × 0.92 + (3/9) × 0.00 = 0.61 gain(D) = entropy(D) − info(D) = 0.99 − 0.61 = 0.38 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 29 / 45
  50. Example (contd.) So, 24 is a better splitting point than

    31 What is the optimal splitting point? Exhaustive algorithm Tests all possible n − 1 partitions Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 30 / 45
  51. Chi-square test Statistical test More correctly, this is the Pearson’s

    chi-square test Can test for goodness of fit and independence Uses chi-square statistic and chi-square distribution Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 31 / 45
  52. Chi-square test Statistical test More correctly, this is the Pearson’s

    chi-square test Can test for goodness of fit and independence Uses chi-square statistic and chi-square distribution The value of the chi-square statistic is χ2 = n i=1 (Oi − Ei )2 Ei where Oi is the observed frequency, Ei is the expected frequency and n is the number of observations Chi-square statistic asymptotically approaches the chi-square distribution Chi-square distribution is characterized by a single parameter – degrees of freedom Here, degrees of freedom is n − 1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 31 / 45
  53. Chi-square test for independence Two distributions with k frequencies each

    The value of the chi-square statistic is χ2 = 2 i=1 k j=1 (Oij − Eij )2 Eij Here, degrees of freedom is (2 − 1) × (k − 1) The value of the statistic is compared against the chi-square distribution with df = k − 1 Choose a significance level, say, 0.05 If the statistic obtained is less than the theoretical level, then conclude that the two distributions are independent at the chosen level of significance Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 32 / 45
  54. Chi-merge discretization Supervised Bottom-up Tests every adjacent interval If they

    are independent, then they can be merged Otherwise, the difference in frequencies is statistically significant, and they should not be merged Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 33 / 45
  55. Chi-merge discretization Supervised Bottom-up Tests every adjacent interval If they

    are independent, then they can be merged Otherwise, the difference in frequencies is statistically significant, and they should not be merged Keep on merging intervals with lowest chi-square value Stopping criterion Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 33 / 45
  56. Chi-merge discretization Supervised Bottom-up Tests every adjacent interval If they

    are independent, then they can be merged Otherwise, the difference in frequencies is statistically significant, and they should not be merged Keep on merging intervals with lowest chi-square value Stopping criterion When lowest chi-square value is greater than threshold chosen at a particular level of significance Number of categories greater than a threshold Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 33 / 45
  57. Example Dataset D consists of two classes 1 and 2

    Class 1 1, 7, 8, 9, 37, 45, 46, 59 Class 2 3, 11, 23, 39 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 34 / 45
  58. Example Dataset D consists of two classes 1 and 2

    Class 1 1, 7, 8, 9, 37, 45, 46, 59 Class 2 3, 11, 23, 39 Test the first merging, i.e., the first two values: (1, C1) and (3, C2) Contingency table is C1 C2 I1 1 0 1 I2 0 1 1 1 1 2 χ2 = (1 − 1/2)2 1/2 + (0 − 1/2)2 1/2 + (1 − 1/2)2 1/2 + (0 − 1/2)2 1/2 = 2 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 34 / 45
  59. Example (contd.) Test the second merging, i.e., the second and

    third values: (3, C2) and (7, C1) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 35 / 45
  60. Example (contd.) Test the second merging, i.e., the second and

    third values: (3, C2) and (7, C1) χ2 is again 2 Test the third merging, i.e., the third and fourth values: (7, C1) and (8, C1) Contingency table is C1 C2 I1 1 0 1 I2 1 0 1 2 0 2 χ2 = (1 − 1)2 1 + (0 − 0)2 0 + (1 − 1)2 1 + (0 − 0)2 0 = 0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 35 / 45
  61. Merging Data 1 | 3 | 7 | 8 |

    9 | 11 | 23 | 37 | 39 | 45 | 46 | 59 χ2 2 2 0 0 2 0 2 2 2 0 0 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 36 / 45
  62. Merging Data 1 | 3 | 7 | 8 |

    9 | 11 | 23 | 37 | 39 | 45 | 46 | 59 χ2 2 2 0 0 2 0 2 2 2 0 0 Data 1 | 3 | 7, 8, 9 | 11, 23 | 37 | 39 | 45, 46, 59 χ2 2 4 5 3 2 4 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 36 / 45
  63. Merging Data 1 | 3 | 7 | 8 |

    9 | 11 | 23 | 37 | 39 | 45 | 46 | 59 χ2 2 2 0 0 2 0 2 2 2 0 0 Data 1 | 3 | 7, 8, 9 | 11, 23 | 37 | 39 | 45, 46, 59 χ2 2 4 5 3 2 4 Continue till a threshold From chi-square distribution table, chi-square value at significance level 0.1 with degrees of freedom 1 is 2.70 So, when none of the chi-square values is less than 2.70, stop Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 36 / 45
  64. Intuitive partitioning Relatively uniform, easy to remember and easy to

    read intervals For example, (500, 600) is a better interval than (512.23, 609.87) Idea is to break up into “natural” partitions Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 37 / 45
  65. Intuitive partitioning Relatively uniform, easy to remember and easy to

    read intervals For example, (500, 600) is a better interval than (512.23, 609.87) Idea is to break up into “natural” partitions Uses the 3-4-5 rule Hierarchical partitioning At each level, examine number of distinct values of the most significant digit for each interval If it is 3, 6, 7 or 9 (3n), partition into 3 equi-width intervals May be 2-3-2 for 7 partitions If it is 2, 4 or 8 (2n), partition into 4 equi-width intervals If it is 1, 5 or 10 (5n), partition into 5 equi-width intervals Recursively applied at each level To avoid outliers, it is applied for data that represents the majority, i.e., 5th percentile to 95th percentile Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 37 / 45
  66. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  67. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  68. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  69. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  70. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  71. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  72. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  73. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  74. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into 5 partitions: (1000, 1200), (1200, 1400), (1400, 1600), (1600, 1800), (1800, 2000) (2000, 5000) into Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  75. Example Suppose data has range (-351, 4700), i.e., min =

    -351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most significant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of first interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into 5 partitions: (1000, 1200), (1200, 1400), (1400, 1600), (1600, 1800), (1800, 2000) (2000, 5000) into 3 partitions: (2000, 3000), (3000, 4000), (4000, 5000) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 38 / 45
  76. Regression Aim is to describe how the response variable Y

    is generated Can predict missing values There is a set of k predictors, denoted by X There are n observations The general form of regression model is Y = f (X) + ε The function f can be chosen using domain knowledge For linear regression, f is linear ε encodes the error terms associated with each observation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 39 / 45
  77. Linear regression The function f is chosen to be linear

    The response variable depends only linearly with the predictor variables The general form of linear regression model is Y = XW + ε W are the regression coefficients or weights on the predictors Sizes of matrices Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 40 / 45
  78. Linear regression The function f is chosen to be linear

    The response variable depends only linearly with the predictor variables The general form of linear regression model is Y = XW + ε W are the regression coefficients or weights on the predictors Sizes of matrices Y : n × 1 X : n × k W : k × 1 ε : n × 1 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 40 / 45
  79. Outline 1 Data reduction Numerosity reduction Data discretization Data modeling

    2 Data integration 3 Data transformation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 41 / 45
  80. Data integration Data integration is the process of transforming multiple

    data sources into one single coherent source Useful when there are multiple databases about the same set of objects Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 42 / 45
  81. Data integration Data integration is the process of transforming multiple

    data sources into one single coherent source Useful when there are multiple databases about the same set of objects Schema matching and entity identification Is cust id equal to cust number? Correlation analysis to reduce redundancy Chi-square test for categorical data De-duplication Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 42 / 45
  82. Outline 1 Data reduction Numerosity reduction Data discretization Data modeling

    2 Data integration 3 Data transformation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 43 / 45
  83. Data transformation Data transformation is useful when Identifying trends Normalizing

    to correctly get statistics Applying particular data mining algorithms Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 44 / 45
  84. Data transformation Data transformation is useful when Identifying trends Normalizing

    to correctly get statistics Applying particular data mining algorithms Smoothing of bins using histograms Aggregation and summarization Generalization Normalization Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 44 / 45
  85. Normalization Normalization changes the range of values Min-max normalization x

    = x − min max − min This puts range to Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45
  86. Normalization Normalization changes the range of values Min-max normalization x

    = x − min max − min This puts range to (0, 1) If new range is (min’, max’) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45
  87. Normalization Normalization changes the range of values Min-max normalization x

    = x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45
  88. Normalization Normalization changes the range of values Min-max normalization x

    = x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Z-score normalization x = x − µ σ This puts range to Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45
  89. Normalization Normalization changes the range of values Min-max normalization x

    = x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Z-score normalization x = x − µ σ This puts range to (−∞, +∞) Also called standard score because it is the value for the standard normal distribution N(0, 1) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 2 2012-13 45 / 45