sum of (inﬁnite) sine and cosine waves Way to analyze the frequency components in a signal Fourier transform is a transformation from time domain f (x) to frequency domain g(u) g(u) = ∞ −∞ f (x)e−2πuxi dx f (x) can be obtained from g(u) by the inverse transformation f (x) = ∞ −∞ g(u)e2πuxi du f (x) and g(u) form a Fourier transform pair Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 4 / 45
has N components DFT of x, denoted by X, has N components given by Xk = 1. N−1 n=0 xne−(2π/N)kni k = 0, . . . , N − 1 Inverse transformation (IDFT) is given by xn = 1 N . N−1 k=0 Xke(2π/N)kni n = 0, . . . , N − 1 To avoid using separate scaling factors, both can be taken as (1/ √ N) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 5 / 45
is preserved N−1 n=0 |xn|2 = N−1 k=0 |Xk|2 When both scaling factors are (1/ √ N) Contractive mapping, i.e., lengths get reduced Invertible, linear transformation Essentially, a rotation in N-dimensional space Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 6 / 45
Xk = N−1 n=0 xne−(2π/N)kni = N−1 n=0 xn cos( 2π N kn) − i sin( 2π N kn) ∴ X0 = N−1 n=0 xn (cos 0 − i sin 0) = N−1 n=0 xn First coeﬃcient is (scaled) sum or average Other coeﬃcients deﬁne the frequency components (cosine and sine) at frequencies 2π(k/N) for k = 0, 1, . . . , N − 1 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 7 / 45
noise Alternatively, for a database of objects, retain those coeﬃcients whose variances are highest Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 8 / 45
terms of cosine waves of diﬀerent frequencies (and amplitudes) Various deﬁnitions X(1) k = 2 N N−1 n=0 u(1) n xn cos π N n + 1 2 k k = 0, . . . , N − 1 X(2) k = 2 N N−1 n=0 u(2) n xn cos π N n + 1 2 k + 1 2 k = 0, . . . , N − 1 u(1) n = 1/ √ 2 n = 0 u(1) n = 1 n = 1, . . . , N − 1 u(2) n = 1 n = 0, . . . , N − 1 Length preserving Inverses are the same functions Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 9 / 45
analyze a function in both time and frequency domains Good time resolution and poor frequency resolution at high frequencies Good frequency resolution and poor time resolution at low frequencies Wavelets are useful for short duration signals of high frequency and long duration signals of short frequency Wavelets are generated from a mother wavelet function ψ Zero mean (oscillatory, i.e., wave nature): ψ(x) dx = 0 Unit length: ψ2(x) dx = 1 Basis functions are generated by scaling (s) and shifting (l) the mother wavelet ψs,l (t) = 1 √ s ψ t − l s Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 10 / 45
function or vectors Two functions: 1 Wavelet function 2 Scaling function Space spanned by n = 2j basis vectors at level j can be spanned by two sets of basis vectors ψ and φ at level j − 1 ψ and φ are wavelet and scaling functions respectively DWT generates basis vectors for wavelet and scaling functions at diﬀerent levels sj−1 d1 s0 d0 s1 sj sj−2 dj−2 dj−1 . . . ϕ ϕ ϕ ϕ φ φ φ φ Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 11 / 45
Only ψ0,0 is needed since others are expressed in terms of φ·,· When size of vector is n = 2j , j levels of basis vectors are needed There are 2j−1 + 2j−2 + · · · + 20 = 2j − 1 basis vectors corresponding to detail coeﬃcients and 1 basis vector corresponding to sum coeﬃcient at 0th level Transformation matrix H has these basis vectors as columns Transformed vector v = v.H for data (row) vector v Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 14 / 45
Only ψ0,0 is needed since others are expressed in terms of φ·,· When size of vector is n = 2j , j levels of basis vectors are needed There are 2j−1 + 2j−2 + · · · + 20 = 2j − 1 basis vectors corresponding to detail coeﬃcients and 1 basis vector corresponding to sum coeﬃcient at 0th level Transformation matrix H has these basis vectors as columns Transformed vector v = v.H for data (row) vector v Each step can be deﬁned as multiplication by a matrix Composition of these matrices gives the ﬁnal transformation matrix H is orthonormal H−1 = HT Hence, inverse operation is easy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 14 / 45
detail coeﬃcients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coeﬃcients Contractive mapping, i.e., lengths get reduced Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 19 / 45
detail coeﬃcients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coeﬃcients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coeﬃcients whose variances are highest Another option is to zero out coeﬃcients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 19 / 45
detail coeﬃcients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coeﬃcients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coeﬃcients whose variances are highest Another option is to zero out coeﬃcients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Pad zeros at end: latter half becomes less important Pad equal amount of zeros in each half: should be done recursively Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 19 / 45
detail coeﬃcients For example, retain 1 sum (at level 0) and 3 detail (1 at level 0 and 2 at level 1) coeﬃcients Contractive mapping, i.e., lengths get reduced Alternatively, for a database of objects, retain those coeﬃcients whose variances are highest Another option is to zero out coeﬃcients whose absolute values are below a threshold What happens when dimensionality is not a power of 2? Pad zeros at end: latter half becomes less important Pad equal amount of zeros in each half: should be done recursively Interestingly, shuﬄing the dimensions produce diﬀerent wavelet coeﬃcients How to shuﬄe to satisfy some criteria? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 19 / 45
from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 20 / 45
from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 20 / 45
from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Top-down discretization: repeatedly ﬁnds split points or cut points to divide the data Bottom-up discretization: considers all continuous values as split points and then repeatedly merges Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 20 / 45
from quantitative data to qualitative data Reduces number of possible values of an attribute May be encoded in a more compact form If discretization uses information about groups, then it is called supervised discretization Otherwise unsupervised discretization Top-down discretization: repeatedly ﬁnds split points or cut points to divide the data Bottom-up discretization: considers all continuous values as split points and then repeatedly merges Discretization often leads to conceptualization where each distinct group represents a particular concept Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 20 / 45
multiple levels Concept hierarchy captures more basic concepts towards the top and ﬁner details towards the leaves Example Higher level: brilliant, ordinary, deﬁcient Middle level: CPI ≥ 9, 8 − 9, etc. Lower level: actual CPI Can be used to represent knowledge hierarchies as well Ontology Example WordNet: ontology for English words Gene Ontology: three separate ontologies that capture diﬀerent attributes of a gene Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 21 / 45
the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? When every pi is 1/n Most random Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
the data Entropy encodes the number of bits required to describe a distribution It is also called the information content entropy(D) = − n i=1 (pi log2 pi ) When is entropy minimum? When one pi is 1 (others are 0) Least random Entropy = 0 When is entropy maximum? When every pi is 1/n Most random Entropy = log2 n For n = 2, it is 1 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 23 / 45
D1, D2, . . . , Dn for the dataset D If Di has instances from m classes C1, C2, . . . , Cm, then entropy of partition Di is entropy(Di ) = − m j=1 (pji log2 pji ) where pji is probability of class Cj in partition Di pji = |Cj ∈ Di | |Di | Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 24 / 45
. . . , sn−1 such that ∀v ∈ Di , si−1 < v ≤ si Implicitly, s0 is minimum and sn is maximum How to choose these partition values? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 25 / 45
. . . , sn−1 such that ∀v ∈ Di , si−1 < v ≤ si Implicitly, s0 is minimum and sn is maximum How to choose these partition values? Use the concept of information gain or expected information requirement Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 25 / 45
expected information requirement is deﬁned as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is deﬁned as gain(D) = entropy(D) − info(D) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 26 / 45
expected information requirement is deﬁned as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is deﬁned as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 26 / 45
expected information requirement is deﬁned as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is deﬁned as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Keep on partitioning into two parts Stopping criterion Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 26 / 45
expected information requirement is deﬁned as info(D) = n i=1 |Di | |D| entropy(Di ) The information gain obtained by this partitioning is deﬁned as gain(D) = entropy(D) − info(D) Choose partition values s1, s2, . . . , sn−1 such that information gain is maximized (equivalently, expected information requirement is minimized) Keep on partitioning into two parts Stopping criterion Number of categories greater than a threshold Expected information requirement is below a threshold Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 26 / 45
31 What is the optimal splitting point? Exhaustive algorithm Tests all possible n − 1 partitions Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 30 / 45
chi-square test Can test for goodness of ﬁt and independence Uses chi-square statistic and chi-square distribution Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 31 / 45
chi-square test Can test for goodness of ﬁt and independence Uses chi-square statistic and chi-square distribution The value of the chi-square statistic is χ2 = n i=1 (Oi − Ei )2 Ei where Oi is the observed frequency, Ei is the expected frequency and n is the number of observations Chi-square statistic asymptotically approaches the chi-square distribution Chi-square distribution is characterized by a single parameter – degrees of freedom Here, degrees of freedom is n − 1 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 31 / 45
The value of the chi-square statistic is χ2 = 2 i=1 k j=1 (Oij − Eij )2 Eij Here, degrees of freedom is (2 − 1) × (k − 1) The value of the statistic is compared against the chi-square distribution with df = k − 1 Choose a signiﬁcance level, say, 0.05 If the statistic obtained is less than the theoretical level, then conclude that the two distributions are independent at the chosen level of signiﬁcance Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 32 / 45
are independent, then they can be merged Otherwise, the diﬀerence in frequencies is statistically signiﬁcant, and they should not be merged Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 33 / 45
are independent, then they can be merged Otherwise, the diﬀerence in frequencies is statistically signiﬁcant, and they should not be merged Keep on merging intervals with lowest chi-square value Stopping criterion Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 33 / 45
are independent, then they can be merged Otherwise, the diﬀerence in frequencies is statistically signiﬁcant, and they should not be merged Keep on merging intervals with lowest chi-square value Stopping criterion When lowest chi-square value is greater than threshold chosen at a particular level of signiﬁcance Number of categories greater than a threshold Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 33 / 45
read intervals For example, (500, 600) is a better interval than (512.23, 609.87) Idea is to break up into “natural” partitions Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 37 / 45
read intervals For example, (500, 600) is a better interval than (512.23, 609.87) Idea is to break up into “natural” partitions Uses the 3-4-5 rule Hierarchical partitioning At each level, examine number of distinct values of the most signiﬁcant digit for each interval If it is 3, 6, 7 or 9 (3n), partition into 3 equi-width intervals May be 2-3-2 for 7 partitions If it is 2, 4 or 8 (2n), partition into 4 equi-width intervals If it is 1, 5 or 10 (5n), partition into 5 equi-width intervals Recursively applied at each level To avoid outliers, it is applied for data that represents the majority, i.e., 5th percentile to 95th percentile Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 37 / 45
-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into 5 partitions: (1000, 1200), (1200, 1400), (1400, 1600), (1600, 1800), (1800, 2000) (2000, 5000) into Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
-351, max = 4700 5th and 95th percentile are low = -159 and high = 1838 respectively Most signiﬁcant digit is for 1000 Rounding yields low’ = -1000 and high’ = 2000 Number of distinct digits is 3 3 equi-width partitions are (-1000, 0), (0, 1000) and (1000, 2000) Since low’ < min, adjust boundary of ﬁrst interval to (-400, 0) Since max > high’, a new interval needs to be formed: (2000, 5000) These partitions are broken recursively (-400, 0) into 4 partitions: (-400, -300), (-300, -200), (-200, -100), (-100, 0) (0, 1000) into 5 partitions: (0, 200), (200, 400), (400, 600), (600, 800), (800, 1000) (1000, 2000) into 5 partitions: (1000, 1200), (1200, 1400), (1400, 1600), (1600, 1800), (1800, 2000) (2000, 5000) into 3 partitions: (2000, 3000), (3000, 4000), (4000, 5000) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 38 / 45
is generated Can predict missing values There is a set of k predictors, denoted by X There are n observations The general form of regression model is Y = f (X) + ε The function f can be chosen using domain knowledge For linear regression, f is linear ε encodes the error terms associated with each observation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 39 / 45
The response variable depends only linearly with the predictor variables The general form of linear regression model is Y = XW + ε W are the regression coeﬃcients or weights on the predictors Sizes of matrices Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 40 / 45
The response variable depends only linearly with the predictor variables The general form of linear regression model is Y = XW + ε W are the regression coeﬃcients or weights on the predictors Sizes of matrices Y : n × 1 X : n × k W : k × 1 ε : n × 1 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 40 / 45
data sources into one single coherent source Useful when there are multiple databases about the same set of objects Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 42 / 45
data sources into one single coherent source Useful when there are multiple databases about the same set of objects Schema matching and entity identiﬁcation Is cust id equal to cust number? Correlation analysis to reduce redundancy Chi-square test for categorical data De-duplication Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 42 / 45
to correctly get statistics Applying particular data mining algorithms Smoothing of bins using histograms Aggregation and summarization Generalization Normalization Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 44 / 45
= x − min max − min This puts range to (0, 1) If new range is (min’, max’) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 45 / 45
= x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 45 / 45
= x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Z-score normalization x = x − µ σ This puts range to Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 45 / 45
= x − min max − min This puts range to (0, 1) If new range is (min’, max’) x = x − min max − min (max − min ) + min Z-score normalization x = x − µ σ This puts range to (−∞, +∞) Also called standard score because it is the value for the standard normal distribution N(0, 1) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 2 2012-13 45 / 45