60

# CS685 Data Preprocessing 1

August 07, 2012

## Transcript

1. ### CS685: Data Mining Data Preprocessing Arnab Bhattacharya arnabb@cse.iitk.ac.in Computer Science

and Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/~cs685/ 1st semester, 2012-13 Tue, Wed, Fri 0900-1000 at CS101 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 1 / 43
2. ### Outline 1 Data 2 Data preprocessing 3 Data cleaning 4

Data reduction Dimensionality reduction Numerosity reduction Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 2 / 43
3. ### Outline 1 Data 2 Data preprocessing 3 Data cleaning 4

Data reduction Dimensionality reduction Numerosity reduction Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 3 / 43
4. ### Categorical data Categorical data is qualitative Nominal Categories Example: color

Operations: equal, not equal Binary Special case of nominal Example: gender, diabetic Symmetric: Two cases are equally important Asymmetric: One case is more important Ordinal or Rank or Ordered scalar Can order Example: small, medium, large Operations: equality, lesser, greater Diﬀerence has no meaning Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 4 / 43
5. ### Numeric data Numeric data is quantitative Interval-scaled Measured on equal

sized units Example: temperature in Celsius, date Operations: diﬀerence No zero point: absolute value has no meaning Ratio-scaled Has a zero point: absolute values are ratios of each other Example: temperature in Kelvin, age, mass, length Operations: diﬀerence, ratio Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 5 / 43
6. ### Types of data Data values can be also classiﬁed as

discrete or continuous Discrete Finite or countably inﬁnite set of values Countably inﬁnite sets have a one-to-one correspondence with the set of natural numbers Continuous Real numbers Precision of measurement and machine-representation limit possibilities Not continuous in the actual sense Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 6 / 43
7. ### Data quality Data should have the following qualities Accuracy Completeness

Consistency Timeliness Reliability Interpretability Availability Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 7 / 43
8. ### Data quality errors and parameters Errors in data due to

Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 8 / 43
9. ### Data quality errors and parameters Errors in data due to

Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Parameters to measure the quality of measurements Precision: closeness of repeated measurements Bias: systematic variation of measurements Accuracy: closeness of measurements to true value Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 8 / 43
10. ### Data quality errors and parameters Errors in data due to

Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Parameters to measure the quality of measurements Precision: closeness of repeated measurements Bias: systematic variation of measurements Accuracy: closeness of measurements to true value Data problems Missing values Noise Outliers Inconsistent values Duplicate objects Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 8 / 43
11. ### Data quality errors and parameters Errors in data due to

Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Parameters to measure the quality of measurements Precision: closeness of repeated measurements Bias: systematic variation of measurements Accuracy: closeness of measurements to true value Data problems Missing values Noise Outliers Inconsistent values Duplicate objects Domain knowledge about data and attributes helps data mining Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 8 / 43
12. ### Outline 1 Data 2 Data preprocessing 3 Data cleaning 4

Data reduction Dimensionality reduction Numerosity reduction Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 9 / 43
13. ### Data preprocessing Data preprocessing is the process of preparing the

data to be ﬁt for data mining algorithms and methods It may involve one or more of the following steps Data cleaning Data reduction Data integration Data transformation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 10 / 43
14. ### Outline 1 Data 2 Data preprocessing 3 Data cleaning 4

Data reduction Dimensionality reduction Numerosity reduction Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 11 / 43
15. ### Data cleaning Process of handling errors in data Five major

ways Filling in missing values Handling noise Removing outliers One of the main methods in handling noise Resolving inconsistent data Out of range Once identiﬁed as inconsistent data, handled as missing value De-duplicating duplicated objects Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 12 / 43
16. ### Missing values Ignore the data object Ignore only the missing

attribute during analysis Estimate the missing value Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 13 / 43
17. ### Missing values Ignore the data object Ignore only the missing

attribute during analysis Estimate the missing value Use a measure of overall central tendency Mean or median Use a measure of central tendency from only the neighborhood Interpolation Useful for temporal and spatial data Use the most probable value Mode Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 13 / 43
18. ### Noise Random perturbations in the data It is generally assumed

that magnitude of noise is smaller than magnitude of attribute of interest Signal-to-noise ratio should not be too low White noise Gaussian distribution with zero mean Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 14 / 43
19. ### Noise Random perturbations in the data It is generally assumed

that magnitude of noise is smaller than magnitude of attribute of interest Signal-to-noise ratio should not be too low White noise Gaussian distribution with zero mean Histogram binning Bin values are replaced by mean or median Equi-width histograms are more common than equi-depth Regression Fitting a function to describe the values Small values of noise do not aﬀect the overall ﬁt Noisy value replaced by most likely value predicted by the function Outlier identiﬁcation and removal Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 14 / 43
20. ### Noise Random perturbations in the data It is generally assumed

that magnitude of noise is smaller than magnitude of attribute of interest Signal-to-noise ratio should not be too low White noise Gaussian distribution with zero mean Histogram binning Bin values are replaced by mean or median Equi-width histograms are more common than equi-depth Regression Fitting a function to describe the values Small values of noise do not aﬀect the overall ﬁt Noisy value replaced by most likely value predicted by the function Outlier identiﬁcation and removal As opposed to noise, bias can be corrected since it is deterministic Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 14 / 43
21. ### Outliers Outliers are data objects that are generated from a

perceivably diﬀerent process Values are considerably unusual Also called anomalous It is not straightforward to identify outliers Unusual values may be the one that are of interest Statistical methods and tests are mostly used Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 15 / 43
22. ### Data duplication Duplicate objects may appear during data insertion or

data transfer Introduces errors in statistics about the data If most attributes are exact copies, then it is easy to remove Sometimes one or more attributes are slightly diﬀerent Domain knowledge needs to be utilized to identify such cases Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 16 / 43
23. ### Outline 1 Data 2 Data preprocessing 3 Data cleaning 4

Data reduction Dimensionality reduction Numerosity reduction Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 17 / 43
24. ### Data reduction Beneﬁts of data reduction More understandable model Less

number of rules Less complex rules, i.e., involving less number of attributes Faster algorithms Easier visualization Important ways of data reduction 1 Dimensionality reduction 2 Numerosity reduction 3 Data discretization 4 Data modeling Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 18 / 43
25. ### Dimensionality reduction Dimensionality reduction reduces the number of dimensions New

dimensions are generally diﬀerent from original ones Curse of dimensionality Data becomes too sparse as dimensions increase Classiﬁcation: Not enough data to create good models or methods Clustering: Density becomes irrelevant and distance between points becomes similar Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 19 / 43
26. ### Singular value decomposition (SVD) Factorization of a matrix A =

UΣV T If A is of size m × n, then U is m × m, V is n × n and Σ is m × n matrix Columns of U are eigenvectors of AAT Left singular vectors UUT = Im (orthonormal) Columns of V are eigenvectors of AT A Right singular vectors V T V = In (orthonormal) σii are the singular values Σ is diagonal Singular values are positive square roots of eigenvalues of AAT or AT A σ11 ≥ σ22 ≥ · · · ≥ σnn (assuming n singular values) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 20 / 43
27. ### SVD of real symmetric matrix A is real symmetric of

size n × n A = AT U = V since AT A = AAT = A2 A = QΣQT Q is of size n × n and contains eigenvectors of A2 This is called spectral decomposition of A Σ contains n singular values Eigenvectors of A = eigenvectors of A2 Eigenvalues of A = square root of eigenvalues of A2 Eigenvalues of A = singular values of A Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 21 / 43
28. ### Transformation using SVD Transformed data T = AV = UΣ

V is called SVD transform matrix Essentially, T is just a rotation of A Dimensionality of T is n n diﬀerent basis vectors than the original space Columns of V give the basis vectors in rotated space Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 22 / 43
29. ### Transformation using SVD Transformed data T = AV = UΣ

V is called SVD transform matrix Essentially, T is just a rotation of A Dimensionality of T is n n diﬀerent basis vectors than the original space Columns of V give the basis vectors in rotated space V shows how each dimension can be represented as a linear combination of other dimensions Columns are input basis vectors U shows how each object can be represented as a linear combination of other objects Columns are output basis vectors Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 22 / 43
30. ### Transformation using SVD Transformed data T = AV = UΣ

V is called SVD transform matrix Essentially, T is just a rotation of A Dimensionality of T is n n diﬀerent basis vectors than the original space Columns of V give the basis vectors in rotated space V shows how each dimension can be represented as a linear combination of other dimensions Columns are input basis vectors U shows how each object can be represented as a linear combination of other objects Columns are output basis vectors Lengths of vectors are preserved ||ai ||2 = ||ti ||2 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 22 / 43
31. ### Example A     2 4 1 3

0 1 −1 0.5     =U     −0.80 0.22 0.05 0.54 −0.56 −0.20 −0.34 −0.71 −0.16 −0.31 0.90 −0.21 −0.01 −0.89 −0.22 0.37     × Σ     5.54 0 0 1.24 0 0 0 0     × V T −0.39 −0.92 0.92 −0.39 T T = AV = UΣ =     2 4 1 3 0 1 −1 0.5     × −0.39 −0.92 0.92 −0.39 =     −4.46 0.27 −3.15 −0.25 −0.92 −0.39 −0.06 −1.15     Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 23 / 43
32. ### Example: compact form A     2 4

1 3 0 1 −1 0.5     =U     −0.80 0.22 −0.56 −0.20 −0.16 −0.31 −0.01 −0.89     × Σ 5.54 0 0 1.24 × V T −0.39 −0.92 0.92 −0.39 T If A is of size m × n, then U is m × n, V is n × n and Σ is n × n matrix Works because there at most n non-zero singular values in Σ Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 24 / 43
33. ### Dimensionality reduction using SVD A = UΣV T = n

i=1 (ui σii vT i ) Use only k dimensions Retain ﬁrst k columns for U and V and ﬁrst k values for Σ First k columns of V give the basis vectors in reduced space Best rank k approximation in terms of sum squared error A ≈ k i=1 (ui σii vT i ) = U1...kΣ1...kV T 1...k T ≈ AV1...k Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 25 / 43
34. ### Example of dimensionality reduction (k = 1) A ≈Ak =

Uk     −0.80 0 0 0 −0.56 0 0 0 −0.16 0 0 0 −0.01 0 0 0     × Σk     5.54 0 0 0 0 0 0 0     × V T k −0.39 0 −0.92 0 T =     1.74 4.10 1.23 2.90 0.35 0.84 0.02 0.06     T ≈Tk = AVk = UkΣk =     2 4 1 3 0 1 −1 0.5     × −0.39 0 0.92 0 =     −4.46 0 −3.15 0 −0.92 0 −0.06 0     Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 26 / 43
35. ### Compact way of dimensionality reduction (k = 1) A ≈Ak

= Uk     −0.80 −0.56 −0.16 −0.01     × Σk 5.54 × V T k −0.39 −0.92 T =     1.74 4.10 1.23 2.90 0.35 0.84 0.02 0.06     T ≈Tk = AVk = UkΣk =     2 4 1 3 0 1 −1 0.5     × −0.39 0.92 =     −4.46 −3.15 −0.92 −0.06     Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 27 / 43
36. ### How many dimensions to retain? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing

1 2012-13 28 / 43
37. ### How many dimensions to retain? There is no easy answer

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 28 / 43
38. ### How many dimensions to retain? There is no easy answer

Concept of energy of a dataset Total energy is sum of squares of singular values (aka spread or variance) E = n i=1 σ2 ii Retain k dimensions such that p % of the energy is retained Ek = k i=1 σ2 ii Ek/E ≥ p Generally, p is between 80 % to 95 % Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 28 / 43
39. ### How many dimensions to retain? There is no easy answer

Concept of energy of a dataset Total energy is sum of squares of singular values (aka spread or variance) E = n i=1 σ2 ii Retain k dimensions such that p % of the energy is retained Ek = k i=1 σ2 ii Ek/E ≥ p Generally, p is between 80 % to 95 % In the above example, k = 1 retains 95.22 % of the energy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 28 / 43
40. ### How many dimensions to retain? There is no easy answer

Concept of energy of a dataset Total energy is sum of squares of singular values (aka spread or variance) E = n i=1 σ2 ii Retain k dimensions such that p % of the energy is retained Ek = k i=1 σ2 ii Ek/E ≥ p Generally, p is between 80 % to 95 % In the above example, k = 1 retains 95.22 % of the energy Running time: O(mnr) for A of size m × n and rank r Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 28 / 43
41. ### Principal component analysis (PCA) Way of identifying patterns in data

How input basis vectors are correlated for the given data A transformation from a set of (possibly correlated) axes to another set of uncorrelated axes Orthogonal linear transformation (i.e., rotation) New axes are principal components First principal component produces projections that are best in the squared error sense Optimal least squares solution Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 29 / 43
42. ### Algorithm Mean center the data (optional) Compute the covariance matrix

of the dimensions Find eigenvectors of covariance matrix Sort eigenvectors in decreasing order of eigenvalues Project onto eigenvectors in order Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 30 / 43
43. ### Algorithm Mean center the data (optional) Compute the covariance matrix

of the dimensions Find eigenvectors of covariance matrix Sort eigenvectors in decreasing order of eigenvalues Project onto eigenvectors in order Assume data matrix is B of size m × n For each dimension, compute mean µi Mean center B by subtracting µi from each column i to get A Compute covariance matrix C of size n × n If mean centered, C = AT A Find eigenvectors and corresponding eigenvalues (V , E) of C Sort eigenvalues such that e1 ≥ e2 ≥ · · · ≥ en Project step-by-step onto the principal components v1, v2, . . . , etc. Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 30 / 43
44. ### Example B =     2 4 1

3 0 1 −1 0.5     ; µ(B) = 0.500 2.125 ∴ A =     1.5 1.875 0.5 0.875 −0.5 −1.125 −1.5 −1.625     and, C = AT A = 5.000 6.250 6.250 8.187 Eigenvectors V = 0.613 −0.789 0.789 0.613 ; eigenvalues E = 13.043 0.143 Transformed data T = AV =     2.400 −0.034 0.997 0.142 −1.195 −0.295 −2.203 0.187     Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 31 / 43

/ 43
46. ### Properties Also known as Karhunen-Lo` eve transform (KLT) Arnab Bhattacharya

(arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
47. ### Properties Also known as Karhunen-Lo` eve transform (KLT) Works for

L2 distances only as others are not invariant to rotation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
48. ### Properties Also known as Karhunen-Lo` eve transform (KLT) Works for

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
49. ### Properties Also known as Karhunen-Lo` eve transform (KLT) Works for

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
50. ### Properties Also known as Karhunen-Lo` eve transform (KLT) Works for

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though How many dimensions to retain? Based on energy (similar to SVD) Total energy is sum of eigenvalues ei Retain k dimensions such that 80 % − 95 % of the energy is retained Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
51. ### Properties Also known as Karhunen-Lo` eve transform (KLT) Works for

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though How many dimensions to retain? Based on energy (similar to SVD) Total energy is sum of eigenvalues ei Retain k dimensions such that 80 % − 95 % of the energy is retained In the above example, k = 1 retains 98.91 % of the energy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
52. ### Properties Also known as Karhunen-Lo` eve transform (KLT) Works for

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though How many dimensions to retain? Based on energy (similar to SVD) Total energy is sum of eigenvalues ei Retain k dimensions such that 80 % − 95 % of the energy is retained In the above example, k = 1 retains 98.91 % of the energy Running time: O(mn2 + n3) for A of size m × n Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
53. ### Numerosity reduction Numerosity reduction reduces the volume of data Arnab

Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 34 / 43
54. ### Numerosity reduction Numerosity reduction reduces the volume of data Reduction

in number of data objects Data compression Modeling Discretization Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 34 / 43
55. ### Aggregation Considers a set of data objects having some similar

attribute(s) Aggregates some other attribute(s) into single value(s) Example: sum, average Beneﬁts of aggregation Aggregate value has less variability Absorbs individual errors Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 35 / 43
56. ### Data cube S1 S2 S3 C1 C2 C3 C4 A

E P 90 75 80 95 45 60 60 For multi-dimensional datasets, aggregation can happen along diﬀerent dimensions Data cubes are essentially multi-dimensional arrays Each cell or face or lower dimensional surface represents a certain projection operation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 36 / 43
57. ### Data cube S1 S2 S3 C1 C2 C3 C4 A

E P 90 75 80 95 45 60 60 For multi-dimensional datasets, aggregation can happen along diﬀerent dimensions Data cubes are essentially multi-dimensional arrays Each cell or face or lower dimensional surface represents a certain projection operation Aggregation can also happen along diﬀerent resolutions in each dimension Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 36 / 43
58. ### Sampling A sample is a representative if it has (approximately)

the same property of interest as the full dataset Sampling approaches Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
59. ### Sampling A sample is a representative if it has (approximately)

the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
60. ### Sampling A sample is a representative if it has (approximately)

the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Simple random sampling For diﬀerent types of objects, stratiﬁed sampling Picks equal or representative number of objects from each group Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
61. ### Sampling A sample is a representative if it has (approximately)

the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Simple random sampling For diﬀerent types of objects, stratiﬁed sampling Picks equal or representative number of objects from each group Sample size Sample should have enough data to capture variability What is the probability of obtaining objects from all k groups in a sample of size n? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
62. ### Sampling A sample is a representative if it has (approximately)

the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Simple random sampling For diﬀerent types of objects, stratiﬁed sampling Picks equal or representative number of objects from each group Sample size Sample should have enough data to capture variability What is the probability of obtaining objects from all k groups in a sample of size n? Progressive sampling or adaptive sampling Start with a small sample size Keep on increasing till it is acceptable Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
63. ### Histograms Method of discretizing the data Mostly useful for one-dimensional

data Equi-width histograms: bins are equally spaced apart Equi-height (equi-depth) histograms: each bin has the same height Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 38 / 43
64. ### Histograms Method of discretizing the data Mostly useful for one-dimensional

data Equi-width histograms: bins are equally spaced apart Equi-height (equi-depth) histograms: each bin has the same height MaxDiﬀ histograms Values are ﬁrst sorted To get b bins, the largest b − 1 diﬀerences are made bin boundaries Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 38 / 43
65. ### Histograms Method of discretizing the data Mostly useful for one-dimensional

data Equi-width histograms: bins are equally spaced apart Equi-height (equi-depth) histograms: each bin has the same height MaxDiﬀ histograms Values are ﬁrst sorted To get b bins, the largest b − 1 diﬀerences are made bin boundaries V-optimal histograms Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 38 / 43
66. ### V-optimal histograms A dataset with n objects Reduce n to

b bins where b n Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 39 / 43
67. ### V-optimal histograms A dataset with n objects Reduce n to

b bins where b n Formally, assume a set V of n (sorted) values v1, v2, . . . , vn having frequencies f1, f2, . . . , fn respectively Problem is to output another histogram H having b bins, i.e., b non-overlapping intervals on V Interval Ii is of the form [li , ri ] and has a value hi If value vj ∈ Ii , estimate e(vj ) of fj is hi Error in estimation is distance d(f , e) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 39 / 43
68. ### Details Histogram value hi is average of values in Ii

, i.e., hi = avg([li , ri ]) =   ri k=li fk   /(ri − li + 1) Error function is sum squared error (SSE) (or L2 error) SSE([l, r]) = r k=l fk − avg([l, r]) 2 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 40 / 43
69. ### Algorithm Assume optimal partitioning for the ﬁrst i values with

at most k bins is SSE∗(i, k) Consider placement of the last bin Choice is any of the i gaps For each such placement at gap j, at most k − 1 bins have been placed optimally for the ﬁrst j values This leads to the recursion SSE∗(i, k) = min 1≤j≤i {SSE∗(j, k − 1) + SSE([lj+1, ri ])} Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 41 / 43
70. ### Algorithm Assume optimal partitioning for the ﬁrst i values with

at most k bins is SSE∗(i, k) Consider placement of the last bin Choice is any of the i gaps For each such placement at gap j, at most k − 1 bins have been placed optimally for the ﬁrst j values This leads to the recursion SSE∗(i, k) = min 1≤j≤i {SSE∗(j, k − 1) + SSE([lj+1, ri ])} Dynamic programming (DP) solution Table of size n × b Start with cell (1, 1) and proceed in a column-scan order Computation for cell (i, k) requires values at cells (j, k − 1), ∀1 ≤ j ≤ i Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 41 / 43
71. ### Algorithm Assume optimal partitioning for the ﬁrst i values with

at most k bins is SSE∗(i, k) Consider placement of the last bin Choice is any of the i gaps For each such placement at gap j, at most k − 1 bins have been placed optimally for the ﬁrst j values This leads to the recursion SSE∗(i, k) = min 1≤j≤i {SSE∗(j, k − 1) + SSE([lj+1, ri ])} Dynamic programming (DP) solution Table of size n × b Start with cell (1, 1) and proceed in a column-scan order Computation for cell (i, k) requires values at cells (j, k − 1), ∀1 ≤ j ≤ i Running time: O(n2b) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 41 / 43
72. ### Data summarization Central tendency measures Mean: may be weighted Median:

“middle” value Mode: dataset may be unimodal or multimodal Midrange: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
73. ### Data summarization Central tendency measures Mean: may be weighted Median:

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
74. ### Data summarization Central tendency measures Mean: may be weighted Median:

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
75. ### Data summarization Central tendency measures Mean: may be weighted Median:

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Variance Standard deviation Range Percentile (quartile) Five-number summary: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
76. ### Data summarization Central tendency measures Mean: may be weighted Median:

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Variance Standard deviation Range Percentile (quartile) Five-number summary: minimum, ﬁrst quartile, median, third quartile, maximum Box plot: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
77. ### Data summarization Central tendency measures Mean: may be weighted Median:

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Variance Standard deviation Range Percentile (quartile) Five-number summary: minimum, ﬁrst quartile, median, third quartile, maximum Box plot: plot of ﬁve values Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
78. ### Data summarization (contd.) Distributive measures Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing

1 2012-13 43 / 43
79. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
80. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
81. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
82. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
83. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
84. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
85. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
86. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
87. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
88. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
89. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Plot of one variable against another Scatter plot matrix: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
90. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Plot of one variable against another Scatter plot matrix: n(n − 1)/2 scatter plots for n variables Loess curve: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
91. ### Data summarization (contd.) Distributive measures Can be computed by partitioning

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Plot of one variable against another Scatter plot matrix: n(n − 1)/2 scatter plots for n variables Loess curve: plot of regression polynomial against actual values Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43