CS685 Data Preprocessing 1

CS685: Data Mining Data Preprocessing Arnab Bhattacharya [email protected] Computer Science
and Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/~cs685/ 1st semester, 2012-13 Tue, Wed, Fri 0900-1000 at CS101 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 1 / 43

Outline 1 Data 2 Data preprocessing 3 Data cleaning 4
Data reduction Dimensionality reduction Numerosity reduction Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 2 / 43

Categorical data Categorical data is qualitative Nominal Categories Example: color
Operations: equal, not equal Binary Special case of nominal Example: gender, diabetic Symmetric: Two cases are equally important Asymmetric: One case is more important Ordinal or Rank or Ordered scalar Can order Example: small, medium, large Operations: equality, lesser, greater Diﬀerence has no meaning Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 4 / 43

Numeric data Numeric data is quantitative Interval-scaled Measured on equal
sized units Example: temperature in Celsius, date Operations: diﬀerence No zero point: absolute value has no meaning Ratio-scaled Has a zero point: absolute values are ratios of each other Example: temperature in Kelvin, age, mass, length Operations: diﬀerence, ratio Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 5 / 43

Types of data Data values can be also classified as
discrete or continuous Discrete Finite or countably infinite set of values Countably infinite sets have a one-to-one correspondence with the set of natural numbers Continuous Real numbers Precision of measurement and machine-representation limit possibilities Not continuous in the actual sense Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 6 / 43

Data quality Data should have the following qualities Accuracy Completeness
Consistency Timeliness Reliability Interpretability Availability Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 7 / 43

Data quality errors and parameters Errors in data due to
Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 8 / 43

Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Parameters to measure the quality of measurements Precision: closeness of repeated measurements Bias: systematic variation of measurements Accuracy: closeness of measurements to true value Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 8 / 43

Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Parameters to measure the quality of measurements Precision: closeness of repeated measurements Bias: systematic variation of measurements Accuracy: closeness of measurements to true value Data problems Missing values Noise Outliers Inconsistent values Duplicate objects Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 8 / 43

Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Parameters to measure the quality of measurements Precision: closeness of repeated measurements Bias: systematic variation of measurements Accuracy: closeness of measurements to true value Data problems Missing values Noise Outliers Inconsistent values Duplicate objects Domain knowledge about data and attributes helps data mining Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 8 / 43

Data preprocessing Data preprocessing is the process of preparing the
data to be ﬁt for data mining algorithms and methods It may involve one or more of the following steps Data cleaning Data reduction Data integration Data transformation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 10 / 43

Data cleaning Process of handling errors in data Five major
ways Filling in missing values Handling noise Removing outliers One of the main methods in handling noise Resolving inconsistent data Out of range Once identiﬁed as inconsistent data, handled as missing value De-duplicating duplicated objects Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 12 / 43

Missing values Ignore the data object Ignore only the missing
attribute during analysis Estimate the missing value Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 13 / 43

Missing values Ignore the data object Ignore only the missing
attribute during analysis Estimate the missing value Use a measure of overall central tendency Mean or median Use a measure of central tendency from only the neighborhood Interpolation Useful for temporal and spatial data Use the most probable value Mode Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 13 / 43

Noise Random perturbations in the data It is generally assumed
that magnitude of noise is smaller than magnitude of attribute of interest Signal-to-noise ratio should not be too low White noise Gaussian distribution with zero mean Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 14 / 43

that magnitude of noise is smaller than magnitude of attribute of interest Signal-to-noise ratio should not be too low White noise Gaussian distribution with zero mean Histogram binning Bin values are replaced by mean or median Equi-width histograms are more common than equi-depth Regression Fitting a function to describe the values Small values of noise do not affect the overall fit Noisy value replaced by most likely value predicted by the function Outlier identification and removal Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 14 / 43

that magnitude of noise is smaller than magnitude of attribute of interest Signal-to-noise ratio should not be too low White noise Gaussian distribution with zero mean Histogram binning Bin values are replaced by mean or median Equi-width histograms are more common than equi-depth Regression Fitting a function to describe the values Small values of noise do not affect the overall fit Noisy value replaced by most likely value predicted by the function Outlier identification and removal As opposed to noise, bias can be corrected since it is deterministic Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 14 / 43

Outliers Outliers are data objects that are generated from a
perceivably diﬀerent process Values are considerably unusual Also called anomalous It is not straightforward to identify outliers Unusual values may be the one that are of interest Statistical methods and tests are mostly used Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 15 / 43

Data duplication Duplicate objects may appear during data insertion or
data transfer Introduces errors in statistics about the data If most attributes are exact copies, then it is easy to remove Sometimes one or more attributes are slightly diﬀerent Domain knowledge needs to be utilized to identify such cases Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 16 / 43

Data reduction Beneﬁts of data reduction More understandable model Less
number of rules Less complex rules, i.e., involving less number of attributes Faster algorithms Easier visualization Important ways of data reduction 1 Dimensionality reduction 2 Numerosity reduction 3 Data discretization 4 Data modeling Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 18 / 43

Dimensionality reduction Dimensionality reduction reduces the number of dimensions New
dimensions are generally diﬀerent from original ones Curse of dimensionality Data becomes too sparse as dimensions increase Classiﬁcation: Not enough data to create good models or methods Clustering: Density becomes irrelevant and distance between points becomes similar Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 19 / 43

Singular value decomposition (SVD) Factorization of a matrix A =
UΣV T If A is of size m × n, then U is m × m, V is n × n and Σ is m × n matrix Columns of U are eigenvectors of AAT Left singular vectors UUT = Im (orthonormal) Columns of V are eigenvectors of AT A Right singular vectors V T V = In (orthonormal) σii are the singular values Σ is diagonal Singular values are positive square roots of eigenvalues of AAT or AT A σ11 ≥ σ22 ≥ · · · ≥ σnn (assuming n singular values) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 20 / 43

SVD of real symmetric matrix A is real symmetric of
size n × n A = AT U = V since AT A = AAT = A2 A = QΣQT Q is of size n × n and contains eigenvectors of A2 This is called spectral decomposition of A Σ contains n singular values Eigenvectors of A = eigenvectors of A2 Eigenvalues of A = square root of eigenvalues of A2 Eigenvalues of A = singular values of A Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 21 / 43

Transformation using SVD Transformed data T = AV = UΣ
V is called SVD transform matrix Essentially, T is just a rotation of A Dimensionality of T is n n diﬀerent basis vectors than the original space Columns of V give the basis vectors in rotated space Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 22 / 43

V is called SVD transform matrix Essentially, T is just a rotation of A Dimensionality of T is n n diﬀerent basis vectors than the original space Columns of V give the basis vectors in rotated space V shows how each dimension can be represented as a linear combination of other dimensions Columns are input basis vectors U shows how each object can be represented as a linear combination of other objects Columns are output basis vectors Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 22 / 43

V is called SVD transform matrix Essentially, T is just a rotation of A Dimensionality of T is n n diﬀerent basis vectors than the original space Columns of V give the basis vectors in rotated space V shows how each dimension can be represented as a linear combination of other dimensions Columns are input basis vectors U shows how each object can be represented as a linear combination of other objects Columns are output basis vectors Lengths of vectors are preserved ||ai ||2 = ||ti ||2 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 22 / 43

Example A     2 4 1 3
0 1 −1 0.5     =U     −0.80 0.22 0.05 0.54 −0.56 −0.20 −0.34 −0.71 −0.16 −0.31 0.90 −0.21 −0.01 −0.89 −0.22 0.37     × Σ     5.54 0 0 1.24 0 0 0 0     × V T −0.39 −0.92 0.92 −0.39 T T = AV = UΣ =     2 4 1 3 0 1 −1 0.5     × −0.39 −0.92 0.92 −0.39 =     −4.46 0.27 −3.15 −0.25 −0.92 −0.39 −0.06 −1.15     Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 23 / 43

Example: compact form A     2 4
1 3 0 1 −1 0.5     =U     −0.80 0.22 −0.56 −0.20 −0.16 −0.31 −0.01 −0.89     × Σ 5.54 0 0 1.24 × V T −0.39 −0.92 0.92 −0.39 T If A is of size m × n, then U is m × n, V is n × n and Σ is n × n matrix Works because there at most n non-zero singular values in Σ Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 24 / 43

Dimensionality reduction using SVD A = UΣV T = n
i=1 (ui σii vT i ) Use only k dimensions Retain ﬁrst k columns for U and V and ﬁrst k values for Σ First k columns of V give the basis vectors in reduced space Best rank k approximation in terms of sum squared error A ≈ k i=1 (ui σii vT i ) = U1...kΣ1...kV T 1...k T ≈ AV1...k Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 25 / 43

Example of dimensionality reduction (k = 1) A ≈Ak =
Uk     −0.80 0 0 0 −0.56 0 0 0 −0.16 0 0 0 −0.01 0 0 0     × Σk     5.54 0 0 0 0 0 0 0     × V T k −0.39 0 −0.92 0 T =     1.74 4.10 1.23 2.90 0.35 0.84 0.02 0.06     T ≈Tk = AVk = UkΣk =     2 4 1 3 0 1 −1 0.5     × −0.39 0 0.92 0 =     −4.46 0 −3.15 0 −0.92 0 −0.06 0     Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 26 / 43

Compact way of dimensionality reduction (k = 1) A ≈Ak
= Uk     −0.80 −0.56 −0.16 −0.01     × Σk 5.54 × V T k −0.39 −0.92 T =     1.74 4.10 1.23 2.90 0.35 0.84 0.02 0.06     T ≈Tk = AVk = UkΣk =     2 4 1 3 0 1 −1 0.5     × −0.39 0.92 =     −4.46 −3.15 −0.92 −0.06     Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 27 / 43

How many dimensions to retain? Arnab Bhattacharya ([email protected]) CS685: Preprocessing
1 2012-13 28 / 43

How many dimensions to retain? There is no easy answer
Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 28 / 43

Concept of energy of a dataset Total energy is sum of squares of singular values (aka spread or variance) E = n i=1 σ2 ii Retain k dimensions such that p % of the energy is retained Ek = k i=1 σ2 ii Ek/E ≥ p Generally, p is between 80 % to 95 % Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 28 / 43

Concept of energy of a dataset Total energy is sum of squares of singular values (aka spread or variance) E = n i=1 σ2 ii Retain k dimensions such that p % of the energy is retained Ek = k i=1 σ2 ii Ek/E ≥ p Generally, p is between 80 % to 95 % In the above example, k = 1 retains 95.22 % of the energy Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 28 / 43

Concept of energy of a dataset Total energy is sum of squares of singular values (aka spread or variance) E = n i=1 σ2 ii Retain k dimensions such that p % of the energy is retained Ek = k i=1 σ2 ii Ek/E ≥ p Generally, p is between 80 % to 95 % In the above example, k = 1 retains 95.22 % of the energy Running time: O(mnr) for A of size m × n and rank r Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 28 / 43

Principal component analysis (PCA) Way of identifying patterns in data
How input basis vectors are correlated for the given data A transformation from a set of (possibly correlated) axes to another set of uncorrelated axes Orthogonal linear transformation (i.e., rotation) New axes are principal components First principal component produces projections that are best in the squared error sense Optimal least squares solution Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 29 / 43

Algorithm Mean center the data (optional) Compute the covariance matrix
of the dimensions Find eigenvectors of covariance matrix Sort eigenvectors in decreasing order of eigenvalues Project onto eigenvectors in order Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 30 / 43

Algorithm Mean center the data (optional) Compute the covariance matrix
of the dimensions Find eigenvectors of covariance matrix Sort eigenvectors in decreasing order of eigenvalues Project onto eigenvectors in order Assume data matrix is B of size m × n For each dimension, compute mean µi Mean center B by subtracting µi from each column i to get A Compute covariance matrix C of size n × n If mean centered, C = AT A Find eigenvectors and corresponding eigenvalues (V , E) of C Sort eigenvalues such that e1 ≥ e2 ≥ · · · ≥ en Project step-by-step onto the principal components v1, v2, . . . , etc. Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 30 / 43

Example B =     2 4 1
3 0 1 −1 0.5     ; µ(B) = 0.500 2.125 ∴ A =     1.5 1.875 0.5 0.875 −0.5 −1.125 −1.5 −1.625     and, C = AT A = 5.000 6.250 6.250 8.187 Eigenvectors V = 0.613 −0.789 0.789 0.613 ; eigenvalues E = 13.043 0.143 Transformed data T = AV =     2.400 −0.034 0.997 0.142 −1.195 −0.295 −2.203 0.187     Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 31 / 43

Visual example Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 32
/ 43

Properties Also known as Karhunen-Lo` eve transform (KLT) Arnab Bhattacharya
([email protected]) CS685: Preprocessing 1 2012-13 33 / 43

Properties Also known as Karhunen-Lo` eve transform (KLT) Works for
L2 distances only as others are not invariant to rotation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 33 / 43

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 33 / 43

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 33 / 43

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though How many dimensions to retain? Based on energy (similar to SVD) Total energy is sum of eigenvalues ei Retain k dimensions such that 80 % − 95 % of the energy is retained Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 33 / 43

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though How many dimensions to retain? Based on energy (similar to SVD) Total energy is sum of eigenvalues ei Retain k dimensions such that 80 % − 95 % of the energy is retained In the above example, k = 1 retains 98.91 % of the energy Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 33 / 43

L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though How many dimensions to retain? Based on energy (similar to SVD) Total energy is sum of eigenvalues ei Retain k dimensions such that 80 % − 95 % of the energy is retained In the above example, k = 1 retains 98.91 % of the energy Running time: O(mn2 + n3) for A of size m × n Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 33 / 43

Numerosity reduction Numerosity reduction reduces the volume of data Arnab
Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 34 / 43

Numerosity reduction Numerosity reduction reduces the volume of data Reduction
in number of data objects Data compression Modeling Discretization Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 34 / 43

Aggregation Considers a set of data objects having some similar
attribute(s) Aggregates some other attribute(s) into single value(s) Example: sum, average Beneﬁts of aggregation Aggregate value has less variability Absorbs individual errors Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 35 / 43

Data cube S1 S2 S3 C1 C2 C3 C4 A
E P 90 75 80 95 45 60 60 For multi-dimensional datasets, aggregation can happen along diﬀerent dimensions Data cubes are essentially multi-dimensional arrays Each cell or face or lower dimensional surface represents a certain projection operation Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 36 / 43

Data cube S1 S2 S3 C1 C2 C3 C4 A
E P 90 75 80 95 45 60 60 For multi-dimensional datasets, aggregation can happen along diﬀerent dimensions Data cubes are essentially multi-dimensional arrays Each cell or face or lower dimensional surface represents a certain projection operation Aggregation can also happen along diﬀerent resolutions in each dimension Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 36 / 43

Sampling A sample is a representative if it has (approximately)
the same property of interest as the full dataset Sampling approaches Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 37 / 43

the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 37 / 43

the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Simple random sampling For diﬀerent types of objects, stratiﬁed sampling Picks equal or representative number of objects from each group Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 37 / 43

the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Simple random sampling For diﬀerent types of objects, stratiﬁed sampling Picks equal or representative number of objects from each group Sample size Sample should have enough data to capture variability What is the probability of obtaining objects from all k groups in a sample of size n? Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 37 / 43

the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Simple random sampling For diﬀerent types of objects, stratiﬁed sampling Picks equal or representative number of objects from each group Sample size Sample should have enough data to capture variability What is the probability of obtaining objects from all k groups in a sample of size n? Progressive sampling or adaptive sampling Start with a small sample size Keep on increasing till it is acceptable Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 37 / 43

Histograms Method of discretizing the data Mostly useful for one-dimensional
data Equi-width histograms: bins are equally spaced apart Equi-height (equi-depth) histograms: each bin has the same height Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 38 / 43

data Equi-width histograms: bins are equally spaced apart Equi-height (equi-depth) histograms: each bin has the same height MaxDiff histograms Values are first sorted To get b bins, the largest b − 1 differences are made bin boundaries Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 38 / 43

data Equi-width histograms: bins are equally spaced apart Equi-height (equi-depth) histograms: each bin has the same height MaxDiff histograms Values are first sorted To get b bins, the largest b − 1 differences are made bin boundaries V-optimal histograms Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 38 / 43

V-optimal histograms A dataset with n objects Reduce n to
b bins where b n Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 39 / 43

V-optimal histograms A dataset with n objects Reduce n to
b bins where b n Formally, assume a set V of n (sorted) values v1, v2, . . . , vn having frequencies f1, f2, . . . , fn respectively Problem is to output another histogram H having b bins, i.e., b non-overlapping intervals on V Interval Ii is of the form [li , ri ] and has a value hi If value vj ∈ Ii , estimate e(vj ) of fj is hi Error in estimation is distance d(f , e) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 39 / 43

Details Histogram value hi is average of values in Ii
, i.e., hi = avg([li , ri ]) =   ri k=li fk   /(ri − li + 1) Error function is sum squared error (SSE) (or L2 error) SSE([l, r]) = r k=l fk − avg([l, r]) 2 Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 40 / 43

Algorithm Assume optimal partitioning for the ﬁrst i values with
at most k bins is SSE∗(i, k) Consider placement of the last bin Choice is any of the i gaps For each such placement at gap j, at most k − 1 bins have been placed optimally for the ﬁrst j values This leads to the recursion SSE∗(i, k) = min 1≤j≤i {SSE∗(j, k − 1) + SSE([lj+1, ri ])} Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 41 / 43

at most k bins is SSE∗(i, k) Consider placement of the last bin Choice is any of the i gaps For each such placement at gap j, at most k − 1 bins have been placed optimally for the ﬁrst j values This leads to the recursion SSE∗(i, k) = min 1≤j≤i {SSE∗(j, k − 1) + SSE([lj+1, ri ])} Dynamic programming (DP) solution Table of size n × b Start with cell (1, 1) and proceed in a column-scan order Computation for cell (i, k) requires values at cells (j, k − 1), ∀1 ≤ j ≤ i Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 41 / 43

at most k bins is SSE∗(i, k) Consider placement of the last bin Choice is any of the i gaps For each such placement at gap j, at most k − 1 bins have been placed optimally for the ﬁrst j values This leads to the recursion SSE∗(i, k) = min 1≤j≤i {SSE∗(j, k − 1) + SSE([lj+1, ri ])} Dynamic programming (DP) solution Table of size n × b Start with cell (1, 1) and proceed in a column-scan order Computation for cell (i, k) requires values at cells (j, k − 1), ∀1 ≤ j ≤ i Running time: O(n2b) Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 41 / 43

Data summarization Central tendency measures Mean: may be weighted Median:
“middle” value Mode: dataset may be unimodal or multimodal Midrange: Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 42 / 43

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 42 / 43

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 42 / 43

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Variance Standard deviation Range Percentile (quartile) Five-number summary: Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 42 / 43

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Variance Standard deviation Range Percentile (quartile) Five-number summary: minimum, ﬁrst quartile, median, third quartile, maximum Box plot: Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 42 / 43

“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Variance Standard deviation Range Percentile (quartile) Five-number summary: minimum, ﬁrst quartile, median, third quartile, maximum Box plot: plot of ﬁve values Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 42 / 43

Data summarization (contd.) Distributive measures Arnab Bhattacharya ([email protected]) CS685: Preprocessing
1 2012-13 43 / 43

Data summarization (contd.) Distributive measures Can be computed by partitioning
the dataset Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Plot of one variable against another Scatter plot matrix: Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Plot of one variable against another Scatter plot matrix: n(n − 1)/2 scatter plots for n variables Loess curve: Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Plot of one variable against another Scatter plot matrix: n(n − 1)/2 scatter plots for n variables Loess curve: plot of regression polynomial against actual values Arnab Bhattacharya ([email protected]) CS685: Preprocessing 1 2012-13 43 / 43

CS685 Data Preprocessing 1

CS685 Data Preprocessing 1

More Decks by pankajmore

Other Decks in Science

Featured

Transcript