Operations: equal, not equal Binary Special case of nominal Example: gender, diabetic Symmetric: Two cases are equally important Asymmetric: One case is more important Ordinal or Rank or Ordered scalar Can order Example: small, medium, large Operations: equality, lesser, greater Diﬀerence has no meaning Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 4 / 43
sized units Example: temperature in Celsius, date Operations: diﬀerence No zero point: absolute value has no meaning Ratio-scaled Has a zero point: absolute values are ratios of each other Example: temperature in Kelvin, age, mass, length Operations: diﬀerence, ratio Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 5 / 43
discrete or continuous Discrete Finite or countably inﬁnite set of values Countably inﬁnite sets have a one-to-one correspondence with the set of natural numbers Continuous Real numbers Precision of measurement and machine-representation limit possibilities Not continuous in the actual sense Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 6 / 43
Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Parameters to measure the quality of measurements Precision: closeness of repeated measurements Bias: systematic variation of measurements Accuracy: closeness of measurements to true value Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 8 / 43
Measurement error Data collection error Noise: probabilistic Artifact: deterministic distortions Parameters to measure the quality of measurements Precision: closeness of repeated measurements Bias: systematic variation of measurements Accuracy: closeness of measurements to true value Data problems Missing values Noise Outliers Inconsistent values Duplicate objects Domain knowledge about data and attributes helps data mining Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 8 / 43
data to be ﬁt for data mining algorithms and methods It may involve one or more of the following steps Data cleaning Data reduction Data integration Data transformation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 10 / 43
ways Filling in missing values Handling noise Removing outliers One of the main methods in handling noise Resolving inconsistent data Out of range Once identiﬁed as inconsistent data, handled as missing value De-duplicating duplicated objects Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 12 / 43
attribute during analysis Estimate the missing value Use a measure of overall central tendency Mean or median Use a measure of central tendency from only the neighborhood Interpolation Useful for temporal and spatial data Use the most probable value Mode Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 13 / 43
that magnitude of noise is smaller than magnitude of attribute of interest Signal-to-noise ratio should not be too low White noise Gaussian distribution with zero mean Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 14 / 43
that magnitude of noise is smaller than magnitude of attribute of interest Signal-to-noise ratio should not be too low White noise Gaussian distribution with zero mean Histogram binning Bin values are replaced by mean or median Equi-width histograms are more common than equi-depth Regression Fitting a function to describe the values Small values of noise do not aﬀect the overall ﬁt Noisy value replaced by most likely value predicted by the function Outlier identiﬁcation and removal Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 14 / 43
that magnitude of noise is smaller than magnitude of attribute of interest Signal-to-noise ratio should not be too low White noise Gaussian distribution with zero mean Histogram binning Bin values are replaced by mean or median Equi-width histograms are more common than equi-depth Regression Fitting a function to describe the values Small values of noise do not aﬀect the overall ﬁt Noisy value replaced by most likely value predicted by the function Outlier identiﬁcation and removal As opposed to noise, bias can be corrected since it is deterministic Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 14 / 43
perceivably diﬀerent process Values are considerably unusual Also called anomalous It is not straightforward to identify outliers Unusual values may be the one that are of interest Statistical methods and tests are mostly used Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 15 / 43
data transfer Introduces errors in statistics about the data If most attributes are exact copies, then it is easy to remove Sometimes one or more attributes are slightly diﬀerent Domain knowledge needs to be utilized to identify such cases Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 16 / 43
number of rules Less complex rules, i.e., involving less number of attributes Faster algorithms Easier visualization Important ways of data reduction 1 Dimensionality reduction 2 Numerosity reduction 3 Data discretization 4 Data modeling Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 18 / 43
dimensions are generally diﬀerent from original ones Curse of dimensionality Data becomes too sparse as dimensions increase Classiﬁcation: Not enough data to create good models or methods Clustering: Density becomes irrelevant and distance between points becomes similar Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 19 / 43
UΣV T If A is of size m × n, then U is m × m, V is n × n and Σ is m × n matrix Columns of U are eigenvectors of AAT Left singular vectors UUT = Im (orthonormal) Columns of V are eigenvectors of AT A Right singular vectors V T V = In (orthonormal) σii are the singular values Σ is diagonal Singular values are positive square roots of eigenvalues of AAT or AT A σ11 ≥ σ22 ≥ · · · ≥ σnn (assuming n singular values) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 20 / 43
size n × n A = AT U = V since AT A = AAT = A2 A = QΣQT Q is of size n × n and contains eigenvectors of A2 This is called spectral decomposition of A Σ contains n singular values Eigenvectors of A = eigenvectors of A2 Eigenvalues of A = square root of eigenvalues of A2 Eigenvalues of A = singular values of A Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 21 / 43
V is called SVD transform matrix Essentially, T is just a rotation of A Dimensionality of T is n n diﬀerent basis vectors than the original space Columns of V give the basis vectors in rotated space Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 22 / 43
V is called SVD transform matrix Essentially, T is just a rotation of A Dimensionality of T is n n diﬀerent basis vectors than the original space Columns of V give the basis vectors in rotated space V shows how each dimension can be represented as a linear combination of other dimensions Columns are input basis vectors U shows how each object can be represented as a linear combination of other objects Columns are output basis vectors Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 22 / 43
V is called SVD transform matrix Essentially, T is just a rotation of A Dimensionality of T is n n diﬀerent basis vectors than the original space Columns of V give the basis vectors in rotated space V shows how each dimension can be represented as a linear combination of other dimensions Columns are input basis vectors U shows how each object can be represented as a linear combination of other objects Columns are output basis vectors Lengths of vectors are preserved ||ai ||2 = ||ti ||2 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 22 / 43
1 3 0 1 −1 0.5 =U −0.80 0.22 −0.56 −0.20 −0.16 −0.31 −0.01 −0.89 × Σ 5.54 0 0 1.24 × V T −0.39 −0.92 0.92 −0.39 T If A is of size m × n, then U is m × n, V is n × n and Σ is n × n matrix Works because there at most n non-zero singular values in Σ Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 24 / 43
i=1 (ui σii vT i ) Use only k dimensions Retain ﬁrst k columns for U and V and ﬁrst k values for Σ First k columns of V give the basis vectors in reduced space Best rank k approximation in terms of sum squared error A ≈ k i=1 (ui σii vT i ) = U1...kΣ1...kV T 1...k T ≈ AV1...k Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 25 / 43
Concept of energy of a dataset Total energy is sum of squares of singular values (aka spread or variance) E = n i=1 σ2 ii Retain k dimensions such that p % of the energy is retained Ek = k i=1 σ2 ii Ek/E ≥ p Generally, p is between 80 % to 95 % Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 28 / 43
Concept of energy of a dataset Total energy is sum of squares of singular values (aka spread or variance) E = n i=1 σ2 ii Retain k dimensions such that p % of the energy is retained Ek = k i=1 σ2 ii Ek/E ≥ p Generally, p is between 80 % to 95 % In the above example, k = 1 retains 95.22 % of the energy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 28 / 43
Concept of energy of a dataset Total energy is sum of squares of singular values (aka spread or variance) E = n i=1 σ2 ii Retain k dimensions such that p % of the energy is retained Ek = k i=1 σ2 ii Ek/E ≥ p Generally, p is between 80 % to 95 % In the above example, k = 1 retains 95.22 % of the energy Running time: O(mnr) for A of size m × n and rank r Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 28 / 43
How input basis vectors are correlated for the given data A transformation from a set of (possibly correlated) axes to another set of uncorrelated axes Orthogonal linear transformation (i.e., rotation) New axes are principal components First principal component produces projections that are best in the squared error sense Optimal least squares solution Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 29 / 43
of the dimensions Find eigenvectors of covariance matrix Sort eigenvectors in decreasing order of eigenvalues Project onto eigenvectors in order Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 30 / 43
of the dimensions Find eigenvectors of covariance matrix Sort eigenvectors in decreasing order of eigenvalues Project onto eigenvectors in order Assume data matrix is B of size m × n For each dimension, compute mean µi Mean center B by subtracting µi from each column i to get A Compute covariance matrix C of size n × n If mean centered, C = AT A Find eigenvectors and corresponding eigenvalues (V , E) of C Sort eigenvalues such that e1 ≥ e2 ≥ · · · ≥ en Project step-by-step onto the principal components v1, v2, . . . , etc. Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 30 / 43
L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though How many dimensions to retain? Based on energy (similar to SVD) Total energy is sum of eigenvalues ei Retain k dimensions such that 80 % − 95 % of the energy is retained Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though How many dimensions to retain? Based on energy (similar to SVD) Total energy is sum of eigenvalues ei Retain k dimensions such that 80 % − 95 % of the energy is retained In the above example, k = 1 retains 98.91 % of the energy Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
L2 distances only as others are not invariant to rotation Mean-centering Easier way to compute covariance: AT A is covariance matrix Allows use of SVD to compute PCA Can be done using SVD Eigenvector matrix V of C is really the SVD transform matrix V for A Diﬀerent from SVD of B though How many dimensions to retain? Based on energy (similar to SVD) Total energy is sum of eigenvalues ei Retain k dimensions such that 80 % − 95 % of the energy is retained In the above example, k = 1 retains 98.91 % of the energy Running time: O(mn2 + n3) for A of size m × n Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 33 / 43
attribute(s) Aggregates some other attribute(s) into single value(s) Example: sum, average Beneﬁts of aggregation Aggregate value has less variability Absorbs individual errors Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 35 / 43
E P 90 75 80 95 45 60 60 For multi-dimensional datasets, aggregation can happen along diﬀerent dimensions Data cubes are essentially multi-dimensional arrays Each cell or face or lower dimensional surface represents a certain projection operation Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 36 / 43
E P 90 75 80 95 45 60 60 For multi-dimensional datasets, aggregation can happen along diﬀerent dimensions Data cubes are essentially multi-dimensional arrays Each cell or face or lower dimensional surface represents a certain projection operation Aggregation can also happen along diﬀerent resolutions in each dimension Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 36 / 43
the same property of interest as the full dataset Sampling approaches Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Simple random sampling For diﬀerent types of objects, stratiﬁed sampling Picks equal or representative number of objects from each group Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Simple random sampling For diﬀerent types of objects, stratiﬁed sampling Picks equal or representative number of objects from each group Sample size Sample should have enough data to capture variability What is the probability of obtaining objects from all k groups in a sample of size n? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
the same property of interest as the full dataset Sampling approaches Sampling without replacement (SRSWOR): produces population Sampling with replacement (SRSWR): can be picked more than once Simple random sampling For diﬀerent types of objects, stratiﬁed sampling Picks equal or representative number of objects from each group Sample size Sample should have enough data to capture variability What is the probability of obtaining objects from all k groups in a sample of size n? Progressive sampling or adaptive sampling Start with a small sample size Keep on increasing till it is acceptable Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 37 / 43
data Equi-width histograms: bins are equally spaced apart Equi-height (equi-depth) histograms: each bin has the same height Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 38 / 43
data Equi-width histograms: bins are equally spaced apart Equi-height (equi-depth) histograms: each bin has the same height MaxDiﬀ histograms Values are ﬁrst sorted To get b bins, the largest b − 1 diﬀerences are made bin boundaries Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 38 / 43
data Equi-width histograms: bins are equally spaced apart Equi-height (equi-depth) histograms: each bin has the same height MaxDiﬀ histograms Values are ﬁrst sorted To get b bins, the largest b − 1 diﬀerences are made bin boundaries V-optimal histograms Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 38 / 43
b bins where b n Formally, assume a set V of n (sorted) values v1, v2, . . . , vn having frequencies f1, f2, . . . , fn respectively Problem is to output another histogram H having b bins, i.e., b non-overlapping intervals on V Interval Ii is of the form [li , ri ] and has a value hi If value vj ∈ Ii , estimate e(vj ) of fj is hi Error in estimation is distance d(f , e) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 39 / 43
at most k bins is SSE∗(i, k) Consider placement of the last bin Choice is any of the i gaps For each such placement at gap j, at most k − 1 bins have been placed optimally for the ﬁrst j values This leads to the recursion SSE∗(i, k) = min 1≤j≤i {SSE∗(j, k − 1) + SSE([lj+1, ri ])} Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 41 / 43
at most k bins is SSE∗(i, k) Consider placement of the last bin Choice is any of the i gaps For each such placement at gap j, at most k − 1 bins have been placed optimally for the ﬁrst j values This leads to the recursion SSE∗(i, k) = min 1≤j≤i {SSE∗(j, k − 1) + SSE([lj+1, ri ])} Dynamic programming (DP) solution Table of size n × b Start with cell (1, 1) and proceed in a column-scan order Computation for cell (i, k) requires values at cells (j, k − 1), ∀1 ≤ j ≤ i Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 41 / 43
at most k bins is SSE∗(i, k) Consider placement of the last bin Choice is any of the i gaps For each such placement at gap j, at most k − 1 bins have been placed optimally for the ﬁrst j values This leads to the recursion SSE∗(i, k) = min 1≤j≤i {SSE∗(j, k − 1) + SSE([lj+1, ri ])} Dynamic programming (DP) solution Table of size n × b Start with cell (1, 1) and proceed in a column-scan order Computation for cell (i, k) requires values at cells (j, k − 1), ∀1 ≤ j ≤ i Running time: O(n2b) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 41 / 43
“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Variance Standard deviation Range Percentile (quartile) Five-number summary: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Variance Standard deviation Range Percentile (quartile) Five-number summary: minimum, ﬁrst quartile, median, third quartile, maximum Box plot: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
“middle” value Mode: dataset may be unimodal or multimodal Midrange: average of largest and smallest value Dispersion measures Variance Standard deviation Range Percentile (quartile) Five-number summary: minimum, ﬁrst quartile, median, third quartile, maximum Box plot: plot of ﬁve values Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 42 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Plot of one variable against another Scatter plot matrix: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Plot of one variable against another Scatter plot matrix: n(n − 1)/2 scatter plots for n variables Loess curve: Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43
the dataset Example: mean Holistic measures Cannot be computed by partitioning the dataset Example: median Graphical measures Histograms Bar chart: histograms where bins are categorical Pie chart: relative frequencies shown as sectors in a circle Quantile plot: percentiles against value Quantile-quantile plot (q-q plot): Quantiles of one variable against another Scatter plot: Plot of one variable against another Scatter plot matrix: n(n − 1)/2 scatter plots for n variables Loess curve: plot of regression polynomial against actual values Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Preprocessing 1 2012-13 43 / 43