Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory Data Analysis

Exploratory Data Analysis

Basic Introduction of Exploratory Data Analysis as taught in UTSEUS in January 2014

Fabien Pfaender

January 06, 2014
Tweet

More Decks by Fabien Pfaender

Other Decks in Education

Transcript

  1. Initiated by J.W. Tukey  (1915 - 2000) Far better

    an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise J. W. Tukey (1962, page 13), "The future of data analysis". Annals of Mathematical Statistics 33(1), pp. 1-67.
  2. Exploratory Analysis versus Classical Analysis ‣ Classical analysis Problem -

    data - model - analysis - conclusions ‣ Exploratory analysis : Problem - data - analysis - model - conclusions ‣ Bayesan analysis : Problem - data - model - draft of distribution - analysis - conclusions
  3. Mantra Shneiderman’s mantra : Overview first, zoom and filter, and

    then details-on-demand ! EDA mantra : See the whole, zoom and focus, attend to particular
  4. A novel approach that Maximize insights in a dataset Discover

    underlying structures Extract important variables Detect abnormalities Test suppositions issuing from the data Develop minimal models Tune to discover best parameters
  5. Objective is to maximize insights  of the explorer To

    do that we have to give him A not-so-bad model that fits well Extreme data Robust conclusions An estimation for the parameters An error estimation for all the parameters List of the important factors and their relative individual importance Optimal parameters
  6. insights When the course of action must respond to new

    comprehension, new insights and new intuitive flashes of possible explanations or solutions, it will not be an orderly process. Existing means of composing and working with symbol structures penalize disorderly processes very heavily, and it is part of the real promise in the automated H-LAM/T systems of tomorrow that the human can have the freedom and power of disorderly processes ! Engelbart (1962).
  7. Data http://www.itl.nist.gov/ http://books.google.com http://www.utc.fr/~wic05 2010 2006 2009 HITS degree PR

    ngram Referrers Data + year attributes 0,2 4 6 29 0,6 80 9 12 0,1 2 3 50 references characteristics
  8. EDA steps Principle 1 : See the whole Principle 2

    : Simplify and look for models Principle 3 : Divide & group Principle 4 : See in relation Principle 5 : Look for recognizable Principle 6 : Zoom et Focus Principle 7 : pay attention to particularities Principle 8 : establish Links Principle 9 : establish Structure Principle 10 : integrate knowledge from the domain
  9. Tasks Elementary Synoptic Lookup (direct, indirect) Pattern identification (pattern definition,

    pattern search) Comparison (direct, inverse) Behaviour comparison (direct, inverse) Relation-seeking Relation-seeking
  10. Visualization can play a key role for such activities, for

    example : in presenting a visual overview of the data so that categories might be hypothesised (abductively), in evaluating individual examples with respect to their “representativeness” (inductively), and showing the results of applying the new knowledge to structure the data (deductively) ! M Gahegan, M Takatsuka, M Wheeler, and F Hardisty. Introducing geovista studio : an integrated suite of visualization and computational methods for exploration and .... Computers, Environment and urban Systems, 26(4) :267–292, Jan 2002.
  11. Graphical technics Simple technics ‣ Plot raw (data traces, histogrammes,

    bihistogrammes, probability plots, lag plots, block plots, and Youden plots). ‣ Plot simple statistics (mean plots, standard deviation plots, box plots) ‣ Use multiple diagrams and put them in a page to maximise our ability to recognise patterns
  12. 176 Figure 75: A visualization of county-level election results for

    the State of Michigan from 1998 to 2004 (see appendix A.3). A tinted lens highlights views, using labeled arrows to reveal
  13. EDA softwares examples Figure 75: A visualization of county-level election

    results for the State of Michigan from 1998 to 2004 (see appendix A.3). A tinted lens highlights views, using labeled arrows to reveal coordination on the user’s selection of counties in the Votes v. Counties scatter plot.