Exploratory Data Analysis

Exploratory Data Analysis

Basic Introduction of Exploratory Data Analysis as taught in UTSEUS in January 2014


Fabien Pfaender

January 06, 2014


  1. Exploratory Data Analysis EDA : general principles

  2. What is an Exploratory Data Analysis ? 1. A philosophy

    2. Principles & methods 3. Tools
  3. What is a complex system ?

  4. None
  5. EDA is an attitude, a philosophy to reveal unkown directly

    from the data
  6. Initiated by J.W. Tukey  (1915 - 2000) Far better

    an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise J. W. Tukey (1962, page 13), "The future of data analysis". Annals of Mathematical Statistics 33(1), pp. 1-67.
  7. Exploratory Analysis versus Classical Analysis ‣ Classical analysis Problem -

    data - model - analysis - conclusions ‣ Exploratory analysis : Problem - data - analysis - model - conclusions ‣ Bayesan analysis : Problem - data - model - draft of distribution - analysis - conclusions
  8. Mantra Shneiderman’s mantra : Overview first, zoom and filter, and

    then details-on-demand ! EDA mantra : See the whole, zoom and focus, attend to particular
  9. A novel approach that Maximize insights in a dataset Discover

    underlying structures Extract important variables Detect abnormalities Test suppositions issuing from the data Develop minimal models Tune to discover best parameters
  10. Objective is to maximize insights  of the explorer To

    do that we have to give him A not-so-bad model that fits well Extreme data Robust conclusions An estimation for the parameters An error estimation for all the parameters List of the important factors and their relative individual importance Optimal parameters
  11. insights When the course of action must respond to new

    comprehension, new insights and new intuitive flashes of possible explanations or solutions, it will not be an orderly process. Existing means of composing and working with symbol structures penalize disorderly processes very heavily, and it is part of the real promise in the automated H-LAM/T systems of tomorrow that the human can have the freedom and power of disorderly processes ! Engelbart (1962).
  12. Principles EDA Get data Mine & structure Create a model

    Present results
  13. Data http://www.itl.nist.gov/ http://books.google.com http://www.utc.fr/~wic05 2010 2006 2009 HITS degree PR

    ngram Referrers Data + year attributes 0,2 4 6 29 0,6 80 9 12 0,1 2 3 50 references characteristics
  14. Data

  15. EDA steps Principle 1 : See the whole Principle 2

    : Simplify and look for models Principle 3 : Divide & group Principle 4 : See in relation Principle 5 : Look for recognizable Principle 6 : Zoom et Focus Principle 7 : pay attention to particularities Principle 8 : establish Links Principle 9 : establish Structure Principle 10 : integrate knowledge from the domain
  16. Tasks Elementary Synoptic Lookup (direct, indirect) Pattern identification (pattern definition,

    pattern search) Comparison (direct, inverse) Behaviour comparison (direct, inverse) Relation-seeking Relation-seeking
  17. How to do that ? Use visualizations!

  18. 0 17,5 35 52,5 70 2007 2008 2009 2010 Région

  19. Visualization can play a key role for such activities, for

    example : in presenting a visual overview of the data so that categories might be hypothesised (abductively), in evaluating individual examples with respect to their “representativeness” (inductively), and showing the results of applying the new knowledge to structure the data (deductively) ! M Gahegan, M Takatsuka, M Wheeler, and F Hardisty. Introducing geovista studio : an integrated suite of visualization and computational methods for exploration and .... Computers, Environment and urban Systems, 26(4) :267–292, Jan 2002.
  20. Graphical technics Simple technics ‣ Plot raw (data traces, histogrammes,

    bihistogrammes, probability plots, lag plots, block plots, and Youden plots). ‣ Plot simple statistics (mean plots, standard deviation plots, box plots) ‣ Use multiple diagrams and put them in a page to maximise our ability to recognise patterns
  21. None
  22. None
  23. None
  24. None
  25. None
  26. 176 Figure 75: A visualization of county-level election results for

    the State of Michigan from 1998 to 2004 (see appendix A.3). A tinted lens highlights views, using labeled arrows to reveal
  27. None
  28. Improve EDA Find clues

  29. Principle 1: See the Whole

  30. Improve EDA Use semiology

  31. None
  32. None
  33. EDA softwares examples Figure 75: A visualization of county-level election

    results for the State of Michigan from 1998 to 2004 (see appendix A.3). A tinted lens highlights views, using labeled arrows to reveal coordination on the user’s selection of counties in the Votes v. Counties scatter plot.