Fabien Pfaender
January 06, 2014

# Exploratory Data Analysis

Basic Introduction of Exploratory Data Analysis as taught in UTSEUS in January 2014

January 06, 2014

## Transcript

2. ### What is an Exploratory Data Analysis ? 1. A philosophy

2. Principles & methods 3. Tools

4. None
5. ### EDA is an attitude, a philosophy to reveal unkown directly

from the data
6. ### Initiated by J.W. Tukey  (1915 - 2000) Far better

an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise J. W. Tukey (1962, page 13), "The future of data analysis". Annals of Mathematical Statistics 33(1), pp. 1-67.
7. ### Exploratory Analysis versus Classical Analysis ‣ Classical analysis Problem -

data - model - analysis - conclusions ‣ Exploratory analysis : Problem - data - analysis - model - conclusions ‣ Bayesan analysis : Problem - data - model - draft of distribution - analysis - conclusions
8. ### Mantra Shneiderman’s mantra : Overview ﬁrst, zoom and ﬁlter, and

then details-on-demand ! EDA mantra : See the whole, zoom and focus, attend to particular
9. ### A novel approach that Maximize insights in a dataset Discover

underlying structures Extract important variables Detect abnormalities Test suppositions issuing from the data Develop minimal models Tune to discover best parameters
10. ### Objective is to maximize insights  of the explorer To

do that we have to give him A not-so-bad model that ﬁts well Extreme data Robust conclusions An estimation for the parameters An error estimation for all the parameters List of the important factors and their relative individual importance Optimal parameters
11. ### insights When the course of action must respond to new

comprehension, new insights and new intuitive ﬂashes of possible explanations or solutions, it will not be an orderly process. Existing means of composing and working with symbol structures penalize disorderly processes very heavily, and it is part of the real promise in the automated H-LAM/T systems of tomorrow that the human can have the freedom and power of disorderly processes ! Engelbart (1962).
12. ### Principles EDA Get data Mine & structure Create a model

Present results
13. ### Data http://www.itl.nist.gov/ http://books.google.com http://www.utc.fr/~wic05 2010 2006 2009 HITS degree PR

ngram Referrers Data + year attributes 0,2 4 6 29 0,6 80 9 12 0,1 2 3 50 references characteristics

15. ### EDA steps Principle 1 : See the whole Principle 2

: Simplify and look for models Principle 3 : Divide & group Principle 4 : See in relation Principle 5 : Look for recognizable Principle 6 : Zoom et Focus Principle 7 : pay attention to particularities Principle 8 : establish Links Principle 9 : establish Structure Principle 10 : integrate knowledge from the domain
16. ### Tasks Elementary Synoptic Lookup (direct, indirect) Pattern identiﬁcation (pattern deﬁnition,

pattern search) Comparison (direct, inverse) Behaviour comparison (direct, inverse) Relation-seeking Relation-seeking

2
19. ### Visualization can play a key role for such activities, for

example : in presenting a visual overview of the data so that categories might be hypothesised (abductively), in evaluating individual examples with respect to their “representativeness” (inductively), and showing the results of applying the new knowledge to structure the data (deductively) ! M Gahegan, M Takatsuka, M Wheeler, and F Hardisty. Introducing geovista studio : an integrated suite of visualization and computational methods for exploration and .... Computers, Environment and urban Systems, 26(4) :267–292, Jan 2002.
20. ### Graphical technics Simple technics ‣ Plot raw (data traces, histogrammes,

bihistogrammes, probability plots, lag plots, block plots, and Youden plots). ‣ Plot simple statistics (mean plots, standard deviation plots, box plots) ‣ Use multiple diagrams and put them in a page to maximise our ability to recognise patterns
21. None
22. None
23. None
24. None
25. None
26. ### 176 Figure 75: A visualization of county-level election results for

the State of Michigan from 1998 to 2004 (see appendix A.3). A tinted lens highlights views, using labeled arrows to reveal
27. None

31. None
32. None
33. ### EDA softwares examples Figure 75: A visualization of county-level election

results for the State of Michigan from 1998 to 2004 (see appendix A.3). A tinted lens highlights views, using labeled arrows to reveal coordination on the user’s selection of counties in the Votes v. Counties scatter plot.