Slide 1

Slide 1 text

DATA WRANGLING

Slide 2

Slide 2 text

Rachel Shadoan @rachelshadoan @akashiclabs

Slide 3

Slide 3 text

What do you need for a data visualization?

Slide 4

Slide 4 text

Data

Slide 5

Slide 5 text

A Spectrum of Data Quantitative Qualitative Nominal •City •Gender Ordinal •Months •Seasons •Agreement Interval •Latitude •Longitude •Dates Ratio •Amount •Age •Height

Slide 6

Slide 6 text

Another Spectrum of Data Structured Unstructured •Books •Blogs •Video •Images •Music •Databases •Spreadsheets •Etc

Slide 7

Slide 7 text

What is structure? Unstructured Structured

Slide 8

Slide 8 text

Unstructured Data Examples: blog posts books, images, video Not easily searchable Has no consistent underlying organization

Slide 9

Slide 9 text

Structured Data Searchable Has a consistent underlying organization Examples: spreadsheets, .csv files, xml files, databases

Slide 10

Slide 10 text

What Can Be Visualized? Quantitative Qualitative Unstructured Structured

Slide 11

Slide 11 text

What Can Be Visualized? Quantitative Qualitative Unstructured Structured

Slide 12

Slide 12 text

What Can Be Visualized? Quantitative Qualitative Unstructured Structured

Slide 13

Slide 13 text

What Can Be Visualized? Quantitative Qualitative Unstructured Structured

Slide 14

Slide 14 text

Unstructured data can be transformed into structured data

Slide 15

Slide 15 text

What Can Be Visualized? Quantitative Qualitative Unstructured Structured

Slide 16

Slide 16 text

ANY DATA CAN BE STRUCTURED.

Slide 17

Slide 17 text

ANY DATA CAN BE VISUALIZED.

Slide 18

Slide 18 text

Structuring Data: Abstract Data Models This is a person. This is person data model, which is an abstraction of a person. o Person ID o Name o Age o Height o Profession o Hobbies o Nationality o Native Language Person

Slide 19

Slide 19 text

An OSBridge Attendee Data Model • Name • Email address • Home address • Company/organization • Twitter/identi.ca • Website • Age • Gender • Years in open source • Food preference • Favorite language • Current projects • Favorite color

Slide 20

Slide 20 text

One Rule for Transforming Unstructured Data into Structured Data CONSISTENCY. CONSISTENCY. CONSISTENCY.

Slide 21

Slide 21 text

What you need for data visualization: Structured Data

Slide 22

Slide 22 text

But what do you need for a good data visualization?

Slide 23

Slide 23 text

The Dimensionality Problem

Slide 24

Slide 24 text

A dimension is a variable in the data • Name • Email address • Home address • Company/organization • Twitter/identi.ca • Website • Age • Gender • Years in open source • Food preference • Favorite language • Current projects • Favorite color These are all dimensions of the data

Slide 25

Slide 25 text

Dimensionality Increases Quickly Email Home address Company/Org Twitter Website Age Gender Years in OS Food preference Favorite language Current projects Favorite color 78 pairwise relationships!

Slide 26

Slide 26 text

Questions Reduce Dimensionality Email Home address Company/Org Twitter Website Age Gender Years in OS Food preference Favorite language Current projects Favorite color How are food preferences distributed among languages and projects?

Slide 27

Slide 27 text

Questions Reduce Dimensionality Email Home address Company/Org Twitter Website Age Gender Years in OS Food preference Favorite language Current projects Favorite color How are languages and projects distributed geographically?

Slide 28

Slide 28 text

Multiple comparison problem The more relationships you look at, the more likely you are to find a pattern that only exists because of random chance

Slide 29

Slide 29 text

The View: The Smallest Unit of Visualization

Slide 30

Slide 30 text

The View: The Smallest Unit of Visualization

Slide 31

Slide 31 text

The View: The Smallest Unit of Visualization

Slide 32

Slide 32 text

The View: The Smallest Unit of Visualization

Slide 33

Slide 33 text

Why is reducing dimensionality especially important for visualization?

Slide 34

Slide 34 text

Dimension count: 1 Date dimension encoded as position along the horizontal axis

Slide 35

Slide 35 text

Dimension count: 2 Time dimension encoded as position along the vertical axis

Slide 36

Slide 36 text

Dimension count: 3 Type of training dimension encoded as color

Slide 37

Slide 37 text

Dimension count: 4 Number of stat points gained from training dimension encoded as bubble size

Slide 38

Slide 38 text

So what do you need for a good data visualization?

Slide 39

Slide 39 text

What you need: Structured Data Questions

Slide 40

Slide 40 text

Choosing the right views for your data

Slide 41

Slide 41 text

Quantitative Data

Slide 42

Slide 42 text

Spatial (Location) Data

Slide 43

Slide 43 text

Temporal (Time) Data

Slide 44

Slide 44 text

Relational (Network) Data

Slide 45

Slide 45 text

What if one view isn’t enough?

Slide 46

Slide 46 text

Multiple Coordinated Views to the rescue!

Slide 47

Slide 47 text

Filtering, Brushing + Linking

Slide 48

Slide 48 text

Focus + Context This view shows the context, or overview of the data

Slide 49

Slide 49 text

Focus + Context This view shows the details of the data selected in the context view Selection

Slide 50

Slide 50 text

Tools

Slide 51

Slide 51 text

D3.js Limited support for multiple coordinated views, large data sets Beautiful browser- based visualizations JavaScript visualization library

Slide 52

Slide 52 text

Improvise Few resources for learning Java-based desktop design environment Powerful, flexible tool for creating multiple-coordinated view visualizations

Slide 53

Slide 53 text

Python Great for data wrangling and rapid prototyping Check out MatPlotLib and ggplot visualization libaries

Slide 54

Slide 54 text

Go forth and visualize!

Slide 55

Slide 55 text

Credits Liz Mc, CC 2.0, via Flickr Kent K. Barns, CC 2.0, kentkb.com Jerome Collins, CC 2.0, via Flickr Cristinacosta, CC 2.0, via Flickr Thinking Machine 4, Mid- game, by Martin Wattenberg and Marek Walczak