Tony Ojeda - Human-Machine Collaboration for Improved Analytical Processes

Slide 1

Slide 1 text

HUMAN + MACHINE COLLABORATION FOR IMPROVED ANALYTICAL PROCESSES

Slide 2

Slide 2 text

TONY OJEDA (ME) • Data Scientist @ Follett • Founder @ District Data Labs • Co-Author • Applied Text Analysis with Python (O’Reilly, Fall 2017) • Practical Data Science Cookbook (Packt, Fall 2014) • Conference Speaker • Data Day Seattle 2016 • PyData - Carolinas & DC 2016

Slide 3

Slide 3 text

SOME BACKGROUND…

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

RECENT AI HEADLINES • Can The Rise Of Artificial Intelligence End Humanity? • Will Artificial Intelligence Leave You Jobless? • Essential Skills To Keep Your Job In The Era Of Artificial Intelligence • Artificial intelligence probably won't kill you, but it could take your job

Slide 9

Slide 9 text

How can we combine human & machine abilities to produce better outcomes than either could on their own?

Slide 10

Slide 10 text

Not Human vs. Machine But Human + Machine

Slide 11

Slide 11 text

DESIGNING COLLABORATIVE ANALYTICAL PROCESSES

Slide 12

Slide 12 text

WHAT IS AN ANALYTICAL PROCESS? • A series of tasks for ingesting, transforming, analyzing, modeling, or visualizing data. Ingestion Wrangling Analysis Modeling Visualization Data Science Pipeline

Slide 13

Slide 13 text

DECONSTRUCTING A PROCESS Steps Tasks Process

Slide 14

Slide 14 text

What types of tasks are humans better at? What types of tasks are machines better at?

Slide 15

Slide 15 text

TYPES OF TASKS HUMANS ARE GOOD AT • Sensory Tasks • Social/Language/Communication Tasks • General or Domain Knowledge Tasks • Tasks Requiring Flexibility, Adaptability, or Creativity • Exploratory or Investigative Tasks

Slide 16

Slide 16 text

TYPES OF TASKS MACHINES ARE GOOD AT • Tasks Where Precision is Important • Tasks that Require Processing Vast Amounts of Information • Memory and Recollection Tasks • Repetitive Tasks Where Consistency is Important

Slide 17

Slide 17 text

DESIGNING COLLABORATIVE PROCESSES • Deconstruct the process into tasks and steps. • Determine which steps should be performed by the human and which should be performed by the machine. • Identify the points of interaction and ensure those are intuitive.

Slide 18

Slide 18 text

THE INTERFACE IS IMPORTANT

Slide 19

Slide 19 text

COLLABORATIVE DATA EXPLORATION

Slide 20

Slide 20 text

DATA EXPLORATION FRAMEWORK Prep Phase Explore Phase

Slide 21

Slide 21 text

CREATE: CATEGORY AGGREGATIONS Categorical variables with a lot of categories (ex. more than 10) Distill down into fewer categories

Slide 22

Slide 22 text

CATEGORY AGGREGATION REQUIREMENTS • Identification of categorical variables and unique values • Natural language understanding • General and/or domain knowledge • Similarity in meaning • Sometimes creativity

Slide 23

Slide 23 text

CREATE: CONTINUOUS BINS Very Low Low Moderate High Very High Identify continuous variables Assign them to buckets or bins based on how high or low their values are.

Slide 24

Slide 24 text

BINNING REQUIREMENTS • Identification of continuous variables • Comparison, ordering, and segregation • Knowing whether higher or lower values are better • Meaningful naming of resulting categories

Slide 25

Slide 25 text

CONTINUOUS BINNING EXAMPLE import pandas as pd import numpy as np numeric_cols = data.select_dtypes(include=[np.number]).columns.values for column in numeric_cols: quint_levels = ['Very Low', 'Low','Moderate', 'High', 'Very High'] data[column + ' Level'] = pd.qcut(data[column], 5, quint_levels) data[column + ' Decile'] = pd.qcut(data[column], 10, range(1,11)) data[column + ' Perc'] = pd.qcut(data[column],100, range(1,101))

Slide 26

Slide 26 text

CREATE: CLUSTER CATEGORIES

Slide 27

Slide 27 text

CLUSTERING REQUIREMENTS • Identification of numeric variables • Clustering similar records together • Determining quality and appropriate numbers of clusters • Meaningful naming of resulting categories

Slide 28

Slide 28 text

DATA EXPLORATION FRAMEWORK Prep Phase Explore Phase

Slide 29

Slide 29 text

EXPLORE: FILTER + AGGREGATE

Slide 30

Slide 30 text

FILTER + AGGREGATE REQUIREMENTS • Identifying categorical and numeric variables. • Filtering/sub-setting the data set by categories. • Aggregating on categories and calculation of numeric fields. • Interpreting results and determining what is useful.

Slide 31

Slide 31 text

EXPLORE: FIELD RELATIONSHIPS

Slide 32

Slide 32 text

FIELD RELATIONSHIP REQUIREMENTS • Identifying numeric fields. • Comparing cross-distributions of values across all combinations of numeric fields. • Identifying existence, direction, strength, and type of relationship. • Determining which relationships (or lack thereof) are interesting or insightful.

Slide 33

Slide 33 text

EXPLORE: ENTITY RELATIONSHIPS

Slide 34

Slide 34 text

GRAPH ANALYSIS REQUIREMENTS • Identifying hierarchical entity levels in the data. • Identifying similarities and strength of similarities between entities. • Identifying clusters, communities, sub-networks and other important groupings within the network. • Interpreting those relationships and what they mean in the real world.

Slide 35

Slide 35 text

KEY TAKE-AWAYS • Human machine collaboration is important and very useful. • We can design these processes via deconstruction into tasks and steps. • Pay special attention to the interfaces. • There is plenty of room for development and advancement in this area, and Python already contains a lot of the tools we need to make progress.

Slide 36

Slide 36 text

WHERE TO LEARN MORE & GET INVOLVED • Blog: blog.districtdatalabs.com • Cultivar: github.com/DistrictDataLabs/cultivar • Yellowbrick: github.com/DistrictDataLabs/yellowbrick • Twitter: @tonyojeda3 • LinkedIn: linkedin.com/in/tonyojeda

Slide 37

Slide 37 text

THANK YOU!