Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tony Ojeda - Human-Machine Collaboration for Improved Analytical Processes

Tony Ojeda - Human-Machine Collaboration for Improved Analytical Processes

Over the last several years, Python developers interested in data science and analytics have acquired a variety of tools and libraries that aim to facilitate analytical processes. Libraries such as Pandas, Statsmodels, Scikit-learn, Matplotlib, Seaborn, and Yellowbrick have made tasks such as data wrangling, statistical modeling, machine learning, and data visualization much quicker and easier. They have accomplished this by automating and abstracting away some of the more tedious, repetitive processes involved with analyzing and modeling data.

Over the next few years, we are sure to witness the introduction of new tools that are increasingly intelligent and have the ability to automate more complex analytical processes. However, as we begin using these tools (and developing new ones), we should strongly consider the level of automation that is most appropriate for each case. Some analytical processes are technically difficult to automate, and therefore require large degrees of human steering. Others are relatively easy to automate but perhaps should not be due to the unpredictability of results or outputs requiring a level of compassionate decision-making that machines simply don’t possess. Such processes would benefit greatly from the collaboration between automated machine tasks and uniquely human ones. After all, it is often systems that utilize a combination of both human and machine intelligence that achieve better results than either could on their own.

In this talk, we will discuss human-machine collaboration as it applies to analyzing data with Python. We will review a framework for exploratory data analysis with the goal of identifying which tasks should be automated, which tasks should not, and which tasks would benefit from a more interactive, symbiotic, and collaborative process between the human and the machine. We will explore Python libraries that we can use to build tools that allow us to perform different types of analysis. We’ll also introduce the Cultivar project, an example of a hybrid analytics tool that combines a Django framework with Javascript visualizations and Celery for task management to facilitate more efficient and effective human-machine systems for data analysis.


PyCon 2017

May 21, 2017

More Decks by PyCon 2017

Other Decks in Programming


  1. TONY OJEDA (ME) • Data Scientist @ Follett • Founder

    @ District Data Labs • Co-Author • Applied Text Analysis with Python (O’Reilly, Fall 2017) • Practical Data Science Cookbook (Packt, Fall 2014) • Conference Speaker • Data Day Seattle 2016 • PyData - Carolinas & DC 2016
  2. RECENT AI HEADLINES • Can The Rise Of Artificial Intelligence

    End Humanity? • Will Artificial Intelligence Leave You Jobless? • Essential Skills To Keep Your Job In The Era Of Artificial Intelligence • Artificial intelligence probably won't kill you, but it could take your job
  3. How can we combine human & machine abilities to produce

    better outcomes than either could on their own?
  4. WHAT IS AN ANALYTICAL PROCESS? • A series of tasks

    for ingesting, transforming, analyzing, modeling, or visualizing data. Ingestion Wrangling Analysis Modeling Visualization Data Science Pipeline
  5. What types of tasks are humans better at? What types

    of tasks are machines better at?

    • Social/Language/Communication Tasks • General or Domain Knowledge Tasks • Tasks Requiring Flexibility, Adaptability, or Creativity • Exploratory or Investigative Tasks

    Precision is Important • Tasks that Require Processing Vast Amounts of Information • Memory and Recollection Tasks • Repetitive Tasks Where Consistency is Important
  8. DESIGNING COLLABORATIVE PROCESSES • Deconstruct the process into tasks and

    steps. • Determine which steps should be performed by the human and which should be performed by the machine. • Identify the points of interaction and ensure those are intuitive.
  9. CREATE: CATEGORY AGGREGATIONS Categorical variables with a lot of categories

    (ex. more than 10) Distill down into fewer categories
  10. CATEGORY AGGREGATION REQUIREMENTS • Identification of categorical variables and unique

    values • Natural language understanding • General and/or domain knowledge • Similarity in meaning • Sometimes creativity
  11. CREATE: CONTINUOUS BINS Very Low Low Moderate High Very High

    Identify continuous variables Assign them to buckets or bins based on how high or low their values are.
  12. BINNING REQUIREMENTS • Identification of continuous variables • Comparison, ordering,

    and segregation • Knowing whether higher or lower values are better • Meaningful naming of resulting categories
  13. CONTINUOUS BINNING EXAMPLE import pandas as pd import numpy as

    np numeric_cols = data.select_dtypes(include=[np.number]).columns.values for column in numeric_cols: quint_levels = ['Very Low', 'Low','Moderate', 'High', 'Very High'] data[column + ' Level'] = pd.qcut(data[column], 5, quint_levels) data[column + ' Decile'] = pd.qcut(data[column], 10, range(1,11)) data[column + ' Perc'] = pd.qcut(data[column],100, range(1,101))
  14. CLUSTERING REQUIREMENTS • Identification of numeric variables • Clustering similar

    records together • Determining quality and appropriate numbers of clusters • Meaningful naming of resulting categories
  15. FILTER + AGGREGATE REQUIREMENTS • Identifying categorical and numeric variables.

    • Filtering/sub-setting the data set by categories. • Aggregating on categories and calculation of numeric fields. • Interpreting results and determining what is useful.
  16. FIELD RELATIONSHIP REQUIREMENTS • Identifying numeric fields. • Comparing cross-distributions

    of values across all combinations of numeric fields. • Identifying existence, direction, strength, and type of relationship. • Determining which relationships (or lack thereof) are interesting or insightful.
  17. GRAPH ANALYSIS REQUIREMENTS • Identifying hierarchical entity levels in the

    data. • Identifying similarities and strength of similarities between entities. • Identifying clusters, communities, sub-networks and other important groupings within the network. • Interpreting those relationships and what they mean in the real world.
  18. KEY TAKE-AWAYS • Human machine collaboration is important and very

    useful. • We can design these processes via deconstruction into tasks and steps. • Pay special attention to the interfaces. • There is plenty of room for development and advancement in this area, and Python already contains a lot of the tools we need to make progress.
  19. WHERE TO LEARN MORE & GET INVOLVED • Blog: blog.districtdatalabs.com

    • Cultivar: github.com/DistrictDataLabs/cultivar • Yellowbrick: github.com/DistrictDataLabs/yellowbrick • Twitter: @tonyojeda3 • LinkedIn: linkedin.com/in/tonyojeda