Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AI Observability ツ - Examples with Spark, Panda...

AI Observability ツ - Examples with Spark, Pandas, and Scikit-Learn

Presentation to the Data Engineering Melbourne meetup: https://www.meetup.com/fr-FR/Data-Engineering-Melbourne/events/277038718/

In this talk, Andy Petrella will do 50% live coding (scala, python) and 50% highlights on best practices to help data teams keep calm and let AI go to production.

Andy is an entrepreneur with a Mathematics and Distributed Data background focused on unleashing unexploited business potentials leveraging new technologies in machine learning, artificial intelligence, and cognitive systems.

In the data community, Andy is known as an early evangelist of Apache Spark (2011-), the Spark Notebook creator (2013-), a public speaker at various events (Spark Summit, Strata, Big Data Spain), and an O'Reilly author (Distributed Data Science, Data Lineage Essentials, Data Governance, and Machine Learning Model Monitoring).

Andy is the CEO of Kensu, bringing the Data Intelligence Management (DIM) Platform for data-driven companies to leverage AI sustainably, combining AI Observability with Data Usage Catalog.

We’ll take the opportunity to introduce concepts like datastrophes, data as a product, and federated governance.

In the end, you’ll have a grasp of the relationship between monitoring data and AI, data mesh, and advanced data discovery.

Andy Petrella

April 01, 2021
Tweet

More Decks by Andy Petrella

Other Decks in Technology

Transcript

  1. © Kensu, inc. 2021 AI Observability ツ Examples with Spark,

    Pandas, and Scikit-Learn 1. “Who’s that dude” (73’) 2. Introduction to `Datastrophes` (10’) 3. Solution: needs and methods (15’) 4. Showtime: implementation examples (15’) AGENDA
  2. © Kensu, inc. 2021 Introduction to `Datastrophes` Like any projects,

    a data project needs to limit its scope. To do so many assumptions are necessary. Also, those assumptions are made by both the business (about the market) and the engineering (about the system) which leads, inevitably, to Datastrophes; Catastrophe = denouement of a drama. Datastrophe = catastrophe with data. ---------------------------------------------------------------------- Datastrophe = denouement of a DAMA (*). (*) DAta MAnagement
  3. © Kensu, inc. 2021 Datastrophes ⇢? AI Winter 🥶🥶🥶 “The

    AI winter was a result of such hype, due to over-inflated promises by developers, unnaturally high expectations from end-users, and extensive promotion in the media.” https://www.actuaries.digital/2018/09/05/history-of-ai-winters/
  4. © Kensu, inc. 2021 Datastrophes ⇢? AI Winter 🥶🥶🥶 4.

    Data variations are uncontrolled and unknown Why? POC Syndrome ROI drops over time It is not about garbage in, garbage out or data quality but trust what’s going on with data in production “15% of documentation overhead to ensure compliance and Data Catalog usefulness” -- project manager “Data is not available on time in production” -- data ops “Data suppliers changed schema, or semantic (business definition), impacting business rules accuracy” -- data engineer “The data is different than 6 months ago, all predictions are wrong” -- data scientist “Datastrophes” 1. Data is hard to find and usable in production 2. Cost of maintenance reduces team capabilities 3. Impact assessments are ineffective, incomplete
  5. © Kensu, inc. 2021 AI Observability Wave-Particle duality A. Einstein:

    “It seems as though we must use sometimes the one theory and sometimes the other, while at times we may use either. We are faced with a new kind of difficulty. We have two contradictory pictures of reality; separately neither of them fully explains the phenomena of light, but together they do.”
  6. © Kensu, inc. 2021 AI Observability A Machine Learning model

    can be seen as: - Data: it is a bunch of doubles resulting from the training process on the observations (i.e. the known world). - Application: it is used as a function (e.g. to predict). Moreover the behavior of the application part depends on the observations used in the training phase. Where our control resides in the hyper parameters we provide (found) during the training. It is like Java, Scala, Python, R, Go, SQL, etc. code changing automatically with its context, and what it’s becoming is unknown.
  7. © Kensu, inc. 2021 AI Observability I can hear that

    it is raining cats and dogs. I see a poor person outside walking the street. Although, I don’t have to help as the umbrella does already the job Note: in this case, it is raining Schrödinger's cats
  8. © Kensu, inc. 2021 AI Observability As per the Schrödinger's

    cat, an AI system can be considered, after a certain of time, good and bad simultaneously. Unless an observer looks into it and identifies its real state. However, especially with AI… the question is: What do we have to observe? In other words, which outputs shall we use to infer the internal state?
  9. © Kensu, inc. 2021 (some) Links with Data Mesh •

    Responsibility The domain becomes responsible for the data it exposes → The consumer shares the responsibility by exposing its usages and constraints • Data as a Product Linked to the responsibility, SLAs (SLOs) have to be defined and communicated. But more importantly, their failures must be detected, or anticipated • Federated Governance As data products are shared and promoted, (analytical) applications are mostly crossing several domain boundaries.
  10. © Kensu, inc. 2021 Let’s jump in Jupyter to see

    HOW! Example in Spark, Pandas, and Scikit-Learn
  11. © Kensu, inc. 2021 Stay calm and let your codes/tools

    speak while they run At least 3 strategies have been used successfully (running in prod 😁): • Catch events or use APIs of high-end tools (e.g. tableau): For example, lineage, nowadays, is being more and more implemented. • Wrap your preferred libraries with auto-logging capabilities: Spark, Pandas, Dplyr, Spring, and so on can be beefed up with internals log reporting. • Use opentracing philosophy to capture facets from your data usage that can be reconsolidated later on (trace reporter)
  12. © Kensu, inc. 2021 Link with Data Intelligence Management (DIM)

    Data Management, or especially, Data Governance is often thought of as Data Catalog (metadata repository, glossary, workflow management, …) DM by essence focuses on the data, therefore for example, to allow one to find a dataset based on its metadata - e.g. where are the customers data? AI Observability allows an organization to also capture the purposes of the data usages through the lens of the applications. Such that a usage based catalog will allow to find dataset based on the purpose - e.g. how can I predict my churn?
  13. THANKS! Ping me on @nooostab or LinkedIn Checkout Kensu DIM

    on https://kensu.io 🎺 📣 O’Reilly training (4/28): ML Monitoring in Python