Slide 1

Slide 1 text

State of MLOps in 2022 Asei Sugiyama

Slide 2

Slide 2 text

TL;DR So many progress in the technical field. For example, many players builds their ML pipelines. The progress in the technical field unveils the problem of machine learning. Especially fairness, security, and transparency. Even though it is a well-known field and has a long history, we still have unsolved problems in the statistical measurement of ML impact. This year, governments are planning to define standards and regulations (e.g., EU AI Act). It is unclear now, but we must keep watching that activity.

Slide 3

Slide 3 text

TOC ML Ops in 2022 Technical <- Measurement Process MLOps in 2023 Measurement Process Culture Regulations & Standards

Slide 4

Slide 4 text

Technical Fairness <- Security Transparency

Slide 5

Slide 5 text

Traditional problem of the Fairness Unfair based on sensitive features (race, gender, etc.) Measured by difference of conditional probabilities Fairness-Aware Machine Learning and Data Mining p.20

Slide 6

Slide 6 text

Unfairness in generative AI; well known problem If the dataset is biased by sensitive feature, the generative model trained by the dataset may generate biased images. Note: The right image is not generated naturally but by forcing to create a cute girl image. 有名な絵をAIさんに美少女の絵だと言い張った。/I claimed that a world-famous picture is a picture of a girl to AI. https://youtu.be/MZUtn9EvoRo

Slide 7

Slide 7 text

Gender bias Illust generation models tend to generate girl. This is caused by the bias in the train dataset. 珠洲ノらめる さんはTwitterを使っています: 「白州擬人化…!!!!! これは、AIちゃんすごいぞ!!!?(๑ °⌓ °๑ ) https://t.co/uUVoUARM8x」 / Twitter https://twitter.com/piyo_rameru/status/1615156487801950211? s=20&t=lZhYKkGcaAeyW7aXERvlpw

Slide 8

Slide 8 text

Unfairness of the reward The developers of the generative models can gain wealth The creators of the original images can not get any money Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content - The Verge https://www.theverge.com/2023/1/17/23558516/ai-art-copyright- stable-diffusion-getty-images-lawsuit

Slide 9

Slide 9 text

Awful AI Award: 2022 daviddao/awful-ai: Awful AI is a curated list to track current scary usages of AI - hoping to raise awareness https://github.com/daviddao/awful-ai

Slide 10

Slide 10 text

Who gained from OpenAI Huge amount of investment from Microsoft to OpenAI No news about huge investment from Microsoft to creators of the training dataset Inside the structure of OpenAI’s looming new investment from Microsoft and VCs | Fortune https://fortune.com/2023/01/11/structure-openai-investment-microsoft/

Slide 11

Slide 11 text

Similar problem: Annotation I see companies get this wrong all the time: their in-house annotation teams are left in the dark about the impact that their work is having on daily or long- term goals. That’s disrespectful to the people doing the work and will lead to poor motivation, high churn, and low quality annotations. So, it doesn’t help anyone. Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop- machine-learning

Slide 12

Slide 12 text

Fairness: recap Before rising the stable diffusion, the problem of unfairness is based on the bias of dataset We are facing new unfairness problem: unfairness in data collection process and business model It is not enough to check the bias in the dataset. We have to check the business model of the ML model.

Slide 13

Slide 13 text

Technical Fairness Security <- Transparency

Slide 14

Slide 14 text

Traditional problem of the Security Explaining and Harnessing Adversarial Examples https://arxiv.org/abs/1412.6572

Slide 15

Slide 15 text

Real Attackers Don't Compute Gradients real-world evidence suggests that actual attackers use simple tactics to subvert ML-driven systems, and as a result security practitioners have not prioritized adversarial ML defenses. "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Slide 16

Slide 16 text

Case 1. Facebook (1/2) An attacker attempts to spread spam on Facebook For example, they want to post a pornographic image with some text, which may lure a user to click on an embedded URL The attacker—aware of the existence of the ML system—tries to evade the detector by perturbing the content and/or changing their behavior "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Slide 17

Slide 17 text

Case 1. Facebook (2/2) Multi layered security Automation: bot detection Access: deny illegal access Activity: spam detection Application: hate speech classifier, nudity detector First three layer are normal security practice "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Slide 18

Slide 18 text

Case 2. Phishing webpage detection (1/2) Phishing webpage detector (image classification, input form detection, etc.) Attackers try to pass through from the phishing detector by masking, cropping, blurring, etc. "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Slide 19

Slide 19 text

Case 2. Phishing webpage detection (2/2) "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Slide 20

Slide 20 text

Reality of the ML security (my assumptions) Flood of poor quality trials The adversarial attack seems too expensive for the attackers Even though the user doesn't have malicious intent, users tend to try to change the system's behavior "Please don't forget to like and subscribe to my channel!" Malicious users may not behave in the same manner with regular user For example: Huge amount of trials

Slide 21

Slide 21 text

Machine Learning Lens (Best practice from AWS) MLSEC-10: Protect against data poisoning threats Protect against data injection and data manipulation that pollutes the training dataset. Data injections add corrupt training data that will result in incorrect model and outputs. Data manipulations change existing data (for example labels) that can result in inaccurate and weak predictive models. Identify and address corrupt data and inaccurate models using security methods and anomaly detection algorithms. MLSEC-10: Protect against data poisoning threats - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine-learning- lens/mlsec-10.html

Slide 22

Slide 22 text

Machine Learning Lens (Best practice from AWS) MLSEC-11: Protect against adversarial and malicious activities Add protection inside and outside of the deployed code to detect malicious inputs that might result in incorrect predictions. Automatically detect unauthorized changes by examining the inputs in detail. Repair and validate the inputs before they are added back to the pool. MLSEC-11: Protect against adversarial and malicious activities - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine- learning-lens/mlsec-11.html

Slide 23

Slide 23 text

Metamorphic testing Add "noise" to test dataset Equivalent to augmentation Metamorphic Testing of Machine-Learning Based Systems | by Teemu Kanstrén | Towards Data Science https://towardsdatascience.com/metamorphic- testing-of-machine-learning-based-systems- e1fe13baf048

Slide 24

Slide 24 text

Security: Recap Traditional security problems are mainly focusing on research purpose Since the Machine Learning system grows much broader, we can collect more realistic attacks Real Attackers Don't Compute Gradients Use augmentation & metamorphic testing to build robust (secure) model

Slide 25

Slide 25 text

Technical Fairness Security Transparency <-

Slide 26

Slide 26 text

Mechanical Turk as a holy grail of annotation crowdsourcing, benchmarking & other cool things https://www.image-net.org/static_files/papers/ImageNet_2010.pdf

Slide 27

Slide 27 text

Reproducibility Crisis Well known crisis in the psychology Same problem in the ML we found 20 reviews across 17 scientific fields that find errors in a total of 329 papers that use ML- based science. Reproducibility workshop https://sites.google.com/princeton.edu/rep- workshop

Slide 28

Slide 28 text

Too Good to Be True: Bots and Bad Data From Mechanical Turk I summarize my own experience with MTurk and how I deduced that my sample was—at best—only 2.6% valid, by my estimate Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027

Slide 29

Slide 29 text

Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027

Slide 30

Slide 30 text

Eligibility criteria (529 -> 336) Target age: 18 - 24 years old MTurk filter of 18 to 25 years & additional question about age Consent quiz (336 -> 200) Quiz about their right to end participation, their right to confidentiality, and researchers’ ability to contact them (threshold: 2/3) Completion (200 -> 140) Some of participants did'nt finished the 45-min survey Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027

Slide 31

Slide 31 text

Attention checks (140 -> 124) Selecting other option even if the option is "1 – Select thisoption" 140 participants -> 124 participants Unrealistic response time (124 -> 77) Finished too fast (less than 20 min) or too long (several hours)

Slide 32

Slide 32 text

Examination of qualitative responses (77 -> 14) Consistent answer for following requests; "Who are you?" "Write ten sentences below, describing yourself as you are today." "Who will you be? Think about 1 week [1 year/10 years] from today. Write ten sentences below, describing yourself as you imagine you will be in 1 week" If a participant answers "a great man" for some question and "a great woman" for another question, that one fails for this filter.

Slide 33

Slide 33 text

Can the Transparency solve the Reproducibility Crisis? The Transparency by the documentation is not enough ImageNet describes how they collect and annotate dataset. Those are requirements from Transparency If we hire MTurk to annotate our dataset, we cannot expect the same labeled data even though we follow the strictly same workflow According to the paper, we cannot rely the quality of the label We should build well-skilled team to solve the Reproducibility Crisis. In-house specialists Outsourcing: Annotation vendor

Slide 34

Slide 34 text

Can Outsourcing be next MTurk? Outsourcing itself is nothing new. Outsourced workers are the fastest growing workforce for data annotation. Finally, not all outsourced workers should be considered low skilled! Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop- machine-learning

Slide 35

Slide 35 text

Recap: Transparency In the data science field, many players creates their datasets using Mechanical Turk The Reproducibility Crisis: If we hire MTurk, we may not create our dataset that holds reproducibility We should consider to build in-house or outsourced specialists team

Slide 36

Slide 36 text

TOC ML Ops in 2022 Technical Measurement <- Process MLOps in 2023 Measurement Process Culture Regulations & Standards

Slide 37

Slide 37 text

Measurement Four keys Metrics for MLOps Capability of ML: ML Test Score Health check: Percentage of toil Business impact: Experiment & Agreement

Slide 38

Slide 38 text

Four keys Well-defined set of metrics with threshold Developed by DORA Use Four Keys metrics like change failure rate to measure your DevOps performance | Google Cloud Blog https://cloud.google.com/blog/products/devops- sre/using-the-four-keys-to-measure-your-devops- performance?hl=en

Slide 39

Slide 39 text

DORA (DevOps Research & Assessment) DORA provides paper of DevOps based on their survey

Slide 40

Slide 40 text

Machine Learning Operations (MLOps): Overview, Definition, and Architecture Comprehensive research paper of MLOps based on their survey and interviews we furnish a definition of MLOps and highlight open challenges in the field. Machine Learning Operations (MLOps): Overview, Definition, and Architecture https://arxiv.org/abs/2205.02302

Slide 41

Slide 41 text

Metrics for MLOps (1/3) No best practice Current definition of MLOps lucks the "Measurement" principle Machine Learning Operations (MLOps): Overview, Definition, and Architecture https://arxiv.org/abs/2205.02302

Slide 42

Slide 42 text

Metrics for MLOps (2/3) Requirements for metrics of ML team; 1. Observable: well-defined, easy to measure 2. Comparable: threshold (good / bad), benchmarks 3. Actionable: able to understand what should we do

Slide 43

Slide 43 text

Metrics for MLOps (3/3) Consider three metrics for three objects 1. Capability of Team: ML Test Score 2. Health check of relationship: Toils 3. Business impact: Agreements & Experiments

Slide 44

Slide 44 text

Capability of ML: ML Test Score Tests to measure the ML team 28 tests 0.5 point: test it manually 1.0 point: test it automatically Right: average scores of the Google ML teams The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction https://research.google/pubs/pub46555/

Slide 45

Slide 45 text

Health check: Percentage of toil Measuring and managing the amount of the toils helps MLOps team focusing the team's task Also, to avoid paying too much amount of effort to other team's task Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. Google - Site Reliability Engineering https://sre.google/sre-book/eliminating-toil/

Slide 46

Slide 46 text

Definition of the Toil Manual Repetitive Automatable Tactical No enduring value O(n) with service growth Google - Site Reliability Engineering https://sre.google/sre-book/eliminating- toil/

Slide 47

Slide 47 text

Business impact: Experiment & Agreement Define relationships between ML metrics and business metrics Measure them by experiments Experimental Design A/B Testing Single case experiment

Slide 48

Slide 48 text

Business impact doesn't hold requirements Requirements ML test score toil (%) Business impact Observable ✓ ✓ Comparable ✓ ✓ Actionable ✓ ✓ Experiments are required to define the business metric, compare the impact with baseline, and consider what should we do

Slide 49

Slide 49 text

A/B Testing A/B Testing is based on statistical RTC (randomized controlled trial) introduced in 1948 by Hill (left side) We are still discussing about the method of the A/B testing (including causal inference) Use of randomisation in the Medical Research Council’s clinical trial of streptomycin in pulmonary tuberculosis in the 1940s - PMC https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1114162/ Trustworthy Online Controlled Experiments: 9781108724265: Computer Science Books @ Amazon.com https://www.amazon.com/dp/1108724264

Slide 50

Slide 50 text

Single case experiment (1/2) One of the quasi- experimental design (experiments without control group) Example: DeepMind AI reduces energy used for cooling Google data centers by 40% DeepMind AI reduces energy used for cooling Google data centers by 40% https://blog.google/outreach-initiatives/environment/deepmind-ai-reduces- energy-used-for/

Slide 51

Slide 51 text

Single case experiment (2/2) Recommended: A/B/A design or A/B/A/B design A: control B: treatment Right image is A/B/A Introduce new feature (or service) within limited time span DeepMind AI reduces energy used for cooling Google data centers by 40% https://blog.google/outreach-initiatives/environment/deepmind-ai-reduces- energy-used-for/

Slide 52

Slide 52 text

TOC ML Ops in 2022 Technical Measurement Process <- MLOps in 2023 Measurement Process Culture Regulations & Standards

Slide 53

Slide 53 text

Process As we see in the technical section (transparency), we realized the importance of annotation process. Human-in-the-loop Machine Learning section 7 is the great resource to understand the best practice of the annotation. Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop- machine-learning

Slide 54

Slide 54 text

Three types of the workforce 1. Crowd sourcing 2. BPO (Outsource) 3. In-house specialist If the annotation task requires high specialty and confidence, combine BPO & In-house specialist. Human-in-the-Loop Machine Learning fig 8.21 https://www.manning.com/books/human-in-the-loop-machine-learning

Slide 55

Slide 55 text

Three principals Salary - Fair pay Annotators should be payed as much as other workers, including data scientist Job Security - Pay regularly (data scientists should) structure the amount of work available to be as consistent as possible Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning

Slide 56

Slide 56 text

Three principals (2/2) Ownership - Provide transparency The best way to make any repetitive task interesting is to make it clear how important that work is. An annotator who spends 400 hours annotating data that powers a new application should feel as much ownership as an engineer who spends 400 hours coding it. Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning

Slide 57

Slide 57 text

Three tips In-house exparts: Always run in-house annotation sessions Outsourced workers: Talk to your outsourced workers Crowd sourcing: Create a path to secure work and career advancement Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning

Slide 58

Slide 58 text

TOC ML Ops in 2022 Technical Measurement Process MLOps in 2023 <- Measurement Process Culture Regulations & Standards

Slide 59

Slide 59 text

Challenges of A/B Testing Measuring network effects <- Managing real-world dynamism Supporting diverse lines of business Supporting our culture of experimentation Challenges in Experimentation. In this post, we provide an overview of… | by John Kirn | Lyft Engineering https://eng.lyft.com/challenges-in- experimentation-be9ab98a7ef4

Slide 60

Slide 60 text

Measuring network effects (2/2) Three kind of randomization: (a) alternating time intervals (one hour) (b) randomized coarse and fine spatial units (c) randomized user sessions The best randomization was (a) to measure the effect of Prime Time Challenges in Experimentation. In this post, we provide an overview of… | by John Kirn | Lyft Engineering https://eng.lyft.com/challenges-in-experimentation-be9ab98a7ef4

Slide 61

Slide 61 text

PROCESS (1/3) MLOE-02: Establish ML roles and responsibilities Establish cross-functional teams with roles and responsibilities An ML project typically consists of multiple roles, with defined tasks and responsibilities for each. In many cases, the separation of roles and responsibilities is not clear and there is overlap. MLOE-02: Establish ML roles and responsibilities - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine-learning- lens/mloe-02.html

Slide 62

Slide 62 text

PROCESS (2/3) 13 ML roles defined by AWS MLOE-02: Establish ML roles and responsibi Learning Lens https://docs.aws.amazon.com/wellarchitecte learning-lens/mloe-02.html

Slide 63

Slide 63 text

PROCESS (3/3) Experiment tracking Integrate business metrics and ML metrics into one dashboard Discussion based on the single source of truth by all stakeholders

Slide 64

Slide 64 text

CULTURE: Netflix Netflix has a strong culture of experimentation, and results from A/B tests, or other applications of the scientific method, are generally expected to inform decisions about how to improve our product and deliver more joy to members. Netflix: A Culture of Learning https://netflixtechblog.com/netflix-a-culture-of- learning-394bc7d0f94c

Slide 65

Slide 65 text

AI Act: Regulations & Standards of AI AI Act: European legal movement to establish AI low Similar to GDPR: Global regulation (not only in EU) The second half of 2024 is the earliest time the regulation could become applicable EUのAI規制法案の概要 https://www.soumu.go.jp/main_content/000826707.pdf Regulatory framework proposal on artificial intelligence https://digital- strategy.ec.europa.eu/en/policies/regulatory-framework-ai

Slide 66

Slide 66 text

Recap So many progress in the technical field. For example, many players builds their ML pipelines. The progress in the technical field unveils the problem of machine learning. Especially fairness, security, and transparency. Even though it is a well-known field and has a long history, we still have unsolved problems in the statistical measurement of ML impact. This year, governments are planning to define standards and regulations (e.g., EU AI Act). It is unclear now, but we must keep watching that activity.