State of MLOps in 2022

State of MLOps in 2022 Asei Sugiyama

TL;DR So many progress in the technical field. For example,
many players builds their ML pipelines. The progress in the technical field unveils the problem of machine learning. Especially fairness, security, and transparency. Even though it is a well-known field and has a long history, we still have unsolved problems in the statistical measurement of ML impact. This year, governments are planning to define standards and regulations (e.g., EU AI Act). It is unclear now, but we must keep watching that activity.

TOC ML Ops in 2022 Technical <- Measurement Process MLOps
in 2023 Measurement Process Culture Regulations & Standards

Technical Fairness <- Security Transparency

Traditional problem of the Fairness Unfair based on sensitive features
(race, gender, etc.) Measured by difference of conditional probabilities Fairness-Aware Machine Learning and Data Mining p.20

Unfairness in generative AI; well known problem If the dataset
is biased by sensitive feature, the generative model trained by the dataset may generate biased images. Note: The right image is not generated naturally but by forcing to create a cute girl image. 有名な絵をAIさんに美少女の絵だと言い張った。/I claimed that a world-famous picture is a picture of a girl to AI. https://youtu.be/MZUtn9EvoRo

Gender bias Illust generation models tend to generate girl. This
is caused by the bias in the train dataset. 珠洲ノらめるさんはTwitterを使っています: 「白州擬人化…！！！！！これは、AIちゃんすごいぞ！！！？(๑ °⌓ °๑ ) https://t.co/uUVoUARM8x」 / Twitter https://twitter.com/piyo_rameru/status/1615156487801950211? s=20&t=lZhYKkGcaAeyW7aXERvlpw

Unfairness of the reward The developers of the generative models
can gain wealth The creators of the original images can not get any money Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content - The Verge https://www.theverge.com/2023/1/17/23558516/ai-art-copyright- stable-diffusion-getty-images-lawsuit

Awful AI Award: 2022 daviddao/awful-ai: Awful AI is a curated
list to track current scary usages of AI - hoping to raise awareness https://github.com/daviddao/awful-ai

Who gained from OpenAI Huge amount of investment from Microsoft
to OpenAI No news about huge investment from Microsoft to creators of the training dataset Inside the structure of OpenAI’s looming new investment from Microsoft and VCs | Fortune https://fortune.com/2023/01/11/structure-openai-investment-microsoft/

Similar problem: Annotation I see companies get this wrong all
the time: their in-house annotation teams are left in the dark about the impact that their work is having on daily or long- term goals. That’s disrespectful to the people doing the work and will lead to poor motivation, high churn, and low quality annotations. So, it doesn’t help anyone. Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop- machine-learning

Fairness: recap Before rising the stable diffusion, the problem of
unfairness is based on the bias of dataset We are facing new unfairness problem: unfairness in data collection process and business model It is not enough to check the bias in the dataset. We have to check the business model of the ML model.

Technical Fairness Security <- Transparency

Traditional problem of the Security Explaining and Harnessing Adversarial Examples
https://arxiv.org/abs/1412.6572

Real Attackers Don't Compute Gradients real-world evidence suggests that actual
attackers use simple tactics to subvert ML-driven systems, and as a result security practitioners have not prioritized adversarial ML defenses. "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Case 1. Facebook (1/2) An attacker attempts to spread spam
on Facebook For example, they want to post a pornographic image with some text, which may lure a user to click on an embedded URL The attacker—aware of the existence of the ML system—tries to evade the detector by perturbing the content and/or changing their behavior "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Case 1. Facebook (2/2) Multi layered security Automation: bot detection
Access: deny illegal access Activity: spam detection Application: hate speech classifier, nudity detector First three layer are normal security practice "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Case 2. Phishing webpage detection (1/2) Phishing webpage detector (image
classification, input form detection, etc.) Attackers try to pass through from the phishing detector by masking, cropping, blurring, etc. "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Case 2. Phishing webpage detection (2/2) "Real Attackers Don't Compute
Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

Reality of the ML security (my assumptions) Flood of poor
quality trials The adversarial attack seems too expensive for the attackers Even though the user doesn't have malicious intent, users tend to try to change the system's behavior "Please don't forget to like and subscribe to my channel!" Malicious users may not behave in the same manner with regular user For example: Huge amount of trials

Machine Learning Lens (Best practice from AWS) MLSEC-10: Protect against
data poisoning threats Protect against data injection and data manipulation that pollutes the training dataset. Data injections add corrupt training data that will result in incorrect model and outputs. Data manipulations change existing data (for example labels) that can result in inaccurate and weak predictive models. Identify and address corrupt data and inaccurate models using security methods and anomaly detection algorithms. MLSEC-10: Protect against data poisoning threats - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine-learning- lens/mlsec-10.html

Machine Learning Lens (Best practice from AWS) MLSEC-11: Protect against
adversarial and malicious activities Add protection inside and outside of the deployed code to detect malicious inputs that might result in incorrect predictions. Automatically detect unauthorized changes by examining the inputs in detail. Repair and validate the inputs before they are added back to the pool. MLSEC-11: Protect against adversarial and malicious activities - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine- learning-lens/mlsec-11.html

Metamorphic testing Add "noise" to test dataset Equivalent to augmentation
Metamorphic Testing of Machine-Learning Based Systems | by Teemu Kanstrén | Towards Data Science https://towardsdatascience.com/metamorphic- testing-of-machine-learning-based-systems- e1fe13baf048

Security: Recap Traditional security problems are mainly focusing on research
purpose Since the Machine Learning system grows much broader, we can collect more realistic attacks Real Attackers Don't Compute Gradients Use augmentation & metamorphic testing to build robust (secure) model

Technical Fairness Security Transparency <-

Mechanical Turk as a holy grail of annotation crowdsourcing, benchmarking
& other cool things https://www.image-net.org/static_files/papers/ImageNet_2010.pdf

Reproducibility Crisis Well known crisis in the psychology Same problem
in the ML we found 20 reviews across 17 scientific fields that find errors in a total of 329 papers that use ML- based science. Reproducibility workshop https://sites.google.com/princeton.edu/rep- workshop

Too Good to Be True: Bots and Bad Data From
Mechanical Turk I summarize my own experience with MTurk and how I deduced that my sample was—at best—only 2.6% valid, by my estimate Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027

Too Good to Be True: Bots and Bad Data From
Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027

Eligibility criteria (529 -> 336) Target age: 18 - 24
years old MTurk filter of 18 to 25 years & additional question about age Consent quiz (336 -> 200) Quiz about their right to end participation, their right to confidentiality, and researchers’ ability to contact them (threshold: 2/3) Completion (200 -> 140) Some of participants did'nt finished the 45-min survey Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027

Attention checks (140 -> 124) Selecting other option even if
the option is "1 – Select thisoption" 140 participants -> 124 participants Unrealistic response time (124 -> 77) Finished too fast (less than 20 min) or too long (several hours)

Examination of qualitative responses (77 -> 14) Consistent answer for
following requests; "Who are you?" "Write ten sentences below, describing yourself as you are today." "Who will you be? Think about 1 week [1 year/10 years] from today. Write ten sentences below, describing yourself as you imagine you will be in 1 week" If a participant answers "a great man" for some question and "a great woman" for another question, that one fails for this filter.

Can the Transparency solve the Reproducibility Crisis? The Transparency by
the documentation is not enough ImageNet describes how they collect and annotate dataset. Those are requirements from Transparency If we hire MTurk to annotate our dataset, we cannot expect the same labeled data even though we follow the strictly same workflow According to the paper, we cannot rely the quality of the label We should build well-skilled team to solve the Reproducibility Crisis. In-house specialists Outsourcing: Annotation vendor

Can Outsourcing be next MTurk? Outsourcing itself is nothing new.
Outsourced workers are the fastest growing workforce for data annotation. Finally, not all outsourced workers should be considered low skilled! Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop- machine-learning

Recap: Transparency In the data science field, many players creates
their datasets using Mechanical Turk The Reproducibility Crisis: If we hire MTurk, we may not create our dataset that holds reproducibility We should consider to build in-house or outsourced specialists team

TOC ML Ops in 2022 Technical Measurement <- Process MLOps

Measurement Four keys Metrics for MLOps Capability of ML: ML
Test Score Health check: Percentage of toil Business impact: Experiment & Agreement

Four keys Well-defined set of metrics with threshold Developed by
DORA Use Four Keys metrics like change failure rate to measure your DevOps performance | Google Cloud Blog https://cloud.google.com/blog/products/devops- sre/using-the-four-keys-to-measure-your-devops- performance?hl=en

DORA (DevOps Research & Assessment) DORA provides paper of DevOps
based on their survey

Machine Learning Operations (MLOps): Overview, Definition, and Architecture Comprehensive research
paper of MLOps based on their survey and interviews we furnish a definition of MLOps and highlight open challenges in the field. Machine Learning Operations (MLOps): Overview, Definition, and Architecture https://arxiv.org/abs/2205.02302

Metrics for MLOps (1/3) No best practice Current definition of
MLOps lucks the "Measurement" principle Machine Learning Operations (MLOps): Overview, Definition, and Architecture https://arxiv.org/abs/2205.02302

Metrics for MLOps (2/3) Requirements for metrics of ML team;
1. Observable: well-defined, easy to measure 2. Comparable: threshold (good / bad), benchmarks 3. Actionable: able to understand what should we do

Metrics for MLOps (3/3) Consider three metrics for three objects
1. Capability of Team: ML Test Score 2. Health check of relationship: Toils 3. Business impact: Agreements & Experiments

Capability of ML: ML Test Score Tests to measure the
ML team 28 tests 0.5 point: test it manually 1.0 point: test it automatically Right: average scores of the Google ML teams The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction https://research.google/pubs/pub46555/

Health check: Percentage of toil Measuring and managing the amount
of the toils helps MLOps team focusing the team's task Also, to avoid paying too much amount of effort to other team's task Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. Google - Site Reliability Engineering https://sre.google/sre-book/eliminating-toil/

Definition of the Toil Manual Repetitive Automatable Tactical No enduring
value O(n) with service growth Google - Site Reliability Engineering https://sre.google/sre-book/eliminating- toil/

Business impact: Experiment & Agreement Define relationships between ML metrics
and business metrics Measure them by experiments Experimental Design A/B Testing Single case experiment

Business impact doesn't hold requirements Requirements ML test score toil
(%) Business impact Observable ✓ ✓ Comparable ✓ ✓ Actionable ✓ ✓ Experiments are required to define the business metric, compare the impact with baseline, and consider what should we do

A/B Testing A/B Testing is based on statistical RTC (randomized
controlled trial) introduced in 1948 by Hill (left side) We are still discussing about the method of the A/B testing (including causal inference) Use of randomisation in the Medical Research Council’s clinical trial of streptomycin in pulmonary tuberculosis in the 1940s - PMC https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1114162/ Trustworthy Online Controlled Experiments: 9781108724265: Computer Science Books @ Amazon.com https://www.amazon.com/dp/1108724264

Single case experiment (1/2) One of the quasi- experimental design
(experiments without control group) Example: DeepMind AI reduces energy used for cooling Google data centers by 40% DeepMind AI reduces energy used for cooling Google data centers by 40% https://blog.google/outreach-initiatives/environment/deepmind-ai-reduces- energy-used-for/

Single case experiment (2/2) Recommended: A/B/A design or A/B/A/B design
A: control B: treatment Right image is A/B/A Introduce new feature (or service) within limited time span DeepMind AI reduces energy used for cooling Google data centers by 40% https://blog.google/outreach-initiatives/environment/deepmind-ai-reduces- energy-used-for/

TOC ML Ops in 2022 Technical Measurement Process <- MLOps

Process As we see in the technical section (transparency), we
realized the importance of annotation process. Human-in-the-loop Machine Learning section 7 is the great resource to understand the best practice of the annotation. Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop- machine-learning

Three types of the workforce 1. Crowd sourcing 2. BPO
(Outsource) 3. In-house specialist If the annotation task requires high specialty and confidence, combine BPO & In-house specialist. Human-in-the-Loop Machine Learning fig 8.21 https://www.manning.com/books/human-in-the-loop-machine-learning

Three principals Salary - Fair pay Annotators should be payed
as much as other workers, including data scientist Job Security - Pay regularly (data scientists should) structure the amount of work available to be as consistent as possible Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning

Three principals (2/2) Ownership - Provide transparency The best way
to make any repetitive task interesting is to make it clear how important that work is. An annotator who spends 400 hours annotating data that powers a new application should feel as much ownership as an engineer who spends 400 hours coding it. Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning

Three tips In-house exparts: Always run in-house annotation sessions Outsourced
workers: Talk to your outsourced workers Crowd sourcing: Create a path to secure work and career advancement Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning

TOC ML Ops in 2022 Technical Measurement Process MLOps in
2023 <- Measurement Process Culture Regulations & Standards

Challenges of A/B Testing Measuring network effects <- Managing real-world
dynamism Supporting diverse lines of business Supporting our culture of experimentation Challenges in Experimentation. In this post, we provide an overview of… | by John Kirn | Lyft Engineering https://eng.lyft.com/challenges-in- experimentation-be9ab98a7ef4

Measuring network effects (2/2) Three kind of randomization: (a) alternating
time intervals (one hour) (b) randomized coarse and fine spatial units (c) randomized user sessions The best randomization was (a) to measure the effect of Prime Time Challenges in Experimentation. In this post, we provide an overview of… | by John Kirn | Lyft Engineering https://eng.lyft.com/challenges-in-experimentation-be9ab98a7ef4

PROCESS (1/3) MLOE-02: Establish ML roles and responsibilities Establish cross-functional
teams with roles and responsibilities An ML project typically consists of multiple roles, with defined tasks and responsibilities for each. In many cases, the separation of roles and responsibilities is not clear and there is overlap. MLOE-02: Establish ML roles and responsibilities - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine-learning- lens/mloe-02.html

PROCESS (2/3) 13 ML roles defined by AWS MLOE-02: Establish
ML roles and responsibi Learning Lens https://docs.aws.amazon.com/wellarchitecte learning-lens/mloe-02.html

PROCESS (3/3) Experiment tracking Integrate business metrics and ML metrics
into one dashboard Discussion based on the single source of truth by all stakeholders

CULTURE: Netflix Netflix has a strong culture of experimentation, and
results from A/B tests, or other applications of the scientific method, are generally expected to inform decisions about how to improve our product and deliver more joy to members. Netflix: A Culture of Learning https://netflixtechblog.com/netflix-a-culture-of- learning-394bc7d0f94c

AI Act: Regulations & Standards of AI AI Act: European
legal movement to establish AI low Similar to GDPR: Global regulation (not only in EU) The second half of 2024 is the earliest time the regulation could become applicable EUのAI規制法案の概要 https://www.soumu.go.jp/main_content/000826707.pdf Regulatory framework proposal on artificial intelligence https://digital- strategy.ec.europa.eu/en/policies/regulatory-framework-ai

Recap So many progress in the technical field. For example,
many players builds their ML pipelines. The progress in the technical field unveils the problem of machine learning. Especially fairness, security, and transparency. Even though it is a well-known field and has a long history, we still have unsolved problems in the statistical measurement of ML impact. This year, governments are planning to define standards and regulations (e.g., EU AI Act). It is unclear now, but we must keep watching that activity.

State of MLOps in 2022

State of MLOps in 2022

More Decks by Asei Sugiyama

Other Decks in Technology

Featured

Transcript