Upgrade to Pro — share decks privately, control downloads, hide ads and more …

State of MLOps in 2022

Asei Sugiyama
February 24, 2023

State of MLOps in 2022

Summary of the MLOps in 2022. This deck is used for Money Forward internal event. My opinions are my own.

2022年の MLOps の取り組みについてまとめた資料です。別資料もあり、こちらは日本語版ですが重複しない内容もあります 。https://speakerdeck.com/asei/mlops-nokoremadetokorekara-eb9ed3f9-3635-4709-8b48-5ccf201c4ae7 。この資料は Money Forward 社内で開かれた MLOps についての勉強会のために作成しました。

なお、資料内でさまざまな組織の取り組みについて触れていますが、これらは著者の私見です。現時点で公表されている資料に基づいていますが、各組織の公式な見解ではありません。

Asei Sugiyama

February 24, 2023
Tweet

More Decks by Asei Sugiyama

Other Decks in Technology

Transcript

  1. State of MLOps in 2022
    Asei Sugiyama

    View Slide

  2. TL;DR
    So many progress in the technical field. For example, many players
    builds their ML pipelines.
    The progress in the technical field unveils the problem of machine
    learning. Especially fairness, security, and transparency.
    Even though it is a well-known field and has a long history, we still have
    unsolved problems in the statistical measurement of ML impact.
    This year, governments are planning to define standards and
    regulations (e.g., EU AI Act). It is unclear now, but we must keep
    watching that activity.

    View Slide

  3. TOC
    ML Ops in 2022
    Technical <-
    Measurement
    Process
    MLOps in 2023
    Measurement
    Process
    Culture
    Regulations & Standards

    View Slide

  4. Technical
    Fairness <-
    Security
    Transparency

    View Slide

  5. Traditional problem
    of the Fairness
    Unfair based on
    sensitive features
    (race, gender, etc.)
    Measured by
    difference of
    conditional
    probabilities
    Fairness-Aware Machine Learning and Data Mining p.20

    View Slide

  6. Unfairness in generative AI;
    well known problem
    If the dataset is biased by sensitive
    feature, the generative model
    trained by the dataset may generate
    biased images.
    Note: The right image is not
    generated naturally but by forcing to
    create a cute girl image.
    有名な絵をAIさんに美少女の絵だと言い張った。/I claimed that a world-famous picture is a
    picture of a girl to AI. https://youtu.be/MZUtn9EvoRo

    View Slide

  7. Gender bias
    Illust generation models
    tend to generate girl.
    This is caused by the bias
    in the train dataset.
    珠洲ノらめる さんはTwitterを使っています: 「白州擬人化…!!!!!
    これは、AIちゃんすごいぞ!!!?(๑
    °⌓
    °๑
    )
    https://t.co/uUVoUARM8x」 / Twitter
    https://twitter.com/piyo_rameru/status/1615156487801950211?
    s=20&t=lZhYKkGcaAeyW7aXERvlpw

    View Slide

  8. Unfairness of the
    reward
    The developers of the
    generative models can
    gain wealth
    The creators of the
    original images can not
    get any money
    Getty Images is suing the creators of AI art tool Stable Diffusion for
    scraping its content - The Verge
    https://www.theverge.com/2023/1/17/23558516/ai-art-copyright-
    stable-diffusion-getty-images-lawsuit

    View Slide

  9. Awful AI Award: 2022
    daviddao/awful-ai: Awful AI is a curated list to track current scary usages of AI - hoping to raise awareness https://github.com/daviddao/awful-ai

    View Slide

  10. Who gained from OpenAI
    Huge amount of investment
    from Microsoft to OpenAI
    No news about huge
    investment from Microsoft to
    creators of the training dataset
    Inside the structure of OpenAI’s looming new investment from Microsoft and VCs |
    Fortune https://fortune.com/2023/01/11/structure-openai-investment-microsoft/

    View Slide

  11. Similar problem: Annotation
    I see companies get this wrong all the
    time: their in-house annotation teams
    are left in the dark about the impact that
    their work is having on daily or long-
    term goals. That’s disrespectful to the
    people doing the work and will lead to
    poor motivation, high churn, and low
    quality annotations. So, it doesn’t help
    anyone.
    Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-
    machine-learning

    View Slide

  12. Fairness: recap
    Before rising the stable diffusion, the problem of unfairness is based on
    the bias of dataset
    We are facing new unfairness problem: unfairness in data collection
    process and business model
    It is not enough to check the bias in the dataset. We have to check the
    business model of the ML model.

    View Slide

  13. Technical
    Fairness
    Security <-
    Transparency

    View Slide

  14. Traditional problem of the Security
    Explaining and Harnessing Adversarial Examples https://arxiv.org/abs/1412.6572

    View Slide

  15. Real Attackers Don't
    Compute Gradients
    real-world evidence suggests that
    actual attackers use simple tactics
    to subvert ML-driven systems, and
    as a result security practitioners
    have not prioritized adversarial ML
    defenses.
    "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML
    Research and Practice https://arxiv.org/abs/2212.14315

    View Slide

  16. Case 1. Facebook (1/2)
    An attacker attempts to spread spam on Facebook
    For example, they want to post a pornographic image with some text,
    which may lure a user to click on an embedded URL
    The attacker—aware of the existence of the ML system—tries to evade
    the detector by perturbing the content and/or changing their behavior
    "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

    View Slide

  17. Case 1. Facebook (2/2)
    Multi layered security
    Automation: bot detection
    Access: deny illegal access
    Activity: spam detection
    Application: hate speech
    classifier, nudity detector
    First three layer are normal
    security practice
    "Real Attackers Don't Compute Gradients": Bridging the Gap Between
    Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

    View Slide

  18. Case 2. Phishing webpage
    detection (1/2)
    Phishing webpage detector
    (image classification, input
    form detection, etc.)
    Attackers try to pass
    through from the phishing
    detector by masking,
    cropping, blurring, etc.
    "Real Attackers Don't Compute Gradients": Bridging the Gap Between
    Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

    View Slide

  19. Case 2. Phishing webpage detection (2/2)
    "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315

    View Slide

  20. Reality of the ML security (my assumptions)
    Flood of poor quality trials
    The adversarial attack seems too expensive for the attackers
    Even though the user doesn't have malicious intent, users tend to try to
    change the system's behavior
    "Please don't forget to like and subscribe to my channel!"
    Malicious users may not behave in the same manner with regular user
    For example: Huge amount of trials

    View Slide

  21. Machine Learning Lens (Best practice from AWS)
    MLSEC-10: Protect against data poisoning threats
    Protect against data injection and data manipulation that pollutes the
    training dataset. Data injections add corrupt training data that will
    result in incorrect model and outputs. Data manipulations change
    existing data (for example labels) that can result in inaccurate and
    weak predictive models. Identify and address corrupt data and
    inaccurate models using security methods and anomaly detection
    algorithms.
    MLSEC-10: Protect against data poisoning threats - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-
    lens/mlsec-10.html

    View Slide

  22. Machine Learning Lens (Best practice from AWS)
    MLSEC-11: Protect against adversarial and malicious activities
    Add protection inside and outside of the deployed code to detect
    malicious inputs that might result in incorrect predictions.
    Automatically detect unauthorized changes by examining the inputs in
    detail. Repair and validate the inputs before they are added back to the
    pool.
    MLSEC-11: Protect against adversarial and malicious activities - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine-
    learning-lens/mlsec-11.html

    View Slide

  23. Metamorphic
    testing
    Add "noise" to
    test dataset
    Equivalent to
    augmentation
    Metamorphic Testing of Machine-Learning Based
    Systems | by Teemu Kanstrén | Towards Data
    Science
    https://towardsdatascience.com/metamorphic-
    testing-of-machine-learning-based-systems-
    e1fe13baf048

    View Slide

  24. Security: Recap
    Traditional security problems are mainly focusing on research purpose
    Since the Machine Learning system grows much broader, we can
    collect more realistic attacks
    Real Attackers Don't Compute Gradients
    Use augmentation & metamorphic testing to build robust (secure)
    model

    View Slide

  25. Technical
    Fairness
    Security
    Transparency <-

    View Slide

  26. Mechanical Turk as a holy grail of annotation
    crowdsourcing, benchmarking & other cool things https://www.image-net.org/static_files/papers/ImageNet_2010.pdf

    View Slide

  27. Reproducibility Crisis
    Well known crisis in the
    psychology
    Same problem in the ML
    we found 20 reviews
    across 17 scientific fields
    that find errors in a total of
    329 papers that use ML-
    based science.
    Reproducibility workshop https://sites.google.com/princeton.edu/rep-
    workshop

    View Slide

  28. Too Good to Be True: Bots and Bad
    Data From Mechanical Turk
    I summarize my own experience
    with MTurk and how I deduced that
    my sample was—at best—only
    2.6% valid, by my estimate
    Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P.
    Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027

    View Slide

  29. Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022
    https://journals.sagepub.com/doi/10.1177/17456916221120027

    View Slide

  30. Eligibility criteria (529 -> 336)
    Target age: 18 - 24 years old
    MTurk filter of 18 to 25 years & additional question about age
    Consent quiz (336 -> 200)
    Quiz about their right to end participation, their right to confidentiality,
    and researchers’ ability to contact them (threshold: 2/3)
    Completion (200 -> 140)
    Some of participants did'nt finished the 45-min survey
    Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022
    https://journals.sagepub.com/doi/10.1177/17456916221120027

    View Slide

  31. Attention checks (140 -> 124)
    Selecting other option even if the option is "1 – Select thisoption"
    140 participants -> 124 participants
    Unrealistic response time (124 -> 77)
    Finished too fast (less than 20 min) or too long (several hours)

    View Slide

  32. Examination of qualitative responses (77 -> 14)
    Consistent answer for following requests;
    "Who are you?"
    "Write ten sentences below, describing yourself as you are today."
    "Who will you be? Think about 1 week [1 year/10 years] from
    today. Write ten sentences below, describing yourself as you
    imagine you will be in 1 week"
    If a participant answers "a great man" for some question and "a great
    woman" for another question, that one fails for this filter.

    View Slide

  33. Can the Transparency solve the Reproducibility Crisis?
    The Transparency by the documentation is not enough
    ImageNet describes how they collect and annotate dataset. Those
    are requirements from Transparency
    If we hire MTurk to annotate our dataset, we cannot expect the same
    labeled data even though we follow the strictly same workflow
    According to the paper, we cannot rely the quality of the label
    We should build well-skilled team to solve the Reproducibility Crisis.
    In-house specialists
    Outsourcing: Annotation vendor

    View Slide

  34. Can Outsourcing be next MTurk?
    Outsourcing itself is nothing new.
    Outsourced workers are the fastest
    growing workforce for data annotation.
    Finally, not all outsourced workers
    should be considered low skilled!
    Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-
    machine-learning

    View Slide

  35. Recap: Transparency
    In the data science field, many players creates their datasets using
    Mechanical Turk
    The Reproducibility Crisis: If we hire MTurk, we may not create our
    dataset that holds reproducibility
    We should consider to build in-house or outsourced specialists team

    View Slide

  36. TOC
    ML Ops in 2022
    Technical
    Measurement <-
    Process
    MLOps in 2023
    Measurement
    Process
    Culture
    Regulations & Standards

    View Slide

  37. Measurement
    Four keys
    Metrics for MLOps
    Capability of ML: ML Test Score
    Health check: Percentage of toil
    Business impact: Experiment & Agreement

    View Slide

  38. Four keys
    Well-defined
    set of metrics
    with threshold
    Developed by
    DORA
    Use Four Keys metrics like change failure rate to
    measure your DevOps performance | Google Cloud
    Blog
    https://cloud.google.com/blog/products/devops-
    sre/using-the-four-keys-to-measure-your-devops-
    performance?hl=en

    View Slide

  39. DORA (DevOps Research &
    Assessment)
    DORA provides paper of
    DevOps based on their
    survey

    View Slide

  40. Machine Learning Operations
    (MLOps): Overview, Definition,
    and Architecture
    Comprehensive research paper
    of MLOps based on their survey
    and interviews
    we furnish a definition of
    MLOps and highlight open
    challenges in the field.
    Machine Learning Operations (MLOps): Overview, Definition, and Architecture
    https://arxiv.org/abs/2205.02302

    View Slide

  41. Metrics for MLOps (1/3)
    No best practice
    Current definition of
    MLOps lucks the
    "Measurement" principle
    Machine Learning Operations (MLOps): Overview, Definition, and
    Architecture https://arxiv.org/abs/2205.02302

    View Slide

  42. Metrics for MLOps (2/3)
    Requirements for metrics of ML team;
    1. Observable: well-defined, easy to measure
    2. Comparable: threshold (good / bad), benchmarks
    3. Actionable: able to understand what should we do

    View Slide

  43. Metrics for MLOps (3/3)
    Consider three metrics for three objects
    1. Capability of Team: ML Test Score
    2. Health check of relationship: Toils
    3. Business impact: Agreements & Experiments

    View Slide

  44. Capability of ML: ML Test Score
    Tests to measure the ML team
    28 tests
    0.5 point: test it manually
    1.0 point: test it automatically
    Right: average scores of the
    Google ML teams
    The ML Test Score: A Rubric for ML Production Readiness and Technical Debt
    Reduction https://research.google/pubs/pub46555/

    View Slide

  45. Health check: Percentage of toil
    Measuring and managing the
    amount of the toils helps MLOps
    team focusing the team's task
    Also, to avoid paying too much
    amount of effort to other team's task
    Our SRE organization has an
    advertised goal of keeping
    operational work (i.e., toil) below
    50% of each SRE’s time.
    Google - Site Reliability Engineering https://sre.google/sre-book/eliminating-toil/

    View Slide

  46. Definition of the Toil
    Manual
    Repetitive
    Automatable
    Tactical
    No enduring value
    O(n) with service growth
    Google - Site Reliability Engineering https://sre.google/sre-book/eliminating-
    toil/

    View Slide

  47. Business impact: Experiment & Agreement
    Define relationships between ML metrics and business metrics
    Measure them by experiments
    Experimental Design
    A/B Testing
    Single case experiment

    View Slide

  48. Business impact doesn't hold requirements
    Requirements ML test score toil (%) Business impact
    Observable ✓ ✓
    Comparable ✓ ✓
    Actionable ✓ ✓
    Experiments are required to define the business metric, compare the
    impact with baseline, and consider what should we do

    View Slide

  49. A/B Testing
    A/B Testing is based on
    statistical RTC (randomized
    controlled trial) introduced
    in 1948 by Hill (left side)
    We are still discussing
    about the method of the
    A/B testing (including
    causal inference)
    Use of randomisation in the Medical Research Council’s clinical trial of
    streptomycin in pulmonary tuberculosis in the 1940s - PMC
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1114162/
    Trustworthy Online Controlled Experiments: 9781108724265: Computer
    Science Books @ Amazon.com https://www.amazon.com/dp/1108724264

    View Slide

  50. Single case experiment (1/2)
    One of the quasi-
    experimental design
    (experiments without
    control group)
    Example: DeepMind AI
    reduces energy used for
    cooling Google data
    centers by 40%
    DeepMind AI reduces energy used for cooling Google data centers by 40%
    https://blog.google/outreach-initiatives/environment/deepmind-ai-reduces-
    energy-used-for/

    View Slide

  51. Single case experiment (2/2)
    Recommended: A/B/A
    design or A/B/A/B design
    A: control
    B: treatment
    Right image is A/B/A
    Introduce new feature (or
    service) within limited time
    span
    DeepMind AI reduces energy used for cooling Google data centers by 40%
    https://blog.google/outreach-initiatives/environment/deepmind-ai-reduces-
    energy-used-for/

    View Slide

  52. TOC
    ML Ops in 2022
    Technical
    Measurement
    Process <-
    MLOps in 2023
    Measurement
    Process
    Culture
    Regulations & Standards

    View Slide

  53. Process
    As we see in the technical section
    (transparency), we realized the
    importance of annotation process.
    Human-in-the-loop Machine Learning
    section 7 is the great resource to
    understand the best practice of the
    annotation.
    Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-
    machine-learning

    View Slide

  54. Three types of the
    workforce
    1. Crowd sourcing
    2. BPO (Outsource)
    3. In-house specialist
    If the annotation task
    requires high specialty and
    confidence, combine BPO
    & In-house specialist.
    Human-in-the-Loop Machine Learning fig 8.21
    https://www.manning.com/books/human-in-the-loop-machine-learning

    View Slide

  55. Three principals
    Salary - Fair pay
    Annotators should be payed as much as other
    workers, including data scientist
    Job Security - Pay regularly
    (data scientists should) structure the amount of
    work available to be as consistent as possible
    Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning

    View Slide

  56. Three principals (2/2)
    Ownership - Provide transparency
    The best way to make any repetitive task
    interesting is to make it clear how important that
    work is.
    An annotator who spends 400 hours annotating
    data that powers a new application should feel
    as much ownership as an engineer who spends
    400 hours coding it.
    Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning

    View Slide

  57. Three tips
    In-house exparts: Always run in-house
    annotation sessions
    Outsourced workers: Talk to your outsourced
    workers
    Crowd sourcing: Create a path to secure work
    and career advancement
    Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning

    View Slide

  58. TOC
    ML Ops in 2022
    Technical
    Measurement
    Process
    MLOps in 2023 <-
    Measurement
    Process
    Culture
    Regulations & Standards

    View Slide

  59. Challenges of A/B Testing
    Measuring network
    effects <-
    Managing real-world
    dynamism
    Supporting diverse lines of
    business
    Supporting our culture of
    experimentation
    Challenges in Experimentation. In this post, we provide an overview of… | by
    John Kirn | Lyft Engineering https://eng.lyft.com/challenges-in-
    experimentation-be9ab98a7ef4

    View Slide

  60. Measuring network effects (2/2)
    Three kind of randomization:
    (a) alternating time intervals (one
    hour)
    (b) randomized coarse and fine
    spatial units
    (c) randomized user sessions
    The best randomization was (a) to
    measure the effect of Prime Time
    Challenges in Experimentation. In this post, we provide an overview of… | by John Kirn | Lyft
    Engineering https://eng.lyft.com/challenges-in-experimentation-be9ab98a7ef4

    View Slide

  61. PROCESS (1/3)
    MLOE-02: Establish ML roles and responsibilities
    Establish cross-functional teams with roles and responsibilities
    An ML project typically consists of multiple roles, with defined tasks
    and responsibilities for each. In many cases, the separation of roles
    and responsibilities is not clear and there is overlap.
    MLOE-02: Establish ML roles and responsibilities - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-
    lens/mloe-02.html

    View Slide

  62. PROCESS
    (2/3)
    13 ML roles
    defined by
    AWS
    MLOE-02: Establish ML roles and responsibi
    Learning Lens
    https://docs.aws.amazon.com/wellarchitecte
    learning-lens/mloe-02.html

    View Slide

  63. PROCESS (3/3)
    Experiment tracking
    Integrate business metrics
    and ML metrics into one
    dashboard
    Discussion based on the
    single source of truth by all
    stakeholders

    View Slide

  64. CULTURE: Netflix
    Netflix has a strong culture of
    experimentation, and results
    from A/B tests, or other
    applications of the scientific
    method, are generally expected
    to inform decisions about how
    to improve our product and
    deliver more joy to members.
    Netflix: A Culture of Learning https://netflixtechblog.com/netflix-a-culture-of-
    learning-394bc7d0f94c

    View Slide

  65. AI Act: Regulations &
    Standards of AI
    AI Act: European legal movement to
    establish AI low
    Similar to GDPR: Global regulation
    (not only in EU)
    The second half of 2024 is the
    earliest time the regulation could
    become applicable
    EUのAI規制法案の概要 https://www.soumu.go.jp/main_content/000826707.pdf
    Regulatory framework proposal on artificial intelligence https://digital-
    strategy.ec.europa.eu/en/policies/regulatory-framework-ai

    View Slide

  66. Recap
    So many progress in the technical field. For example, many players
    builds their ML pipelines.
    The progress in the technical field unveils the problem of machine
    learning. Especially fairness, security, and transparency.
    Even though it is a well-known field and has a long history, we still have
    unsolved problems in the statistical measurement of ML impact.
    This year, governments are planning to define standards and
    regulations (e.g., EU AI Act). It is unclear now, but we must keep
    watching that activity.

    View Slide