Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How_to_create_MLOps_Team.pdf

 How_to_create_MLOps_Team.pdf

Avatar for Daisuke Akagawa (Akasan)

Daisuke Akagawa (Akasan)

April 22, 2026

More Decks by Daisuke Akagawa (Akasan)

Other Decks in Technology

Transcript

  1. Profile Name : Daisuke AKAGAWA Role : ML Engineer at

    3-shake Interests : Machine Learning / Artificial Intelligence / GenAI Cloud infrastructure /computing Personal Activity : Develop applications Writing tech blogs 2
  2. Agenda 01 What is MLOps?? Understand what MLOps is 02

    Essential Capabilities What abilities are required? 03 Is ML knowledge required? Don’t be afraid to become a team member of MLOps 04 Initial organization design Let's start small. 3 05 Grow a team Refinement through experience
  3. Overview You can learn … • A common language for

    discussing MLOps team building • Criteria for determining the "first step" that aligns with your company's current phase. • Specific actions you can start tomorrow Not covered… • Detail explanation about tools • Design pattern of specific cloud vendor • Not suitable for existing MLOps organizations, especially large teams. 4
  4. MLOps MLOps components Fetch and curate data Feature engineering Data

    handling Provide your model Detect drift / degradation Deploy / Monitoring Train model based on data Validate performance Train / Val model Keep your model up to date CI / CD / CT 9
  5. Handle Data before feeding to model Data ingestion Ingest data

    from into the system Data validation Handle missing values ​​and outliers Feature Engineering Transform raw data into valuable input Data versioning Version your data for reproducibility Data labeling Assign ground truth to raw data Privacy and security Protect data and ensure compliance 10
  6. 11 Train and evaluate model Data splitting Split data into

    train / val / test dataset Evaluate model Evaluate / fix hyperparameters based on val dataset Design model architecture From scratch / pre-trained model etc. Train model Fite your model to training data Validate Business KPI Go/No-go decision for production.
  7. Deploy and monitoring your ML environment Infra management Maintain the

    model execution environment Serve model Provide model accessibility Observability Track health and data integrity 12
  8. Keep up-to-date your ML service CD Code quality check Keep

    your code for production Packaging models Containerize artifacts and dependencies Keep training model Trigger by data drift, concept drift etc. Validate pipeline Confirm pipeline work properly Deploy Serve models as prediction services Manage metadata Track histories of trained model CI CT 13
  9. The Three Pillars of MLOps Organization • Collaboration protocol between

    ML engineers and other engineers • Adjusting expectations with the business side • Governance & Compliance • psychological safety Process (Operation) • Release criteria • Incident response and rollback • Documentation • Data/Feature/Model Review Technology (Fundamental) • Versioning • Reproducible pipeline • Experiment tracking • Monitoring & Alert • Infrastructure 16
  10. 17 Where did the accidents and setbacks occur? What factors

    make model development complicated? Incident / Release failure / Gave up because it couldn't retrain. Lack of reproducibility/ Unable to see production logs。 Between whom is the ball lying? Solve with structure what tech cannot. 02 03 01 Checklist for identifying missing capabilities
  11. Rather than the depth of ML knowledge, what's important is

    the breadth of being able to "talk about your company's model as if it were your own." 19
  12. Requirements as MLOps team member ML Engineer Must know •

    Deep insight of ML, create model, evaluate model etc. • Manage experiment results • Write codes for serving, not Notebook Nice-to-have • Experience for infrastructure • Distributed computing technology Non-ML Engineer Must know • Container technologies, docker and k8s • How to build CI/CD pipelines • Experience to build infrastructure with IaC Nice-to-have • Knowledge of machine learning • Security and compliance of ML specific 20
  13. Situations where ML knowledge is required ML specific • Model

    performance degradation • Review ML specific metrics • Translate requirements from DS into "foundation design” • Data / Concept drift Not ML specific • Build infrastructure • IaC • Security • Pipeline maintenance • SRE-based operational design 21
  14. Common misconceptions 22 If only ML engineers build MLOps, the

    operational design will be weak and it will break in production. MLOps team = Group of ML engineers 01 Overlooking ML-specific issues (such as drift). Can be built with just SRE/SWE 02 Learning by doing, rather than waiting for classroom lectures, is the most practical solution. First, everyone will undergo ML training 03
  15. Problem The response stopped, and after checking the logs and

    metrics, it was found that there was a shortage of GPU memory due to excessive access. Solution There may be issues with how the model uses memory, but this could potentially be addressed by scaling out and scaling up the GPU. Situation 1: Response stopped due to excessive access 23
  16. Situation 2: CTR dropped 20% Week on Week Problem The

    CTR of the recommendation model decreased by 20% compared to the previous week. Solution Without ML Engineer Without an ML engineer, the only conclusion might be, "The inference API latency is normal and not infrastructure-related." 24
  17. Situation 2: CTR dropped 20% Week on Week Problem The

    CTR of the recommendation model decreased by 20% compared to the previous week. Solution With ML Engineer 1. Drift is occurring 2. data not present in training dataset 3. Fault in the label collection pipeline 4. Problems with data preprocessing 25
  18. The first step isn't recruitment or infrastructure development, but rather

    focusing on stakeholder mapping, initial members, and decision-making criteria. 27
  19. 28 Who are the stakeholders? Management layer Business department Why

    ML Engineer Data Scientist Data Engineer What SRE Platform Engineer How Legal Compliance IT Systems Security Risk The process involves addressing Why → What → How → Risk.
  20. Ops Engineer Data Engineer Infra Engineer ML Engineer Responsible for

    handling data, such as ingest, preprocess, store data Build infrastructure to handle data, develop models, serve models Develop performable models based on data Production monitoring, alerting (drift/latency), and on-call/incident response. 29 Let's start small ◎ Culture building, business knowledge, low cost △ Possibility of limited ML experience Internal members ◎ Expertise and best practices △ Time, cost, and adoption risk Mainly external recruitment
  21. 30 Clearly define decision-making criteria Activity ML Engineer Data Engineer

    Infra Engineer Ops Engineer Stakeholders Model development R/A C I I C Build training pipeline R/A R/A C C I Production deployment R I C C A Monitoring / Operation C R R R/A I Incident primary response C C C R/A I Decision on keep serving model or withdrawing it. C I I I R/A Data Quality and Features C R/A I I R R: Responsible A: Accountable C: Consulted I: Informed "Deployment Decision" and "Withdrawal Decision" must always be placed in the business unit.
  22. 32 Create a flagship use case It will almost certainly

    fail. It's better to take small, steady steps rather than attempting a large-scale construction project all at once. Don’t build the entire company infrastructure from scratch. 01 Developing the foundation through actual operation. Foundation → Not utilization 02 Don't create things that aren't used to deliver value to ML models. MLOps can also be considered a product. 03
  23. How to choose a flagship model Four selection criteria Business

    value can be expressed in numbers. Rather than accuracy, focus on sales/cost/man-hours. You should NOT choose 33 Data already exists "Collecting them from now on" will be a long process. Even if you fail, it won't be fatal. It is unsuitable for situations directly related to human life or compliance. Utilizing internal chat / Turning everything into an LLM A perfect general-purpose prediction platform Because it's popular at other companies.
  24. Top-Down : Start with a clear goal 34 Strengths Risks

    • Quick decision making • Easy to attract investment and talent • Discrepancy with on-site needs • Unused MLOps infrastructure • Neglecting the foundation due to pressure for short-term results Translate goals into business KPIs (sales/costs) 01. 2 ~ 3 use cases → Extract common issues 02. The central team prioritizes accompanying the runner. 03. Quantifying milestones 3 → 6 → 12 months 04.
  25. Bottom-Up : Prepare for the future 35 Strengths Risks •

    Driven by on-site challenges, eliminating waste • Can be started small • High probability of being used • Stagnation due to insufficient budget and authority • Dependence on individual effort (running on the enthusiasm of one person) • Difficult to see and not appreciated by the company Start by visualizing small improvements. 01. Picking up on the problems faced by ML engineers 02. Results are measured by numbers and stories. 03. Expanding your network study groups, Slack etc. 04.
  26. Let's also make a list of things you won't do

    36 Managed services are sufficient. Building your learning infrastructure from scratch 01 There are no targets for horizontal expansion yet. Company-wide standardization 02 First, create the foundation for what works. Perfect documentation 03 this does not align with the company's own problems. Completely copying another company's example 04
  27. 37 Summary Top-Down Bottom-Up Reason Management decisions and large- scale

    projects Problems encountered on site Budget Almost none / Volunteers Speed Fast but with a risk of divergence Slow but close to the scene Primary Risk Unused infrastructure Personalization and stagnation First Step Business KPI Translation + Support Visualizing small improvements Recommended Organizational Structure Independent division or under the Platform umbrella Directly under the business unit or Data Organization Key to Success Bridging the gap with the field Visualizing results + building a network
  28. Key takeaways 39 MLOps is an organizational problem. Code, data,

    and models cannot be handled without collaboration across multiple professions. They cannot be solved with tools alone. 01. Organizational design begins with "a single sheet of paper". The first task is to clearly define the stakeholders, the initial three individuals, and the RACI framework. 02. Grow it with the first use case. Despite being red, Mars is actually a cold place. It’s full of iron oxide dust. 03.
  29. Please discuss following within your company. 40 01 02 03

    04 Those experiencing conflict over organizational structure Platform subsidiary vs. direct subordinate of business unit) How to involve management and business units when budgets and proposals are not approved) Unable to assemble the initial team internal transfers vs. external recruitment) How to get started with a small team (up to 5 people)