Essential Capabilities What abilities are required? 03 Is ML knowledge required? Don’t be afraid to become a team member of MLOps 04 Initial organization design Let's start small. 3 05 Grow a team Refinement through experience
discussing MLOps team building • Criteria for determining the "first step" that aligns with your company's current phase. • Specific actions you can start tomorrow Not covered… • Detail explanation about tools • Design pattern of specific cloud vendor • Not suitable for existing MLOps organizations, especially large teams. 4
handling Provide your model Detect drift / degradation Deploy / Monitoring Train model based on data Validate performance Train / Val model Keep your model up to date CI / CD / CT 9
from into the system Data validation Handle missing values and outliers Feature Engineering Transform raw data into valuable input Data versioning Version your data for reproducibility Data labeling Assign ground truth to raw data Privacy and security Protect data and ensure compliance 10
train / val / test dataset Evaluate model Evaluate / fix hyperparameters based on val dataset Design model architecture From scratch / pre-trained model etc. Train model Fite your model to training data Validate Business KPI Go/No-go decision for production.
your code for production Packaging models Containerize artifacts and dependencies Keep training model Trigger by data drift, concept drift etc. Validate pipeline Confirm pipeline work properly Deploy Serve models as prediction services Manage metadata Track histories of trained model CI CT 13
make model development complicated? Incident / Release failure / Gave up because it couldn't retrain. Lack of reproducibility/ Unable to see production logs。 Between whom is the ball lying? Solve with structure what tech cannot. 02 03 01 Checklist for identifying missing capabilities
Deep insight of ML, create model, evaluate model etc. • Manage experiment results • Write codes for serving, not Notebook Nice-to-have • Experience for infrastructure • Distributed computing technology Non-ML Engineer Must know • Container technologies, docker and k8s • How to build CI/CD pipelines • Experience to build infrastructure with IaC Nice-to-have • Knowledge of machine learning • Security and compliance of ML specific 20
performance degradation • Review ML specific metrics • Translate requirements from DS into "foundation design” • Data / Concept drift Not ML specific • Build infrastructure • IaC • Security • Pipeline maintenance • SRE-based operational design 21
operational design will be weak and it will break in production. MLOps team = Group of ML engineers 01 Overlooking ML-specific issues (such as drift). Can be built with just SRE/SWE 02 Learning by doing, rather than waiting for classroom lectures, is the most practical solution. First, everyone will undergo ML training 03
metrics, it was found that there was a shortage of GPU memory due to excessive access. Solution There may be issues with how the model uses memory, but this could potentially be addressed by scaling out and scaling up the GPU. Situation 1: Response stopped due to excessive access 23
CTR of the recommendation model decreased by 20% compared to the previous week. Solution Without ML Engineer Without an ML engineer, the only conclusion might be, "The inference API latency is normal and not infrastructure-related." 24
CTR of the recommendation model decreased by 20% compared to the previous week. Solution With ML Engineer 1. Drift is occurring 2. data not present in training dataset 3. Fault in the label collection pipeline 4. Problems with data preprocessing 25
ML Engineer Data Scientist Data Engineer What SRE Platform Engineer How Legal Compliance IT Systems Security Risk The process involves addressing Why → What → How → Risk.
handling data, such as ingest, preprocess, store data Build infrastructure to handle data, develop models, serve models Develop performable models based on data Production monitoring, alerting (drift/latency), and on-call/incident response. 29 Let's start small ◎ Culture building, business knowledge, low cost △ Possibility of limited ML experience Internal members ◎ Expertise and best practices △ Time, cost, and adoption risk Mainly external recruitment
Infra Engineer Ops Engineer Stakeholders Model development R/A C I I C Build training pipeline R/A R/A C C I Production deployment R I C C A Monitoring / Operation C R R R/A I Incident primary response C C C R/A I Decision on keep serving model or withdrawing it. C I I I R/A Data Quality and Features C R/A I I R R: Responsible A: Accountable C: Consulted I: Informed "Deployment Decision" and "Withdrawal Decision" must always be placed in the business unit.
fail. It's better to take small, steady steps rather than attempting a large-scale construction project all at once. Don’t build the entire company infrastructure from scratch. 01 Developing the foundation through actual operation. Foundation → Not utilization 02 Don't create things that aren't used to deliver value to ML models. MLOps can also be considered a product. 03
value can be expressed in numbers. Rather than accuracy, focus on sales/cost/man-hours. You should NOT choose 33 Data already exists "Collecting them from now on" will be a long process. Even if you fail, it won't be fatal. It is unsuitable for situations directly related to human life or compliance. Utilizing internal chat / Turning everything into an LLM A perfect general-purpose prediction platform Because it's popular at other companies.
• Quick decision making • Easy to attract investment and talent • Discrepancy with on-site needs • Unused MLOps infrastructure • Neglecting the foundation due to pressure for short-term results Translate goals into business KPIs (sales/costs) 01. 2 ~ 3 use cases → Extract common issues 02. The central team prioritizes accompanying the runner. 03. Quantifying milestones 3 → 6 → 12 months 04.
Driven by on-site challenges, eliminating waste • Can be started small • High probability of being used • Stagnation due to insufficient budget and authority • Dependence on individual effort (running on the enthusiasm of one person) • Difficult to see and not appreciated by the company Start by visualizing small improvements. 01. Picking up on the problems faced by ML engineers 02. Results are measured by numbers and stories. 03. Expanding your network study groups, Slack etc. 04.
36 Managed services are sufficient. Building your learning infrastructure from scratch 01 There are no targets for horizontal expansion yet. Company-wide standardization 02 First, create the foundation for what works. Perfect documentation 03 this does not align with the company's own problems. Completely copying another company's example 04
projects Problems encountered on site Budget Almost none / Volunteers Speed Fast but with a risk of divergence Slow but close to the scene Primary Risk Unused infrastructure Personalization and stagnation First Step Business KPI Translation + Support Visualizing small improvements Recommended Organizational Structure Independent division or under the Platform umbrella Directly under the business unit or Data Organization Key to Success Bridging the gap with the field Visualizing results + building a network
and models cannot be handled without collaboration across multiple professions. They cannot be solved with tools alone. 01. Organizational design begins with "a single sheet of paper". The first task is to clearly define the stakeholders, the initial three individuals, and the RACI framework. 02. Grow it with the first use case. Despite being red, Mars is actually a cold place. It’s full of iron oxide dust. 03.
04 Those experiencing conflict over organizational structure Platform subsidiary vs. direct subordinate of business unit) How to involve management and business units when budgets and proposals are not approved) Unable to assemble the initial team internal transfers vs. external recruitment) How to get started with a small team (up to 5 people)