How_to_create_MLOps_Team.pdf

How to create MLOps Team? 3-shake inc. Daisuke Akagawa

Profile Name : Daisuke AKAGAWA Role : ML Engineer at
3-shake Interests : Machine Learning / Artificial Intelligence / GenAI Cloud infrastructure /computing Personal Activity : Develop applications Writing tech blogs 2

Agenda 01 What is MLOps?? Understand what MLOps is 02
Essential Capabilities What abilities are required? 03 Is ML knowledge required? Don’t be afraid to become a team member of MLOps 04 Initial organization design Let's start small. 3 05 Grow a team Refinement through experience

Overview You can learn … • A common language for
discussing MLOps team building • Criteria for determining the "first step" that aligns with your company's current phase. • Specific actions you can start tomorrow Not covered… • Detail explanation about tools • Design pattern of specific cloud vendor • Not suitable for existing MLOps organizations, especially large teams. 4

What is MLOps? 5

MLOps is not a collection of tools, MLOps is organization
/ mechanism. 6

What is MLOps? https://blogs.nvidia.com/blog/what-is-mlops/ MLOps = Machine Learning + Operations
7

MLOps is attracting attention https://www.oreilly.com/library/view/implementi ng-mlops-in/9781098136574/ https://www.oreilly.com/library/view/introducing -mlops/9781492083283/ https://www.manning.com/books/mlops-engineering-at- scale
8

MLOps MLOps components Fetch and curate data Feature engineering Data
handling Provide your model Detect drift / degradation Deploy / Monitoring Train model based on data Validate performance Train / Val model Keep your model up to date CI / CD / CT 9

Handle Data before feeding to model Data ingestion Ingest data
from into the system Data validation Handle missing values and outliers Feature Engineering Transform raw data into valuable input Data versioning Version your data for reproducibility Data labeling Assign ground truth to raw data Privacy and security Protect data and ensure compliance 10

11 Train and evaluate model Data splitting Split data into
train / val / test dataset Evaluate model Evaluate / fix hyperparameters based on val dataset Design model architecture From scratch / pre-trained model etc. Train model Fite your model to training data Validate Business KPI Go/No-go decision for production.

Deploy and monitoring your ML environment Infra management Maintain the
model execution environment Serve model Provide model accessibility Observability Track health and data integrity 12

Keep up-to-date your ML service CD Code quality check Keep
your code for production Packaging models Containerize artifacts and dependencies Keep training model Trigger by data drift, concept drift etc. Validate pipeline Confirm pipeline work properly Deploy Serve models as prediction services Manage metadata Track histories of trained model CI CT 13

Essential Capabilities 14

Capabilities encompass not only technical skills but also operational processes
and organizational structure. 15

The Three Pillars of MLOps Organization • Collaboration protocol between
ML engineers and other engineers • Adjusting expectations with the business side • Governance & Compliance • psychological safety Process (Operation) • Release criteria • Incident response and rollback • Documentation • Data/Feature/Model Review Technology (Fundamental) • Versioning • Reproducible pipeline • Experiment tracking • Monitoring & Alert • Infrastructure 16

17 Where did the accidents and setbacks occur? What factors
make model development complicated? Incident / Release failure / Gave up because it couldn't retrain. Lack of reproducibility/ Unable to see production logs。 Between whom is the ball lying? Solve with structure what tech cannot. 02 03 01 Checklist for identifying missing capabilities

Is ML knowledge necessary for anyone other than ML engineers?
18

Rather than the depth of ML knowledge, what's important is
the breadth of being able to "talk about your company's model as if it were your own." 19

Requirements as MLOps team member ML Engineer Must know •
Deep insight of ML, create model, evaluate model etc. • Manage experiment results • Write codes for serving, not Notebook Nice-to-have • Experience for infrastructure • Distributed computing technology Non-ML Engineer Must know • Container technologies, docker and k8s • How to build CI/CD pipelines • Experience to build infrastructure with IaC Nice-to-have • Knowledge of machine learning • Security and compliance of ML specific 20

Situations where ML knowledge is required ML specific • Model
performance degradation • Review ML specific metrics • Translate requirements from DS into "foundation design” • Data / Concept drift Not ML specific • Build infrastructure • IaC • Security • Pipeline maintenance • SRE-based operational design 21

Common misconceptions 22 If only ML engineers build MLOps, the
operational design will be weak and it will break in production. MLOps team = Group of ML engineers 01 Overlooking ML-specific issues (such as drift). Can be built with just SRE/SWE 02 Learning by doing, rather than waiting for classroom lectures, is the most practical solution. First, everyone will undergo ML training 03

Problem The response stopped, and after checking the logs and
metrics, it was found that there was a shortage of GPU memory due to excessive access. Solution There may be issues with how the model uses memory, but this could potentially be addressed by scaling out and scaling up the GPU. Situation 1: Response stopped due to excessive access 23

Situation 2: CTR dropped 20% Week on Week Problem The
CTR of the recommendation model decreased by 20% compared to the previous week. Solution Without ML Engineer Without an ML engineer, the only conclusion might be, "The inference API latency is normal and not infrastructure-related." 24

Situation 2: CTR dropped 20% Week on Week Problem The
CTR of the recommendation model decreased by 20% compared to the previous week. Solution With ML Engineer 1. Drift is occurring 2. data not present in training dataset 3. Fault in the label collection pipeline 4. Problems with data preprocessing 25

Initial organization design 26

The first step isn't recruitment or infrastructure development, but rather
focusing on stakeholder mapping, initial members, and decision-making criteria. 27

28 Who are the stakeholders? Management layer Business department Why
ML Engineer Data Scientist Data Engineer What SRE Platform Engineer How Legal Compliance IT Systems Security Risk The process involves addressing Why → What → How → Risk.

Ops Engineer Data Engineer Infra Engineer ML Engineer Responsible for
handling data, such as ingest, preprocess, store data Build infrastructure to handle data, develop models, serve models Develop performable models based on data Production monitoring, alerting (drift/latency), and on-call/incident response. 29 Let's start small ◎ Culture building, business knowledge, low cost △ Possibility of limited ML experience Internal members ◎ Expertise and best practices △ Time, cost, and adoption risk Mainly external recruitment

30 Clearly define decision-making criteria Activity ML Engineer Data Engineer
Infra Engineer Ops Engineer Stakeholders Model development R/A C I I C Build training pipeline R/A R/A C C I Production deployment R I C C A Monitoring / Operation C R R R/A I Incident primary response C C C R/A I Decision on keep serving model or withdrawing it. C I I I R/A Data Quality and Features C R/A I I R R: Responsible A: Accountable C: Consulted I: Informed "Deployment Decision" and "Withdrawal Decision" must always be placed in the business unit.

Grow a team 31

32 Create a flagship use case It will almost certainly
fail. It's better to take small, steady steps rather than attempting a large-scale construction project all at once. Don’t build the entire company infrastructure from scratch. 01 Developing the foundation through actual operation. Foundation → Not utilization 02 Don't create things that aren't used to deliver value to ML models. MLOps can also be considered a product. 03

How to choose a flagship model Four selection criteria Business
value can be expressed in numbers. Rather than accuracy, focus on sales/cost/man-hours. You should NOT choose 33 Data already exists "Collecting them from now on" will be a long process. Even if you fail, it won't be fatal. It is unsuitable for situations directly related to human life or compliance. Utilizing internal chat / Turning everything into an LLM A perfect general-purpose prediction platform Because it's popular at other companies.

Top-Down : Start with a clear goal 34 Strengths Risks
• Quick decision making • Easy to attract investment and talent • Discrepancy with on-site needs • Unused MLOps infrastructure • Neglecting the foundation due to pressure for short-term results Translate goals into business KPIs (sales/costs) 01. 2 ~ 3 use cases → Extract common issues 02. The central team prioritizes accompanying the runner. 03. Quantifying milestones 3 → 6 → 12 months 04.

Bottom-Up : Prepare for the future 35 Strengths Risks •
Driven by on-site challenges, eliminating waste • Can be started small • High probability of being used • Stagnation due to insufficient budget and authority • Dependence on individual effort (running on the enthusiasm of one person) • Difficult to see and not appreciated by the company Start by visualizing small improvements. 01. Picking up on the problems faced by ML engineers 02. Results are measured by numbers and stories. 03. Expanding your network study groups, Slack etc. 04.

Let's also make a list of things you won't do
36 Managed services are sufficient. Building your learning infrastructure from scratch 01 There are no targets for horizontal expansion yet. Company-wide standardization 02 First, create the foundation for what works. Perfect documentation 03 this does not align with the company's own problems. Completely copying another company's example 04

37 Summary Top-Down Bottom-Up Reason Management decisions and large- scale
projects Problems encountered on site Budget Almost none / Volunteers Speed Fast but with a risk of divergence Slow but close to the scene Primary Risk Unused infrastructure Personalization and stagnation First Step Business KPI Translation + Support Visualizing small improvements Recommended Organizational Structure Independent division or under the Platform umbrella Directly under the business unit or Data Organization Key to Success Bridging the gap with the field Visualizing results + building a network

Conclusion 38

Key takeaways 39 MLOps is an organizational problem. Code, data,
and models cannot be handled without collaboration across multiple professions. They cannot be solved with tools alone. 01. Organizational design begins with "a single sheet of paper". The first task is to clearly define the stakeholders, the initial three individuals, and the RACI framework. 02. Grow it with the first use case. Despite being red, Mars is actually a cold place. It’s full of iron oxide dust. 03.

Please discuss following within your company. 40 01 02 03
04 Those experiencing conflict over organizational structure Platform subsidiary vs. direct subordinate of business unit) How to involve management and business units when budgets and proposals are not approved) Unable to assemble the initial team internal transfers vs. external recruitment) How to get started with a small team (up to 5 people)

How_to_create_MLOps_Team.pdf

How_to_create_MLOps_Team.pdf

More Decks by Daisuke Akagawa (Akasan)

Other Decks in Technology

Featured

Transcript