MLCommons - IT Press Tour #60 Jan. 2025

MLCommons & MLPerf: An Introduction David Kanter MLCommons Executive Director
[email protected]

2 • Overview of MLCommons® • MLPerf® Benchmarks • MLPerf
Training • MLPerf Inference • MLPerf Client • MLPerf Storage Agenda

Better AI for Everyone 3 Improving the accuracy, safety, speed
and efﬁciency of AI 56,000+ MLPerf results to-date Global collective engineering effort spanning industry and academia 125+ Members & afﬁliates

4 MLPerf Benchmarks

Why Benchmarks? 5 Benchmarks drive progress and transparency Benchmarking aligns
the entire community in pursuit of the same clear objective “What gets measured, gets improved.” - Peter Drucker

6 Big Data Drives New ML Capabilities

7 Big Data Needs Big Models

8 ...and Big Compute

9 Fundamental Theorem of Machine Learning + + = !
Data Model Compute Innovation

10 Algorithms Software Architecture Silicon Scale ( ) { }
Data ML is a full system problem We need to measure….

11 Enforce performance result replicability to ensure reliable results Use
representative workloads reﬂecting production use-cases Encourage innovation to improve the state-of-the-a rt of ML Accelerate progress in ML via a fair and useful measurement Serve both the commercial and research communities Keep benchmarking affordable so that all can participate MLPerf Goals

12 Improving technical maturity • Standardized methodology for Training •
Power measurement for Inference, Tiny • Power measurement for Training and cloud systems • Throughput metric for HPC • Apps for Android, iOS, Windows, etc. • Constantly evolving workloads Scale Evolution over time 2018 2019 2020 2021 2022 2023 2024 2025 Training Training- HPC Inference - Datacenter Inference - Edge Inference - Mobile Inference - Tiny (IoT) Storage Automotive* Client MLPerf Expansion Compatibility, historical comparisons: https://github.com/mlcommons/policies/blob/master/ MLPerf_Compatibility_Table.adoc * New benchmarks in development

13 MLPerf Training and Moore’s Law Relative performance (log scale)
Relative Performance Gains Over Time for MLPerf Submissions (Available Category, Closed Division)

14 MLPerf Training and Moore’s Law Relative performance (log scale)
Relative Performance Gains Over Time for MLPerf Submissions (Available Category, Closed Division) System-wide optimizations result in increased performance gains for most ML models 6X 5.7X 7.5X 7.9X 9.5X 8.4X 10.2X 9X 5X 2X

15 MLPerf Training Developed by MLPerf Training Working Group Chairs:
Hiwot Kassim, Meta, Shriya Palsamudram, Nvidia MLPerf Training Benchmark Paper - https://arxiv.org/abs/1910.01500 https://mlcommons.org/benchmarks/training/

16 MLPerf Training benchmark deﬁnition Target Quality E.g. 75.9% Model
Dataset E.g. ImageNet

17 Two divisions with diﬀerent model restrictions Dataset Target Quality
E.g. 75.9% Model E.g. ImageNet Closed division: specific model e.g. ResNet v1.5 → direct comparisons Open division: any model → innovation

18 Metric: time-to-train Alternative is throughput Easy / cheap to
measure Higher throughput Fewer epochs • Lower precision • Higher batch size • Higher precision • Lower batch size But can increase throughput at cost of total time to train! Time-to-train (end-to-end) Time to solution! Computationally expensive High variance Least bad choice

19 Time-to-train excludes System initialization Depends on cluster configuration and
state Model initialization Disproportionate for big systems with small benchmarking datasets Data reformatting Mandating format would give advantage to some systems

20 MLPerf categories and divisions • Two Divisions ◦ Closed:
Mathematically equivalent to the reference model, to enable optimization on many different systems with a level playing ﬁeld ▪ Example changes: batch size, numerics, padding, framework, data layout ▪ Cannot change: # of layers, # of weights / pruning ◦ Open Model: not mathematically equivalent to the reference ▪ Could be very different, or a small difference, submitters should describe • Three Categories ◦ Available: Commercially available at submission ◦ Preview: Commercially available soon (~6 months from submission) ◦ RDI: Not commercially available, e.g. research, prototype, or internal systems

MLPerf Training v4.1 21 17 Submitting Organizations ASUSTeK, Azure, Cisco,
Clemson University, Dell**, FlexAI*, Fujitsu, Giga Computing, Google, KRAI, Lambda*, Lenovo, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, tinycorp * New submitters to MLPerf Training ** New submitter to MLPerf Training + power

RDI Research, Development not for production MLPerf Divisions and Categories
Training v4.1 AVAILABLE Can be purchased today PREVIEW Will be submitted into available in next round CLOSED DIVISION OPEN DIVISION Level playing ﬁeld for comparable results Showcase novel techniques Fixed: data format, ML framework and data loader, # of samples per ﬁle, batch size Code changes are allowed but must be disclosed FlexAI, Google, NVIDIA AVAILABLE ON CLOUD: Azure, Google, Lambda, NVIDIA, Oracle AVAILABLE ON-PREMISE: ASUSTek, Cisco, Clemson University, Dell, Fujitsu, Giga Computing, KRAI, Lenovo, NVIDIA, Quanta Cloud Technology, Supermicro, tinycorp

23 MLPerf Training v3.1 detail MLPerf Training v4.1 Task Real
World Application Examples Dataset Reference Model Quality Target Recommendation Content or shopping recommendation, e.g., search, social media, ads Criteo 4TB multi-hot DLRM-dcnv2 0.8032 AUC Large Language Model Pre-training Search, translation, chatbots, summarization Wikipedia 2020-01-01 C4 BERT-large GPT3 0.72 Mask-LM 2.69 log perplexity Large Language Model Fine-tuning High-accuracy domain-specific translation, chatbots, summarization GovRep r1/r2/r3 Llama 2 70B 0.925 cross entropy loss Object detection Pedestrian detection, manufacturing defect detection, red-eye reduction Open Images (800x800) RetinaNet 0.34 mAP Image generation Animation, Content, Marketing, Art LAION-400M-filtered Stable Diffusion v2 FID<=90 and CLIP>=0.15 Graph node classification (GNN) Fraud detection, drug discovery IGBH-Full R-GAT 72% classification accuracy

MLPerf Training v4.1 Overview 24 • 154 performance results successfully
submitted ◦ Up to 1.26x better performance from last round in closed available • New processors tested in preview category ◦ Google Trillium TPU ◦ NVIDIA B200 GPU • Second MLPerf Training round with power measurement results ◦ 1 power submission by Dell PowerEdge XE 9680 data center utilizing NVIDIA H100-SXM-80GB GPUs

Key results & insights • “Generative AI” training is more
popular ◦ 46% increase in total submissions across GPT3, Stable Diffusion, and Llama 2 70B LoRA • Adoption and performance gains from benchmarks introduced in v4.0: ◦ Llama 2 70B LoRA: 16% increase in submissions and 1.26X performance speedup ◦ GNN: 55% increase in submissions and a 1.23X performance speedup 25

26 MLPerf Inference Developed by MLPerf Inference Working Group Chairs:
Miro Hodak, AMD, Mitchelle Rasquinha, Google MLPerf Inference Benchmark Paper - https://arxiv.org/abs/1911.02549 https://mlcommons.org/benchmarks/inference/

27 Server e.g. translation app Ofﬂine e.g. photo sorting app
Queries/second (QPS) Random query arrival with latency constraint Throughput All queries available, no latency constraint MLPerf Inference Datacenter Scenarios Required for all Datacenter submissions Single stream e.g. cell phone augmented vision Multiple stream e.g. multiple camera driving assistance 90%-ile Latency One query at a time Throughput for 8 images

28 Task Dataset Reference Network Reference Quality Image Classiﬁer ImageNet
2012 ResNet-50 v1.5 76.456% top-1 Object detector (large) Open Images (800x800) RetinaNet-ResNext50 0.3755 mAP Object detector (small) COCO 2017 (300x300) SSD MobileNet v1 (edge only) 0.22 mAP 3D medical imaging KITS 2019 (602x512x512) 3D UNET 0.86330 mean* NLP / Q&A SQuAD 1.1 BERT-Large 90.874% F1 score* Recommendation Criteo 4TB multi-hot DLRM-DCNv2 80.32% AUC* LLM / Text Summarization CNN/DailyMail GPT-J 6B ROUGE-1 42.9865*, ROUGE-2 20.1235*, ROUGE-L 29.9881* LLM / Q&A OpenOrca Llama 2 70B ROUGE-1 = 44.4312, ROUGE-2 = 22.0352, ROUGE-L = 28.6162 Image Generation COCO-2014 SDXL FID ∈ (23.0108, 23.9501) CLIP ∈ (31.686, 31.813) LLM / Text Generation OpenOrca/GSM8K/MBXP Mixtral-8x7B ROUGE-1 45.4911, ROUGE-2 23.2829, ROUGE-L 30.3615 (gsm8k)Accuracy 73.78 (mbxp)Accuracy 60.12 • Data Center: Offline and Server scenario • Edge: Single Stream, Offline, Multiple Stream (vision-only) • * indicates both 99% and 99.9% reference accuracy • Optional power/energy measurement, building on SPEC and MLC developed tooling; Perf. only and Perf + power not directly comparable MLPerf Inference v4.1 workloads

29 Task Reference Network Latency Constraint Image Classiﬁer ResNet-50 v1.5
15 ms Object detector RetinaNet-ResNext50 100 ms 3D medical imaging 3D UNET N/A Speech-to-text RNN-T 1000 ms NLP / Q&A BERT-Large 130 ms Recommendation DLRM-DCNv2 60 ms LLM / Text summarization GPT-J 6B 20 seconds LLM / Q&A Llama 2 70B TTFT: 2s & TPOT: 200ms Image Generation SDXL 20 seconds LLM / Text generation Mixtral-8x7B TTFT: 2s & TPOT: 200 ms MLPerf Inference v4.1 server latencies • Server scenario requires 99% of all queries under the latency constraint • Latency selected based on input from advisory boards and customers to represent real deployment

MLPerf Inference v4.1 overview 30 • 22 submitting organizations: AMD*,
ASUSTek, Cisco Systems, Connect Tech Inc, CTuning Foundation, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Intel, Juniper Networks, KRAI, Lenovo, Neutral Magic, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Supermicro, Sustainable Metal Cloud*, and Untether AI* • Results: ◦ 964 performance results ◦ 31 power results ◦ 34 new MoE benchmark results • Power results from NVIDIA and Untether AI* * bold denotes new submitter to MLPerf Inference benchmark

MLPerf Inference v4.1 Insights 31 • Submissions showcase of the
future of AI computing: AMD MI300x accelerator (available), AMD EPYC “Turin” CPU (preview), Google “Trillium” TPUv6e accelerator (preview), Intel “Granite Rapids” Xeon CPUs (preview), Nvidia “Blackwell” B200 (preview) accelerator, UntetherAI SpeedAI 240 Slim (available) and SpeedAI 240 (preview) accelerator • Initial adoption of FP4 • 2-4X generational performance increases • Upto 50% improvement in benchmark performance from 4.0

32 MLPerf Client Developed by MLPerf Client Working Group Chairs:
Yanni Minadakis, Microsoft, Ramesh Jaladi, Intel, Jani Joki, Nvidia https://mlcommons.org/working-groups/benchmarks/client/

Introducing MLPerf Client v0.5 What’s now 33 • First release,
early access ◦ Put the benchmark into the hands of users ◦ Gather feedback and learn how to improve • Four tests based on one AI model • Supports Windows x64 • Two GPU acceleration paths • CLI only • Open source, available on GitHub What’s next • Broader hardware support with more accelerated paths • Support for Windows on ARM and macOS • More tests based on varied end-user scenarios and AI models • A graphical user interface • A regular update cadence

The large language model tests 34 • Large language models
(LLMs) are one of the most popular forms of generative AI ◦ Take in text prompts and produce text responses ◦ Many possible applications from chat bots to AI agents • Key specs for MLPerf Client v0.5: Tasks Model Dataset Mode Quality Content generation Creative writing Summarization, light Summarization, moderate Llama 2 7B OpenOrca Single stream MMLU score

What is a token? 35 A token is a component
part of language that the machine-learning model uses to understand a language’s syntax. 100 tokens would typically translate into about 75 English words. Each colored region is one token

Four different LLM work categories 36 Category Approximate input tokens
Approximate expected output tokens Content generation 128 256 Creative writing 512 512 Summarization, Light 1024 128 Summarization, Moderate 1566 256 Prompts come from the OpenOrca dataset. Larger context lengths are more computationally intensive.

Model optimization and accuracy 37 • We allow room for
IHVs to modify the model to work best on their devices • The Llama 2 7B-parameter source model from HuggingFace stores weights in 16-bit ﬂoating-point format (fp16) ◦ A common practice for LLMs on client systems is to quantize the weights to a four-bit integer (int4) format ◦ MLPerf Client’s LLM models have been quantized to int4 • However, quantization and other optimizations can impact model output quality ◦ IHV implementations must pass an accuracy test ◦ That test is based on the MMLU data set ◦ Veriﬁes functionality is not too compromised

Supported acceleration paths 38 • Two supported acceleration paths in
MLPerf Client v0.5 ◦ ONNX Runtime GenAI with the DirectML execution provider for GPUs ◦ Intel OpenVINO native path for Intel GPUs • The benchmark ships with config JSON files for each participating IHV ◦ Specifying the path and config options they recommend • Detailed info is in the reviewer’s guide! ◦ Recommended test hardware ◦ Supported hardware ◦ Driver and memory requirements per IHV

39 MLPerf Storage Developed by MLPerf Storage Working Group Chairs:
Curtis Anderson, Oana Balmau, Johnu George, and Huihuo Zheng https://mlcommons.org/working-groups/benchmarks/storage/

40 Storage is Critical for Big Data Models

41 • Understand storage bottlenecks in ML workloads ◦ When
the dataset cannot be cached in system memory • Help AI/ML practitioners make an informed storage decision ◦ Can the storage system keep GPUs running at >90% utilization? • Help storage suppliers optimize for ML workloads ◦ “What gets measured, gets improved” Why measure storage?

MLPerf Storage Primer Accelerators not heavily utilized do not test
the capabilities of the storage resources MLPerf requires high accelerator utilization to pushes storage system to keep up with data needs for AI training Online

MLPerf Storage Benchmark v1.0 Sleep for the time it takes
to process a batch before requesting the next batch. Sleep time is conﬁgurable to simulate many types of accelerators. Simulated training “think time”

• Framework – E.g., PyTorch vs TensorFlow • Storage network
– How is the storage connected to the host nodes • Storage solution – Hardware and/or software that that runs the storage • Type of training – E.g., single-host vs distributed training • Type of data – E.g., images vs text samples • Caching – Benchmark guarantees that datasets are larger than host node cache, but in-storage system caching can have important beneﬁts for performance 44 MLPerf Storage v1.0 signiﬁcant variables

45 Task Dataset Referenc e Network Sample size Framework Reference
Quality Image segmentation (medical) Synthetic - from KiTS19 3D-Unet 146 MB PyTorch maximize MB/s, and # of accelerators with >90% accelerator utilization Image classification Synthetic – from ImageNet ResNet50 150 KB Tensorflow maximize MB/s, and # of accelerators with >90% accelerator utilization Scientific (cosmology) Synthetic – from Cosmoflow N-body simulation Parameter prediction 2 MB Tensorflow maximize MB/s, and # of accelerators with >70% accelerator utilization Workloads simulated by MLPerf Storage v1.0 • Workloads can be run with simulated NVIDIA A100 or H100 accelerators • Note variety in data size → bandwidth vs. IOP intensity • “Compute time” per sample varies by two orders of magnitude

Results overview 46 • 13 submitting organizations with over 100
results across three workloads • Congratulations to all submitting organizations: DDN, Hammerspace, Hewlett Packard Enterprise, Huawei, IEIT SYSTEMS, Juicedata, Lightbits Labs, MangoBoost, Nutanix, Simplyblock, Volumez, WEKA, and YanRong Tech

MLPerf Divisions and Categories (v1.0) AVAILABLE Can be purchased today
PREVIEW Will be submitted into available in next round CLOSED DIVISION OPEN DIVISION Level playing ﬁeld for comparable results Showcase novel techniques Fixed: data format, ML framework and data loader, # of samples per ﬁle, batch size Code changes are allowed but must be disclosed

• The explosion of the AI market is driving a
wave of innovation in the storage market • Those waves are relatively rare; the storage market is very focused on stability so major innovation typically only happens when it has to happen • There is a broad range and diversity of storage systems and architectures responding to the needs of the AI community • Distributed training applies a much heavier metadata load to the storage system than single-node training does • For example, up to 76% reduction in pre-batch computation time for the Unet3D workload using the H100* is indicative of a shift from what was previously a bandwidth sensitive workload to a latency sensitive workload Key results insights * compared to v0.5 results using V100 emulated accelerators

49 Listening to the results Every result says something interesting,
but it may not be obvious • Lots of work went into each result • Look at submissions that are similar across some dimensions, e.g., same vendor, same scale, best performance...but different in other dimensions • Distributed vs. single host node • Scale of distributed solutions • Different ML frameworks • First open and transparent benchmark that was developed by the AI community for storage

50 Thank You!

51 Backup

52 Training Workloads

53 GNN

54 GNN overview • Important area to study and model
graph structured data such as social networks, molecules, and webpage links • Wide range of commercial applications e.g., recommendation, fraud detection, drug discovery, knowledge graph answering, etc. ◦ E.g., Alibaba’s Taobao recommender applies GNN on a user behavior graph w more than one billion vertices and tens of billions of edges • Novel characteristics compared to other benchmarks – sparse operations, feature-fetching over network, memory access patterns

Task/dataset/model 55 • Task: Node classiﬁcation in a heterogenous graph
• Dataset: IGBH-Full dataset (paper, repo) ◦ Largest open-source dataset available ◦ 547 million nodes, 5.8 billion edges ◦ Total classes: 2983 ◦ Augmented dataset by adding reverse edges ◦ Feature size: 2.2TB for fp32, Graph size: 160GB • R-GAT model: ◦ 3-layers with [5,10,15] fanout ◦ Hidden dimension of 512 ◦ 4 attention heads MLPerf R-GAT is #1 on IGBH leaderboard

56 LLM Fine-tuning

LLM fine-tuning overview Fine-tuning starts with pre-trained LLM and further
trains it on a specific task or dataset to optimize for that particular task. This increases training efficiency by: • reducing computational intensity • enhancing performance on specific tasks without starting training from scratch 57 Unlabeled dataset Labeled dataset LLM pretraining LLM fine tuning Pretrained model Computationally demanding Computationally efficient Performs better in specific tasks

Parameter-efficient fine-tuning Parameter-efficient fine-tuning (PEFT): A method for fine-tuning LLMs
on specific tasks by adjusting a subset of model parameters; significantly more computationally efficient compared to traditional fine-tuning methods. Low-rank adaptation (LoRA) enables indirect training of dense layers through optimizing rank decomposition matrices, maintaining pre-trained weights frozen. • Reduces hardware requirements up to 3X with adaptive optimizers by eliminating gradient calculation and optimizer state maintenance • Memory and storage usage are significantly reduced 58

Model and dataset details • Pretrained model – Llama-2-70B-hf, Context
length – 8K • Dataset - Scrolls gov report ◦ Task – summarization ◦ Size 1.1GB 17.5K samples for training and 1.9 samples for validation ◦ Average input tokens 13k avg output token 900 ◦ Context length 8k (largest context that ﬁts in one system) ◦ License: CC-BY-4.0 59

60 Inference Workloads

61 Mixture of Experts (MoE)

62 Mixture of Experts (MoE) Language Modeling • MoE is
the gateway to multi-modal models (e.g., vision, text, speech) • Monolithic LLMs increase compute proportional to parameter count • MoEs scale parameters with less compute and maintain high accuracy ◦ Improve inference costs and performance (throughput, latency) ◦ Combine multiple specialized neural networks, or "experts" ◦ Task specialization and adaptability (e.g., coding, summarization) • MoEs have demonstrated results comparable to larger monolithic models • Task force developed MoE benchmark reference implementation

Switching FFN layer replaces standard dense FFN Mixture of Experts
Multiple “Expert” FFNs with Router

64 Mixtral 8x7B • MoE architecture requires a new approach
to evaluate • Three tasks and datasets ◦ Q&A - OpenOrca dataset as in Llama 2 70B benchmark ◦ Math Reasoning: GSM8K (Grade School Math 8K) dataset ◦ Coding: MBXP (Mostly Basic X Programming) dataset ◦ 5,000 samples drawn from each dataset for a total of 15,000 samples • Accuracy criteria: ◦ Q&A: ROUGE-1 45.4911, ROUGE-2 23.2829, ROUGE-L 30.3615 ◦ GSM8K: 73.78 & MBXP: 60.12 • Latency: TTFT: 2s & TPOT: 200 ms (same as Llama 2 70B) • Most complex MLPerf Inference benchmark to date

Queries Examples Q&A: Question: Leo: If "A woman sitting on
some steps smokes a cigarette while accessing her laptop." does that mean that "A person sitting on some steps smokes a cigarette while accessing her laptop."? Options: - yes - it is not possible to tell - no Answer: Yes, the sentence "A woman sitting on some steps smokes a cigarette while accessing her laptop" can be changed to "A person sitting on some steps smokes a cigarette while accessing her laptop." The gender of the person has been changed from specifically a woman to a person in general. However, it is important to note that the pronoun "her" is still used, which indicates that the person referred to is female. If you want to make the sentence completely gender-neutral, you could change "her laptop" to "the laptop" or "a laptop." Math Reasoning Question: The square footage of the two bedrooms in the apartment that Jenny and Martha share totals 300 square feet. If Jenny's bedroom is 60 square feet larger than Martha's, how large, in square feet, is Martha's bedroom? Answer: Let's denote the size of Martha's bedroom as M. Jenny's bedroom is then M + 60 square feet. The total size of the two bedrooms is 300 square feet. So we have the equation M + (M + 60) = 300. Solving this equation for M, we get 2M + 60 = 300, which simplifies to 2M = 240, so M = 120. The answer is 120 square feet. Coding Problem: <s> [INST] Complete the following code. Be concise, don't output anything that isn't necessary. #include <bits/stdc++.h> using namespace std; /** * Write a function to access dictionary key’s element by index. * > accessKey(unordered_map<string, int>{{string("physics"), 80}, {string("math"), 90}, {string("chemistry"), 86}}, 0) * string("physics") * > accessKey(unordered_map<string, int>{{string("python"), 10}, {string("java"), 20}, {string("C++"), 30}}, 2) * string("C++") * > accessKey(unordered_map<string, int>{{string("program"), 15}, {string("computer"), 45}}, 1) * string("computer") */ string accessKey(unordered_map<string, int> ditionary, int key) { [/INST]Here's the completed code: ```cpp Solution: #include <bits/stdc++.h> using namespace std; string accessKey(unordered\_map<string, int> dictionary, int key) { auto it = begin(dictionary); advance(it, key); return it->first; } ```

66 Power

Power Measurement in MLPerf • First MLPerf Training round with
power measurement results • Industry’s ﬁrst datacenter scale full system power measurement methodology applicable to both on-prem and cloud data centers • Includes power consumed by all compute nodes (measured) and interconnect network (measured/estimated) during workload runtime. • Measured power integrated over Time-to-Train (TTT) to derive Energy-to-Train (ETT). Result metrics presented as (TTT, ETT) • Submissions by Sustainable Metal Cloud (SMC): ◦ Submitted RN50, BERT, LLama 2 70B LoRA, GPT3 Pre-training at different scales ranging from 1 node (8 GPUs) to 64 nodes (512 GPUs) 67

68 MLPerf Training v4.0 divisions and categories Two Divisions •
Closed: Mathematically equivalent to the reference model, to enable optimization on many different systems with a level playing ﬁeld ◦ Example changes: batch size, numerics, padding, framework, data layout ◦ Cannot change: # of layers, # of weights / pruning • Open Model: not mathematically equivalent to the reference ◦ Could be very different, or a small difference, submitters should describe Three Categories • Available: Commercially available at submission • Preview: Commercially available soon (~6 months from submission) • RDI: Not commercially available, e.g. research, prototype, or internal systems

69 e.g. image Input e.g. ‘cat’ Result (with required quality,
e.g. 75.1%) e.g. ResNet Trained model Submission division Closed Open Inference Strict rules Apples-to-apples ML system comparison Permissive rules Better models than reference MLPerf benchmarking scope: ML systems (HW + SW) MLPerf Inference Benchmark Deﬁnition

70 MLPerf Categories and Divisions • Two Divisions ◦ Closed:
Mathematically equivalent to the reference model, to enable optimization on many different systems with a level playing ﬁeld ▪ Example changes: batch size, numerics, padding, framework, data layout ▪ Cannot change: # of layers, # of weights / pruning ◦ Open Model: not mathematically equivalent to the reference ▪ Could be very different, or a small difference, submitters should describe • Three Categories ◦ Available: Commercially available at submission ◦ Preview: Commercially available soon (~6 months from submission) ◦ RDI: Not commercially available, e.g. research, prototype, or internal systems

MLPerf Inference Power Measurement 71 • Measured power is the
measured average AC power (and implicitly energy) consumed by the entire system while executing a performance benchmark. • MLPerf Power is only capable of measuring and validating the full system power (and implicitly energy). Any other references to power in any description (e.g., a TDP conﬁguration) are not measured or validated by MLCommons. • Performance in a performance-only submission is not comparable to performance in performance + power submission ◦ Different optimization targets, e.g., different voltage/frequency

72 MLPerf Automotive

Partnering with AVCC ML benchmark suite for automotive request for
information/quotes (RFI/RFQs). Focus areas include: • Safety – Collision avoidance, lane departure warning, etc. • Efﬁciency – Fuel optimization, predictive maintenance, etc. • User Experience - Voice recognition, LLMs, etc. • Regulatory Compliance – Meet global regulations around AI/ML in vehicles, especially for safety MLCommons and AVCC Release Automotive Benchmark Proof-of-Concept - June 2024 MLPerf Automotive 73

MLCommons - IT Press Tour #60 Jan. 2025

MLCommons - IT Press Tour #60 Jan. 2025

More Decks by The IT Press Tour

Other Decks in Technology

Featured

Transcript