Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MLCommons - IT Press Tour #60 Jan. 2025

MLCommons - IT Press Tour #60 Jan. 2025

The IT Press Tour

January 27, 2025

More Decks by The IT Press Tour

Other Decks in Technology

Transcript

  1. 2 • Overview of MLCommons® • MLPerf® Benchmarks • MLPerf

    Training • MLPerf Inference • MLPerf Client • MLPerf Storage Agenda
  2. Better AI for Everyone 3 Improving the accuracy, safety, speed

    and efficiency of AI 56,000+ MLPerf results to-date Global collective engineering effort spanning industry and academia 125+ Members & affiliates
  3. Why Benchmarks? 5 Benchmarks drive progress and transparency Benchmarking aligns

    the entire community in pursuit of the same clear objective “What gets measured, gets improved.” - Peter Drucker
  4. 10 Algorithms Software Architecture Silicon Scale ( ) { }

    Data ML is a full system problem We need to measure….
  5. 11 Enforce performance result replicability to ensure reliable results Use

    representative workloads reflecting production use-cases Encourage innovation to improve the state-of-the-a rt of ML Accelerate progress in ML via a fair and useful measurement Serve both the commercial and research communities Keep benchmarking affordable so that all can participate MLPerf Goals
  6. 12 Improving technical maturity • Standardized methodology for Training •

    Power measurement for Inference, Tiny • Power measurement for Training and cloud systems • Throughput metric for HPC • Apps for Android, iOS, Windows, etc. • Constantly evolving workloads Scale Evolution over time 2018 2019 2020 2021 2022 2023 2024 2025 Training Training- HPC Inference - Datacenter Inference - Edge Inference - Mobile Inference - Tiny (IoT) Storage Automotive* Client MLPerf Expansion Compatibility, historical comparisons: https://github.com/mlcommons/policies/blob/master/ MLPerf_Compatibility_Table.adoc * New benchmarks in development
  7. 13 MLPerf Training and Moore’s Law Relative performance (log scale)

    Relative Performance Gains Over Time for MLPerf Submissions (Available Category, Closed Division)
  8. 14 MLPerf Training and Moore’s Law Relative performance (log scale)

    Relative Performance Gains Over Time for MLPerf Submissions (Available Category, Closed Division) System-wide optimizations result in increased performance gains for most ML models 6X 5.7X 7.5X 7.9X 9.5X 8.4X 10.2X 9X 5X 2X
  9. 15 MLPerf Training Developed by MLPerf Training Working Group Chairs:

    Hiwot Kassim, Meta, Shriya Palsamudram, Nvidia MLPerf Training Benchmark Paper - https://arxiv.org/abs/1910.01500 https://mlcommons.org/benchmarks/training/
  10. 17 Two divisions with different model restrictions Dataset Target Quality

    E.g. 75.9% Model E.g. ImageNet Closed division: specific model e.g. ResNet v1.5 → direct comparisons Open division: any model → innovation
  11. 18 Metric: time-to-train Alternative is throughput Easy / cheap to

    measure Higher throughput Fewer epochs • Lower precision • Higher batch size • Higher precision • Lower batch size But can increase throughput at cost of total time to train! Time-to-train (end-to-end) Time to solution! Computationally expensive High variance Least bad choice
  12. 19 Time-to-train excludes System initialization Depends on cluster configuration and

    state Model initialization Disproportionate for big systems with small benchmarking datasets Data reformatting Mandating format would give advantage to some systems
  13. 20 MLPerf categories and divisions • Two Divisions ◦ Closed:

    Mathematically equivalent to the reference model, to enable optimization on many different systems with a level playing field ▪ Example changes: batch size, numerics, padding, framework, data layout ▪ Cannot change: # of layers, # of weights / pruning ◦ Open Model: not mathematically equivalent to the reference ▪ Could be very different, or a small difference, submitters should describe • Three Categories ◦ Available: Commercially available at submission ◦ Preview: Commercially available soon (~6 months from submission) ◦ RDI: Not commercially available, e.g. research, prototype, or internal systems
  14. MLPerf Training v4.1 21 17 Submitting Organizations ASUSTeK, Azure, Cisco,

    Clemson University, Dell**, FlexAI*, Fujitsu, Giga Computing, Google, KRAI, Lambda*, Lenovo, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, tinycorp * New submitters to MLPerf Training ** New submitter to MLPerf Training + power
  15. RDI Research, Development not for production MLPerf Divisions and Categories

    Training v4.1 AVAILABLE Can be purchased today PREVIEW Will be submitted into available in next round CLOSED DIVISION OPEN DIVISION Level playing field for comparable results Showcase novel techniques Fixed: data format, ML framework and data loader, # of samples per file, batch size Code changes are allowed but must be disclosed FlexAI, Google, NVIDIA AVAILABLE ON CLOUD: Azure, Google, Lambda, NVIDIA, Oracle AVAILABLE ON-PREMISE: ASUSTek, Cisco, Clemson University, Dell, Fujitsu, Giga Computing, KRAI, Lenovo, NVIDIA, Quanta Cloud Technology, Supermicro, tinycorp
  16. 23 MLPerf Training v3.1 detail MLPerf Training v4.1 Task Real

    World Application Examples Dataset Reference Model Quality Target Recommendation Content or shopping recommendation, e.g., search, social media, ads Criteo 4TB multi-hot DLRM-dcnv2 0.8032 AUC Large Language Model Pre-training Search, translation, chatbots, summarization Wikipedia 2020-01-01 C4 BERT-large GPT3 0.72 Mask-LM 2.69 log perplexity Large Language Model Fine-tuning High-accuracy domain-specific translation, chatbots, summarization GovRep r1/r2/r3 Llama 2 70B 0.925 cross entropy loss Object detection Pedestrian detection, manufacturing defect detection, red-eye reduction Open Images (800x800) RetinaNet 0.34 mAP Image generation Animation, Content, Marketing, Art LAION-400M-filtered Stable Diffusion v2 FID<=90 and CLIP>=0.15 Graph node classification (GNN) Fraud detection, drug discovery IGBH-Full R-GAT 72% classification accuracy
  17. MLPerf Training v4.1 Overview 24 • 154 performance results successfully

    submitted ◦ Up to 1.26x better performance from last round in closed available • New processors tested in preview category ◦ Google Trillium TPU ◦ NVIDIA B200 GPU • Second MLPerf Training round with power measurement results ◦ 1 power submission by Dell PowerEdge XE 9680 data center utilizing NVIDIA H100-SXM-80GB GPUs
  18. Key results & insights • “Generative AI” training is more

    popular ◦ 46% increase in total submissions across GPT3, Stable Diffusion, and Llama 2 70B LoRA • Adoption and performance gains from benchmarks introduced in v4.0: ◦ Llama 2 70B LoRA: 16% increase in submissions and 1.26X performance speedup ◦ GNN: 55% increase in submissions and a 1.23X performance speedup 25
  19. 26 MLPerf Inference Developed by MLPerf Inference Working Group Chairs:

    Miro Hodak, AMD, Mitchelle Rasquinha, Google MLPerf Inference Benchmark Paper - https://arxiv.org/abs/1911.02549 https://mlcommons.org/benchmarks/inference/
  20. 27 Server e.g. translation app Offline e.g. photo sorting app

    Queries/second (QPS) Random query arrival with latency constraint Throughput All queries available, no latency constraint MLPerf Inference Datacenter Scenarios Required for all Datacenter submissions Single stream e.g. cell phone augmented vision Multiple stream e.g. multiple camera driving assistance 90%-ile Latency One query at a time Throughput for 8 images
  21. 28 Task Dataset Reference Network Reference Quality Image Classifier ImageNet

    2012 ResNet-50 v1.5 76.456% top-1 Object detector (large) Open Images (800x800) RetinaNet-ResNext50 0.3755 mAP Object detector (small) COCO 2017 (300x300) SSD MobileNet v1 (edge only) 0.22 mAP 3D medical imaging KITS 2019 (602x512x512) 3D UNET 0.86330 mean* NLP / Q&A SQuAD 1.1 BERT-Large 90.874% F1 score* Recommendation Criteo 4TB multi-hot DLRM-DCNv2 80.32% AUC* LLM / Text Summarization CNN/DailyMail GPT-J 6B ROUGE-1 42.9865*, ROUGE-2 20.1235*, ROUGE-L 29.9881* LLM / Q&A OpenOrca Llama 2 70B ROUGE-1 = 44.4312, ROUGE-2 = 22.0352, ROUGE-L = 28.6162 Image Generation COCO-2014 SDXL FID ∈ (23.0108, 23.9501) CLIP ∈ (31.686, 31.813) LLM / Text Generation OpenOrca/GSM8K/MBXP Mixtral-8x7B ROUGE-1 45.4911, ROUGE-2 23.2829, ROUGE-L 30.3615 (gsm8k)Accuracy 73.78 (mbxp)Accuracy 60.12 • Data Center: Offline and Server scenario • Edge: Single Stream, Offline, Multiple Stream (vision-only) • * indicates both 99% and 99.9% reference accuracy • Optional power/energy measurement, building on SPEC and MLC developed tooling; Perf. only and Perf + power not directly comparable MLPerf Inference v4.1 workloads
  22. 29 Task Reference Network Latency Constraint Image Classifier ResNet-50 v1.5

    15 ms Object detector RetinaNet-ResNext50 100 ms 3D medical imaging 3D UNET N/A Speech-to-text RNN-T 1000 ms NLP / Q&A BERT-Large 130 ms Recommendation DLRM-DCNv2 60 ms LLM / Text summarization GPT-J 6B 20 seconds LLM / Q&A Llama 2 70B TTFT: 2s & TPOT: 200ms Image Generation SDXL 20 seconds LLM / Text generation Mixtral-8x7B TTFT: 2s & TPOT: 200 ms MLPerf Inference v4.1 server latencies • Server scenario requires 99% of all queries under the latency constraint • Latency selected based on input from advisory boards and customers to represent real deployment
  23. MLPerf Inference v4.1 overview 30 • 22 submitting organizations: AMD*,

    ASUSTek, Cisco Systems, Connect Tech Inc, CTuning Foundation, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Intel, Juniper Networks, KRAI, Lenovo, Neutral Magic, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Supermicro, Sustainable Metal Cloud*, and Untether AI* • Results: ◦ 964 performance results ◦ 31 power results ◦ 34 new MoE benchmark results • Power results from NVIDIA and Untether AI* * bold denotes new submitter to MLPerf Inference benchmark
  24. MLPerf Inference v4.1 Insights 31 • Submissions showcase of the

    future of AI computing: AMD MI300x accelerator (available), AMD EPYC “Turin” CPU (preview), Google “Trillium” TPUv6e accelerator (preview), Intel “Granite Rapids” Xeon CPUs (preview), Nvidia “Blackwell” B200 (preview) accelerator, UntetherAI SpeedAI 240 Slim (available) and SpeedAI 240 (preview) accelerator • Initial adoption of FP4 • 2-4X generational performance increases • Upto 50% improvement in benchmark performance from 4.0
  25. 32 MLPerf Client Developed by MLPerf Client Working Group Chairs:

    Yanni Minadakis, Microsoft, Ramesh Jaladi, Intel, Jani Joki, Nvidia https://mlcommons.org/working-groups/benchmarks/client/
  26. Introducing MLPerf Client v0.5 What’s now 33 • First release,

    early access ◦ Put the benchmark into the hands of users ◦ Gather feedback and learn how to improve • Four tests based on one AI model • Supports Windows x64 • Two GPU acceleration paths • CLI only • Open source, available on GitHub What’s next • Broader hardware support with more accelerated paths • Support for Windows on ARM and macOS • More tests based on varied end-user scenarios and AI models • A graphical user interface • A regular update cadence
  27. The large language model tests 34 • Large language models

    (LLMs) are one of the most popular forms of generative AI ◦ Take in text prompts and produce text responses ◦ Many possible applications from chat bots to AI agents • Key specs for MLPerf Client v0.5: Tasks Model Dataset Mode Quality Content generation Creative writing Summarization, light Summarization, moderate Llama 2 7B OpenOrca Single stream MMLU score
  28. What is a token? 35 A token is a component

    part of language that the machine-learning model uses to understand a language’s syntax. 100 tokens would typically translate into about 75 English words. Each colored region is one token
  29. Four different LLM work categories 36 Category Approximate input tokens

    Approximate expected output tokens Content generation 128 256 Creative writing 512 512 Summarization, Light 1024 128 Summarization, Moderate 1566 256 Prompts come from the OpenOrca dataset. Larger context lengths are more computationally intensive.
  30. Model optimization and accuracy 37 • We allow room for

    IHVs to modify the model to work best on their devices • The Llama 2 7B-parameter source model from HuggingFace stores weights in 16-bit floating-point format (fp16) ◦ A common practice for LLMs on client systems is to quantize the weights to a four-bit integer (int4) format ◦ MLPerf Client’s LLM models have been quantized to int4 • However, quantization and other optimizations can impact model output quality ◦ IHV implementations must pass an accuracy test ◦ That test is based on the MMLU data set ◦ Verifies functionality is not too compromised
  31. Supported acceleration paths 38 • Two supported acceleration paths in

    MLPerf Client v0.5 ◦ ONNX Runtime GenAI with the DirectML execution provider for GPUs ◦ Intel OpenVINO native path for Intel GPUs • The benchmark ships with config JSON files for each participating IHV ◦ Specifying the path and config options they recommend • Detailed info is in the reviewer’s guide! ◦ Recommended test hardware ◦ Supported hardware ◦ Driver and memory requirements per IHV
  32. 39 MLPerf Storage Developed by MLPerf Storage Working Group Chairs:

    Curtis Anderson, Oana Balmau, Johnu George, and Huihuo Zheng https://mlcommons.org/working-groups/benchmarks/storage/
  33. 41 • Understand storage bottlenecks in ML workloads ◦ When

    the dataset cannot be cached in system memory • Help AI/ML practitioners make an informed storage decision ◦ Can the storage system keep GPUs running at >90% utilization? • Help storage suppliers optimize for ML workloads ◦ “What gets measured, gets improved” Why measure storage?
  34. MLPerf Storage Primer Accelerators not heavily utilized do not test

    the capabilities of the storage resources MLPerf requires high accelerator utilization to pushes storage system to keep up with data needs for AI training Online
  35. MLPerf Storage Benchmark v1.0 Sleep for the time it takes

    to process a batch before requesting the next batch. Sleep time is configurable to simulate many types of accelerators. Simulated training “think time”
  36. • Framework – E.g., PyTorch vs TensorFlow • Storage network

    – How is the storage connected to the host nodes • Storage solution – Hardware and/or software that that runs the storage • Type of training – E.g., single-host vs distributed training • Type of data – E.g., images vs text samples • Caching – Benchmark guarantees that datasets are larger than host node cache, but in-storage system caching can have important benefits for performance 44 MLPerf Storage v1.0 significant variables
  37. 45 Task Dataset Referenc e Network Sample size Framework Reference

    Quality Image segmentation (medical) Synthetic - from KiTS19 3D-Unet 146 MB PyTorch maximize MB/s, and # of accelerators with >90% accelerator utilization Image classification Synthetic – from ImageNet ResNet50 150 KB Tensorflow maximize MB/s, and # of accelerators with >90% accelerator utilization Scientific (cosmology) Synthetic – from Cosmoflow N-body simulation Parameter prediction 2 MB Tensorflow maximize MB/s, and # of accelerators with >70% accelerator utilization Workloads simulated by MLPerf Storage v1.0 • Workloads can be run with simulated NVIDIA A100 or H100 accelerators • Note variety in data size → bandwidth vs. IOP intensity • “Compute time” per sample varies by two orders of magnitude
  38. Results overview 46 • 13 submitting organizations with over 100

    results across three workloads • Congratulations to all submitting organizations: DDN, Hammerspace, Hewlett Packard Enterprise, Huawei, IEIT SYSTEMS, Juicedata, Lightbits Labs, MangoBoost, Nutanix, Simplyblock, Volumez, WEKA, and YanRong Tech
  39. MLPerf Divisions and Categories (v1.0) AVAILABLE Can be purchased today

    PREVIEW Will be submitted into available in next round CLOSED DIVISION OPEN DIVISION Level playing field for comparable results Showcase novel techniques Fixed: data format, ML framework and data loader, # of samples per file, batch size Code changes are allowed but must be disclosed
  40. • The explosion of the AI market is driving a

    wave of innovation in the storage market • Those waves are relatively rare; the storage market is very focused on stability so major innovation typically only happens when it has to happen • There is a broad range and diversity of storage systems and architectures responding to the needs of the AI community • Distributed training applies a much heavier metadata load to the storage system than single-node training does • For example, up to 76% reduction in pre-batch computation time for the Unet3D workload using the H100* is indicative of a shift from what was previously a bandwidth sensitive workload to a latency sensitive workload Key results insights * compared to v0.5 results using V100 emulated accelerators
  41. 49 Listening to the results Every result says something interesting,

    but it may not be obvious • Lots of work went into each result • Look at submissions that are similar across some dimensions, e.g., same vendor, same scale, best performance...but different in other dimensions • Distributed vs. single host node • Scale of distributed solutions • Different ML frameworks • First open and transparent benchmark that was developed by the AI community for storage
  42. 54 GNN overview • Important area to study and model

    graph structured data such as social networks, molecules, and webpage links • Wide range of commercial applications e.g., recommendation, fraud detection, drug discovery, knowledge graph answering, etc. ◦ E.g., Alibaba’s Taobao recommender applies GNN on a user behavior graph w more than one billion vertices and tens of billions of edges • Novel characteristics compared to other benchmarks – sparse operations, feature-fetching over network, memory access patterns
  43. Task/dataset/model 55 • Task: Node classification in a heterogenous graph

    • Dataset: IGBH-Full dataset (paper, repo) ◦ Largest open-source dataset available ◦ 547 million nodes, 5.8 billion edges ◦ Total classes: 2983 ◦ Augmented dataset by adding reverse edges ◦ Feature size: 2.2TB for fp32, Graph size: 160GB • R-GAT model: ◦ 3-layers with [5,10,15] fanout ◦ Hidden dimension of 512 ◦ 4 attention heads MLPerf R-GAT is #1 on IGBH leaderboard
  44. LLM fine-tuning overview Fine-tuning starts with pre-trained LLM and further

    trains it on a specific task or dataset to optimize for that particular task. This increases training efficiency by: • reducing computational intensity • enhancing performance on specific tasks without starting training from scratch 57 Unlabeled dataset Labeled dataset LLM pretraining LLM fine tuning Pretrained model Computationally demanding Computationally efficient Performs better in specific tasks
  45. Parameter-efficient fine-tuning Parameter-efficient fine-tuning (PEFT): A method for fine-tuning LLMs

    on specific tasks by adjusting a subset of model parameters; significantly more computationally efficient compared to traditional fine-tuning methods. Low-rank adaptation (LoRA) enables indirect training of dense layers through optimizing rank decomposition matrices, maintaining pre-trained weights frozen. • Reduces hardware requirements up to 3X with adaptive optimizers by eliminating gradient calculation and optimizer state maintenance • Memory and storage usage are significantly reduced 58
  46. Model and dataset details • Pretrained model – Llama-2-70B-hf, Context

    length – 8K • Dataset - Scrolls gov report ◦ Task – summarization ◦ Size 1.1GB 17.5K samples for training and 1.9 samples for validation ◦ Average input tokens 13k avg output token 900 ◦ Context length 8k (largest context that fits in one system) ◦ License: CC-BY-4.0 59
  47. 62 Mixture of Experts (MoE) Language Modeling • MoE is

    the gateway to multi-modal models (e.g., vision, text, speech) • Monolithic LLMs increase compute proportional to parameter count • MoEs scale parameters with less compute and maintain high accuracy ◦ Improve inference costs and performance (throughput, latency) ◦ Combine multiple specialized neural networks, or "experts" ◦ Task specialization and adaptability (e.g., coding, summarization) • MoEs have demonstrated results comparable to larger monolithic models • Task force developed MoE benchmark reference implementation
  48. 64 Mixtral 8x7B • MoE architecture requires a new approach

    to evaluate • Three tasks and datasets ◦ Q&A - OpenOrca dataset as in Llama 2 70B benchmark ◦ Math Reasoning: GSM8K (Grade School Math 8K) dataset ◦ Coding: MBXP (Mostly Basic X Programming) dataset ◦ 5,000 samples drawn from each dataset for a total of 15,000 samples • Accuracy criteria: ◦ Q&A: ROUGE-1 45.4911, ROUGE-2 23.2829, ROUGE-L 30.3615 ◦ GSM8K: 73.78 & MBXP: 60.12 • Latency: TTFT: 2s & TPOT: 200 ms (same as Llama 2 70B) • Most complex MLPerf Inference benchmark to date
  49. Queries Examples Q&A: Question: Leo: If "A woman sitting on

    some steps smokes a cigarette while accessing her laptop." does that mean that "A person sitting on some steps smokes a cigarette while accessing her laptop."? Options: - yes - it is not possible to tell - no Answer: Yes, the sentence "A woman sitting on some steps smokes a cigarette while accessing her laptop" can be changed to "A person sitting on some steps smokes a cigarette while accessing her laptop." The gender of the person has been changed from specifically a woman to a person in general. However, it is important to note that the pronoun "her" is still used, which indicates that the person referred to is female. If you want to make the sentence completely gender-neutral, you could change "her laptop" to "the laptop" or "a laptop." Math Reasoning Question: The square footage of the two bedrooms in the apartment that Jenny and Martha share totals 300 square feet. If Jenny's bedroom is 60 square feet larger than Martha's, how large, in square feet, is Martha's bedroom? Answer: Let's denote the size of Martha's bedroom as M. Jenny's bedroom is then M + 60 square feet. The total size of the two bedrooms is 300 square feet. So we have the equation M + (M + 60) = 300. Solving this equation for M, we get 2M + 60 = 300, which simplifies to 2M = 240, so M = 120. The answer is 120 square feet. Coding Problem: <s> [INST] Complete the following code. Be concise, don't output anything that isn't necessary. #include <bits/stdc++.h> using namespace std; /** * Write a function to access dictionary key’s element by index. * > accessKey(unordered_map<string, int>{{string("physics"), 80}, {string("math"), 90}, {string("chemistry"), 86}}, 0) * string("physics") * > accessKey(unordered_map<string, int>{{string("python"), 10}, {string("java"), 20}, {string("C++"), 30}}, 2) * string("C++") * > accessKey(unordered_map<string, int>{{string("program"), 15}, {string("computer"), 45}}, 1) * string("computer") */ string accessKey(unordered_map<string, int> ditionary, int key) { [/INST]Here's the completed code: ```cpp Solution: #include <bits/stdc++.h> using namespace std; string accessKey(unordered\_map<string, int> dictionary, int key) { auto it = begin(dictionary); advance(it, key); return it->first; } ```
  50. Power Measurement in MLPerf • First MLPerf Training round with

    power measurement results • Industry’s first datacenter scale full system power measurement methodology applicable to both on-prem and cloud data centers • Includes power consumed by all compute nodes (measured) and interconnect network (measured/estimated) during workload runtime. • Measured power integrated over Time-to-Train (TTT) to derive Energy-to-Train (ETT). Result metrics presented as (TTT, ETT) • Submissions by Sustainable Metal Cloud (SMC): ◦ Submitted RN50, BERT, LLama 2 70B LoRA, GPT3 Pre-training at different scales ranging from 1 node (8 GPUs) to 64 nodes (512 GPUs) 67
  51. 68 MLPerf Training v4.0 divisions and categories Two Divisions •

    Closed: Mathematically equivalent to the reference model, to enable optimization on many different systems with a level playing field ◦ Example changes: batch size, numerics, padding, framework, data layout ◦ Cannot change: # of layers, # of weights / pruning • Open Model: not mathematically equivalent to the reference ◦ Could be very different, or a small difference, submitters should describe Three Categories • Available: Commercially available at submission • Preview: Commercially available soon (~6 months from submission) • RDI: Not commercially available, e.g. research, prototype, or internal systems
  52. 69 e.g. image Input e.g. ‘cat’ Result (with required quality,

    e.g. 75.1%) e.g. ResNet Trained model Submission division Closed Open Inference Strict rules Apples-to-apples ML system comparison Permissive rules Better models than reference MLPerf benchmarking scope: ML systems (HW + SW) MLPerf Inference Benchmark Definition
  53. 70 MLPerf Categories and Divisions • Two Divisions ◦ Closed:

    Mathematically equivalent to the reference model, to enable optimization on many different systems with a level playing field ▪ Example changes: batch size, numerics, padding, framework, data layout ▪ Cannot change: # of layers, # of weights / pruning ◦ Open Model: not mathematically equivalent to the reference ▪ Could be very different, or a small difference, submitters should describe • Three Categories ◦ Available: Commercially available at submission ◦ Preview: Commercially available soon (~6 months from submission) ◦ RDI: Not commercially available, e.g. research, prototype, or internal systems
  54. MLPerf Inference Power Measurement 71 • Measured power is the

    measured average AC power (and implicitly energy) consumed by the entire system while executing a performance benchmark. • MLPerf Power is only capable of measuring and validating the full system power (and implicitly energy). Any other references to power in any description (e.g., a TDP configuration) are not measured or validated by MLCommons. • Performance in a performance-only submission is not comparable to performance in performance + power submission ◦ Different optimization targets, e.g., different voltage/frequency
  55. Partnering with AVCC ML benchmark suite for automotive request for

    information/quotes (RFI/RFQs). Focus areas include: • Safety – Collision avoidance, lane departure warning, etc. • Efficiency – Fuel optimization, predictive maintenance, etc. • User Experience - Voice recognition, LLMs, etc. • Regulatory Compliance – Meet global regulations around AI/ML in vehicles, especially for safety MLCommons and AVCC Release Automotive Benchmark Proof-of-Concept - June 2024 MLPerf Automotive 73