apidays Paris 2024 - Evaluation as a Tool for Regulatory Compliance Scratching the AI Regulation Surface - Carlos Muñoz Ferrandis, Alinia AI

Slide 1

Slide 1 text

Evaluation as a tool for regulatory compliance: Scratching the AI regulation surface Carlos Muñoz Ferrandis, co-founder & COO

Slide 2

Slide 2 text

1278% 1000+ Policy initiatives reported by gvmts in 70+ jurisdictions in OECD database. May 2023. USA 2024: total number of AI-related regulations grew by Increase of AI policy and regulatory initiatives, worldwide AI mentioned in legislative proceedings 2022: 1,247 2023: 2,175 Between 2022 and 2023, AI incidents reported increased by globally... 56.3% Data extracted from OECD AI and AI Index Report 2024

Slide 3

Slide 3 text

Regulations are coming.

Slide 4

Slide 4 text

Main challenges... how to interpret regulation? how measure compliance? how to anticipate and mitigate risk? how to report compliance? ...Challenge for whom? Market Authorities

Slide 5

Slide 5 text

Prohibited Practices High Risk AI Systems General Purpose AI Evaluation & Red teaming Art 9, 15, 17 Art 53.1 Guardrails & Monitoring Art 9, 15, 17 Art 55.1 Documentation Art 11, 13, Annex IV Art 53, 55 EU AI Act control & risk mitigation tooling *Similarities with Digital Operational Resilience Act (arts 9, 10, 25)

Slide 6

Slide 6 text

EU AI Act Prohibited AI systems: “behavioral distortion” High risk AI systems: “robustness” “perform consistently for their intended purpose” Digital Operational Resilience Act Robustness, Resilience, Reliability of ICT systems Let´s scratch a bit the regulatory surface... How do we measure and monitor unclear requirements?

Slide 7

Slide 7 text

Define criterion “Robustness” Define metrics/rubrics Train “Evaluator models” Run evals, monitor at scale Close your eyes and pray Evaluation at scale is a need...and a science.

Slide 8

Slide 8 text

What is currently missing? Standardizing benchmarks for regulatory compliance Variety of open Benchmarks + not so good datasets Focus on defining criteria + metrics to measure Transversal criteria vs industry-specific criteria “Cards” everywhere Industry-specific Gen AI evals + guardrails

Slide 9

Slide 9 text

Safe & controlled deployment of gen AI Thank you!