Upgrade to Pro — share decks privately, control downloads, hide ads and more …

apidays Paris 2024 - Evaluation as a Tool for R...

apidays
December 31, 2024

apidays Paris 2024 - Evaluation as a Tool for Regulatory Compliance Scratching the AI Regulation Surface - Carlos Muñoz Ferrandis, Alinia AI

AI Evaluation as a Tool for Regulatory Compliance
Carlos Muñoz Ferrandis, Creator of OpenRAIL and Co-Founder & COO at Alinia AI

apidays Paris 2024 - The Future API Stack for Mass Innovation
December 3 - 5, 2024

------

Check out our conferences at https://www.apidays.global/

Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8

Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io

Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/

apidays

December 31, 2024
Tweet

More Decks by apidays

Other Decks in Programming

Transcript

  1. Evaluation as a tool for regulatory compliance: Scratching the AI

    regulation surface Carlos Muñoz Ferrandis, co-founder & COO
  2. 1278% 1000+ Policy initiatives reported by gvmts in 70+ jurisdictions

    in OECD database. May 2023. USA 2024: total number of AI-related regulations grew by Increase of AI policy and regulatory initiatives, worldwide AI mentioned in legislative proceedings 2022: 1,247 2023: 2,175 Between 2022 and 2023, AI incidents reported increased by globally... 56.3% Data extracted from OECD AI and AI Index Report 2024
  3. Main challenges... how to interpret regulation? how measure compliance? how

    to anticipate and mitigate risk? how to report compliance? ...Challenge for whom? Market Authorities
  4. Prohibited Practices High Risk AI Systems General Purpose AI Evaluation

    & Red teaming Art 9, 15, 17 Art 53.1 Guardrails & Monitoring Art 9, 15, 17 Art 55.1 Documentation Art 11, 13, Annex IV Art 53, 55 EU AI Act control & risk mitigation tooling *Similarities with Digital Operational Resilience Act (arts 9, 10, 25)
  5. EU AI Act Prohibited AI systems: “behavioral distortion” High risk

    AI systems: “robustness” “perform consistently for their intended purpose” Digital Operational Resilience Act Robustness, Resilience, Reliability of ICT systems Let´s scratch a bit the regulatory surface... How do we measure and monitor unclear requirements?
  6. Define criterion “Robustness” Define metrics/rubrics Train “Evaluator models” Run evals,

    monitor at scale Close your eyes and pray Evaluation at scale is a need...and a science.
  7. What is currently missing? Standardizing benchmarks for regulatory compliance Variety

    of open Benchmarks + not so good datasets Focus on defining criteria + metrics to measure Transversal criteria vs industry-specific criteria “Cards” everywhere Industry-specific Gen AI evals + guardrails