Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open Data For Accountability

Open Data For Accountability

This is a talk I delivered at the FOSSAsia Summit 2025 held in Bangkok, Thailand on 14 March 2025.

It is about the shortcoming of the Open Source AI Definition v1 on the need for full publication and access to training data of models that want to claim to be open source.

The proposal at the end of the talk is have the OSAID v1 be re-cast as a Lesser OSAID (along the lines and intent of Lesser GPL/Library GPL) and a OSAID v2 that requires full publication of training data.

Original editable deck is here: https://docs.google.com/presentation/d/1mvO23-KWXYFuSSwnpnPDenT4krBe-ZbR-4C9u-jXW-g/edit?usp=sharing.

Harish Pillay

March 14, 2025
Tweet

More Decks by Harish Pillay

Other Decks in Technology

Transcript

  1. Open Data For Accountability Harish Pillay Advisor, Gen AI and

    AI Governance Straits Interactive, h.pillay@ieee.org
  2. Back To First Principles • How Do We Build and

    Release Software - all types • How We Build and Release Generative AI - all types • The Open Source AI Definition v1 • The Shortcomings of OSAID v1 • A Proposed Way Forward
  3. HOW DO WE BUILD SOFTWARE COMPILERS MAKE SOURCE INTO BINARIES

    int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER SOURCE CODE $ ./hello-world Hello world. $ BINARY 101010110101010101010101010101 010010101010111010101010111101 010101111010001001010101010100 11001010101101 USER COMPILER
  4. HOW DO WE BUILD OPEN SOURCE SOFTWARE COMPILERS MAKE SOURCE

    INTO BINARIES int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER SOURCE CODE $ ./hello-world Hello world. $ BINARY 101010110101010101010101010101 010010101010111010101010111101 010101111010001001010101010100 11001010101101 USER COMPILER
  5. HOW DO WE RELEASE OPEN SOURCE SOFTWARE COMPILERS MAKE SOURCE

    INTO BINARIES int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER SOURCE CODE $ ./hello-world Hello world. $ BINARY 101010110101010101010101010101 010010101010111010101010111101 010101111010001001010101010100 11001010101101 USER COMPILER Any Open Source License
  6. Ten Principles of the OSD https://opensource.org/osd 1. Free Redistribution 2.

    Source Code 3. Derived Works 4. Integrity of The Author’s Source Code 5. No Discrimination Against Persons or Groups 6. No Discrimination Against Fields of Endeavor 7. Distribution of License 8. License Must Not Be Specific to a Product 9. License Must Not Restrict Other Software 10. License Must Be Technology-Neutral
  7. Four Freedoms by the Free Software Foundation https://fsf.org Freedom 0:

    Free to use •Anyone can use it, however they like Freedom 1: Free to copy •Anyone can get a copy for the cost of media Freedom 2: Free to modify •If I don’t like how it works, I can change it Freedom 3: Free to distribute •I can share my changes 14
  8. The Open Source Way 0: Create 1: Share 2: Collaborate

    3: Let everyone else do the same 15
  9. HOW DO WE BUILD AND RELEASE GEN AI SOFTWARE Compiled

    Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA
  10. HOW DO WE BUILD & RELEASE OPEN SOURCE GEN AI

    Compiled Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA Any Open Source License Open Data License
  11. HOW DO WE BUILD & RELEASE OPEN SOURCE GEN AI

    Compiled Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA Any Open Source License Open Data License
  12. https://opensource.org/ai/open-source-ai-definition Preferred form to make modifications to machine-learning systems The

    preferred form of making modifications to a machine-learning system must include all the elements below: • Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms. ◦ In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.
  13. Let’s Recognize Some Challenges With Training Data ◦ Not all

    data can be shared ◦ Not all data can be published ◦ Not all data has rights established
  14. In my view the following idea needs to be pursued:

    a) Rename OSAID v1 as Lesser OSAID b) OSAID v2 requires full and complete publication of training data
  15. HOW DO WE BUILD & RELEASE OPEN SOURCE GEN AI

    Compiled Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA Any Open Source License Open Data License