Open Data For Accountability

Open Data For Accountability Harish Pillay Advisor, Gen AI and
AI Governance Straits Interactive, [email protected]

Back To First Principles • How Do We Build and
Release Software - all types • How We Build and Release Generative AI - all types • The Open Source AI Definition v1 • The Shortcomings of OSAID v1 • A Proposed Way Forward

Back To First Principles

How Do We Build Software?

HOW DO WE BUILD SOFTWARE COMPILERS MAKE SOURCE INTO BINARIES
int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER SOURCE CODE $ ./hello-world Hello world. $ BINARY 101010110101010101010101010101 010010101010111010101010111101 010101111010001001010101010100 11001010101101 USER COMPILER

How Do We Build Open Source Software?

HOW DO WE BUILD OPEN SOURCE SOFTWARE COMPILERS MAKE SOURCE
INTO BINARIES int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER SOURCE CODE $ ./hello-world Hello world. $ BINARY 101010110101010101010101010101 010010101010111010101010111101 010101111010001001010101010100 11001010101101 USER COMPILER

How Do We Release Open Source Software?

By adding an Open Source Software License to it and
releasing

HOW DO WE RELEASE OPEN SOURCE SOFTWARE COMPILERS MAKE SOURCE
INTO BINARIES int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER SOURCE CODE $ ./hello-world Hello world. $ BINARY 101010110101010101010101010101 010010101010111010101010111101 010101111010001001010101010100 11001010101101 USER COMPILER Any Open Source License

These Open Source Licenses adhere to the 10 Principles of
the Open Source Definition

Ten Principles of the OSD https://opensource.org/osd 1. Free Redistribution 2.
Source Code 3. Derived Works 4. Integrity of The Author’s Source Code 5. No Discrimination Against Persons or Groups 6. No Discrimination Against Fields of Endeavor 7. Distribution of License 8. License Must Not Be Speciﬁc to a Product 9. License Must Not Restrict Other Software 10. License Must Be Technology-Neutral

And these 10 OSD Principles are expanded from the FSF’s
Four Freedoms

Four Freedoms by the Free Software Foundation https://fsf.org Freedom 0:
Free to use •Anyone can use it, however they like Freedom 1: Free to copy •Anyone can get a copy for the cost of media Freedom 2: Free to modify •If I don’t like how it works, I can change it Freedom 3: Free to distribute •I can share my changes 14

The Open Source Way 0: Create 1: Share 2: Collaborate
3: Let everyone else do the same 15

Permissionless Innovation 16

Ask For Forgiveness, Not Permission 17

How Do We Build & Release AI Software?

Specifically Generative AI Software?

HOW DO WE BUILD AND RELEASE GEN AI SOFTWARE Compiled
Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA

HOW DO WE BUILD & RELEASE OPEN SOURCE GEN AI
Compiled Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA Any Open Source License Open Data License

What Does the Open Source AI Definition Say?

https://opensource.org/ai

And The Looks GOOD!

What About Training Data?

https://opensource.org/ai/open-source-ai-definition Preferred form to make modifications to machine-learning systems The
preferred form of making modifications to a machine-learning system must include all the elements below: • Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms. ◦ In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.

Let’s Recognize Some Challenges With Training Data ◦ Not all
data can be shared ◦ Not all data can be published ◦ Not all data has rights established

In my view, OSAID v1 does not meet the Four
Freedoms (yet)

In my view the following idea needs to be pursued:
a) Rename OSAID v1 as Lesser OSAID b) OSAID v2 requires full and complete publication of training data

For Accountability, full and complete access to training data is
needed.

Harish Pillay [email protected] @[email protected] THANK YOU

Open Data For Accountability

Open Data For Accountability

More Decks by Harish Pillay

Other Decks in Technology

Featured

Transcript