Slide 1

Slide 1 text

Open Data For Accountability Harish Pillay Advisor, Gen AI and AI Governance Straits Interactive, [email protected]

Slide 2

Slide 2 text

Back To First Principles ● How Do We Build and Release Software - all types ● How We Build and Release Generative AI - all types ● The Open Source AI Definition v1 ● The Shortcomings of OSAID v1 ● A Proposed Way Forward

Slide 3

Slide 3 text

Back To First Principles

Slide 4

Slide 4 text

How Do We Build Software?

Slide 5

Slide 5 text

HOW DO WE BUILD SOFTWARE COMPILERS MAKE SOURCE INTO BINARIES int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER SOURCE CODE $ ./hello-world Hello world. $ BINARY 101010110101010101010101010101 010010101010111010101010111101 010101111010001001010101010100 11001010101101 USER COMPILER

Slide 6

Slide 6 text

How Do We Build Open Source Software?

Slide 7

Slide 7 text

HOW DO WE BUILD OPEN SOURCE SOFTWARE COMPILERS MAKE SOURCE INTO BINARIES int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER SOURCE CODE $ ./hello-world Hello world. $ BINARY 101010110101010101010101010101 010010101010111010101010111101 010101111010001001010101010100 11001010101101 USER COMPILER

Slide 8

Slide 8 text

How Do We Release Open Source Software?

Slide 9

Slide 9 text

By adding an Open Source Software License to it and releasing

Slide 10

Slide 10 text

HOW DO WE RELEASE OPEN SOURCE SOFTWARE COMPILERS MAKE SOURCE INTO BINARIES int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER SOURCE CODE $ ./hello-world Hello world. $ BINARY 101010110101010101010101010101 010010101010111010101010111101 010101111010001001010101010100 11001010101101 USER COMPILER Any Open Source License

Slide 11

Slide 11 text

These Open Source Licenses adhere to the 10 Principles of the Open Source Definition

Slide 12

Slide 12 text

Ten Principles of the OSD https://opensource.org/osd 1. Free Redistribution 2. Source Code 3. Derived Works 4. Integrity of The Author’s Source Code 5. No Discrimination Against Persons or Groups 6. No Discrimination Against Fields of Endeavor 7. Distribution of License 8. License Must Not Be Specific to a Product 9. License Must Not Restrict Other Software 10. License Must Be Technology-Neutral

Slide 13

Slide 13 text

And these 10 OSD Principles are expanded from the FSF’s Four Freedoms

Slide 14

Slide 14 text

Four Freedoms by the Free Software Foundation https://fsf.org Freedom 0: Free to use •Anyone can use it, however they like Freedom 1: Free to copy •Anyone can get a copy for the cost of media Freedom 2: Free to modify •If I don’t like how it works, I can change it Freedom 3: Free to distribute •I can share my changes 14

Slide 15

Slide 15 text

The Open Source Way 0: Create 1: Share 2: Collaborate 3: Let everyone else do the same 15

Slide 16

Slide 16 text

Permissionless Innovation 16

Slide 17

Slide 17 text

Ask For Forgiveness, Not Permission 17

Slide 18

Slide 18 text

How Do We Build & Release AI Software?

Slide 19

Slide 19 text

Specifically Generative AI Software?

Slide 20

Slide 20 text

HOW DO WE BUILD AND RELEASE GEN AI SOFTWARE Compiled Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA

Slide 21

Slide 21 text

HOW DO WE BUILD & RELEASE OPEN SOURCE GEN AI Compiled Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA Any Open Source License Open Data License

Slide 22

Slide 22 text

What Does the Open Source AI Definition Say?

Slide 23

Slide 23 text

https://opensource.org/ai

Slide 24

Slide 24 text

https://opensource.org/ai

Slide 25

Slide 25 text

https://opensource.org/ai

Slide 26

Slide 26 text

https://opensource.org/ai

Slide 27

Slide 27 text

And The Looks GOOD!

Slide 28

Slide 28 text

What About Training Data?

Slide 29

Slide 29 text

HOW DO WE BUILD & RELEASE OPEN SOURCE GEN AI Compiled Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA Any Open Source License Open Data License

Slide 30

Slide 30 text

https://opensource.org/ai/open-source-ai-definition Preferred form to make modifications to machine-learning systems The preferred form of making modifications to a machine-learning system must include all the elements below: ● Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms. ○ In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.

Slide 31

Slide 31 text

Let’s Recognize Some Challenges With Training Data ○ Not all data can be shared ○ Not all data can be published ○ Not all data has rights established

Slide 32

Slide 32 text

In my view, OSAID v1 does not meet the Four Freedoms (yet)

Slide 33

Slide 33 text

In my view the following idea needs to be pursued: a) Rename OSAID v1 as Lesser OSAID b) OSAID v2 requires full and complete publication of training data

Slide 34

Slide 34 text

HOW DO WE BUILD & RELEASE OPEN SOURCE GEN AI Compiled Model Code Uses Data To Create The Model Binary int main() { printf(“Hello world.\n”); exit (0); } DEVELOPER MODEL SOURCE CODE Books1, Books2, CommonCrawl, Internet, ???, &^%$, WhoKnows? BINARY 1010101101010101010101010101010100 1010101011101010101011110101010111 1010001001010101010100110010101011 0 DATA COMPILER Foundation Models USER DATA DATA DATA Any Open Source License Open Data License

Slide 35

Slide 35 text

For Accountability, full and complete access to training data is needed.

Slide 36

Slide 36 text

Harish Pillay [email protected] @[email protected] THANK YOU