Slide 1

Slide 1 text

Synthetic Data 101 Luca Gilli Torino, 18 Ottobre 2023

Slide 2

Slide 2 text

2 ● Co-founded Clearbox AI, a synthetic data startup hosted by the university incubator of Politecnico di Torino ● I have a background in applied mathematics and scientific software development ● Lived in the Netherlands for 10 years before moving to Val di Susa About myself

Slide 3

Slide 3 text

Gartner estimates that by 2030, synthetic data will completely overshadow real data in AI models. Gartner, Is Synthetic Data the Future of AI?, 2022 3

Slide 4

Slide 4 text

4 Synthetic data Definition A synthetic dataset is obtained by generating fictitious data that incorporates the statistical properties and distributions of an original dataset, thus resulting realistic.

Slide 5

Slide 5 text

5 Why the hype: Access and Quality Data Privacy ● Privacy issues related to data sharing (GDPR/CCPA) New Anonymization paradigm 01 02 Data Augmentation ● Class imbalance and ML models generalization issues It improves models’ performances

Slide 6

Slide 6 text

6 Why the hype: Access and Quality Data Privacy ● Privacy issues related to data sharing (GDPR/CCPA) New Anonymization paradigm 01 02 Data Augmentation ● Class imbalance and ML models generalization issues It improves models’ performances

Slide 7

Slide 7 text

7 Why the hype: Access and Quality Data Privacy ● Privacy issues related to data sharing (GDPR/CCPA) New Anonymization paradigm 01 https://edps.europa.eu/press-publications/publications/techsonar/synth etic-data_en

Slide 8

Slide 8 text

Synthetic data as a Privacy Enhancing Technology 8

Slide 9

Slide 9 text

Anonymisation !=Pseudonymisation Anonymising behavioural data is really challenging. 9

Slide 10

Slide 10 text

http://www.tdp.cat/issues16/tdp.a363a19.pdf 10

Slide 11

Slide 11 text

http://www.tdp.cat/issues16/tdp.a363a19.pdf 11

Slide 12

Slide 12 text

https://edps.europa.eu/press-publications/publications/techsonar/synth etic-data_en 12

Slide 13

Slide 13 text

How to generate synthetic data 13

Slide 14

Slide 14 text

Synthetic Data Generation

Slide 15

Slide 15 text

Synthetic Data Generation https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ Several families of generative models. GANs gained popularity in 2017, Diffusion models are becoming state-of-the-art.

Slide 16

Slide 16 text

16 Synthetic Data Generation

Slide 17

Slide 17 text

El Emam, K., “Practical Synthetic Data Generation”, O’Reilly Synthetic Data Generation 17

Slide 18

Slide 18 text

El Emam, K., “Practical Synthetic Data Generation”, O’Reilly Synthetic Data Generation 18 Measuring the information contained within the synthetic data. Measuring the ‘novelty’ of the synthetic dataset.

Slide 19

Slide 19 text

In conclusion Synthetic data generators are a modern and powerful Privacy Enhancing Technology. Trust in synthetic data → generating synthetic data goes beyond using generative models but requires extensive privacy and quality tests. 19

Slide 20

Slide 20 text

@ClearboxAI www.clearbox.ai info@clearbox.ai Thanks for listening! Feel free to contact us: