Synthetic Data 101
Luca Gilli
Torino, 18 Ottobre 2023
Slide 2
Slide 2 text
2
● Co-founded Clearbox AI, a synthetic
data startup hosted by the university
incubator of Politecnico di Torino
● I have a background in applied
mathematics and scientific software
development
● Lived in the Netherlands for 10 years
before moving to Val di Susa
About myself
Slide 3
Slide 3 text
Gartner estimates that by
2030, synthetic data will
completely overshadow real
data in AI models.
Gartner, Is Synthetic Data the Future of AI?, 2022
3
Slide 4
Slide 4 text
4
Synthetic data
Definition
A synthetic dataset is obtained by
generating fictitious data that incorporates
the statistical properties and distributions
of an original dataset, thus resulting
realistic.
Slide 5
Slide 5 text
5
Why the hype:
Access and Quality
Data Privacy
● Privacy issues
related to data
sharing
(GDPR/CCPA)
New Anonymization
paradigm
01 02
Data Augmentation
● Class imbalance
and ML models
generalization
issues
It improves models’
performances
Slide 6
Slide 6 text
6
Why the hype:
Access and Quality
Data Privacy
● Privacy issues
related to data
sharing
(GDPR/CCPA)
New Anonymization
paradigm
01 02
Data Augmentation
● Class imbalance
and ML models
generalization
issues
It improves models’
performances
Slide 7
Slide 7 text
7
Why the hype:
Access and Quality
Data Privacy
● Privacy issues
related to data
sharing
(GDPR/CCPA)
New Anonymization
paradigm
01
https://edps.europa.eu/press-publications/publications/techsonar/synth
etic-data_en
Slide 8
Slide 8 text
Synthetic data as a
Privacy Enhancing Technology
8
Slide 9
Slide 9 text
Anonymisation
!=Pseudonymisation
Anonymising behavioural
data is really challenging.
9
Synthetic Data Generation
https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Several families of generative
models.
GANs gained popularity in
2017, Diffusion models are
becoming state-of-the-art.
Slide 16
Slide 16 text
16
Synthetic Data Generation
Slide 17
Slide 17 text
El Emam, K., “Practical Synthetic Data Generation”, O’Reilly
Synthetic Data Generation
17
Slide 18
Slide 18 text
El Emam, K., “Practical Synthetic Data Generation”, O’Reilly
Synthetic Data Generation
18
Measuring the information
contained within the synthetic data.
Measuring the ‘novelty’ of the
synthetic dataset.
Slide 19
Slide 19 text
In conclusion
Synthetic data generators are a modern and powerful Privacy
Enhancing Technology.
Trust in synthetic data → generating synthetic data goes beyond using
generative models but requires extensive privacy and quality tests.
19
Slide 20
Slide 20 text
@ClearboxAI
www.clearbox.ai
info@clearbox.ai
Thanks for listening!
Feel free to contact us: