Open science for health data, applying the scikit-learn recipe?

Open science for health data, Applying the scikit-learn recipe? Gaël
Varoquaux,

Open science for health data, Applying the scikit-learn recipe? Gaël
Varoquaux, 1 Scikit-learn: democratizing machine learning 2 Open science in brain imaging 3 Better decisions from electronic health records

Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster G Varoquaux 2

Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster Integration of non curated data
Enabling the human data scientist Bringing in other stakeholders Facilitating access to data Can tools help? G Varoquaux 2

1 Scikit-learn: democratizing machine learning Machine learning for all G
Varoquaux 3

scikit-learn From scientists G Varoquaux 4

scikit-learn From scientists to an industry standard Number of monthly
users 2010201220142016201820202022 200k 400k 600k 800k 1M G Varoquaux 4

1 Embracing the Python stack Python Interactive, easy General-purpose G
Varoquaux 5

1 Embracing the Python stack Python Interactive, easy General-purpose An
ecosystem numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations A memory model (float*) G Varoquaux 5

1 Embracing the Python stack Python Interactive, easy General-purpose An
ecosystem numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations A memory model (float*) pandas pytorch G Varoquaux 5

1 Focus on usability API design Grey box: all models
interchangeable, but still inspectable Documentation & examples Good documentation required to add a feature Easy-understable examples guide API design Teach statistical learning, rather than code Models, solvers, hyperparameters Choices that do not require tinkering Lots of usecase-driven empirical testing G Varoquaux 6

1 Community-driven development Our DNA: distributed development & decision making
Gave the team 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them G Varoquaux 7

Gave the team 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment G Varoquaux 7

Gave the team 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment mid-2018: A foundation for scikit-learn + the community G Varoquaux 7

1 Difference makes better software & science Scikit-learn = computer
science for non computer scientists We all do different things We can all benefit from others though we don’t know how G Varoquaux 8

1 Difference makes better software & science Scikit-learn = computer
science for non computer scientists We all do different things We can all benefit from others though we don’t know how Being didactic outside one’s community is crucial Avoiding jargon take that machine learning Prioritizing information “Simple is better than complex” Students learning numerics don’t care about unicode Build documentation upon very simple examples Think stackoverflow G Varoquaux 8

1 10 rules for community-driven development 1 Choose a project
scope & vision 2 Use Github, work online 3 Do not own a project 4 Seek quality 5 Release early, release often 6 Limit technicity 7 Foster a good project culture 8 Organize sprints 9 Invest in recruitment 10 Communicate G Varoquaux 9

1 Faster computing: Ongoing speed up Improving Newton solver ⇒
more robust to infrequent categories ...0, 0, 0, 0, 0, 1, 0, 0... G Varoquaux 10

1 Better statistics: quantile regression Conditional quantile model - For
uncertainty quantification - With heterogeneous errors In gradient-boosted trees G Varoquaux 11

1 Teaching: the scikit-learn MOOC https://inria.github.io/scikit-learn-mooc From zero to hero:
didactic, but thorough Fully-open, free, reusable, no tracking G Varoquaux 12

Scikit-learn: democratizing machine learning A solvable problem: fit / predict
Focus on simplifying the user’s life Algorithmic choices, API choices, Documentation efforts Building a community On-boarding, trusting G Varoquaux 13

2 Open science in brain imaging G Varoquaux 14

2 Nilearn: machine-learning for brain imaging Nilearn Vision Better machine
learning To help understanding brain images G Varoquaux 15

2 Nilearn: machine-learning for brain imaging In practice Getting the
data f i l e s = datasets . fetch_haxby () Curating light but meaningful data High-quality download (Caching, resume) G Varoquaux 16

data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Filenames to data matrix (memory-efficient I/O) Common preprocessing steps included G Varoquaux 16

data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Learning with scikit-learn e s t i m a t o r . f i t (data , l a b e l s ) That’s easy! G Varoquaux 16

data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Learning with scikit-learn e s t i m a t o r . f i t (data , l a b e l s ) Output plot_stat_map (masker. i n v e r s e _ t r a n s f o r m ( e s t i m a t o r . weights_ )) G Varoquaux 16

2 Easy use: Example-driven development The 3-liner as the new
cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple G Varoquaux 17

cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple User flow on the nilearn website: Examples G Varoquaux 17

cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Sphinx-gallery: compiling scripts in an examples gallery G Varoquaux 17

cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Sphinx-gallery: compiling scripts in an examples gallery Restructured text formatting Capturing outputs Links to function docs G Varoquaux 17

cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Sphinx-gallery: compiling scripts in an examples gallery G Varoquaux 17

2 Building great documentation Focus on explaining concepts (hint: write
plain English) Less is more: prioritize, avoid redundancy Code examples must be short (link to full tutorial examples) Links everywhere: users will land at the wrong place Teach with the docs Maintenance of docs: Continuous integration Check links Runs examples Doctests G Varoquaux 18

2 Beyond software: data requires online platforms NeuroVault.org Sharing dozens
of thousands of brain images [Gorgolewski... 2015] https://neurovault.org/collections/2138/ G Varoquaux 19

2 Consolidation: data + software ⇒ science online neuroquery.org AI
trained on scientific literature to generate topic-specific brain maps [Dockès... 2020] G Varoquaux 20

Open data science in brain imaging nilearn: simplifies advanced statistical
processing The hard part: a community of data - file standards (nifti) - opening data, and data platforms A community ripe for openness Publications, brainhack: enthousiasm G Varoquaux 21

3 Better decisions from electronic health records G Varoquaux 22

3 Electronic Health Records – source of real-life data Patient
records (anything available, really) Claims databases, accounting, measurement history, doctors’ notes Great longitudinal coverage Great population coverage AP-HP (Paris hospitals) 39 hospitals 8 millions patients a year External validity and practical usefulness G Varoquaux 23

3 Electronic Health Records Data preparation is the bottleneck Data
are not numerical matrices G Varoquaux 24

3 Electronic Health Records: dirty data challenges Missing values Uneven
data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency.. Much larger rate of missingness than in clinical studies (often 80%) G Varoquaux 25

3 Electronic Health Records: dirty data challenges Missing values Uneven
data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency.. Much larger rate of missingness than in clinical studies (often 80%) Non normalized information Manual input, different conventions “Diabetes Type 2” | “Diabetes Mellitus, Type 2” | “DM2” G Varoquaux 25

3 Missing values: A simple imputation example http://dirtydata.science/python/ Fully-observed data
G Varoquaux 26

3 Missing values: A simple imputation example http://dirtydata.science/python/ Missing at
random G Varoquaux 26

3 Missing values: A simple imputation example http://dirtydata.science/python/ Missing at
random imputed Any imputation imperfection make regression hard G Varoquaux 26

3 Missing values: A simple imputation example http://dirtydata.science/python/ Missing not
at random imputed Any imputation imperfection make regression hard G Varoquaux 26

3 Missing values: Any deterministic imputation is consistent Theorem For
almost any deterministic imputation function for any missing-values mechanisms a flexible learner is consistent (approaches the ideal predictor) [Le Morvan... 2021] Intuition Imputation create submanifolds, to which the learner adapts Simple learners not sufficient, even in linear, Gaussian settings [Morvan... 2020] G Varoquaux 27

3 Missing values: What’s a good imputation? [Le Morvan... 2021]
Surely, imputing data where they are most likely!? Not Oracle conditional imputation: Ximputed = E[Xmissing|Xobs ] Oracle fully-observed regressor: f s.t. y = f(X)+noise Chaining oracles may be biased Conditional variance turned into bias Best: joint training imputation & regression: differentiable imputation [Le Morvan... 2020] G Varoquaux 28

3 Missing values in practice Benchmarks on health data [Perez-Lebel...
2022] Data most-often missing not at random Imputation is hard and expensive Adding an missing indicator helps imputation Handling missing values inside trees - [Josse... 2019] - Trees HistGradientBoostingClassifier x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 Focus on learner not imputation Consider missing indicator G Varoquaux 29

3 Non-normalized text Problem 2 Non-normalized data G Varoquaux 30

3 Non-normalized text: Substring information Drug Name alcohol ethyl alcohol
isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 31

3 Non-normalized text: GaP Encoder for latent categories Topic model
on sub-strings (GaP: Gamma-Poisson factorization) 3-gram1 L 3-gram2 on 3-gram3 do... Model strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 32 [Cerda and Varoquaux 2020]

3 Non-normalized text: GaP Encoder for latent categories Encodings that
extract latent categories brary rator alist house nager unity escue ficer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant es Categories G Varoquaux 33 [Cerda and Varoquaux 2020]

3 Dirty categories in practice DirtyCat: Dirty category software http://dirty-cat.github.io
from d i r t y _ c a t import GapEncoder gap_encoder = GapEncoder() transformed_values = gap_encoder. f i t _ t r a n s f o r m ( df ) [Cerda and Varoquaux 2020] - SuperVectorizer: dataframe to numerical matrix - fuzzy_join: joining tables despite typos Dirty data in practice Gradient-boosted trees > deep learning on tabular data sklearn.ensemble.HistGradientBoostingRegressor [Grinsztajn... 2022] G Varoquaux 34

Electronic health records Dirty data: solving preprocessing for ML -
missing values: more learning, rather than more imputation - non-normalized text entries: string-level models - non-random data degradation in outcomes or treatement Hard to build a community - Data privacy prevents good open examples - Hospitals have a vertical culture - Huge legacy infrastructure, much money G Varoquaux 35

@GaelVaroquaux Open science for health data, Applying the scikit-learn recipe
The scikit-learn recipe Simple usage – requires identifying sub-problems Quality – do less, do better Community – to go far, go together

The scikit-learn recipe Lessons from brain imaging Data is central, requires more infrastructure Build a full pipeline from data to domain output

The scikit-learn recipe Lessons from brain imaging Electronic Health Records Privacy concerns freeze data and collaboration Data quality: dirty data is solvable, biases are hard Soda research team – social data https://team.inria.fr/soda

4 References I P. Cerda and G. Varoquaux. Encoding high-cardinality
string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 2020. P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty categorical variables. Machine Learning, pages 1–18, 2018. J. Dockès, R. A. Poldrack, R. Primet, H. Gözükan, T. Yarkoni, F. Suchanek, B. Thirion, and G. Varoquaux. Neuroquery, comprehensive meta-analysis of human brain mapping. Elife, 9:e53385, 2020. K. J. Gorgolewski, G. Varoquaux, G. Rivera, Y. Schwarz, S. S. Ghosh, C. Maumet, V. V. Sochat, T. E. Nichols, R. A. Poldrack, J.-B. Poline, ... Neurovault. org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain. Frontiers in neuroinformatics, 9:8, 2015.

4 References II L. Grinsztajn, E. Oyallon, and G. Varoquaux.
Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss networks: differential programming for supervised learning with missing values. In Advances in Neural Information Processing Systems 33, 2020. M. Le Morvan, J. Josse, E. Scornet, and G. Varoquaux. What’s a good imputation to predict with missing values? Neural Information Processing Systems, 34, 2021.

4 References III M. L. Morvan, N. Prost, J. Josse,
E. Scornet, and G. Varoquaux. Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISATS, 2020. A. Perez-Lebel, G. Varoquaux, M. Le Morvan, J. Josse, and J.-B. Poline. Benchmarking missing-values approaches for predictive models on health databases. GigaScience, 11, 2022.

Open science for health data, applying the scik...

Open science for health data, applying the scikit-learn recipe?

More Decks by Gael Varoquaux

Other Decks in Research

Featured

Transcript