Slide 1

Slide 1 text

Open science for health data, Applying the scikit-learn recipe? Gaël Varoquaux,

Slide 2

Slide 2 text

Open science for health data, Applying the scikit-learn recipe? Gaël Varoquaux, 1 Scikit-learn: democratizing machine learning 2 Open science in brain imaging 3 Better decisions from electronic health records

Slide 3

Slide 3 text

Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster G Varoquaux 2

Slide 4

Slide 4 text

Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster Integration of non curated data Enabling the human data scientist Bringing in other stakeholders Facilitating access to data Can tools help? G Varoquaux 2

Slide 5

Slide 5 text

1 Scikit-learn: democratizing machine learning Machine learning for all G Varoquaux 3

Slide 6

Slide 6 text

scikit-learn From scientists G Varoquaux 4

Slide 7

Slide 7 text

scikit-learn From scientists to an industry standard Number of monthly users 2010201220142016201820202022 200k 400k 600k 800k 1M G Varoquaux 4

Slide 8

Slide 8 text

1 Embracing the Python stack Python Interactive, easy General-purpose G Varoquaux 5

Slide 9

Slide 9 text

1 Embracing the Python stack Python Interactive, easy General-purpose An ecosystem numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations A memory model (float*) G Varoquaux 5

Slide 10

Slide 10 text

1 Embracing the Python stack Python Interactive, easy General-purpose An ecosystem numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations A memory model (float*) pandas pytorch G Varoquaux 5

Slide 11

Slide 11 text

1 Focus on usability API design Grey box: all models interchangeable, but still inspectable Documentation & examples Good documentation required to add a feature Easy-understable examples guide API design Teach statistical learning, rather than code Models, solvers, hyperparameters Choices that do not require tinkering Lots of usecase-driven empirical testing G Varoquaux 6

Slide 12

Slide 12 text

1 Community-driven development Our DNA: distributed development & decision making Gave the team 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them G Varoquaux 7

Slide 13

Slide 13 text

1 Community-driven development Our DNA: distributed development & decision making Gave the team 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment G Varoquaux 7

Slide 14

Slide 14 text

1 Community-driven development Our DNA: distributed development & decision making Gave the team 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment mid-2018: A foundation for scikit-learn + the community G Varoquaux 7

Slide 15

Slide 15 text

1 Difference makes better software & science Scikit-learn = computer science for non computer scientists We all do different things We can all benefit from others though we don’t know how G Varoquaux 8

Slide 16

Slide 16 text

1 Difference makes better software & science Scikit-learn = computer science for non computer scientists We all do different things We can all benefit from others though we don’t know how Being didactic outside one’s community is crucial Avoiding jargon take that machine learning Prioritizing information “Simple is better than complex” Students learning numerics don’t care about unicode Build documentation upon very simple examples Think stackoverflow G Varoquaux 8

Slide 17

Slide 17 text

1 10 rules for community-driven development 1 Choose a project scope & vision 2 Use Github, work online 3 Do not own a project 4 Seek quality 5 Release early, release often 6 Limit technicity 7 Foster a good project culture 8 Organize sprints 9 Invest in recruitment 10 Communicate G Varoquaux 9

Slide 18

Slide 18 text

1 Faster computing: Ongoing speed up Improving Newton solver ⇒ more robust to infrequent categories ...0, 0, 0, 0, 0, 1, 0, 0... G Varoquaux 10

Slide 19

Slide 19 text

1 Better statistics: quantile regression Conditional quantile model - For uncertainty quantification - With heterogeneous errors In gradient-boosted trees G Varoquaux 11

Slide 20

Slide 20 text

1 Teaching: the scikit-learn MOOC https://inria.github.io/scikit-learn-mooc From zero to hero: didactic, but thorough Fully-open, free, reusable, no tracking G Varoquaux 12

Slide 21

Slide 21 text

Scikit-learn: democratizing machine learning A solvable problem: fit / predict Focus on simplifying the user’s life Algorithmic choices, API choices, Documentation efforts Building a community On-boarding, trusting G Varoquaux 13

Slide 22

Slide 22 text

2 Open science in brain imaging G Varoquaux 14

Slide 23

Slide 23 text

2 Nilearn: machine-learning for brain imaging Nilearn Vision Better machine learning To help understanding brain images G Varoquaux 15

Slide 24

Slide 24 text

2 Nilearn: machine-learning for brain imaging In practice Getting the data f i l e s = datasets . fetch_haxby () Curating light but meaningful data High-quality download (Caching, resume) G Varoquaux 16

Slide 25

Slide 25 text

2 Nilearn: machine-learning for brain imaging In practice Getting the data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Filenames to data matrix (memory-efficient I/O) Common preprocessing steps included G Varoquaux 16

Slide 26

Slide 26 text

2 Nilearn: machine-learning for brain imaging In practice Getting the data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Learning with scikit-learn e s t i m a t o r . f i t (data , l a b e l s ) That’s easy! G Varoquaux 16

Slide 27

Slide 27 text

2 Nilearn: machine-learning for brain imaging In practice Getting the data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Learning with scikit-learn e s t i m a t o r . f i t (data , l a b e l s ) Output plot_stat_map (masker. i n v e r s e _ t r a n s f o r m ( e s t i m a t o r . weights_ )) G Varoquaux 16

Slide 28

Slide 28 text

2 Nilearn: machine-learning for brain imaging In practice Getting the data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Learning with scikit-learn e s t i m a t o r . f i t (data , l a b e l s ) Output plot_stat_map (masker. i n v e r s e _ t r a n s f o r m ( e s t i m a t o r . weights_ )) G Varoquaux 16

Slide 29

Slide 29 text

2 Easy use: Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple G Varoquaux 17

Slide 30

Slide 30 text

2 Easy use: Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple User flow on the nilearn website: Examples G Varoquaux 17

Slide 31

Slide 31 text

2 Easy use: Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Sphinx-gallery: compiling scripts in an examples gallery G Varoquaux 17

Slide 32

Slide 32 text

2 Easy use: Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Sphinx-gallery: compiling scripts in an examples gallery Restructured text formatting Capturing outputs Links to function docs G Varoquaux 17

Slide 33

Slide 33 text

2 Easy use: Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Sphinx-gallery: compiling scripts in an examples gallery G Varoquaux 17

Slide 34

Slide 34 text

2 Building great documentation Focus on explaining concepts (hint: write plain English) Less is more: prioritize, avoid redundancy Code examples must be short (link to full tutorial examples) Links everywhere: users will land at the wrong place Teach with the docs Maintenance of docs: Continuous integration Check links Runs examples Doctests G Varoquaux 18

Slide 35

Slide 35 text

2 Beyond software: data requires online platforms NeuroVault.org Sharing dozens of thousands of brain images [Gorgolewski... 2015] https://neurovault.org/collections/2138/ G Varoquaux 19

Slide 36

Slide 36 text

2 Consolidation: data + software ⇒ science online neuroquery.org AI trained on scientific literature to generate topic-specific brain maps [Dockès... 2020] G Varoquaux 20

Slide 37

Slide 37 text

Open data science in brain imaging nilearn: simplifies advanced statistical processing The hard part: a community of data - file standards (nifti) - opening data, and data platforms A community ripe for openness Publications, brainhack: enthousiasm G Varoquaux 21

Slide 38

Slide 38 text

3 Better decisions from electronic health records G Varoquaux 22

Slide 39

Slide 39 text

3 Electronic Health Records – source of real-life data Patient records (anything available, really) Claims databases, accounting, measurement history, doctors’ notes Great longitudinal coverage Great population coverage AP-HP (Paris hospitals) 39 hospitals 8 millions patients a year External validity and practical usefulness G Varoquaux 23

Slide 40

Slide 40 text

3 Electronic Health Records Data preparation is the bottleneck Data are not numerical matrices G Varoquaux 24

Slide 41

Slide 41 text

3 Electronic Health Records: dirty data challenges Missing values Uneven data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency.. Much larger rate of missingness than in clinical studies (often 80%) G Varoquaux 25

Slide 42

Slide 42 text

3 Electronic Health Records: dirty data challenges Missing values Uneven data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency.. Much larger rate of missingness than in clinical studies (often 80%) Non normalized information Manual input, different conventions “Diabetes Type 2” | “Diabetes Mellitus, Type 2” | “DM2” G Varoquaux 25

Slide 43

Slide 43 text

3 Missing values: A simple imputation example http://dirtydata.science/python/ Fully-observed data G Varoquaux 26

Slide 44

Slide 44 text

3 Missing values: A simple imputation example http://dirtydata.science/python/ Missing at random G Varoquaux 26

Slide 45

Slide 45 text

3 Missing values: A simple imputation example http://dirtydata.science/python/ Missing at random imputed Any imputation imperfection make regression hard G Varoquaux 26

Slide 46

Slide 46 text

3 Missing values: A simple imputation example http://dirtydata.science/python/ Missing not at random imputed Any imputation imperfection make regression hard G Varoquaux 26

Slide 47

Slide 47 text

3 Missing values: Any deterministic imputation is consistent Theorem For almost any deterministic imputation function for any missing-values mechanisms a flexible learner is consistent (approaches the ideal predictor) [Le Morvan... 2021] Intuition Imputation create submanifolds, to which the learner adapts Simple learners not sufficient, even in linear, Gaussian settings [Morvan... 2020] G Varoquaux 27

Slide 48

Slide 48 text

3 Missing values: What’s a good imputation? [Le Morvan... 2021] Surely, imputing data where they are most likely!? Not Oracle conditional imputation: Ximputed = E[Xmissing|Xobs ] Oracle fully-observed regressor: f s.t. y = f(X)+noise Chaining oracles may be biased Conditional variance turned into bias Best: joint training imputation & regression: differentiable imputation [Le Morvan... 2020] G Varoquaux 28

Slide 49

Slide 49 text

3 Missing values in practice Benchmarks on health data [Perez-Lebel... 2022] Data most-often missing not at random Imputation is hard and expensive Adding an missing indicator helps imputation Handling missing values inside trees - [Josse... 2019] - Trees HistGradientBoostingClassifier x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 Focus on learner not imputation Consider missing indicator G Varoquaux 29

Slide 50

Slide 50 text

3 Non-normalized text Problem 2 Non-normalized data G Varoquaux 30

Slide 51

Slide 51 text

3 Non-normalized text: Substring information Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 31

Slide 52

Slide 52 text

3 Non-normalized text: GaP Encoder for latent categories Topic model on sub-strings (GaP: Gamma-Poisson factorization) 3-gram1 L 3-gram2 on 3-gram3 do... Model strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 32 [Cerda and Varoquaux 2020]

Slide 53

Slide 53 text

3 Non-normalized text: GaP Encoder for latent categories Encodings that extract latent categories brary rator alist house nager unity escue ficer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant es Categories G Varoquaux 33 [Cerda and Varoquaux 2020]

Slide 54

Slide 54 text

3 Dirty categories in practice DirtyCat: Dirty category software http://dirty-cat.github.io from d i r t y _ c a t import GapEncoder gap_encoder = GapEncoder() transformed_values = gap_encoder. f i t _ t r a n s f o r m ( df ) [Cerda and Varoquaux 2020] - SuperVectorizer: dataframe to numerical matrix - fuzzy_join: joining tables despite typos Dirty data in practice Gradient-boosted trees > deep learning on tabular data sklearn.ensemble.HistGradientBoostingRegressor [Grinsztajn... 2022] G Varoquaux 34

Slide 55

Slide 55 text

Electronic health records Dirty data: solving preprocessing for ML - missing values: more learning, rather than more imputation - non-normalized text entries: string-level models - non-random data degradation in outcomes or treatement Hard to build a community - Data privacy prevents good open examples - Hospitals have a vertical culture - Huge legacy infrastructure, much money G Varoquaux 35

Slide 56

Slide 56 text

@GaelVaroquaux Open science for health data, Applying the scikit-learn recipe The scikit-learn recipe Simple usage – requires identifying sub-problems Quality – do less, do better Community – to go far, go together

Slide 57

Slide 57 text

@GaelVaroquaux Open science for health data, Applying the scikit-learn recipe The scikit-learn recipe Lessons from brain imaging Data is central, requires more infrastructure Build a full pipeline from data to domain output

Slide 58

Slide 58 text

@GaelVaroquaux Open science for health data, Applying the scikit-learn recipe The scikit-learn recipe Lessons from brain imaging Electronic Health Records Privacy concerns freeze data and collaboration Data quality: dirty data is solvable, biases are hard Soda research team – social data https://team.inria.fr/soda

Slide 59

Slide 59 text

4 References I P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 2020. P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty categorical variables. Machine Learning, pages 1–18, 2018. J. Dockès, R. A. Poldrack, R. Primet, H. Gözükan, T. Yarkoni, F. Suchanek, B. Thirion, and G. Varoquaux. Neuroquery, comprehensive meta-analysis of human brain mapping. Elife, 9:e53385, 2020. K. J. Gorgolewski, G. Varoquaux, G. Rivera, Y. Schwarz, S. S. Ghosh, C. Maumet, V. V. Sochat, T. E. Nichols, R. A. Poldrack, J.-B. Poline, ... Neurovault. org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain. Frontiers in neuroinformatics, 9:8, 2015.

Slide 60

Slide 60 text

4 References II L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss networks: differential programming for supervised learning with missing values. In Advances in Neural Information Processing Systems 33, 2020. M. Le Morvan, J. Josse, E. Scornet, and G. Varoquaux. What’s a good imputation to predict with missing values? Neural Information Processing Systems, 34, 2021.

Slide 61

Slide 61 text

4 References III M. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISATS, 2020. A. Perez-Lebel, G. Varoquaux, M. Le Morvan, J. Josse, and J.-B. Poline. Benchmarking missing-values approaches for predictive models on health databases. GigaScience, 11, 2022.