Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open science for health data, applying the scikit-learn recipe?

Gael Varoquaux
November 22, 2022

Open science for health data, applying the scikit-learn recipe?

Developing data analyses pipelines around open source, open data, and open communities has shown great successes. The most used machine-learning tools to date, scikit-learn, was assembled by a large community, with different contributors bringing different expertise. Can such success be carried over to health data? I will discuss my experience building first scikit-learn, then nilearn in the brain imaging community, and finally more recent work in electronic health records. Spoiler: things are harder with electronic health records.

Gael Varoquaux

November 22, 2022
Tweet

More Decks by Gael Varoquaux

Other Decks in Research

Transcript

  1. Open science for health data, Applying the scikit-learn recipe? Gaël

    Varoquaux, 1 Scikit-learn: democratizing machine learning 2 Open science in brain imaging 3 Better decisions from electronic health records
  2. Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster Integration of non curated data

    Enabling the human data scientist Bringing in other stakeholders Facilitating access to data Can tools help? G Varoquaux 2
  3. scikit-learn From scientists to an industry standard Number of monthly

    users 2010201220142016201820202022 200k 400k 600k 800k 1M G Varoquaux 4
  4. 1 Embracing the Python stack Python Interactive, easy General-purpose An

    ecosystem numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations A memory model (float*) G Varoquaux 5
  5. 1 Embracing the Python stack Python Interactive, easy General-purpose An

    ecosystem numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations A memory model (float*) pandas pytorch G Varoquaux 5
  6. 1 Focus on usability API design Grey box: all models

    interchangeable, but still inspectable Documentation & examples Good documentation required to add a feature Easy-understable examples guide API design Teach statistical learning, rather than code Models, solvers, hyperparameters Choices that do not require tinkering Lots of usecase-driven empirical testing G Varoquaux 6
  7. 1 Community-driven development Our DNA: distributed development & decision making

    Gave the team 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them G Varoquaux 7
  8. 1 Community-driven development Our DNA: distributed development & decision making

    Gave the team 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment G Varoquaux 7
  9. 1 Community-driven development Our DNA: distributed development & decision making

    Gave the team 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment mid-2018: A foundation for scikit-learn + the community G Varoquaux 7
  10. 1 Difference makes better software & science Scikit-learn = computer

    science for non computer scientists We all do different things We can all benefit from others though we don’t know how G Varoquaux 8
  11. 1 Difference makes better software & science Scikit-learn = computer

    science for non computer scientists We all do different things We can all benefit from others though we don’t know how Being didactic outside one’s community is crucial Avoiding jargon take that machine learning Prioritizing information “Simple is better than complex” Students learning numerics don’t care about unicode Build documentation upon very simple examples Think stackoverflow G Varoquaux 8
  12. 1 10 rules for community-driven development 1 Choose a project

    scope & vision 2 Use Github, work online 3 Do not own a project 4 Seek quality 5 Release early, release often 6 Limit technicity 7 Foster a good project culture 8 Organize sprints 9 Invest in recruitment 10 Communicate G Varoquaux 9
  13. 1 Faster computing: Ongoing speed up Improving Newton solver ⇒

    more robust to infrequent categories ...0, 0, 0, 0, 0, 1, 0, 0... G Varoquaux 10
  14. 1 Better statistics: quantile regression Conditional quantile model - For

    uncertainty quantification - With heterogeneous errors In gradient-boosted trees G Varoquaux 11
  15. 1 Teaching: the scikit-learn MOOC https://inria.github.io/scikit-learn-mooc From zero to hero:

    didactic, but thorough Fully-open, free, reusable, no tracking G Varoquaux 12
  16. Scikit-learn: democratizing machine learning A solvable problem: fit / predict

    Focus on simplifying the user’s life Algorithmic choices, API choices, Documentation efforts Building a community On-boarding, trusting G Varoquaux 13
  17. 2 Nilearn: machine-learning for brain imaging Nilearn Vision Better machine

    learning To help understanding brain images G Varoquaux 15
  18. 2 Nilearn: machine-learning for brain imaging In practice Getting the

    data f i l e s = datasets . fetch_haxby () Curating light but meaningful data High-quality download (Caching, resume) G Varoquaux 16
  19. 2 Nilearn: machine-learning for brain imaging In practice Getting the

    data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Filenames to data matrix (memory-efficient I/O) Common preprocessing steps included G Varoquaux 16
  20. 2 Nilearn: machine-learning for brain imaging In practice Getting the

    data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Learning with scikit-learn e s t i m a t o r . f i t (data , l a b e l s ) That’s easy! G Varoquaux 16
  21. 2 Nilearn: machine-learning for brain imaging In practice Getting the

    data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Learning with scikit-learn e s t i m a t o r . f i t (data , l a b e l s ) Output plot_stat_map (masker. i n v e r s e _ t r a n s f o r m ( e s t i m a t o r . weights_ )) G Varoquaux 16
  22. 2 Nilearn: machine-learning for brain imaging In practice Getting the

    data f i l e s = datasets . fetch_haxby () Massaging the data for machine-learning masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ , standardize =True) data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ ) Learning with scikit-learn e s t i m a t o r . f i t (data , l a b e l s ) Output plot_stat_map (masker. i n v e r s e _ t r a n s f o r m ( e s t i m a t o r . weights_ )) G Varoquaux 16
  23. 2 Easy use: Example-driven development The 3-liner as the new

    cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple G Varoquaux 17
  24. 2 Easy use: Example-driven development The 3-liner as the new

    cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple User flow on the nilearn website: Examples G Varoquaux 17
  25. 2 Easy use: Example-driven development The 3-liner as the new

    cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Sphinx-gallery: compiling scripts in an examples gallery G Varoquaux 17
  26. 2 Easy use: Example-driven development The 3-liner as the new

    cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Sphinx-gallery: compiling scripts in an examples gallery Restructured text formatting Capturing outputs Links to function docs G Varoquaux 17
  27. 2 Easy use: Example-driven development The 3-liner as the new

    cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Sphinx-gallery: compiling scripts in an examples gallery G Varoquaux 17
  28. 2 Building great documentation Focus on explaining concepts (hint: write

    plain English) Less is more: prioritize, avoid redundancy Code examples must be short (link to full tutorial examples) Links everywhere: users will land at the wrong place Teach with the docs Maintenance of docs: Continuous integration Check links Runs examples Doctests G Varoquaux 18
  29. 2 Beyond software: data requires online platforms NeuroVault.org Sharing dozens

    of thousands of brain images [Gorgolewski... 2015] https://neurovault.org/collections/2138/ G Varoquaux 19
  30. 2 Consolidation: data + software ⇒ science online neuroquery.org AI

    trained on scientific literature to generate topic-specific brain maps [Dockès... 2020] G Varoquaux 20
  31. Open data science in brain imaging nilearn: simplifies advanced statistical

    processing The hard part: a community of data - file standards (nifti) - opening data, and data platforms A community ripe for openness Publications, brainhack: enthousiasm G Varoquaux 21
  32. 3 Electronic Health Records – source of real-life data Patient

    records (anything available, really) Claims databases, accounting, measurement history, doctors’ notes Great longitudinal coverage Great population coverage AP-HP (Paris hospitals) 39 hospitals 8 millions patients a year External validity and practical usefulness G Varoquaux 23
  33. 3 Electronic Health Records: dirty data challenges Missing values Uneven

    data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency.. Much larger rate of missingness than in clinical studies (often 80%) G Varoquaux 25
  34. 3 Electronic Health Records: dirty data challenges Missing values Uneven

    data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency.. Much larger rate of missingness than in clinical studies (often 80%) Non normalized information Manual input, different conventions “Diabetes Type 2” | “Diabetes Mellitus, Type 2” | “DM2” G Varoquaux 25
  35. 3 Missing values: A simple imputation example http://dirtydata.science/python/ Missing at

    random imputed Any imputation imperfection make regression hard G Varoquaux 26
  36. 3 Missing values: A simple imputation example http://dirtydata.science/python/ Missing not

    at random imputed Any imputation imperfection make regression hard G Varoquaux 26
  37. 3 Missing values: Any deterministic imputation is consistent Theorem For

    almost any deterministic imputation function for any missing-values mechanisms a flexible learner is consistent (approaches the ideal predictor) [Le Morvan... 2021] Intuition Imputation create submanifolds, to which the learner adapts Simple learners not sufficient, even in linear, Gaussian settings [Morvan... 2020] G Varoquaux 27
  38. 3 Missing values: What’s a good imputation? [Le Morvan... 2021]

    Surely, imputing data where they are most likely!? Not Oracle conditional imputation: Ximputed = E[Xmissing|Xobs ] Oracle fully-observed regressor: f s.t. y = f(X)+noise Chaining oracles may be biased Conditional variance turned into bias Best: joint training imputation & regression: differentiable imputation [Le Morvan... 2020] G Varoquaux 28
  39. 3 Missing values in practice Benchmarks on health data [Perez-Lebel...

    2022] Data most-often missing not at random Imputation is hard and expensive Adding an missing indicator helps imputation Handling missing values inside trees - [Josse... 2019] - Trees HistGradientBoostingClassifier x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 Focus on learner not imputation Consider missing indicator G Varoquaux 29
  40. 3 Non-normalized text: Substring information Drug Name alcohol ethyl alcohol

    isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 31
  41. 3 Non-normalized text: GaP Encoder for latent categories Topic model

    on sub-strings (GaP: Gamma-Poisson factorization) 3-gram1 L 3-gram2 on 3-gram3 do... Model strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 32 [Cerda and Varoquaux 2020]
  42. 3 Non-normalized text: GaP Encoder for latent categories Encodings that

    extract latent categories brary rator alist house nager unity escue ficer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant es Categories G Varoquaux 33 [Cerda and Varoquaux 2020]
  43. 3 Dirty categories in practice DirtyCat: Dirty category software http://dirty-cat.github.io

    from d i r t y _ c a t import GapEncoder gap_encoder = GapEncoder() transformed_values = gap_encoder. f i t _ t r a n s f o r m ( df ) [Cerda and Varoquaux 2020] - SuperVectorizer: dataframe to numerical matrix - fuzzy_join: joining tables despite typos Dirty data in practice Gradient-boosted trees > deep learning on tabular data sklearn.ensemble.HistGradientBoostingRegressor [Grinsztajn... 2022] G Varoquaux 34
  44. Electronic health records Dirty data: solving preprocessing for ML -

    missing values: more learning, rather than more imputation - non-normalized text entries: string-level models - non-random data degradation in outcomes or treatement Hard to build a community - Data privacy prevents good open examples - Hospitals have a vertical culture - Huge legacy infrastructure, much money G Varoquaux 35
  45. @GaelVaroquaux Open science for health data, Applying the scikit-learn recipe

    The scikit-learn recipe Simple usage – requires identifying sub-problems Quality – do less, do better Community – to go far, go together
  46. @GaelVaroquaux Open science for health data, Applying the scikit-learn recipe

    The scikit-learn recipe Lessons from brain imaging Data is central, requires more infrastructure Build a full pipeline from data to domain output
  47. @GaelVaroquaux Open science for health data, Applying the scikit-learn recipe

    The scikit-learn recipe Lessons from brain imaging Electronic Health Records Privacy concerns freeze data and collaboration Data quality: dirty data is solvable, biases are hard Soda research team – social data https://team.inria.fr/soda
  48. 4 References I P. Cerda and G. Varoquaux. Encoding high-cardinality

    string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 2020. P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty categorical variables. Machine Learning, pages 1–18, 2018. J. Dockès, R. A. Poldrack, R. Primet, H. Gözükan, T. Yarkoni, F. Suchanek, B. Thirion, and G. Varoquaux. Neuroquery, comprehensive meta-analysis of human brain mapping. Elife, 9:e53385, 2020. K. J. Gorgolewski, G. Varoquaux, G. Rivera, Y. Schwarz, S. S. Ghosh, C. Maumet, V. V. Sochat, T. E. Nichols, R. A. Poldrack, J.-B. Poline, ... Neurovault. org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain. Frontiers in neuroinformatics, 9:8, 2015.
  49. 4 References II L. Grinsztajn, E. Oyallon, and G. Varoquaux.

    Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss networks: differential programming for supervised learning with missing values. In Advances in Neural Information Processing Systems 33, 2020. M. Le Morvan, J. Josse, E. Scornet, and G. Varoquaux. What’s a good imputation to predict with missing values? Neural Information Processing Systems, 34, 2021.
  50. 4 References III M. L. Morvan, N. Prost, J. Josse,

    E. Scornet, and G. Varoquaux. Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISATS, 2020. A. Perez-Lebel, G. Varoquaux, M. Le Morvan, J. Josse, and J.-B. Poline. Benchmarking missing-values approaches for predictive models on health databases. GigaScience, 11, 2022.