Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open science for health data, applying the scikit-learn recipe?

Gael Varoquaux
November 22, 2022

Open science for health data, applying the scikit-learn recipe?

Developing data analyses pipelines around open source, open data, and open communities has shown great successes. The most used machine-learning tools to date, scikit-learn, was assembled by a large community, with different contributors bringing different expertise. Can such success be carried over to health data? I will discuss my experience building first scikit-learn, then nilearn in the brain imaging community, and finally more recent work in electronic health records. Spoiler: things are harder with electronic health records.

Gael Varoquaux

November 22, 2022
Tweet

More Decks by Gael Varoquaux

Other Decks in Research

Transcript

  1. Open science for health data,
    Applying the scikit-learn recipe?
    Gaël Varoquaux,

    View Slide

  2. Open science for health data,
    Applying the scikit-learn recipe?
    Gaël Varoquaux,
    1 Scikit-learn: democratizing machine learning
    2 Open science in brain imaging
    3 Better decisions from electronic health records

    View Slide

  3. Challenges to data science
    www.kaggle.com/ash316/novice-to-grandmaster
    G Varoquaux 2

    View Slide

  4. Challenges to data science
    www.kaggle.com/ash316/novice-to-grandmaster
    Integration of non curated data
    Enabling the human data scientist
    Bringing in other stakeholders
    Facilitating access to data
    Can tools help?
    G Varoquaux 2

    View Slide

  5. 1 Scikit-learn: democratizing machine
    learning
    Machine learning for all
    G Varoquaux 3

    View Slide

  6. scikit-learn
    From scientists
    G Varoquaux 4

    View Slide

  7. scikit-learn
    From scientists to an industry standard
    Number of monthly users
    2010201220142016201820202022
    200k
    400k
    600k
    800k
    1M
    G Varoquaux 4

    View Slide

  8. 1 Embracing the Python stack
    Python
    Interactive, easy General-purpose
    G Varoquaux 5

    View Slide

  9. 1 Embracing the Python stack
    Python
    Interactive, easy General-purpose
    An ecosystem
    numpy
    03878794797927
    01790752701578
    94071746124797
    54970718717887
    13653490495190
    74754265358098
    48721546349084
    90345673245614
    78957187745620
    03878794797927
    01790752701578
    94071746124797
    54970718717887
    13653490495190
    74754265358098
    48721546349084
    90345673245614
    78957187745620
    Numerical operations
    A memory model (float*)
    G Varoquaux 5

    View Slide

  10. 1 Embracing the Python stack
    Python
    Interactive, easy General-purpose
    An ecosystem
    numpy
    03878794797927
    01790752701578
    94071746124797
    54970718717887
    13653490495190
    74754265358098
    48721546349084
    90345673245614
    78957187745620
    03878794797927
    01790752701578
    94071746124797
    54970718717887
    13653490495190
    74754265358098
    48721546349084
    90345673245614
    78957187745620
    Numerical operations
    A memory model (float*)
    pandas
    pytorch
    G Varoquaux 5

    View Slide

  11. 1 Focus on usability
    API design
    Grey box: all models interchangeable,
    but still inspectable
    Documentation & examples
    Good documentation required to add a feature
    Easy-understable examples guide API design
    Teach statistical learning, rather than code
    Models, solvers, hyperparameters
    Choices that do not require tinkering
    Lots of usecase-driven empirical testing
    G Varoquaux 6

    View Slide

  12. 1 Community-driven development
    Our DNA: distributed development & decision making
    Gave the team
    2010 2014 2018
    0
    25
    50
    # monthly contributors
    and the right focus
    People fix & improve what’s
    important to them
    G Varoquaux 7

    View Slide

  13. 1 Community-driven development
    Our DNA: distributed development & decision making
    Gave the team
    2010 2014 2018
    0
    25
    50
    # monthly contributors
    and the right focus
    People fix & improve what’s
    important to them
    Open source has won
    But it needs sustainability and investment
    G Varoquaux 7

    View Slide

  14. 1 Community-driven development
    Our DNA: distributed development & decision making
    Gave the team
    2010 2014 2018
    0
    25
    50
    # monthly contributors
    and the right focus
    People fix & improve what’s
    important to them
    Open source has won
    But it needs sustainability and investment
    mid-2018: A foundation for scikit-learn
    + the community
    G Varoquaux 7

    View Slide

  15. 1 Difference makes better software & science
    Scikit-learn = computer science for non computer scientists
    We all do different things
    We can all benefit from others though we don’t know how
    G Varoquaux 8

    View Slide

  16. 1 Difference makes better software & science
    Scikit-learn = computer science for non computer scientists
    We all do different things
    We can all benefit from others though we don’t know how
    Being didactic outside one’s community is crucial
    Avoiding jargon take that machine learning
    Prioritizing information
    “Simple is better than complex”
    Students learning numerics don’t care about unicode
    Build documentation upon very simple examples
    Think stackoverflow
    G Varoquaux 8

    View Slide

  17. 1 10 rules for community-driven development
    1 Choose a project scope & vision
    2 Use Github, work online
    3 Do not own a project
    4 Seek quality
    5 Release early, release often
    6 Limit technicity
    7 Foster a good project culture
    8 Organize sprints
    9 Invest in recruitment
    10 Communicate
    G Varoquaux 9

    View Slide

  18. 1 Faster computing: Ongoing speed up
    Improving Newton solver
    ⇒ more robust to
    infrequent categories
    ...0, 0, 0, 0, 0, 1, 0, 0...
    G Varoquaux 10

    View Slide

  19. 1 Better statistics: quantile regression
    Conditional quantile model
    - For uncertainty quantification
    - With heterogeneous errors
    In gradient-boosted trees
    G Varoquaux 11

    View Slide

  20. 1 Teaching: the scikit-learn MOOC
    https://inria.github.io/scikit-learn-mooc
    From zero to hero: didactic, but thorough
    Fully-open, free, reusable, no tracking
    G Varoquaux 12

    View Slide

  21. Scikit-learn: democratizing machine learning
    A solvable problem: fit / predict
    Focus on simplifying the user’s life
    Algorithmic choices, API choices, Documentation efforts
    Building a community
    On-boarding, trusting
    G Varoquaux 13

    View Slide

  22. 2 Open science in brain imaging
    G Varoquaux 14

    View Slide

  23. 2 Nilearn: machine-learning for brain imaging
    Nilearn Vision
    Better machine learning
    To help understanding brain images
    G Varoquaux 15

    View Slide

  24. 2 Nilearn: machine-learning for brain imaging
    In practice
    Getting the data
    f i l e s = datasets . fetch_haxby ()
    Curating light but meaningful data
    High-quality download (Caching, resume)
    G Varoquaux 16

    View Slide

  25. 2 Nilearn: machine-learning for brain imaging
    In practice
    Getting the data
    f i l e s = datasets . fetch_haxby ()
    Massaging the data for machine-learning
    masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ ,
    standardize =True)
    data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ )
    Filenames to data matrix (memory-efficient I/O)
    Common preprocessing steps included
    G Varoquaux 16

    View Slide

  26. 2 Nilearn: machine-learning for brain imaging
    In practice
    Getting the data
    f i l e s = datasets . fetch_haxby ()
    Massaging the data for machine-learning
    masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ ,
    standardize =True)
    data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ )
    Learning with scikit-learn
    e s t i m a t o r . f i t (data , l a b e l s )
    That’s easy!
    G Varoquaux 16

    View Slide

  27. 2 Nilearn: machine-learning for brain imaging
    In practice
    Getting the data
    f i l e s = datasets . fetch_haxby ()
    Massaging the data for machine-learning
    masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ ,
    standardize =True)
    data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ )
    Learning with scikit-learn
    e s t i m a t o r . f i t (data , l a b e l s )
    Output
    plot_stat_map (masker. i n v e r s e _ t r a n s f o r m (
    e s t i m a t o r . weights_ ))
    G Varoquaux 16

    View Slide

  28. 2 Nilearn: machine-learning for brain imaging
    In practice
    Getting the data
    f i l e s = datasets . fetch_haxby ()
    Massaging the data for machine-learning
    masker = N i f t i M a s k e r (mask_img= ’ mask . n i i ’ ,
    standardize =True)
    data = masker. f i t _ t r a n s f o r m ( ’ fmri . n i i ’ )
    Learning with scikit-learn
    e s t i m a t o r . f i t (data , l a b e l s )
    Output
    plot_stat_map (masker. i n v e r s e _ t r a n s f o r m (
    e s t i m a t o r . weights_ ))
    G Varoquaux 16

    View Slide

  29. 2 Easy use: Example-driven development
    The 3-liner as the new cool
    Teaching others
    Teaching yourself
    Write examples that solve end problems
    Iterate on your API until these are simple
    G Varoquaux 17

    View Slide

  30. 2 Easy use: Example-driven development
    The 3-liner as the new cool
    Teaching others
    Teaching yourself
    Write examples that solve end problems
    Iterate on your API until these are simple
    User flow on the nilearn website:
    Examples
    G Varoquaux 17

    View Slide

  31. 2 Easy use: Example-driven development
    The 3-liner as the new cool
    Teaching others
    Teaching yourself
    Write examples that solve end problems
    Iterate on your API until these are simple
    Sphinx-gallery: compiling scripts in an examples gallery
    G Varoquaux 17

    View Slide

  32. 2 Easy use: Example-driven development
    The 3-liner as the new cool
    Teaching others
    Teaching yourself
    Write examples that solve end problems
    Iterate on your API until these are simple
    Sphinx-gallery: compiling scripts in an examples gallery
    Restructured text
    formatting
    Capturing
    outputs
    Links to
    function docs
    G Varoquaux 17

    View Slide

  33. 2 Easy use: Example-driven development
    The 3-liner as the new cool
    Teaching others
    Teaching yourself
    Write examples that solve end problems
    Iterate on your API until these are simple
    Sphinx-gallery: compiling scripts in an examples gallery
    G Varoquaux 17

    View Slide

  34. 2 Building great documentation
    Focus on explaining concepts (hint: write plain English)
    Less is more: prioritize, avoid redundancy
    Code examples must be short (link to full tutorial examples)
    Links everywhere: users will land at the wrong place
    Teach with the docs
    Maintenance of docs:
    Continuous integration
    Check links
    Runs examples
    Doctests
    G Varoquaux 18

    View Slide

  35. 2 Beyond software: data requires online platforms
    NeuroVault.org
    Sharing dozens of
    thousands of brain
    images
    [Gorgolewski...
    2015]
    https://neurovault.org/collections/2138/
    G Varoquaux 19

    View Slide

  36. 2 Consolidation: data + software ⇒ science online
    neuroquery.org
    AI trained on
    scientific literature
    to generate
    topic-specific brain
    maps
    [Dockès... 2020]
    G Varoquaux 20

    View Slide

  37. Open data science in brain imaging
    nilearn: simplifies advanced statistical processing
    The hard part: a community of data
    - file standards (nifti)
    - opening data, and data platforms
    A community ripe for openness
    Publications, brainhack: enthousiasm
    G Varoquaux 21

    View Slide

  38. 3 Better decisions from electronic health
    records
    G Varoquaux 22

    View Slide

  39. 3 Electronic Health Records – source of real-life data
    Patient records (anything available, really)
    Claims databases, accounting, measurement history, doctors’ notes
    Great longitudinal coverage
    Great population coverage
    AP-HP (Paris hospitals)
    39 hospitals
    8 millions patients a year
    External validity and
    practical usefulness
    G Varoquaux 23

    View Slide

  40. 3 Electronic Health Records
    Data preparation is the bottleneck
    Data are not numerical matrices
    G Varoquaux 24

    View Slide

  41. 3 Electronic Health Records: dirty data challenges
    Missing values
    Uneven data on patients, across hospital sites
    Data not measured because not applicable, no time in face of urgency..
    Much larger rate of missingness than in clinical studies (often 80%)
    G Varoquaux 25

    View Slide

  42. 3 Electronic Health Records: dirty data challenges
    Missing values
    Uneven data on patients, across hospital sites
    Data not measured because not applicable, no time in face of urgency..
    Much larger rate of missingness than in clinical studies (often 80%)
    Non normalized information
    Manual input, different conventions
    “Diabetes Type 2” | “Diabetes Mellitus, Type 2” | “DM2”
    G Varoquaux 25

    View Slide

  43. 3 Missing values: A simple imputation example
    http://dirtydata.science/python/
    Fully-observed data
    G Varoquaux 26

    View Slide

  44. 3 Missing values: A simple imputation example
    http://dirtydata.science/python/
    Missing at random
    G Varoquaux 26

    View Slide

  45. 3 Missing values: A simple imputation example
    http://dirtydata.science/python/
    Missing at random imputed
    Any imputation imperfection make regression hard
    G Varoquaux 26

    View Slide

  46. 3 Missing values: A simple imputation example
    http://dirtydata.science/python/
    Missing not at random imputed
    Any imputation imperfection make regression hard
    G Varoquaux 26

    View Slide

  47. 3 Missing values: Any deterministic imputation is consistent
    Theorem For almost any deterministic
    imputation function
    for any missing-values mechanisms
    a flexible learner is consistent
    (approaches the ideal predictor)
    [Le Morvan... 2021]
    Intuition Imputation create submanifolds,
    to which the learner adapts
    Simple learners not sufficient, even in
    linear, Gaussian settings
    [Morvan... 2020]
    G Varoquaux 27

    View Slide

  48. 3 Missing values: What’s a good imputation? [Le Morvan... 2021]
    Surely, imputing data where they are most likely!? Not
    Oracle conditional imputation: Ximputed = E[Xmissing|Xobs
    ]
    Oracle fully-observed regressor: f s.t. y = f(X)+noise
    Chaining oracles
    may be biased
    Conditional
    variance turned
    into bias
    Best: joint training imputation & regression: differentiable imputation
    [Le Morvan... 2020]
    G Varoquaux 28

    View Slide

  49. 3 Missing values in practice
    Benchmarks on health data [Perez-Lebel... 2022]
    Data most-often missing not at random
    Imputation is hard and expensive
    Adding an missing indicator helps imputation
    Handling missing values inside trees
    - [Josse... 2019]
    - Trees
    HistGradientBoostingClassifier
    x10< -1.5 ?
    x2< 2 ?
    Yes/Missing
    x7< 0.3 ?
    No
    ...
    Yes
    ...
    No/Missing
    x1< 0.5 ?
    Yes
    ...
    No/Missing
    ... Predict +1.3
    Focus on learner not imputation
    Consider missing indicator
    G Varoquaux 29

    View Slide

  50. 3 Non-normalized text
    Problem 2
    Non-normalized data
    G Varoquaux 30

    View Slide

  51. 3 Non-normalized text: Substring information
    Drug Name
    alcohol
    ethyl alcohol
    isopropyl alcohol
    polyvinyl alcohol
    isopropyl alcohol swab
    62% ethyl alcohol
    alcohol 68%
    alcohol denat
    benzyl alcohol
    dehydrated alcohol
    Employee Position Title
    Police Aide
    Master Police Officer
    Mechanic Technician II
    Police Officer III
    Senior Architect
    Senior Engineer Technician
    Social Worker III
    G Varoquaux 31

    View Slide

  52. 3 Non-normalized text: GaP Encoder for latent categories
    Topic model on sub-strings
    (GaP: Gamma-Poisson factorization) 3-gram1
    L
    3-gram2
    on
    3-gram3
    do...
    Model strings as a linear combination of substrings
    11111000000000
    00000011111111
    10000001100000
    11100000000000
    11111100000000
    11111000000000
    police
    officer
    pol off
    polis
    policeman
    policier
    er_
    cer
    fic
    off
    _of
    ce_
    ice
    lic
    pol

    03078090707907
    00790752700578
    94071006000797
    topics
    030
    007
    940
    009
    100
    000
    documents
    topics
    +
    What substrings
    are in a latent
    category
    What latent categories
    are in an entry
    er_
    cer
    fic
    off
    _of
    ce_
    ice
    lic
    pol
    G Varoquaux 32
    [Cerda and Varoquaux 2020]

    View Slide

  53. 3 Non-normalized text: GaP Encoder for latent categories
    Encodings
    that extract
    latent
    categories
    brary
    rator
    alist
    house
    nager
    unity
    escue
    ficer
    Legislative Analyst II
    Legislative Attorney
    Equipment Operator I
    Transit Coordinator
    Bus Operator
    Senior Architect
    Senior Engineer Technician
    Financial Programs Manager
    Capital Projects Manager
    Mechanic Technician II
    Master Police Officer
    Police Sergeant
    es
    Categories
    G Varoquaux 33
    [Cerda and Varoquaux 2020]

    View Slide

  54. 3 Dirty categories in practice
    DirtyCat: Dirty category software
    http://dirty-cat.github.io
    from d i r t y _ c a t import GapEncoder
    gap_encoder = GapEncoder()
    transformed_values = gap_encoder. f i t _ t r a n s f o r m ( df )
    [Cerda and Varoquaux 2020]
    - SuperVectorizer: dataframe to numerical matrix
    - fuzzy_join: joining tables despite typos
    Dirty data in practice
    Gradient-boosted trees > deep learning on tabular data
    sklearn.ensemble.HistGradientBoostingRegressor
    [Grinsztajn... 2022]
    G Varoquaux 34

    View Slide

  55. Electronic health records
    Dirty data: solving preprocessing for ML
    - missing values: more learning, rather than more imputation
    - non-normalized text entries: string-level models
    - non-random data degradation in outcomes or treatement
    Hard to build a community
    - Data privacy prevents good open examples
    - Hospitals have a vertical culture
    - Huge legacy infrastructure, much money
    G Varoquaux 35

    View Slide

  56. @GaelVaroquaux
    Open science for health data,
    Applying the scikit-learn recipe
    The scikit-learn recipe
    Simple usage – requires identifying sub-problems
    Quality – do less, do better
    Community – to go far, go together

    View Slide

  57. @GaelVaroquaux
    Open science for health data,
    Applying the scikit-learn recipe
    The scikit-learn recipe
    Lessons from brain imaging
    Data is central, requires more infrastructure
    Build a full pipeline from data to domain output

    View Slide

  58. @GaelVaroquaux
    Open science for health data,
    Applying the scikit-learn recipe
    The scikit-learn recipe
    Lessons from brain imaging
    Electronic Health Records
    Privacy concerns freeze data and collaboration
    Data quality: dirty data is solvable, biases are hard
    Soda research team – social data
    https://team.inria.fr/soda

    View Slide

  59. 4 References I
    P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical
    variables. IEEE Transactions on Knowledge and Data Engineering, 2020.
    P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty
    categorical variables. Machine Learning, pages 1–18, 2018.
    J. Dockès, R. A. Poldrack, R. Primet, H. Gözükan, T. Yarkoni, F. Suchanek,
    B. Thirion, and G. Varoquaux. Neuroquery, comprehensive meta-analysis of
    human brain mapping. Elife, 9:e53385, 2020.
    K. J. Gorgolewski, G. Varoquaux, G. Rivera, Y. Schwarz, S. S. Ghosh,
    C. Maumet, V. V. Sochat, T. E. Nichols, R. A. Poldrack, J.-B. Poline, ...
    Neurovault. org: a web-based repository for collecting and sharing
    unthresholded statistical maps of the human brain. Frontiers in
    neuroinformatics, 9:8, 2015.

    View Slide

  60. 4 References II
    L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still
    outperform deep learning on typical tabular data? In Thirty-sixth Conference
    on Neural Information Processing Systems Datasets and Benchmarks Track,
    2022.
    J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of
    supervised learning with missing values. arXiv preprint arXiv:1902.06931,
    2019.
    M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss
    networks: differential programming for supervised learning with missing
    values. In Advances in Neural Information Processing Systems 33, 2020.
    M. Le Morvan, J. Josse, E. Scornet, and G. Varoquaux. What’s a good
    imputation to predict with missing values? Neural Information Processing
    Systems, 34, 2021.

    View Slide

  61. 4 References III
    M. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear
    predictor on linearly-generated data with missing values: non consistency
    and solutions. AISATS, 2020.
    A. Perez-Lebel, G. Varoquaux, M. Le Morvan, J. Josse, and J.-B. Poline.
    Benchmarking missing-values approaches for predictive models on health
    databases. GigaScience, 11, 2022.

    View Slide