Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AI Assisted Discovery: from UX to Eureka!

AI Assisted Discovery: from UX to Eureka!

Democratization of AI/ML in astronomy has been fostered by increased awareness, powerful software tools, and improving education. Yet as diverse AI/ML methods begin to be infused into workflows and inference chains it is legitimate to ask how AI/ML has fundamentally and uniquely contributed to novel science. I address this question in the context of AI as an assistive tool in three contexts: 1) to leapfrog people-centric bottlenecks, 2) as a model-based computational accelerant, and 3) as a hypothesis generation engine. One recent effort of ours surfaces insights of large language models (LLMs) with a focus on user experience (UX). Another demonstrates an unexpected fundamental breakthrough in our understanding of the theory of microlensing via simulation-based inference.

Plenary talk given at "Cosmic Connections: A ML X Astrophysics Symposium at Simons Foundation" May 23, 2023 (NYC)

Joshua Bloom

May 24, 2023

More Decks by Joshua Bloom

Other Decks in Science


  1. Josh Bloom UC Berkeley (Astronomy) @profjsb AI Assisted Discovery: from

    UX to Eureka! Data Driven Discovery Investigator Faculty Award ML X Astrophysics Symposium, NYC, May 23, 2023 ๏ Decision support in time-domain astronomy ‣ Images: real-bogus ‣ Light curves: probabilistic catalogs ๏ Fast inference surrogate as an intuition guide ๏ LLMs in Production for User experience Agenda
  2. Don’t do ML unless you have to Overcome Resource Constraints

    Computation • Accelerate physics-based simulation • Simulation-based inference Hardware • Data transport bottlenecks • Survey & instrument design • Optimize observing plans People • Scaling decision support • Automated Hypothesis generation • Guided exploration & discovery
  3. Too Many Transients Tax (Follow-Up) Resources Palomar Transient Factory (PTF)

    2009-2016 Zwicky Transient Factory (ZTF) 2017-2024 Large Synop@c Survey Telescope (LSST) 2024-2034 Image data rate 1 GB/90s 3 GB/45 s 6 GB/5 s Transient Alerts per night 4✕104 3✕105 2✕106 Hubble Space Telescope (HST) James Webb Space Telescope (JWST) Thirty-Meter Telescope (TMT) “cheap” discovery “expensive” followup
  4. 4 H. Brink et al. Figure 1. Examples of bogus

    (top) and real (bottom) thumbnails. Note that the shapes of the bogus sources can be quite varied, values lie between 1 and 1. As the pixel values for real can- didates can take on a wide range of values depending on the astrophysical source and observing conditions, this normal- ization ensures that our features are not overly sensitive to the peak brightness of the residual nor the residual level of background flux, and instead capture the sizes and shapes of the subtraction residual. Starting with the raw subtraction thumbnail, I, normalization is achieved by first subtract- ing the median pixel value from the subtraction thumbnail and then dividing by the maximum absolute value across all median-subtracted pixels via IN (x, y) = ⇢ I(x, y) med[I(x, y)] max{abs[I(x, y)]} . (1) Analysis of the features derived from these normalized real and bogus subtraction images showed that the transfor- mation in (1) is superior to other alternatives, such as the Frobenius norm ( p trace(IT I)) and truncation schemes where extreme pixel values are removed. Using Figure 1 as a guide, our first intuition about real candidates is that their subtractions are typically az- imuthally symmetric in nature, and well-represented by a 2-dimensional Gaussian function, whereas bogus candidates are not well behaved. To this end, we define a spherical 2D Gaussian, G(x, y), over pixels x, y as G(x, y) = A · exp ⇢ 1 2  (cx x)2 + (cy y)2 , (2) which we fit to the normalized PTF subtraction image, IN , of each candidate by minimizing the sum-of-squared di↵er- “bogus” “real” image “subtractions” a real-time framework to discover variable/transient sources without people • fast (compared to people) • parallelizable • transparent • deterministic • versionable 1000 to 1 needle in the haystack problem Human Decision Support: ML Discovery Engine in Production
  5. Supernova Discovery in the Pinwheel Galaxy (M101) 11 hr after

    explosion nearest SN Ia in >3 decades ML-assisted “real-bogus” discovery ©Peter Nugent Nugent, …, JSB+12, Nature, 1110.6201
  6. 50k variables, 26 classes, 810 with known labels (timeseries, colors)

    Also, Amstrong+16 (10k K2 stars) Richards+11, 12 Variable Star Science
  7. 1. AE learn to reproduce irregularly sampled light curves using

    an information bottleneck (B) E( ( → B D → ( ( ≈ 2. Use B as features and learn a traditional classifier (e.g., random forest) F. Peréz S. van der Walt Self-Supervised (Autoencoder) Recurrent NN SOTA permutation invariant version: Zhang & Bloom (ICLR20, arxiv:2011.01243) •self-supervised feature learning → leverage large corpus of unlabelled light curves
  8. Probabilis@c Classifica@on Of Variable Stars Shivvers,JSB,Richards MNRAS,2014 106 “DEB” candidates

    12 new mass-radii 15 “RCB/DYP” candidates 8 new discoveries Triple # of Galac@c DYPer Stars Miller, Richards, JSB,..ApJ 2012 Local Distance Ladder: Spectroscopic Metallicity measurements for RRL, Cepheids, Mira… → Inform the use of precious followup resources
  9. Microlensing for Exoplanet Discovery & Characterization Goal: measure masses, separations,

    orbits. ! Grid search+MCMC is slow (millions of forward model computations) & require experts in the loop Animation: B. S. Gaudi
  10. Microlensing for Exoplanet Discovery & Characterization Goal: measure masses, separations,

    orbits. ! Grid search+MCMC is slow (millions of forward model computations) & require experts in the loop " Expecting thousands of events with Roman. Calls for automated & more efficient inference approaches Animation: B. S. Gaudi net Discovery and Characterization Figure from Zhu & Dong 2021 >10x yield g Degeneracy Feb 9th 2022, Caltech/IPAC, Keming Zhang Figure from Zhu & Dong 2021
  11. Fast Inference with Neural Density Estimator Zhang, JSB, … NeurIPS

    MP4PS (2010.04156) Zhang et al., AJ 161 262 (2021) →Amortized inference, 105 faster θ ∼ ℝ8
  12. Automating Inference of Binary Microlensing Events with Neural Density Estimation

    Anonymous Author(s) Affiliation Address email Abstract Automated inference of binary microlensing events with traditional sampling-based 1 algorithms such as MCMC has been hampered by the slowness of the physical 2 forward model and the pathological parameter space. Current analysis of such 3 events requires both expert knowledge and large-scale grid searches to locate 4 the approximate solution as a prerequisite to MCMC posterior sampling. As 5 the next generation, space-based [1] microlensing survey with the Roman Space 6 Observatory [2] is expected to yield thousands of binary microlensing events [3] 7 a new fast, accurate, scalable, and automated approach is desired. In this paper, 8 we present an automated inference method based on neural density estimation 9 (NDE). We show that the trained NDE not only produces fast, accurate, and 10 precise posteriors but also captures expected posterior degeneracies. A hybrid 11 NDE-MCMC framework can further be applied to produce the exact posterior. 12 Zhang, JSB, … NeurIPS MP4PS (2010.04156) Zhang et al., AJ 161 262 (2021) Recovery of Known Caustic Degeneracies
  13. Discovery of Magnification Degeneracies LETTERS NATURE ASTRONOMY Because of this

    unifying feature, we expected the offset degen- eracy to be ubiquitous in past events with twofold degenerate solu- tions and speculate that a large number of cases may have been mistakenly attributed to the close–wide degeneracy. Therefore, we systematically searched for previously published events with two- fold degenerate solutions satisfying q A ≃ q B ≪ 1 (see Methods). We found 23 such events, and then first compared the intercept of the source trajectory on the star–planet axis to the location of the null predicted with equation (1). We also invert equation (1) to predict one degenerate s A from the other s B : sA = 1 2 2x0 − (sB − 1/sB ) + [2x0 − (sB − 1/sB )]2 + 4 , (2) –3 –2 –1 0 1 2 3 x null = s A – 1/s A + s B – 1/s B log 10 (s B ) = –0.40 log 10 (s B ) = –0.25 log 10 (s B ) = –0.10 Close–wide Inner–outer Lens A wide Lens A resonant Lens A close 2 10 5 0 Δx null (%) Reanalysis of 23 previous 2-mode solutions shows one source location predicts the other Continuous set of 
 “offset” degenerate light curves with inner-outer/close-wide as limiting cases “suggests the existence of a deeper symmetry in the equations governing two-body lenses than previously recognized. “ Zhang, Gaudi Bloom, Nat. Ast. 2022
  14. Advancing astronomy by guiding human intuition with AI… …while AI

    is unlikely to replace scientists in the foreseeable future, [this work] demonstrates that it can be harnessed to help us understand deeper mathematical patterns in the underlying theory. Mroz, Nat. Ast. News and Views (2022) See also Davies et al. Nature, 2021 https://joshbloom.org/post/just_the_beginning/
  15. SkyPortal SkyPortal: Collaborative Platform for Time-Domain Astronomy ๏ Single-source-of-truth marshal

    for MMA, transient, variable, and Solar system use. Facilitates follow-up observation management: robotic and classical facilities
 ๏ ZTF-II: 300+ users, 100-1000 events per night; 2.8M sources total, 170k comments, 3.3M annotations
 ๏ Teeming with ML: rb scores, classifications, ML followup triggers ๏ How can we reduce the cognitive load in trying to remember (& act upon) so many sources/data?
  16. LLM Summarization Package comments, redshift, classifications, annotations etc. with the

    query: In one succinct (less than 250 words) paragraph written in the 3rd person summarize the following comments about this astronomical source. If classifications and/or the redshift are given, then note those in the summary. This summary should be most useful to expert astrophysicists who already know the definitions and meaning of all classifications. Ship off to OpenAI and save the embedding in pinecone (vector DB)
  17. Sources with LLM summaries are compared against others. Largest cosine

    distance sources are shown. Embeddings-based Similar Sources
  18. UI/UX Considerations ๏ Prediction speed matters - ideally on <100ms

    latency but otherwise ๏ Don’t show summarization button # unless there is enough raw data ๏ Highlight/colorize query results based on quality ๏ Do not show results on similar sources below user-specified threshold ๏ Design for fault tolerance (e.g., vector DB is offline) ๏ Capture usage for post-mortems, improve user experience XAI for Decision Support ๏ Explainability (with verifiability) will become more important in astronomy HCI See Fok & Weld (2023), Tan 2022
  19. ML in Production is Hard Only real test of the

    model is if its falsifiable on data that does not yet exist Since all models are fallible & people are always on the receiving end, we need to invest in how model are hot- swapped, predictions are consumed & acted upon
  20. Josh Bloom UC Berkeley (Astronomy) @profjsb AI Assisted Discovery: from

    UX to Eureka! ML X Astrophysics Symposium, NYC, May 23, 2023 J. Richards (Stats/Astro) T. Broderick (Stats) J. Long (Stats) Jorge Martínez- Palomera Ellie Abrahams Sara Jamal Keming Zhang Sydney Jenkins Ben Nachman (LBL) Jackie Blaum Natalie LeBaron Stefan van der Walt Fernando Peréz