AI Assisted Discovery: from UX to Eureka!

Josh Bloom UC Berkeley (Astronomy) @profjsb AI Assisted Discovery: from
UX to Eureka! Data Driven Discovery Investigator Faculty Award ML X Astrophysics Symposium, NYC, May 23, 2023 ๏ Decision support in time-domain astronomy ‣ Images: real-bogus ‣ Light curves: probabilistic catalogs ๏ Fast inference surrogate as an intuition guide ๏ LLMs in Production for User experience Agenda

Don’t do ML unless you have to Overcome Resource Constraints
Computation • Accelerate physics-based simulation • Simulation-based inference Hardware • Data transport bottlenecks • Survey & instrument design • Optimize observing plans People • Scaling decision support • Automated Hypothesis generation • Guided exploration & discovery

Too Many Transients Tax (Follow-Up) Resources Palomar Transient Factory (PTF)
2009-2016 Zwicky Transient Factory (ZTF) 2017-2024 Large Synop@c Survey Telescope (LSST) 2024-2034 Image data rate 1 GB/90s 3 GB/45 s 6 GB/5 s Transient Alerts per night 4✕104 3✕105 2✕106 Hubble Space Telescope (HST) James Webb Space Telescope (JWST) Thirty-Meter Telescope (TMT) “cheap” discovery “expensive” followup

Josh Bloom (GE Digital) @profjsb Harvard College Observatory c. 1890

4 H. Brink et al. Figure 1. Examples of bogus
(top) and real (bottom) thumbnails. Note that the shapes of the bogus sources can be quite varied, values lie between 1 and 1. As the pixel values for real candidates can take on a wide range of values depending on the astrophysical source and observing conditions, this normalization ensures that our features are not overly sensitive to the peak brightness of the residual nor the residual level of background flux, and instead capture the sizes and shapes of the subtraction residual. Starting with the raw subtraction thumbnail, I, normalization is achieved by first subtract- ing the median pixel value from the subtraction thumbnail and then dividing by the maximum absolute value across all median-subtracted pixels via IN (x, y) = ⇢ I(x, y) med[I(x, y)] max{abs[I(x, y)]} . (1) Analysis of the features derived from these normalized real and bogus subtraction images showed that the transfor- mation in (1) is superior to other alternatives, such as the Frobenius norm ( p trace(IT I)) and truncation schemes where extreme pixel values are removed. Using Figure 1 as a guide, our first intuition about real candidates is that their subtractions are typically az- imuthally symmetric in nature, and well-represented by a 2-dimensional Gaussian function, whereas bogus candidates are not well behaved. To this end, we define a spherical 2D Gaussian, G(x, y), over pixels x, y as G(x, y) = A · exp ⇢ 1 2  (cx x)2 + (cy y)2 , (2) which we fit to the normalized PTF subtraction image, IN , of each candidate by minimizing the sum-of-squared di↵er- “bogus” “real” image “subtractions” a real-time framework to discover variable/transient sources without people • fast (compared to people) • parallelizable • transparent • deterministic • versionable 1000 to 1 needle in the haystack problem Human Decision Support: ML Discovery Engine in Production

Supernova Discovery in the Pinwheel Galaxy (M101) 11 hr after
explosion nearest SN Ia in >3 decades ML-assisted “real-bogus” discovery ©Peter Nugent Nugent, …, JSB+12, Nature, 1110.6201

Bloom+12 see also Nugent+12, Nature

Discovery (& classification) on images is now a cottage industry
Adapted from D. Goldstein

50k variables, 26 classes, 810 with known labels (timeseries, colors)
Also, Amstrong+16 (10k K2 stars) Richards+11, 12 Variable Star Science

1. AE learn to reproduce irregularly sampled light curves using
an information bottleneck (B) E( ( → B D → ( ( ≈ 2. Use B as features and learn a traditional classifier (e.g., random forest) F. Peréz S. van der Walt Self-Supervised (Autoencoder) Recurrent NN SOTA permutation invariant version: Zhang & Bloom (ICLR20, arxiv:2011.01243) •self-supervised feature learning → leverage large corpus of unlabelled light curves

Probabilis@c Classiﬁca@on Of Variable Stars Shivvers,JSB,Richards MNRAS,2014 106 “DEB” candidates
12 new mass-radii 15 “RCB/DYP” candidates 8 new discoveries Triple # of Galac@c DYPer Stars Miller, Richards, JSB,..ApJ 2012 Local Distance Ladder: Spectroscopic Metallicity measurements for RRL, Cepheids, Mira… → Inform the use of precious followup resources

An Unexpected Fundamental Theoretical Discovery Using SBI

Microlensing for Exoplanet Discovery & Characterization Goal: measure masses, separations,
orbits. ! Grid search+MCMC is slow (millions of forward model computations) & require experts in the loop Animation: B. S. Gaudi

Microlensing for Exoplanet Discovery & Characterization Goal: measure masses, separations,
orbits. ! Grid search+MCMC is slow (millions of forward model computations) & require experts in the loop " Expecting thousands of events with Roman. Calls for automated & more eﬃcient inference approaches Animation: B. S. Gaudi net Discovery and Characterization Figure from Zhu & Dong 2021 >10x yield g Degeneracy Feb 9th 2022, Caltech/IPAC, Keming Zhang Figure from Zhu & Dong 2021

Fast Inference with Neural Density Estimator Zhang, JSB, … NeurIPS
MP4PS (2010.04156) Zhang et al., AJ 161 262 (2021) →Amortized inference, 105 faster θ ∼ ℝ8

Automating Inference of Binary Microlensing Events with Neural Density Estimation
Anonymous Author(s) Afﬁliation Address email Abstract Automated inference of binary microlensing events with traditional sampling-based 1 algorithms such as MCMC has been hampered by the slowness of the physical 2 forward model and the pathological parameter space. Current analysis of such 3 events requires both expert knowledge and large-scale grid searches to locate 4 the approximate solution as a prerequisite to MCMC posterior sampling. As 5 the next generation, space-based [1] microlensing survey with the Roman Space 6 Observatory [2] is expected to yield thousands of binary microlensing events [3] 7 a new fast, accurate, scalable, and automated approach is desired. In this paper, 8 we present an automated inference method based on neural density estimation 9 (NDE). We show that the trained NDE not only produces fast, accurate, and 10 precise posteriors but also captures expected posterior degeneracies. A hybrid 11 NDE-MCMC framework can further be applied to produce the exact posterior. 12 Zhang, JSB, … NeurIPS MP4PS (2010.04156) Zhang et al., AJ 161 262 (2021) Recovery of Known Caustic Degeneracies

Discovery of Magnification Degeneracies LETTERS NATURE ASTRONOMY Because of this
unifying feature, we expected the offset degeneracy to be ubiquitous in past events with twofold degenerate solutions and speculate that a large number of cases may have been mistakenly attributed to the close–wide degeneracy. Therefore, we systematically searched for previously published events with twofold degenerate solutions satisfying q A ≃ q B ≪ 1 (see Methods). We found 23 such events, and then first compared the intercept of the source trajectory on the star–planet axis to the location of the null predicted with equation (1). We also invert equation (1) to predict one degenerate s A from the other s B : sA = 1 2 2x0 − (sB − 1/sB ) + [2x0 − (sB − 1/sB )]2 + 4 , (2) –3 –2 –1 0 1 2 3 x null = s A – 1/s A + s B – 1/s B log 10 (s B ) = –0.40 log 10 (s B ) = –0.25 log 10 (s B ) = –0.10 Close–wide Inner–outer Lens A wide Lens A resonant Lens A close 2 10 5 0 Δx null (%) Reanalysis of 23 previous 2-mode solutions shows one source location predicts the other Continuous set of   “oﬀset” degenerate light curves with inner-outer/close-wide as limiting cases “suggests the existence of a deeper symmetry in the equations governing two-body lenses than previously recognized. “ Zhang, Gaudi Bloom, Nat. Ast. 2022

Advancing astronomy by guiding human intuition with AI… …while AI
is unlikely to replace scientists in the foreseeable future, [this work] demonstrates that it can be harnessed to help us understand deeper mathematical patterns in the underlying theory. Mroz, Nat. Ast. News and Views (2022) See also Davies et al. Nature, 2021 https://joshbloom.org/post/just_the_beginning/

AI/ML Guided Exploration

SkyPortal

SkyPortal SkyPortal: Collaborative Platform for Time-Domain Astronomy ๏ Single-source-of-truth marshal
for MMA, transient, variable, and Solar system use. Facilitates follow-up observation management: robotic and classical facilities  ๏ ZTF-II: 300+ users, 100-1000 events per night; 2.8M sources total, 170k comments, 3.3M annotations  ๏ Teeming with ML: rb scores, classiﬁcations, ML followup triggers ๏ How can we reduce the cognitive load in trying to remember (& act upon) so many sources/data?

LLM Summarization Package comments, redshift, classiﬁcations, annotations etc. with the
query: In one succinct (less than 250 words) paragraph written in the 3rd person summarize the following comments about this astronomical source. If classifications and/or the redshift are given, then note those in the summary. This summary should be most useful to expert astrophysicists who already know the definitions and meaning of all classifications. Ship oﬀ to OpenAI and save the embedding in pinecone (vector DB)

Embeddings-based Natural Language Queries Similar idea as paper-qa Constrained searches
(e.g. by redshift)

Sources with LLM summaries are compared against others. Largest cosine
distance sources are shown. Embeddings-based Similar Sources

UI/UX Considerations ๏ Prediction speed matters - ideally on <100ms
latency but otherwise ๏ Don’t show summarization button # unless there is enough raw data ๏ Highlight/colorize query results based on quality ๏ Do not show results on similar sources below user-specified threshold ๏ Design for fault tolerance (e.g., vector DB is offline) ๏ Capture usage for post-mortems, improve user experience XAI for Decision Support ๏ Explainability (with verifiability) will become more important in astronomy HCI See Fok & Weld (2023), Tan 2022

https://www.reddit.com/r/funny/comments/3e7gy4/yes_netﬂix_because_my_6_year_old_will_enjoy_the/ “Yes Netﬂix, because my 6 year old will enjoy
the animated fun of Sons of Anarchy”

ML in Production is Hard Only real test of the
model is if its falsifiable on data that does not yet exist Since all models are fallible & people are always on the receiving end, we need to invest in how model are hot- swapped, predictions are consumed & acted upon

Josh Bloom (GE Digital) @profjsb Harvard College Observatory c. 1890

Josh Bloom UC Berkeley (Astronomy) @profjsb AI Assisted Discovery: from
UX to Eureka! ML X Astrophysics Symposium, NYC, May 23, 2023 J. Richards (Stats/Astro) T. Broderick (Stats) J. Long (Stats) Jorge Martínez- Palomera Ellie Abrahams Sara Jamal Keming Zhang Sydney Jenkins Ben Nachman (LBL) Jackie Blaum Natalie LeBaron Stefan van der Walt Fernando Peréz

Koichi Itagaki https://stargazerslounge.com/topic/410027-m101-supernova-gif/ SN2023ixf: Nearest SN in a decade

AI Assisted Discovery: from UX to Eureka!

AI Assisted Discovery: from UX to Eureka!

Joshua Bloom

More Decks by Joshua Bloom

Other Decks in Science

Featured

Transcript

Josh Bloom UC Berkeley (Astronomy) @profjsb AI Assisted Discovery: from

Don’t do ML unless you have to Overcome Resource Constraints

Too Many Transients Tax (Follow-Up) Resources Palomar Transient Factory (PTF)

Josh Bloom (GE Digital) @profjsb Harvard College Observatory c. 1890

4 H. Brink et al. Figure 1. Examples of bogus

Supernova Discovery in the Pinwheel Galaxy (M101) 11 hr after

Bloom+12 see also Nugent+12, Nature

Discovery (& classification) on images is now a cottage industry

50k variables, 26 classes, 810 with known labels (timeseries, colors)

1. AE learn to reproduce irregularly sampled light curves using

Probabilis@c Classiﬁca@on Of Variable Stars Shivvers,JSB,Richards MNRAS,2014 106 “DEB” candidates

An Unexpected Fundamental Theoretical Discovery Using SBI

Microlensing for Exoplanet Discovery & Characterization Goal: measure masses, separations,

Microlensing for Exoplanet Discovery & Characterization Goal: measure masses, separations,

Fast Inference with Neural Density Estimator Zhang, JSB, … NeurIPS

Automating Inference of Binary Microlensing Events with Neural Density Estimation

Discovery of Magnification Degeneracies LETTERS NATURE ASTRONOMY Because of this

Advancing astronomy by guiding human intuition with AI… …while AI

AI/ML Guided Exploration

SkyPortal

SkyPortal SkyPortal: Collaborative Platform for Time-Domain Astronomy ๏ Single-source-of-truth marshal

LLM Summarization Package comments, redshift, classiﬁcations, annotations etc. with the

Embeddings-based Natural Language Queries Similar idea as paper-qa Constrained searches

Sources with LLM summaries are compared against others. Largest cosine

UI/UX Considerations ๏ Prediction speed matters - ideally on <100ms

https://www.reddit.com/r/funny/comments/3e7gy4/yes_netﬂix_because_my_6_year_old_will_enjoy_the/ “Yes Netﬂix, because my 6 year old will enjoy

ML in Production is Hard Only real test of the

Josh Bloom (GE Digital) @profjsb Harvard College Observatory c. 1890

Josh Bloom UC Berkeley (Astronomy) @profjsb AI Assisted Discovery: from

Koichi Itagaki https://stargazerslounge.com/topic/410027-m101-supernova-gif/ SN2023ixf: Nearest SN in a decade