Data-Mining and Machine Learning in the LSST Era

Royal Society Discussion Meeting, 24 April 2012 Josh Bloom University
of California, Berkeley LSST Transient & Variable Star, Co-Chair AURA Management Council for LSST (AMCL) Data-Mining and Machine Learning in the LSST Era

/12$ Aus8n,$Texas$ 1/8/12$ Aus8n,$Texas$ NOW BEFOR

Large Synoptic Survey Telescope 3.2 gigapixel camera 5.4M images over
10 years 18,000 sq deg 15 TB/night→30 TB/night (+metadata) Tyson 02; Tyson+03; Ivezic+11

217th Meeting of the AAS • Seattle • January 2011
LSST Science Book: http://www.lsst.org/lsst/ science/scibook Also: microlensing, transiting planets... 3 Light curves for ~billion sources every 3 days - 106 supernovae/yr - 105 eclipsing binaries - (lots)×(your favourite event) http://www.lsst.org/ﬁles/docs/sciencebook/SB_8.pdf

How do we do discovery, follow-up, and inference when the
data rates (& requisite timescales) preclude human involvement?

Machine Learning As Surrogate - trained to quickly make concrete,
deterministic, & repeatable statements about abstract concepts “Is this varying source astrophysical in nature or spurious?” Discovery PTF: 1.5M candidate/night 1:1000 are astrophysical machine has opined on 800M candidates Bloom+11

11kly 11kx Reference New Difference

Discovery image was ~11 hours after explosion Within a few
hours, a spectrum conﬁrmed it to be a SN Ia Nearest SN Ia in more than 3 decades 5th brightest supernova in 100 years 2011fe identiﬁed w/ Machine-Learned Discovery Algorithms 0.0001 0.001 0.01 0.1 1 relative g-band flux 0 1 2 3 4 5 6 time (days since MJD 55796.0) -0.03 -0.02 -0.01 0 0.01 0.02 residual from t2 law flux = C(t-t expl )2 FRODOSpec Lick and KeckI+HIRES Figure 2. Li, Bloom et al. 0.0001 0.001 0.01 0.1 1 relative g-band flux 0 1 2 3 4 5 time (days since MJD 55796.0) -0.03 -0.02 -0.01 0 0.01 0.02 residual from t2 law flux = C(t-t expl )2 FRODOSpec Lick and KeckI+HIRES Figure 3. 19 Nugent et al.

Machine Learning As Surrogate - trained to quickly make concrete,
deterministic, & repeatable statements about abstract concepts “What is the nature (origin/reason...) of the variability?” Classiﬁcation

“Features”: homogenize the data; real-number metrics that describe the time-domain
characteristics & context of a source ~100 features computed in < 1 sec (including periodogram analysis) variability metrics: e.g. Stetson indices, χ2/dof (constant hypothesis) periodic metrics: e.g. dominant frequencies in Lomb-Scargle, phase offsets between periods shape analysis e.g. skewness, kurtosis, Gaussianity context metrics e.g. distance to nearest galaxy, type of nearest galaxy, location in the ecliptic plane Machine-Learning Approach to Classiﬁcation Wózniak et al. 2004; Protopapas+06, Willemsen & Eyer 2007; Debosscher et al. 2007; Mahabal et al. 2008; Sarro et al. 2009; Blomme et al. 2010; Kim+11, Richards+11

Variable Star Classiﬁcation Confusion
!! "# $ %&' ( ) *+ , ,+(( ( & ' -( ) $$ . '$$ ./0 $$ ./ " & ' 11& ' &)& ' )$*! ( * ) " ' - ( , % 233 34 5 262 47 63 243 48 89 223 27 76 46 4: 38 88 82 32 23 28 9 27 256 238 8: ) & pulsating eruptive multi-star Richards+11 True Class - global classiﬁcation errors on well-observed sources approaching 15% - structured learning on taxonomy gross errors ~5% - random forest with missing data imputation emerging as superior e.g., Dubath+11,Richards+11

Long+12; Richards+11 Decision Boundaries are Survey Specific How do we
transfer learning from one survey to the next? – 3 – (a) (b) Fig. 1.— (a) The grey lines represent the CART classifier constructed using Hipparcos data. The points are Hipparcos sources. This classifier separates Hipparcos sources well (0.6% error as measured by cross-validation). (b) Here the OGLE sources are plotted over the feature #1 feature #1 feature #2 Hipparcos OGLE-III

Classification Statements are Inherently Fuzzy - classification probabilities should reflect
uncertainty in the data & training - calibration of classification probability vector E.g.: 20% of transients classified as SN Ib with P=0.2 should be SN Ib Catalogs of Transients and Variable Stars Must Become Probabilistic - higher confidence with greater proximity to training data

MACC: Machine-Learned ASAS Classiﬁcation Catalog 50,000+ variable stars from ACVS
(~14th mag) http://bigmacc.info Richards+12 arXiv:1204.4180 ASAS 777 Hipparcos 644 OGLE 524 Training Labels over 28 Classes: 407 “Actively Learned”: Experts weigh in on poorly labelled feature space

http://bigmacc.info

Doing Science with Probabilistic Catalogs Demographics (with little followup): trading
high purity at the cost of lower efficiency e.g., using RRL to find new Galactic structure Novelty Discovery (with lots of followup): trading high efficiency for lower purity e.g., discovering new instances of rare classes

DRAFT April 20, 2012 Discovery of Bright Galactic R Coronae
Borealis and DY Persei Variables: Rare Gems Mined from ASAS A. A. Miller1,⇤, J. W. Richards1,2, J. S. Bloom1, S. B. Cenko1, J. M. Silverman1, D. L. Starr1, and K. G. Stassun3,4 r 2012 ASAS V -band light curves of newly discovery RCB stars and DYP agnitude ranges shown for each light curve. Spectroscopic observ r candidates to be RCB stars, while the bottom four are DYPers.

"immediately" available data high redshift GRBs less interesting unclassiﬁed collect
burst data from Swift satellite feed predict which events are "high redshift" in real-time Follow-Up Resource Aware Classiﬁcation

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6
0.8 1.0 Fraction of GRBs Followed Up: α Fraction of high (z>4) GRBs observed Efficiency vs α reduced fraction of high-redshift GRBs followed-up fraction random predicted improvement (90% c.l.) Follow-Up-Resource-Aware Classiﬁcation “59% (86%) of high-z GRBs can be captured from following up the top 20% (40%) of the ranked candidates” Morgan+11

Candidate & Subtraction Database LBNL/NERSC - Image Ingest - Image
Analysis - Realbogus generation SDSS SIMBAD USNO B1.0 Query for real candidates in a date range or specified position, with other candidate matches Create contextualized scored of possible sources based on nearby candidates; build light curve of source subtraction completed Store discovered source in internal Oarical database; determine time-domain features Web query interface to Oarical Determine context features by quering webservices Determine PTF type, robotclass, and classification. Store in internal database. PTF Marshal Caltech Save & annotate high-scoring discoveries Generate web summary pages for human scanners & science-specific project groups (Fig. 9) Push email summaries to PTF collaboration PTF Real-time Discovery and Classification Pipeline Bloom+11 – 21 – Fig. 7.— Taxonomy of classification used by Oarical. The top bar shows the PTF type initial classification used when saving candidates as sources. The second tier, “robotclass,” s the four classifications determined by Oarical for a new source. The bottom tier shows exa classifications determined from SIMBAD identifications and SDSS spectroscopic analysis. the context and time-domain features. In this classifier, we rely on a hierarchy of input author from most reliable to less reliable: 1. Minor-Planet Center: After the context and time-domain features are assembled, Oa queries our parallelized minor-planet webservice to determine if the source is consiste time and position with a known asteroid. If so, the source is classified as class Rock with confidence and all other confidences are set to zero. If there is non-negligible proper m immediate classification > 15k sources in PTF

Smith+11

samples of higher efﬁciency and purity than using the SN
Zoo score. Conclusions 3 ed a Figure 4. By lowering the SN selection threshold, we trade lower MDR for higher FPR. For a sample of 345 known PTF SNe, employing the RF score to select objects is uniformly better, in terms of MDR and FPR, than using the SN Zoo score. At 10% FPR, the RF criterion (threshold = 0.035) attains a 14.5% MDR compared to 34.2% MDR for SN Zoo (threshold = 0.4). 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Classification of known SNe False Positive Rate (FPR) Missed Detection Rate (MDR) Random Forest Supernova Zoo Machine-learned, immediate classiﬁcation just now competitive with crowdsourced SN discovery Richards+12 in prep

/12$ Aus8n,$Texas$ 1/8/12$ Aus8n,$Texas$ NOW BEFOR

Summary LSST: on track for FY2014 construction start Precursor machine-learning
work already having an impact in time domain - SN2011fe: rapid needle-in-haystack discovery - Classiﬁcation Rarities (RCB/DY Per) conﬁrmed with new probabilistic catalog Crucial role for experts/crowd in training ML, but role in real-time loop less certain

Data-Mining and Machine Learning in the LSST Era

Data-Mining and Machine Learning in the LSST Era

Joshua Bloom

More Decks by Joshua Bloom

Other Decks in Science

Featured

Transcript

Royal Society Discussion Meeting, 24 April 2012 Josh Bloom University

/12$ Aus8n,$Texas$ 1/8/12$ Aus8n,$Texas$ NOW BEFOR

Large Synoptic Survey Telescope 3.2 gigapixel camera 5.4M images over

217th Meeting of the AAS • Seattle • January 2011

How do we do discovery, follow-up, and inference when the

Machine Learning As Surrogate - trained to quickly make concrete,

11kly 11kx Reference New Difference

Discovery image was ~11 hours after explosion Within a few

Machine Learning As Surrogate - trained to quickly make concrete,

“Features”: homogenize the data; real-number metrics that describe the time-domain

Variable Star Classiﬁcation Confusion

Long+12; Richards+11 Decision Boundaries are Survey Speciﬁc How do we

Classification Statements are Inherently Fuzzy - classification probabilities should reflect

MACC: Machine-Learned ASAS Classiﬁcation Catalog 50,000+ variable stars from ACVS

http://bigmacc.info

Doing Science with Probabilistic Catalogs Demographics (with little followup): trading

DRAFT April 20, 2012 Discovery of Bright Galactic R Coronae

"immediately" available data high redshift GRBs less interesting unclassiﬁed collect

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6

Candidate & Subtraction Database LBNL/NERSC - Image Ingest - Image

Smith+11

samples of higher efﬁciency and purity than using the SN

/12$ Aus8n,$Texas$ 1/8/12$ Aus8n,$Texas$ NOW BEFOR

Summary LSST: on track for FY2014 construction start Precursor machine-learning