Classification Challenges

LSST AHM 2012 Classiﬁcation Challenges Josh Bloom, UC Berkeley @profjsb

11kly 11kx Reference New Difference “Discovery” itself is Hard In
PTF, ~1000 bogus for every 1 real

“Discovery” itself is Hard useful at all is surprising, but
we can clearly see that there are a higher probability of the candidates CDs. my literature ( | joey: algorithm can be found ethod aggregates a col- s of classification trees, outputs the fraction of fraction is greater than classifies the candidate be bogus. ve no missed detections with zero false positives stic classifier will typi- e two types of errors. A (ROC) curve is a com- ys the missed detection ve rate (FPR) of a clas- ace a trade-o↵ between hreshold ⌧ by which we he lower the MDR but Varying ⌧ maps out the ifier, and we can com- classifiers by comparing the lower the curve the erit (FoM) for selecting SVM with a radial basis kernel, a common alternative for non-linear classification problems. A line is plotted to show the 1% FPR to which our figure of merit is fixed. Fig. 3.— Comparison of a few well known classification algorithms applied to the full dataset. ROC curves enable a trade-o↵ between false positives and missed detections, but the best classifier pushes closer towards the origin. Linear models (Logistic Re- gression or Linear SVMs) perform poorly as expected, while non- linear models (SVMs with radial basis function kernels or Random Real or Bogus? 5 Fig. 2.— Histogram of a selection of features divided in real (purple) and bogus (cyan) populations. First two newly introduced features gauss and amp, the goodness-of-fit and amplitude of the Gaussian fit. Then mag ref, the magnitude of the source in the reference image, flux ratio, the ratio of the fluxes in the new and reference images and lastly, ccid, the ID of the camera CCD where the source was detected. The fact that this feature is useful at all is surprising, but we can clearly see that there are a higher probability of the candidates beeing real or bogus on some of the CCDs. els of performance in the astronomy literature ( | joey: add refs | ). A description of the algorithm can be found in Breiman (2001). Briefly, the method aggregates a col- lection of hundreds to thousands of classification trees, and for a given new candidate, outputs the fraction of classifiers that vote real. If this fraction is greater than some threshold ⌧, random forest classifies the candidate as real; otherwise it is deemed to be bogus. While an ideal classifier will have no missed detections (i.e., no real identified as bogus), with zero false positives (bogus identified as real), a realistic classifier will typi- cally o↵er a trade-o↵ between the two types of errors. A receiver operating characteristic (ROC) curve is a com- monly used diagram which displays the missed detection rate (MDR) versus the false positive rate (FPR) of a clas- SVM with a radial basis kernel, a common alternative for non-linear classification problems. A line is plotted to show the 1% FPR to which our figure of merit is fixed. Brink+2012

Classiﬁcation from Level I Photometry • noisy, irregularly sampled •
telltale signature event may not have happened yet • spurious data

variability metrics: e.g. Stetson indices, χ2/dof (constant hypothesis) periodic metrics:
e.g. dominant frequencies in Lomb-Scargle, phase offsets between periods shape analysis e.g. skewness, kurtosis, Gaussianity context metrics e.g. distance to nearest galaxy, type of nearest galaxy, location in the ecliptic plane Machine-Learning Approach to Classiﬁcation Wózniak et al. 2004; Protopapas+06, Willemsen & Eyer 2007; Debosscher et al. 2007; Mahabal et al. 2008; Sarro et al. 2009; Blomme et al. 2010; Kim+11, Richards+11 “Features”: homogenize the data; real-number metrics that describe the time-domain characteristics & context of a source...LSST calls this “characterization” ~100 features computed in < 1 sec (including periodogram analysis)

Variable Star Classiﬁcation Confusion
!! "# $ %&' ( ) *+ , ,+(( ( & ' -( ) $$ . '$$ ./0 $$ ./ " & ' 11& ' &)& ' )$*! ( * ) " ' - ( , % 233 34 5 262 47 63 243 48 89 223 27 76 46 4: 38 88 82 32 23 28 9 27 256 238 8: ) & ! & pulsating eruptive multi-star Richards+11 True Class - global classiﬁcation errors on well-observed sources approaching 15% - structured learning on taxonomy gross errors ~5% - random forest (via rpy2) with missing data imputation emerging as superior e.g., Dubath+12,Richards+12

Long+12; Richards+11 Decision Boundaries are Survey Specific How do we
transfer learning from one survey to the next? – 3 – (a) (b) Fig. 1.— (a) The grey lines represent the CART classifier constructed using Hipparcos data. The points are Hipparcos sources. This classifier separates Hipparcos sources well (0.6% error as measured by cross-validation). (b) Here the OGLE sources are plotted over the feature #1 feature #1 feature #2 Hipparcos OGLE-III ASAS (testing) OGLE+Hip (training) “Expert”

http://bigmacc.info Machine-learned varstar catalog:

Doing Science with Probabilistic Catalogs Demographics (with little followup): trading
high purity at the cost of lower efficiency e.g., using RRL to find new Galactic structure Novelty Discovery (with lots of followup): trading high efficiency for lower purity e.g., discovering new instances of rare classes DRAFT April 20, 2012 Discovery of Bright Galactic R Coronae Borealis and DY Persei Variables: Rare Gems Mined from ASAS A. A. Miller1,⇤, J. W. Richards1,2, J. S. Bloom1, S. B. Cenko1, J. M. Silverman1, D. L. Starr1, and K. G. Stassun3,4 2012

Real-time Classiﬁcations too...

Fig. 14.— Confusion matrix for robotclass random forest classification. Classes
are aligned so that entries along the diagonal corresponds to correct classification. Probabilities are normalized to sum to unity for each column. Recovery rates are 90%, with very high purity, for the three dominant classes. Classification accuracy su↵ers for the two classes with small amounts of data (note: class Real-time Classifications too... see also Joey’s talk Wednesday on supernova classification

Parting Thoughts • discovery is hard; need “real” data to
train • Machine-learned real-time & retrospective classiﬁcation without humans in the loop is already happening (e.g., ASAS, PTF...) • Transfer learning from one survey to the next is non- trivial • Posterior probabilities need good priors: we need to characterize the transient universe as LSST might see it

3 Private donations allowed us to fabricate all three large
mirrors Primary/Tertiary at SOML Secondary at Corning

~10,000 variable stars & transients from PTF 2009: green 2010:
blue mayavi2

PTF11kly (SN 2011fe) ©Peter Nugent Supernova Discovery in the Pinwheel
Galaxy 11 hr after explosition; nearest SN Ia in >3 decades Promoted to the top of the candidate list by our machine-learned codes

Best constraints on a SN Ia “progenitor” system. Red giant
and Helium star progenitors are ruled out for the ﬁrst time pysynphot,matplotlib Li, Bloom et al. Nature (2011) log(brightness) Bloom et al. ApJ Letters (2012)

Discovery of Bright Galactic R Coronae Borealis and DY Persei
Variables: Rare Gems Mined from ASAS A. A. Miller1,⇤, J. W. Richards1,2, J. S. Bloom1, S. B. Cenko1, J. M. Silverman1, D. L. Starr1, and K. G. Stassun3,4 ABSTRACT We present the results of a machine-learning (ML) based search for new R Coronae Borealis (RCB) stars and DY Persei-like stars (DYPers) in the Galaxy using cataloged light curves obtained by the All-Sky Automated Survey (ASAS). RCB stars—a rare class of hydrogen-deficient carbon-rich supergiants—are of great interest owing to the insights they can provide on the late stages of stellar evolution. DYPers are possibly the low-temperature, low-luminosity analogs to the RCB phenomenon, though additional examples are needed to fully estab- lish this connection. While RCB stars and DYPers are traditionally identified by epochs of extreme dimming that occur without regularity, the ML search framework more fully captures the richness and diversity of their photometric behavior. We demonstrate that our ML method recovers ASAS candidates that would have been missed by traditional search methods employing hard cuts on amplitude and periodicity. Our search yields 13 candidates that we consider likely RCB stars/DYPers: new and archival spectroscopic observations confirm that four of these candidates are RCB stars and four are DYPers. Our discovery of four new DYPers increases the number of known Galactic DYPers from two to six; noteworthy is that one of the new DYPers has a measured parallax and is m ⇡ 7 mag, making it the brightest known DYPer to date. Future observations of these new DYPers should prove instrumental in establishing the RCB connection. We consider these results, derived from a machine-learned probabilistic 1Department of Astronomy, University of California, Berkeley, CA 94720-3411, USA 2Statistics Department, University of California, Berkeley, CA, 94720-7450, USA arXiv:1204.4181v1 [astro-ph.SR] 18 Apr 2012 17 known Galactic RCB/DY Per

Candidate & Subtraction Database LBNL/NERSC - Image Ingest - Image
Analysis - Realbogus generation SDSS SIMBAD USNO B1.0 Query for real candidates in a date range or specified position, with other candidate matches Create contextualized scored of possible sources based on nearby candidates; build light curve of source subtraction completed Store discovered source in internal Oarical database; determine time-domain features Web query interface to Oarical Determine context features by quering webservices Determine PTF type, robotclass, and classification. Store in internal database. PTF Marshal Caltech Save & annotate high-scoring discoveries Generate web summary pages for human scanners & science-specific project groups (Fig. 9) Push email summaries to PTF collaboration PTF Real-time Discovery and Classification Pipeline Bloom+11 – 21 – Fig. 7.— Taxonomy of classification used by Oarical. The top bar shows the PTF type initial classification used when saving candidates as sources. The second tier, “robotclass,” s the four classifications determined by Oarical for a new source. The bottom tier shows exa classifications determined from SIMBAD identifications and SDSS spectroscopic analysis. the context and time-domain features. In this classifier, we rely on a hierarchy of input author from most reliable to less reliable: 1. Minor-Planet Center: After the context and time-domain features are assembled, Oa queries our parallelized minor-planet webservice to determine if the source is consiste time and position with a known asteroid. If so, the source is classified as class Rock with immediate classification > 15k sources in PTF

Discovery image was ~11 hours after explosion Within a few
hours, a spectrum conﬁrmed it to be a SN Ia Nearest SN Ia in more than 3 decades 5th brightest supernova in 100 years SN 2011fe identiﬁed w/ Machine-Learned Discovery Algorithms

Smith+11

samples of higher efﬁciency and purity than using the SN
Zoo score. 3 ed a Figure 4. By lowering the SN selection threshold, we trade lower MDR for higher FPR. For a sample of 345 known PTF SNe, employing the RF score to select objects is uniformly better, in terms of MDR and FPR, than using the SN Zoo score. At 10% FPR, the RF criterion (threshold = 0.035) attains a 14.5% MDR compared to 34.2% MDR for SN Zoo (threshold = 0.4). 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Classification of known SNe False Positive Rate (FPR) Missed Detection Rate (MDR) Random Forest Supernova Zoo Machine-learned, immediate classiﬁcation just now competitive with crowdsourced SN discovery Richards, this meeting

Classification Statements are Inherently Fuzzy - classification probabilities should reflect
uncertainty in the data & training - calibration of classification probability vector E.g.: 20% of transients classified as SN Ib with P=0.2 should be SN Ib Catalogs of Transients and Variable Stars Must Become Probabilistic - higher confidence with greater proximity to training data

Classification Challenges

Classification Challenges

Joshua Bloom

More Decks by Joshua Bloom

Other Decks in Science

Featured

Transcript

LSST AHM 2012 Classiﬁcation Challenges Josh Bloom, UC Berkeley @profjsb

11kly 11kx Reference New Difference “Discovery” itself is Hard In

“Discovery” itself is Hard useful at all is surprising, but

Classiﬁcation from Level I Photometry • noisy, irregularly sampled •

variability metrics: e.g. Stetson indices, χ2/dof (constant hypothesis) periodic metrics:

Variable Star Classiﬁcation Confusion

Long+12; Richards+11 Decision Boundaries are Survey Speciﬁc How do we

http://bigmacc.info Machine-learned varstar catalog:

Doing Science with Probabilistic Catalogs Demographics (with little followup): trading

Real-time Classiﬁcations too...

Fig. 14.— Confusion matrix for robotclass random forest classiﬁcation. Classes

Parting Thoughts • discovery is hard; need “real” data to

3 Private donations allowed us to fabricate all three large

~10,000 variable stars & transients from PTF 2009: green 2010:

PTF11kly (SN 2011fe) ©Peter Nugent Supernova Discovery in the Pinwheel

Best constraints on a SN Ia “progenitor” system. Red giant

Discovery of Bright Galactic R Coronae Borealis and DY Persei

Candidate & Subtraction Database LBNL/NERSC - Image Ingest - Image

Discovery image was ~11 hours after explosion Within a few

Smith+11

samples of higher efﬁciency and purity than using the SN

Classification Statements are Inherently Fuzzy - classification probabilities should reflect