Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data-Mining and Machine Learning in the LSST Era

Data-Mining and Machine Learning in the LSST Era

Presented at the Royal Society " New windows on transients across the Universe " (2012)

Audio: http://downloads.royalsociety.org/audio/DM/DM2012-06/Bloom.mp3

Joshua Bloom
University of California, Berkeley
LSST Transient & Variable Star, Co-Chair
AURA Management Council for LSST (AMCL)


Joshua Bloom

April 24, 2012


  1. Royal Society Discussion Meeting, 24 April 2012 Josh Bloom University

    of California, Berkeley LSST Transient & Variable Star, Co-Chair AURA Management Council for LSST (AMCL) Data-Mining and Machine Learning in the LSST Era
  2. /12$ Aus8n,$Texas$ 1/8/12$ Aus8n,$Texas$ NOW BEFOR

  3. Large Synoptic Survey Telescope 3.2 gigapixel camera 5.4M images over

    10 years 18,000 sq deg 15 TB/night→30 TB/night (+metadata) Tyson 02; Tyson+03; Ivezic+11
  4. None
  5. 217th Meeting of the AAS • Seattle • January 2011

    LSST Science Book: http://www.lsst.org/lsst/ science/scibook Also: microlensing, transiting planets... 3 Light curves for ~billion sources every 3 days - 106 supernovae/yr - 105 eclipsing binaries - (lots)×(your favourite event) http://www.lsst.org/files/docs/sciencebook/SB_8.pdf
  6. How do we do discovery, follow-up, and inference when the

    data rates (& requisite timescales) preclude human involvement?
  7. Machine Learning As Surrogate - trained to quickly make concrete,

    deterministic, & repeatable statements about abstract concepts “Is this varying source astrophysical in nature or spurious?” Discovery PTF: 1.5M candidate/night 1:1000 are astrophysical machine has opined on 800M candidates Bloom+11
  8. 11kly 11kx Reference New Difference

  9. Discovery image was ~11 hours after explosion Within a few

    hours, a spectrum confirmed it to be a SN Ia Nearest SN Ia in more than 3 decades 5th brightest supernova in 100 years 2011fe identified w/ Machine-Learned Discovery Algorithms 0.0001 0.001 0.01 0.1 1 relative g-band flux 0 1 2 3 4 5 6 time (days since MJD 55796.0) -0.03 -0.02 -0.01 0 0.01 0.02 residual from t2 law flux = C(t-t expl )2 FRODOSpec Lick and KeckI+HIRES Figure 2. Li, Bloom et al. 0.0001 0.001 0.01 0.1 1 relative g-band flux 0 1 2 3 4 5 time (days since MJD 55796.0) -0.03 -0.02 -0.01 0 0.01 0.02 residual from t2 law flux = C(t-t expl )2 FRODOSpec Lick and KeckI+HIRES Figure 3. 19 Nugent et al.
  10. Machine Learning As Surrogate - trained to quickly make concrete,

    deterministic, & repeatable statements about abstract concepts “What is the nature (origin/reason...) of the variability?” Classification
  11. “Features”: homogenize the data; real-number metrics that describe the time-domain

    characteristics & context of a source ~100 features computed in < 1 sec (including periodogram analysis) variability metrics: e.g. Stetson indices, χ2/dof (constant hypothesis) periodic metrics: e.g. dominant frequencies in Lomb-Scargle, phase offsets between periods shape analysis e.g. skewness, kurtosis, Gaussianity context metrics e.g. distance to nearest galaxy, type of nearest galaxy, location in the ecliptic plane Machine-Learning Approach to Classification Wózniak et al. 2004; Protopapas+06, Willemsen & Eyer 2007; Debosscher et al. 2007; Mahabal et al. 2008; Sarro et al. 2009; Blomme et al. 2010; Kim+11, Richards+11
  12. Variable Star Classification Confusion      

             !! "# $  %&' ( )  *+ , ,+(( (  & '  -(   ) $$ . '$$ ./0 $$ ./ "   & '  11& '  &)& '  )$*!  ( *     )  "  '  -  ( ,   %        233 34 5 262 47 63 243 48 89 223 27 76 46 4: 38 88 82 32 23 28 9 27 256 238 8:  ) & pulsating eruptive multi-star Richards+11 True Class - global classification errors on well-observed sources approaching 15% - structured learning on taxonomy gross errors ~5% - random forest with missing data imputation emerging as superior e.g., Dubath+11,Richards+11
  13. Long+12; Richards+11 Decision Boundaries are Survey Specific How do we

    transfer learning from one survey to the next? – 3 – (a) (b) Fig. 1.— (a) The grey lines represent the CART classifier constructed using Hipparcos data. The points are Hipparcos sources. This classifier separates Hipparcos sources well (0.6% error as measured by cross-validation). (b) Here the OGLE sources are plotted over the feature #1 feature #1 feature #2 Hipparcos OGLE-III
  14. Classification Statements are Inherently Fuzzy - classification probabilities should reflect

    uncertainty in the data & training - calibration of classification probability vector E.g.: 20% of transients classified as SN Ib with P=0.2 should be SN Ib Catalogs of Transients and Variable Stars Must Become Probabilistic - higher confidence with greater proximity to training data
  15. MACC: Machine-Learned ASAS Classification Catalog 50,000+ variable stars from ACVS

    (~14th mag) http://bigmacc.info Richards+12 arXiv:1204.4180 ASAS 777 Hipparcos 644 OGLE 524 Training Labels over 28 Classes: 407 “Actively Learned”: Experts weigh in on poorly labelled feature space
  16. http://bigmacc.info

  17. Doing Science with Probabilistic Catalogs Demographics (with little followup): trading

    high purity at the cost of lower efficiency e.g., using RRL to find new Galactic structure Novelty Discovery (with lots of followup): trading high efficiency for lower purity e.g., discovering new instances of rare classes
  18. DRAFT April 20, 2012 Discovery of Bright Galactic R Coronae

    Borealis and DY Persei Variables: Rare Gems Mined from ASAS A. A. Miller1,⇤, J. W. Richards1,2, J. S. Bloom1, S. B. Cenko1, J. M. Silverman1, D. L. Starr1, and K. G. Stassun3,4 r 2012 ASAS V -band light curves of newly discovery RCB stars and DYP agnitude ranges shown for each light curve. Spectroscopic observ r candidates to be RCB stars, while the bottom four are DYPers.
  19. "immediately" available data high redshift GRBs less interesting unclassified collect

    burst data from Swift satellite feed predict which events are "high redshift" in real-time Follow-Up Resource Aware Classification
  20. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6

    0.8 1.0 Fraction of GRBs Followed Up: α Fraction of high (z>4) GRBs observed Efficiency vs α reduced fraction of high-redshift GRBs followed-up fraction random predicted improvement (90% c.l.) Follow-Up-Resource-Aware Classification “59% (86%) of high-z GRBs can be captured from following up the top 20% (40%) of the ranked candidates” Morgan+11
  21. Candidate & Subtraction Database LBNL/NERSC - Image Ingest - Image

    Analysis - Realbogus generation SDSS SIMBAD USNO B1.0 Query for real candidates in a date range or specified position, with other candidate matches Create contextualized scored of possible sources based on nearby candidates; build light curve of source subtraction completed Store discovered source in internal Oarical database; determine time-domain features Web query interface to Oarical Determine context features by quering webservices Determine PTF type, robotclass, and classification. Store in internal database. PTF Marshal Caltech Save & annotate high-scoring discoveries Generate web summary pages for human scanners & science-specific project groups (Fig. 9) Push email summaries to PTF collaboration PTF Real-time Discovery and Classification Pipeline Bloom+11 – 21 – Fig. 7.— Taxonomy of classification used by Oarical. The top bar shows the PTF type initial classification used when saving candidates as sources. The second tier, “robotclass,” s the four classifications determined by Oarical for a new source. The bottom tier shows exa classifications determined from SIMBAD identifications and SDSS spectroscopic analysis. the context and time-domain features. In this classifier, we rely on a hierarchy of input author from most reliable to less reliable: 1. Minor-Planet Center: After the context and time-domain features are assembled, Oa queries our parallelized minor-planet webservice to determine if the source is consiste time and position with a known asteroid. If so, the source is classified as class Rock with confidence and all other confidences are set to zero. If there is non-negligible proper m immediate classification > 15k sources in PTF
  22. Smith+11

  23. samples of higher efficiency and purity than using the SN

    Zoo score. Conclusions 3 ed a Figure 4. By lowering the SN selection threshold, we trade lower MDR for higher FPR. For a sample of 345 known PTF SNe, employing the RF score to select objects is uniformly better, in terms of MDR and FPR, than using the SN Zoo score. At 10% FPR, the RF criterion (threshold = 0.035) attains a 14.5% MDR compared to 34.2% MDR for SN Zoo (threshold = 0.4). 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Classification of known SNe False Positive Rate (FPR) Missed Detection Rate (MDR) Random Forest Supernova Zoo Machine-learned, immediate classification just now competitive with crowdsourced SN discovery Richards+12 in prep
  24. /12$ Aus8n,$Texas$ 1/8/12$ Aus8n,$Texas$ NOW BEFOR

  25. Summary LSST: on track for FY2014 construction start Precursor machine-learning

    work already having an impact in time domain - SN2011fe: rapid needle-in-haystack discovery - Classification Rarities (RCB/DY Per) confirmed with new probabilistic catalog Crucial role for experts/crowd in training ML, but role in real-time loop less certain