Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Update on LOFAR TKP Database

transientskp
December 04, 2012

Update on LOFAR TKP Database

Bart Scheers

transientskp

December 04, 2012
Tweet

More Decks by transientskp

Other Decks in Science

Transcript

  1. Update on LOFAR TKP Database Bart Scheers Astronomical Institute ”Anton

    Pannekoek”, University of Amsterdam Centrum Wiskunde & Informatica, Amsterdam TKP Meeting Amsterdam, December 4th, 2012 Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  2. The Transients Database The Aim ◮ Store all LOFAR measurements

    ◮ Build light-curve catalogue ◮ Enable fast processing, and access (exploit database engine) Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  3. The Transients Database The Aim ◮ Store all LOFAR measurements

    ◮ Build light-curve catalogue ◮ Enable fast processing, and access (exploit database engine) The Schema Design ◮ Propagate algorithms to the data ◮ Optimise for comparison of latest measurements with a statistical model of all measurements ◮ Recently: redesign, renaming, explicit table relations, installing & upgrading Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  4. The Transients Database The Aim ◮ Store all LOFAR measurements

    ◮ Build light-curve catalogue ◮ Enable fast processing, and access (exploit database engine) The Schema Design ◮ Propagate algorithms to the data ◮ Optimise for comparison of latest measurements with a statistical model of all measurements ◮ Recently: redesign, renaming, explicit table relations, installing & upgrading The Content ◮ External catalogues: VLSS(r), WENSS, NVSS, exoplanets ◮ Standard frequency bands (as defined for MSSS) ◮ Original measurements ◮ Deduced data: associations between measurements, cataloguing measurements ◮ Meta-data: pipeline configuration and task settings Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  5. LOFAR Characteristics, expected volumes & data rates Data production ◮

    Raw data ∼ 25 TB/hr Here, we focus on the database ◮ Distinct sources: ∼ 107 − 108, ⊲ which are measured/revisited many, many, many times ◮ Single measurement stores ∼300B of data ◮ Overall data accumulation about 50 − 100 TB/yr ◮ Peaks may be over 10,000 source measurements per second Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  6. Exploit the Database Engine Move the algorithms to the data,

    inside the database engine, reducing I/O ◮ Build & maintain an up-to-date statistical sky model ◮ Source association ◮ Monitoring list ◮ Transient & variability search ◮ Feature extraction Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  7. Building & maintaining an up-to-date statistical sky model ◮ We

    want to summarise/reduce our data statistically, instead of using all individual datapoints Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  8. Building & maintaining an up-to-date statistical sky model ◮ We

    want to summarise/reduce our data statistically, instead of using all individual datapoints ◮ Therefore, we use a more database-friendly approach Avg xN = 1 N N i=1 xi ⇒ xN+1 = NxN +xN+1 N+1 Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  9. Building & maintaining an up-to-date statistical sky model ◮ We

    want to summarise/reduce our data statistically, instead of using all individual datapoints ◮ Therefore, we use a more database-friendly approach Avg xN = 1 N N i=1 xi ⇒ xN+1 = NxN +xN+1 N+1 w’d Avg ξN = PN i=1 wi xi PN i=1 wi ⇒ NξN +wN+1xN+1 NwN +wN+1xN+1 , wN+1 = 1/e2 N+1 Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  10. Building & maintaining an up-to-date statistical sky model ◮ We

    want to summarise/reduce our data statistically, instead of using all individual datapoints ◮ Therefore, we use a more database-friendly approach Avg xN = 1 N N i=1 xi ⇒ xN+1 = NxN +xN+1 N+1 w’d Avg ξN = PN i=1 wi xi PN i=1 wi ⇒ NξN +wN+1xN+1 NwN +wN+1xN+1 , wN+1 = 1/e2 N+1 Variability indices per band: Magnitude Vν = sν /Iν = 1 Iν N N−1 Iν 2 − Iν 2 Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  11. Building & maintaining an up-to-date statistical sky model ◮ We

    want to summarise/reduce our data statistically, instead of using all individual datapoints ◮ Therefore, we use a more database-friendly approach Avg xN = 1 N N i=1 xi ⇒ xN+1 = NxN +xN+1 N+1 w’d Avg ξN = PN i=1 wi xi PN i=1 wi ⇒ NξN +wN+1xN+1 NwN +wN+1xN+1 , wN+1 = 1/e2 N+1 Variability indices per band: Magnitude Vν = sν /Iν = 1 Iν N N−1 Iν 2 − Iν 2 Significance ην = N N−1 wνIν 2 − wν Iν 2 wν Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  12. Building & maintaining an up-to-date statistical sky model ◮ We

    want to summarise/reduce our data statistically, instead of using all individual datapoints ◮ Therefore, we use a more database-friendly approach Avg xN = 1 N N i=1 xi ⇒ xN+1 = NxN +xN+1 N+1 w’d Avg ξN = PN i=1 wi xi PN i=1 wi ⇒ NξN +wN+1xN+1 NwN +wN+1xN+1 , wN+1 = 1/e2 N+1 Variability indices per band: Magnitude Vν = sν /Iν = 1 Iν N N−1 Iν 2 − Iν 2 Significance ην = N N−1 wνIν 2 − wν Iν 2 wν ◮ Store factors for fast calculation ◮ http://docs.transientskp.org/tkp/database/schema.html Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  13. Source Association (by Position only) De Ruiter radius, dimensionless distance

    takes errors into account rij = (αi cos δi −αj cos δj )2 σ2 αi +σ2 αj + (δi −δj )2 σ2 δi +σ2 δj < rlim Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  14. Source Association (by Position only) Rayleigh Distribution: probability of finding

    source at r ≥ ρ p(r ≥ ρ) = exp(−ρ2/2) Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  15. Source Association (by Position only) Taking care of types of

    association: one-to-one, one-to-many, many-to-one, many-to-many (http://docs.transientskp.org/tkp/database/assoc.html) Missed ones are processed by the monitoring-list recipe Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  16. Monitoring Sources ◮ List of sources to be monitored based

    on position ◮ User-defined sources ◮ Picked up by the TraP ◮ Forced fits at locations by sourcefinder ◮ RMS upper limits if no source is found Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  17. Transient & Variability Detection ◮ Look for deviations in all

    light curves ◮ Use Variability Magnitude (Vν) and Significance (ην) indices ◮ Reduced χ2 probability justifies a rejection/acception of H0 (i.e. the source not being a variable) ⊲ p ην = ∞ ην ′=ην p ην (η ν ′, N − 1)dη ν ′ Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  18. Feature Extraction ◮ Obtain characteristics from detected transient sources. ◮

    Duration ◮ Peak flux ◮ Absolute and relative increase and decrease from background to peak flux, and the increase/decrease ratio Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  19. MonetDB−MySQL, or comparing a column- and row-store MySQL 5.0.45 (red

    line) and MonetDB v5.20.4 Jun2010-SP1 (blue line). Dual-core 64-bit Intel(R) Pentium(R) 4 CPU 3.00 GHz with 1 GB of RAM, running Fedora 8 (Linux kernel 2.6.26.8-57) desk-top computer. Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  20. Non-digestivity of Recipes 0 100 200 300 400 500 600

    700 800 900 Images, grouped per 9 ( ∼20sources per image) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Query processing time (seconds) total ins_temp Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  21. Non-digestivity of Recipes 0 100 200 300 400 500 600

    700 800 900 Images, grouped per 9 ( ∼20sources per image) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Query processing time (seconds) assoc xtr Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  22. Transients Database, from single to multiple sharded nodes Table1 Table2

    Load and Alter SQL Statements 10-1 100 101 102 103 104 Time [s] Load on single node Alter on single node Load data; alter table, add and update 4 DBL columns T1: 4.5 GB, row size 1023B, 4 Mrows T2: 85 GB, row size 467 B, 165 Mrows Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  23. Transients Database, from single to multiple sharded nodes Table1 Table2

    Load and Alter SQL Statements 10-1 100 101 102 103 104 Time [s] Load on single node Alter on single node Load over 9 nodes Alter over 9 nodes Load data; alter table add and update 4 DBL columns T1: 4.5 GB, row size 1023B, 4 Mrows T2: 85 GB, row size 467 B, 165 Mrows Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  24. Transients Database, from single to multiple sharded nodes Q1 Q2

    Q3 Q4 Q5 Q6 Queries 10-3 10-2 10-1 100 101 102 Time [s] Cold Q on single node Hot Q on single node Cold mode: after server start, no in-memory data Hot mode: in-memory data Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  25. Transients Database, from single to multiple sharded nodes Q1 Q2

    Q3 Q4 Q5 Q6 Queries 10-3 10-2 10-1 100 101 102 Time [s] Cold Q on single node Cold Q over 9 nodes Hot Q on single node Hot Q over 9 nodes Cold mode: after server start, no in-memory data Hot mode: in-memory data Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases
  26. Summary & Open Issues ◮ Column-stores perform better in high

    data volumes ◮ Maintain good statistical models ◮ Sharded databases reduce data replication ◮ Merge TraDB with/to LTA ◮ More unit tests ◮ Refactoring on monitoring ◮ Keep monitoring database performance Bart Scheers | TKP Meeting | 2012-12-04 LOFAR Databases