Software Sustainability and Reproducible Research in Remote Sensing

Sustainable So-ware & Reproducible Research in Remote Sensing
Robin Wilson Geography and Environment, University of Southampton & So=ware Sustainability Ins@tute www.rtwilson.com/academic [email protected] @sciremotesense

Discussion Ques:ons Not just Yes/No –
but why?

Given your most recent: •  journal ar@cle • 
conference paper •  presenta@on at this conference Could I reproduce all of your results, from the raw input data + the paper/thesis?

Think of some data you’ve collected yourself… Would it
s:ll be useable in 10 years :me? 20 years? 30 years?

If you’ve wriNen scripts or code of any sort…
Would it s:ll be useable in 10, 20 or 30 years :me? If you disappeared, would someone else be able to understand it?

Two problems: Reproducibility Sustainability Will
they be usable in the future? A long @me in the future? Data, Code, Methods Can you re-‐do exactly what you did for a project? Could I or someone else? KEY TO SCIENCE

“Non-‐reproducible single occurrences are of no signiﬁcance to science”
Karl Popper (1959)

Technology for a better society • The most convincing reason
for me to be reproducible, is that somewhere down the line: • I will have to re-do the graph with different axes because a reviewer asked, • I will have to reinterpret the data for an updated conclusion, • I will write a journal paper based on a conference paper, • I will (hopefully) write a book or book chapter based on previous results, • … 25 "The person most likely to reproduce your work is your own future self" -- Sergey Fomel at ICERM workshop Some@me in the future you will need to: •  Re-‐create a graph to deal with reviewers comments •  Write a journal paper based on a disserta@on/ thesis/conference paper •  Work out what on earth you did for the project… You will need to reproduce your work

If your research is reproducible: •  Other people can
build on it more easily •  People who don’t believe the result can verify it themselves •  People can generally DO STUFF with it Your work will be cited more, applied more, become more well known and generally BE USED 50-‐100%

Scott R. Saleska, *† Kamel Didan, * Alfredo R. Huete,
Humberto R. da Rocha Large-scale numerical models that sim- ulate the interactions between changing global climate and terrestrial vegetation predict substantial carbon loss from tropical ecosystems (1), including the drought-induced collapse of the Amazon forest and conversion to savanna (2). Model-simulated forest collapse is a con- Resolution Imaging Spectroradiometer (MODIS) is a composite of leaf area and chlorophyll content that does not saturate, even over dense forests. Properly filtered to remove atmospheric aerosol and cloud effects, EVI tracks variations in canopy photosynthesis, as confirmed by ecosystem flux measurements on the ground (3, 4). A widespread drought occurred in the Ama- hydrologic redistribution to ac water availability during dry ex These observations suggest zon forests may be more res ecosystem models assume, at le short-term climatic anomalies. not alter the growing unders Amazon forests are vulnerable as deforestation and fire, a vuln to increase dramatically drought (5). But it does s vulnerability to climatic ef to be carefully assessed w at improving models by observations. Especially im work are observations to cally important question o to longer-term drought (8) induced by strong El Niño term climate change. References and Notes 1. P. Friedlingstein et al., J. Cl 2. R. A. Betts et al., Theor. App (2004). 3. Materials and methods are a Online. 4. A. R. Huete et al., Geophys. (2006). 5. L. E. O. C. Aragão, Y. Malhi, S. Saatchi, Y. E. Shimabukur 34, L07701 (2007). 6. D. C. Nepstad et al., Nature 7. A. M. Makarieva, V. G. Gors Syst. Sci. 11, 10133 (2007) 8. D. C. Nepstad, I. M. Tohver, G. Cardinot, Ecology 88, 22 9. Supported by NASA grants N (Large-Scale Biosphere-Atmo Amazônia–Ecology) and NNG 10. We thank M. Keller, S. C. W B. Christoffersen, and two a Fig. 1. Spatial pattern of July to September 2005 standardized anomalies (3) in (A) precipitation (derived from Tropical Rainfall Measuring Mission satellite observations during 1998–2006) and in (B) forest canopy “greenness” (the EVI derived from MODIS satellite observations during 2000–2006). (C) Frequency distribution of EVI anomalies from intact forest areas in (B) that fall within the drought area [red areas in (A), see fig. S2], significantly (P < 0.001) (3) skewed toward greenness. Amazon Forests Green-Up During 2005 Drought Scott R. Saleska,1*† Kamel Didan,2* Alfredo R. Huete,2 Humberto R. da Rocha3 Large-scale numerical models that sim- ulate the interactions between changing global climate and terrestrial vegetation predict substantial carbon loss from tropical ecosystems (1), including the drought-induced collapse of the Amazon forest and conversion to savanna (2). Resolution Imaging Spectroradiometer (MODIS) is a composite of leaf area and chlorophyll content that does not saturate, even over dense forests. Properly filtered to remove atmospheric aerosol and cloud effects, EVI tracks variations in canopy photosynthesis, as confirmed by ecosystem flux measurements on the ground (3, 4). decline consists Incr pectatio from in creased for exa hydrolo water a The zon fo ecosyst short-te not alt Amazo as defo to dro vu to at ob wo ca to ind ter 1 2 Amazing result…or was it?

Article Amazon forests did not green‐up during the 2005 drought
Arindam Samanta,1 Sangram Ganguly,2 Hirofumi Hashimoto,3 Sadashiva Devadiga,4 Eric Vermote,5 Yuri Knyazikhin,1 Ramakrishna R. Nemani,6 and Ranga B. Myneni1 Received 11 December 2009; accepted 26 January 2010; published 5 March 2010. [1] The sensitivity of Amazon rainforests to dry‐season droughts is still poorly understood, with reports of enhanced tree mortality and forest fires on one hand, and excessive forest greening on the other. Here, we report that the previous results of large‐scale greening of the Amazon, obtained from an earlier version of satellite‐ derived vegetation greenness data ‐ Collection 4 (C4) Enhanced Vegetation Index (EVI), are irreproducible, with both this earlier version as well as the improved, current version (C5), owing to inclusion of atmosphere‐corrupted data in those results. We find no evidence of large‐scale greening of intact Amazon forests during the 2005 drought ‐ approximately 11%–12% of these drought‐ stricken forests display greening, while, 28%–29% show browning or no‐change, and for the rest, the data are not of sufficient quality to characterize any changes. These changes are also not unique ‐ approximately similar changes are observed in non‐drought years as well. Changes in surface solar irradiance are contrary to the speculation in the previously published report of enhanced sphere will act to accelerate global cli icantly [Cox et al., 2000]. However, th of these forests is poorly understood debate. Extreme droughts such as those El Niño Southern Oscillation (ENSO available soil moisture stays below a cr for a prolonged period, are known to re tree mortality and increased forest flam al., 2004, 2007]. The drought of 2005, the ENSO‐related droughts of 1983 especially severe during the dry seas Amazon but did not impact the central [Marengo et al., 2008]. There are vary response to this drought ‐ higher tree m in tree growth from ground observati 2009] and more biomass fires [Araga the one hand, and excessive greenin servations [Saleska et al., 2007, hereaf other. Reconciling these reports remain [3] The availability of a new and i SAMANTA ET AL.: AMAZON DROUGHT SENSITIVITY L0 on: Samanta, A., ote, Y. Knyazikhin, zon forests did not . Lett., 37, L05401, nt amount of car- llion tons [Malhi ould these forests ely warming cli- me studies have ar et al., 2007; ased to the atmo- algorithms and input‐data filtering schemes related to clouds and aerosols that otherwise corrupt EVI data [Didan and Huete, 2006] ‐ aerosols from biomass burning are widespread in the Amazon during the dry season [e.g., Eck et al., 1998; Schafer et al., 2002], and aerosol loads were significantly higher, compared to other years, during the dry season of 2005 [Koren et al., 2007; Bevan et al., 2009]. Second, this data set spans a longer time period (2000– 2008). Our analysis here is focused on answering the fol- lowing five questions: (a) are the results published by SDHR07 reproducible with both the current and previous versions of EVI data? (b) What fraction of the intact forest area impacted by the drought exhibited anomalous greening in year 2005? (c) Is there evidence of higher than normal amounts of sunlight during the 2005 drought, which may have somehow caused the forests to green‐up, as speculated by SDHR07? (d) If drought caused the forests to green‐up, is there a relationship between the severity of drought and the spatial extent or magnitude of greening? (e) Are greenness changes during the 2005 drought unique compared to changes in non‐drought years? 2. Data and Methods [4] Detailed information on data and methods is provided in the auxiliary material.7 “Amazon forests” in this report t, Boston University, ett Field, California, olicy, California State . Space Flight Center, ryland, College Park, earch Center, Moffett Aerosol eﬀects not taken into account Not enough details in paper

thick or nearly opaque because the ETMϩ spectral bands d
not easily detect semi-transparent clouds such as Cirrus Uncinus (i.e., “mare’s tail”), Cirrus Fibratus, and cloud edges. Shadows from clouds are also not assessed. Furthe more, if all cirrus clouds were detected and used as a criterion to “reject” scene acquisitions, then most acquisitions would be “rejected” because of the pervasive charac of thin cirrus clouds in the majority of the 183 km by 180 km L7 scenes. Plate 1. Overview of L7 ETMϩ automated cloud-cover ssessment (ACCA) algorithm software flow. Abstract A scene-average automated cloud-cover assessment (ACCA) algorithm has been used for the Landsat-7 Enhanced The- matic Mapper Plus (ETMϩ) mission since its launch by NASA in 1999. ACCA assists in scheduling and confirming the acquisition of global “cloud-free” imagery for the U.S. archive. This paper documents the operational ACCA algorithm and vali- dates its performance to a standard error of Ϯ5 percent. Visual assessment of clouds in three-band browse imagery were used for comparison to the five-band ACCA scores from a stratified sample of 212 ETMϩ 2001 scenes. This comparison of independent cloud-cover estimators produced a 1:1 correla- tion with no offset. The largest commission errors were at high altitudes or at low solar illumination where snow was misclas- sified as clouds. The largest omission errors were associated with undetected optically thin cirrus clouds over water. There were no statistically significant systematic errors in ACCA scores analyzed by latitude, seasonality, or solar elevation angle. Enhancements for additional spectral bands, per-pixel masks, land/water boundaries, topography, shadows, multi- date and multi-sensor imagery were identified for possible use in future ACCA algorithms. Introduction A primary goal of the Landsat-7 (L7) mission is to populate the U.S.-held Landsat data archive with seasonally refreshed, essentially cloud-free Enhanced Thematic Mapper Plus (ETMϩ) imagery of the Earth’s landmasses. To achieve this Characterization of the Landsat-7 ETMϩ Automated Cloud-Cover Assessment (ACCA) Algorithm Richard R. Irish, John L. Barker, Samuel N. Goward, and Terry Arvidson Advanced Very High Resolution Radiometer (AVHRR) observations using the Normalized Difference Vegetation Index (NDVI) (Goward et al., 1999). Use of the resulting seasonality increases the probability of ETMϩ collects during periods of heightened biological activity. Another key element of the LTAP strategy is to use cloud-cover (CC) predictions to reduce cloud contamination in acquired scenes. In addition to the LTAP, acquisition scheduling by mission planners also requires reliable CC reports for imagery that is already acquired. Therefore, an automated cloud- cover assessment (ACCA) algorithm was created for determin- ing the cloud component of each acquired ETMϩ scene. The resulting CC assessment scores are used to monitor LTAP performance and reschedule acquisitions as necessary. The purpose of this paper is to document and evaluate the operational ACCA algorithm and to suggest potential enhancements for future Landsat-type missions. Landsat-7 Mission Planning To predict the probability of clouds in upcoming acquisitions, the L7 LTAP employs historical CC patterns developed by the International Satellite Cloud Climatology Project (ISCCP) and daily predictions provided by NOAA’s National Centers for Environmental Prediction (NCEP). Candidate LTAP acquisitions are prioritized according to the forecasted cloud environment normalized against the historical CC average, as well as other system and resource constraints (Arvidson et al., 2006). The priority for a candidate acquisition receives a boost if the forecasted CC is lower than the Full ﬂowcharts, parameter values, examples given with data details

Aerosol optical thickness determination by exploiting the synergy of TERRA
and AQUA MODIS Jiakui Tanga, Yong Xuea,b,*, Tong Yuc, Yanning Guana aLARSIS, Institute of Remote Sensing Applications, Chinese Academy of Sciences, Beijing, 100101, China bDepartment of Computing, London Metropolitan University, 166-220 Holloway Road, London N7 8DB, UK cBeijing Environmental Monitor Center, Beijing, PR China Received 23 March 2004; received in revised form 22 September 2004; accepted 25 September 2004 bstract Aerosol retrieval over land remains a difficult task because the solar light reflected by the Earth–atmospheric system mainly comes fro e ground surface. The dark dense vegetation (DDV) algorithm for MODIS data has shown excellent competence at retrieving the aeros stribution and properties. However, this algorithm is restricted to lower surface reflectance, such as water bodies and dense vegetation. s paper, we attempt to derive aerosol optical thickness (AOT) by exploiting the synergy of TERRA and AQUA MODIS data (SYNTAM hich can be used for various ground surfaces, including for high-reflective surface. Preliminary validation results by comparing wi erosol Robotic Network (AERONET) data show good accuracy and promising potential. 2004 Elsevier Inc. All rights reserved. ywords: Aerosol retrieval; Aerosol optical thickness; MODIS; TERRA; AQUA Introduction Global aerosol characterization by satellite remote sens- g arouses increasing interest, which is due to the mounting Very High Radiometer/National Oceanic and Atmospher Administration (AVHRR/NOAA; Higurashi & Nakajim 1999; Holben et al., 1992), due to new and mor sensitive instruments available like the Ocean Color an the AOT of the northeast of Beijing is greater than of the others, which demonstrates the larger temporal variability of the aerosol. Fig. 3. The flowchart of aerosol retrieval by SYNTAM. J. Tang et al. / Remote Sensing of Environment 94 (2005) 327–334 331 nd Haigh (1995) proposed that the surface approximated by a part that describes the h the wavelength and a part that describes with the geometry. Under this assumption, wo views’ surface reflectance can be written 2;ki ð7Þ s the surface reflectance for the first view the second view. The ratio K is assumed to on the variation of the surface reflectance metry and to be independent of the wave- rdew & Haigh, 1995; Veefkind et al., 1998, se aerosol extinction decreases rapidly with he AOT at 2.13 Am will be very small as the AOT in the visible. This assumption alid when the aerosol is dominated by the such as desert dust. Ignoring the atmos- ibution at 2.13 Am, Kk=2.13 Am can ated as the ratio between the top of the eflectances for the two overpasses at this Since K is assumed independent of the his value for Kk=2.13 Am can also be used le channels (0.47, 0.55, 0.66 Am), which k=2.13 Am . Actually, it is very difficult to directly get the analytical solution of nonlinear Eq. (6). However, an approximate numerical solution can be obtained by means of many numerical methods. In this paper, Newton iteration algorithm is used for our solution. 3. Data and processing MODIS is one of the sensors on board EOS-AM1/ TERRA and EOS-PM1/AQUA, which are both sun- synchronous polar orbiting satellites. TERRA was launched on Dec. 12, 1999 and flies northward pass the equator at about local time 10:30 AM. AQUA, launched Fig. 2. Aqua/MODIS reflectance RGB (R for Band 1; G for Band 4; B for Band 3) composed image (400æ400), Gaussian enhancement is made. er equations consists in substituting the exact ial equation for radiant intensity by common ations for the upward and incident radiation neral solution of this problem has been given (1969). Therefore, we can find the relation round surface reflectance A and apparent lectance on the top of atmosphere) AV, which Xue and Cracknell (1995) as follows: þ a 1 À AV ð Þe aÀb ð Þesk 0 sechV þ b 1 À AV ð Þe aÀb ð Þesk 0 sechV ð2Þ and b=2, e is the backscattering coefficient, The solar zenith angle is calculated from ude, and satellite pass time or the data set for tration of aerosol particles, namely, Angstrom’s tur- bidity coefficient b. Now, if we substitute bitemporal satellite data such as three visible spectral bands data, central wavelength of 0.47, 0.55, 0.66 Am, respectively, from TERRA and AQUA into Eq. (2), we can obtain one group of nonlinear equations as follows: Aj;ki ¼ Aj;ki Vb À aj À Á þ aj 1 À Aj;ki V À Áe aj Àb ð Þe 0:00879kÀ4:09 i þb j kÀa i ð Þsechj V Aj;ki Vb À aj À Á þ b 1 À Aj;ki V À Áe aj Àb ð Þe 0:00879kÀ4:09 i þb j kÀa i ð Þsechj V ð6Þ where j=1,2, respectively, stand for the observation of TERRA-MODIS and AQUA-MODIS; i=1,2,3, respectively, other symbols are defined in the Appendix A. In real conditions, the bidirectional reflectance properties of the ground surface depend not only on the wavelength but also on the geometry. For two successive views of TERRA and AQUA, the geometries often are different, hence we have to take account of this influence. Flowerdew and Haigh (1995) proposed that the surface reflectance be approximated by a part that describes the variation with the wavelength and a part that describes the variation with the geometry. Under this assumption, the ratio of two views’ surface reflectance can be written as follows: Kki ¼ A1;ki =A2;ki ð7Þ where A1,k i is the surface reflectance for the first view and A2,k i for the second view. The ratio K is assumed to depend only on the variation of the surface reflectance with the geometry and to be independent of the wavelength (Flowerdew & Haigh, 1995; Veefkind et al., 1998, 2000). Because aerosol extinction decreases rapidly with wavelength, the AOT at 2.13 Am will be very small as compared to the AOT in the visible. This assumption will not be valid when the aerosol is dominated by the coarse mode, such as desert dust. Ignoring the atmospheric contribution at 2.13 Am, Kk=2.13 Am can be approximated as the ratio between the top of the atmosphere reflectances for the two overpasses at this wavelength. Since K is assumed independent of the wavelength, this value for Kk=2.13 Am can also be used Not enough informa:on to reproduce!

Standard in the physical sciences ed in ce the
rk-flow: t Lab notebook of Graham Bell, 1876 Everything is documented: Inputs, Outputs, Procedures, Sources of chemicals, Loca@ons, Times, Sample sizes, Temperatures….

What about remote sensing? “An unsupervised classification was performed…”!
What algorithm? How many classes? How many itera@ons? What termina@on parameters?

How to make research reproducible? 1.  Do it in
code – If everything from data import through processing to crea@ng a graph/table is done in code then it can be ‘one-‐click’ reproducible 2.  Document it – Very thoroughly! Every single parameter, every opera@on. Every piece of data used as input. – (Electronically, on paper – whatever)

Then share it with people… •  Supplementary Informa@on
with a journal paper -‐> Soon to be a requirement? •  In an Appendix to a paper/thesis •  On your personal webpage Doesn’t really maNer where it is as long as: •  People can get hold of it •  People know where to look

Example: GPS Precipitable Water •  Valida@on of a new
& novel data source against AERONET & Radiosonde data •  Method must be robust, accurate, repeatable etc. Bri:sh Isles GNSS Facility

R Code: library(ProjectTemplate) load.project() All graphs All tables
All automa:cally produced ‘One Click’ Reproducibility (+ comments/docs)

Example: ArcGIS provenance tool “I’ve forgotten what I did
to create Output3.tif”! “I can’t remember the parameters I used for the unsupervised classification”! Data Provenance

What happened, when, how 1434: pain@ng dated by
van Eyck; 1516: in possession of Don Diego de Guevara, a Spanish career cour@er of the Habsburgs; 1516: portrait given to Margaret of Austria, Habsburg Regent of the Netherlands; 1530: inherited by Margaret’s niece Mary of Hungary; 1558: inherited by Philip II of Spain; 1599: on display in the Alcazar Palace in Madrid; 1794: now in the Palacio Nuevo in Madrid; 1816: in London, probably plundered by a certain Colonel James Hay a=er the BaNle of Vitoria (1813), from a coach loaded with easily portable artworks by King Joseph Bonaparte; 1841: the pain@ng was included in a public exhibi@on; 1842: bought by the Na@onal Gallery, London for £600, where it remains.

Field spectra collected in 1989 used in my PhD

Sustainability Data Code Methods

Formats Metadata Sustainable Data

Metadata – What is this crazy data? Source
Units Loca@on Date/ Time Person Method General Notes & Explana@on

How to store metadata •  Inside the file
– Almost all formats can store georeferencing – ENVI header files can store Sensor, Wavelengths, FWHMs, Units and more… – ArcGIS geodatabases can store metadata •  In a metadata database – Name of file -‐> List of metadata README files: Simple + Effec@ve

How to choose a format? ASCII Simple Text
No special chars Binary + Header ENVI ﬁles Well-‐known format TIFF SHP

Beware of ‘Well-‐known formats’ The most popular word
processor in the 1980s… …Can you read its ﬁles now? OPEN formats are beler

How to code sustainably? Good Design, Commen@ng, Version Control,
Automated Tes@ng… Best Prac:ces for Scien:ﬁc Compu:ng: hNp://arxiv.org/pdf/1210.0530.pdf Do you spend too much time wrestling with computers, and not enough doing research? We can help

So what? This stuﬀ is important for you and
for others Think about it! (tell others about it) Read up about it (www.rtwilson.com/academic/rr)

Easy Idea: Spend an a-ernoon crea:ng some README
ﬁles in your work folders: •  What is this? •  Where did it come from? •  What did I do with it? •  What do I need to remember in a year about it?

Easy Idea: Hide your results/outputs and try and
reproduce them again – check they’re exactly the same •  What did you need to know that wasn’t wriNen down? •  Write that down somewhere before you forget!

Harder Idea: Script/Automate some of your work – then
it’s easier to repeat, and self-‐documen:ng •  Use the ArcGIS Model Builder (can use ENVI commands too!) •  Learn some basic coding (eg. Python) •  If that isn’t possible then document it thoroughly

Harder Idea: Look at the So-ware Carpentry lessons –
can you apply those to your code? •  Does it have comments? •  Do you know what the dependencies are? •  Does it have tests? •  Is it under version control?

Prac:cal Ideas: •  Spend an a=ernoon crea@ng some README
ﬁles in your work folders •  Hide your results/outputs and try and reproduce them again – check they’re exactly the same •  Script/Automate some of your work – then it’s easier to repeat, and self-‐documen@ng •  Look at the So=ware Carpentry lessons – can you apply those to your code? [email protected] www.rtwilson.com/academic/rr

Software Sustainability and Reproducible Resear...

Software Sustainability and Reproducible Research in Remote Sensing

More Decks by Robin Wilson

Other Decks in Education

Featured

Transcript