$30 off During Our Annual Pro Sale. View Details »

Maximising the value of research data through s...

Ian Mulvany
September 14, 2017

Maximising the value of research data through software publishing

Beyond Data: How containers are making the sharing and publishing of complete computational environments a possibility:
Funders have already begun to request data sharing from their researchers, and the publishing industry is responding with initiatives such as enabling data citation and linking publications to primary data. In this short talk we will ask what might come next? Data itself rarely emerges into the world complete, but often needs a series of computational steps for collection, cleaning and analysis. Up to now sharing of software has usually been a fairly static affair, but software can be a notoriously difficult to get running, with dependencies on operating systems and libraries and host systems. Container technology is making it possible to encapsulate software in a lightweight and robust way that allows sharing of computational environments. We will give an overview of this technology, look at how it is being currently used by researchers, and discuss possible future implications for the publishing industry, both in terms of how we think about core infrastructure, as well as how we might provide better services to authors and readers.

Ian Mulvany

September 14, 2017
Tweet

More Decks by Ian Mulvany

Other Decks in Science

Transcript

  1. Ian Mulvany Head of product innovation SAGE Publishing Maximising the

    value of data Through software publishing Hi, I’m Ian and I work at SAGE to help build new products to support computational social science, but today I want to provoke some thinking around how we might work towards better software publishing, with some side benefits. I’m going to be talking about using container technology as a way to get there!
  2. (This is obligatory pitch slide, we just launched courses to

    teach social science researchers how to code!, check it out at campus.sagepub.com)
  3. But back to the talk. I do want to talk

    about how we maximise the value of data, but let’s think for a moment about what we mean by data. It’s presented in it’s final form in the research paper, but I think it’s useful to think about where it comes from.
  4. It’s inexorably tied to the means of acquisition, and as

    our tools have become more powerful the scale of our data has grown.
  5. Acquire Data But it’s never as simple as going from

    acquiring the data to having the data ready to publish!
  6. Acquire Data Acquire Data Clean Analyse A huge amount of

    work often goes into cleaning the data, analysing the data.
  7. Acquire Data Acquire Data Clean Analyse Generate Data Clean Analyse

    Increasingly data is actually generated “in silico”
  8. Acquire Data Acquire Data Clean Analyse Generate Data Clean Analyse

    SOFTWARE All of these green steps require software, and often custom software created by the researcher
  9. < > So my claim is that if we want

    to get serious about data, we have to get serious about software. I want us to treat software as a First Class Citizen. And I strongly believe that code will become the next object of interest from funders, after OA and data.
  10. Just a file, right? Today, though, we often just treat

    software as a file that we expect people to download.
  11. App Dependency Dependency Dependency It will often have dependencies of

    other scripts that the researcher has written.
  12. App Dependency Dependency Dependency Specific language version It might require

    a specific version of a language to be installed on the users system, Python 3.4 vs Python 2.7, for example
  13. App Dependency Dependency Dependency Specific language version Underlying Library 1

    Underlying Library 2 There might be underlying system libraries that are needed
  14. App Dependency Dependency Dependency Specific language version Underlying Library 1

    Underlying Library 2 Database It might need to talk to a database
  15. Specific Operating System App Dependency Dependency Dependency Specific language version

    Underlying Library 1 Underlying Library 2 Database It might need a specific version of the actual operating system
  16. Specific Operating System App Dependency Dependency Dependency Specific language version

    Underlying Library 1 Underlying Library 2 Database App So we can think of the app as living on top of a large stack of other software
  17. This is a problem that has a long history of

    potential solutions from the systems ops world and I’m going to talk a bit about two of these approaches.
  18. http://www.nature.com/encode/#/threads Back in 2012 Researchers used one of these tools

    - virtualisation - to make all of their code and data available.
  19. You create a Digital snapshot of the entire operating system

    that your app is running in, including all of the dependencies
  20. The data and code remain separate from the paper Hosting

    is $$$ Complex instructions 18 GB Large artefacts But this has some downsides for research use cases
  21. So I want to talk about another technology that is

    getting a lot of traction over the last few years
  22. That technology is containerisation, and I’m going to talk about

    the tool docker, which is not the only route to containerisation, but which gaining a lot of traction over the last few years
  23. Physical Hardware (Host) Virtual Machine Approach Operating System VM system

    (Hpyervisor) App OS App OS App Guest OS Physical Hardware (Host) Container / Docker Approach Operating System Docker Engine Shared Libraries Shared Dependencies App 1 App 2 App 3 LOTS O’SPACE!! Containers -> Are a lot more lightweight than virtual machines <-
  24. Virtual Machines Containers Hosts per box Tens Thousands Startup time

    Minutes Fractions of a second Containers are more preferment that virtual machines
  25. However containers run a fixed binary of the application, and

    they can’t write data, or change state, while they are running, so orchestrating how containers work together, can be a bit complex
  26. There is another aspect of containers that I’d like to

    tell you about. We used to care for our software like we care for our pets. We would feed them with updates, if they got sick we would ssh in and try to make them better.
  27. In the containerised world we treat instances of the software

    more like livestock than like pets. As it’s cheap to just spin up a new container, if one container gets sick, we just kill it and replace it.
  28. We have used this technology to provide a neat feature

    for our new online learning platform
  29. We use a containerised version of Jupyterhub to provide a

    hub environment to each of our learners for one of our courses
  30. These other companies are making heavy use of containers to

    create new classes of data/software driven products!
  31. ~3300/s Google spin up about 3.3k containers per second (2B

    per week) to power search This is important because as a by-product of going all in on containers, google has created some create open source tools to help with the orchestration issue. https://kubernetes.io
  32. Summary can solve the problem of software dependencies allows “live

    coding” to be delivered over the web can wrap code, data and environment together growing ecosystem of tools to support containers requires a level of infrastructural sophistication
  33. Some resources to look at • Docker https://www.docker.com • Jupyterhub

    https://zero-to-jupyterhub.readthedocs.io/en/latest/ • Data packages http://frictionlessdata.io/data-packages/ (just metadata) • Kubernetes https://kubernetes.io • Executable Research Compendium http://o2r.info/erc-spec/spec/ • Smart Figures http://smartfigures.net • Singularity http://singularity.lbl.gov/faq • Kubernetes guide http://partiallyattended.com/2017/09/14/kubernetes_- _as_i_learn/