Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Maximising the value of research data through software publishing

Ian Mulvany
September 14, 2017

Maximising the value of research data through software publishing

Beyond Data: How containers are making the sharing and publishing of complete computational environments a possibility:
Funders have already begun to request data sharing from their researchers, and the publishing industry is responding with initiatives such as enabling data citation and linking publications to primary data. In this short talk we will ask what might come next? Data itself rarely emerges into the world complete, but often needs a series of computational steps for collection, cleaning and analysis. Up to now sharing of software has usually been a fairly static affair, but software can be a notoriously difficult to get running, with dependencies on operating systems and libraries and host systems. Container technology is making it possible to encapsulate software in a lightweight and robust way that allows sharing of computational environments. We will give an overview of this technology, look at how it is being currently used by researchers, and discuss possible future implications for the publishing industry, both in terms of how we think about core infrastructure, as well as how we might provide better services to authors and readers.

Ian Mulvany

September 14, 2017
Tweet

More Decks by Ian Mulvany

Other Decks in Science

Transcript

  1. Ian Mulvany
    Head of product innovation SAGE Publishing
    Maximising the value of data
    Through software publishing
    Hi, I’m Ian and I work at SAGE to help build new products
    to support computational social science, but today I want
    to provoke some thinking around how we might work
    towards better software publishing, with some side
    benefits. I’m going to be talking about using container
    technology as a way to get there!

    View Slide

  2. (This is obligatory pitch slide, we just launched
    courses to teach social science researchers how
    to code!, check it out at campus.sagepub.com)

    View Slide

  3. But back to the talk. I do want to talk about how we
    maximise the value of data, but let’s think for a moment
    about what we mean by data. It’s presented in it’s final
    form in the research paper, but I think it’s useful to think
    about where it comes from.

    View Slide

  4. It’s inexorably tied to the
    means of acquisition, and as
    our tools have become more
    powerful the scale of our data
    has grown.

    View Slide

  5. Acquire Data
    But it’s never as simple
    as going from acquiring the
    data to having the data ready to
    publish!

    View Slide

  6. Acquire Data
    Acquire Data
    Clean Analyse
    A huge amount of work
    often goes into cleaning the
    data, analysing the data.

    View Slide

  7. Acquire Data
    Acquire Data
    Clean Analyse
    Generate Data
    Clean Analyse
    Increasingly data is
    actually generated “in silico”

    View Slide

  8. Acquire Data
    Acquire Data
    Clean Analyse
    Generate Data
    Clean Analyse
    SOFTWARE All of these green steps
    require software, and often
    custom software created by
    the researcher

    View Slide

  9. < >
    So my claim is that if
    we want to get serious about
    data, we have to get serious
    about software.
    I want us to treat software as a
    First Class Citizen.
    And I strongly believe that
    code will become the next
    object of interest from funders,
    after OA and data.

    View Slide

  10. Just a file, right?
    Today, though, we often just
    treat software as a file that we
    expect people to download.

    View Slide

  11. But the problem with this is
    that software often does not just
    work out of the box.

    View Slide

  12. App
    Let’s think about the app
    that the researcher has written.

    View Slide

  13. App
    Dependency Dependency Dependency
    It will often have
    dependencies of other scripts that
    the researcher has written.

    View Slide

  14. App
    Dependency Dependency Dependency
    Specific language version It might require a specific
    version of a language to be installed
    on the users system, Python 3.4 vs
    Python 2.7, for example

    View Slide

  15. App
    Dependency Dependency Dependency
    Specific language version
    Underlying Library 1
    Underlying Library 2
    There might be underlying
    system libraries that are needed

    View Slide

  16. App
    Dependency Dependency Dependency
    Specific language version
    Underlying Library 1
    Underlying Library 2
    Database
    It might need to
    talk to a database

    View Slide

  17. Specific Operating System
    App
    Dependency Dependency Dependency
    Specific language version
    Underlying Library 1
    Underlying Library 2
    Database
    It might need a specific version
    of the actual operating system

    View Slide

  18. Specific Operating System
    App
    Dependency Dependency Dependency
    Specific language version
    Underlying Library 1
    Underlying Library 2
    Database
    App
    So we can think of the app as
    living on top of a large stack of other
    software

    View Slide

  19. An software dependencies
    can get really complicated.

    View Slide

  20. This is a problem that has a long
    history of potential solutions from the systems
    ops world and I’m going to talk a bit about two
    of these approaches.

    View Slide

  21. http://www.nature.com/encode/#/threads
    Back in 2012
    Researchers used one of
    these tools - virtualisation
    - to make all of their code
    and data available.

    View Slide

  22. Let’s talk for a
    moment about how
    virtualisation works.

    View Slide

  23. It provides a
    piece of software that
    emulates a computer,
    running inside your
    computer.

    View Slide

  24. You create a Digital
    snapshot of the entire operating
    system that your app is running
    in, including all of the
    dependencies

    View Slide

  25. And you load that
    image into the virtual machine
    running inside your machine

    View Slide

  26. The data and code remain
    separate from the paper
    Hosting is $$$
    Complex instructions
    18 GB
    Large artefacts
    But this has some
    downsides for research use
    cases

    View Slide

  27. So I want to talk
    about another technology that
    is getting a lot of traction over
    the last few years

    View Slide

  28. That technology is
    containerisation, and I’m going to talk about
    the tool docker, which is not the only route to
    containerisation, but which gaining a lot of
    traction over the last few years

    View Slide

  29. Physical Hardware (Host)
    Virtual Machine Approach
    Operating System
    VM system (Hpyervisor)
    App
    OS
    App
    OS
    App
    Guest
    OS
    Physical Hardware (Host)
    Container / Docker Approach
    Operating System
    Docker Engine
    Shared Libraries
    Shared Dependencies
    App 1 App 2 App 3
    LOTS O’SPACE!!
    Containers ->
    Are a lot more
    lightweight than virtual
    machines
    <-

    View Slide

  30. Virtual Machines Containers
    Hosts per box Tens Thousands
    Startup time Minutes
    Fractions of a
    second
    Containers
    are more preferment
    that virtual machines

    View Slide

  31. However containers run a fixed
    binary of the application, and they
    can’t write data, or change state,
    while they are running, so
    orchestrating how containers work
    together, can be a bit complex

    View Slide

  32. There is another aspect of containers that
    I’d like to tell you about.
    We used to care for our software like we
    care for our pets.
    We would feed them with updates, if they
    got sick we would ssh in and try to make
    them better.

    View Slide

  33. In the containerised world we treat
    instances of the software more like
    livestock than like pets.
    As it’s cheap to just spin up a new container,
    if one container gets sick, we just kill it and
    replace it.

    View Slide

  34. We have used this technology to
    provide a neat feature for our new online
    learning platform

    View Slide

  35. We use a containerised version
    of Jupyterhub to provide a hub
    environment to each of our learners for one
    of our courses

    View Slide

  36. These other companies are
    making heavy use of containers to create
    new classes of data/software driven
    products!

    View Slide

  37. Video
    Metadata
    Text
    Code
    https://www.oreilly.com/learning/regex-golf-with-peter-norvig is amazing!

    View Slide

  38. ~3300/s
    Google spin up about 3.3k containers per second (2B per week) to power search
    This is important because as a by-product of going all in on containers, google has created
    some create open source tools to help with the orchestration issue.
    https://kubernetes.io

    View Slide

  39. Summary
    can solve the problem of software dependencies
    allows “live coding” to be delivered over the web
    can wrap code, data and environment together
    growing ecosystem of tools to support containers
    requires a level of infrastructural sophistication

    View Slide

  40. Some resources to look at
    • Docker https://www.docker.com
    • Jupyterhub https://zero-to-jupyterhub.readthedocs.io/en/latest/
    • Data packages http://frictionlessdata.io/data-packages/ (just metadata)
    • Kubernetes https://kubernetes.io
    • Executable Research Compendium http://o2r.info/erc-spec/spec/
    • Smart Figures http://smartfigures.net
    • Singularity http://singularity.lbl.gov/faq
    • Kubernetes guide http://partiallyattended.com/2017/09/14/kubernetes_-
    _as_i_learn/

    View Slide