Maximising the value of research data through software publishing

Ian Mulvany Head of product innovation SAGE Publishing Maximising the
value of data Through software publishing Hi, I’m Ian and I work at SAGE to help build new products to support computational social science, but today I want to provoke some thinking around how we might work towards better software publishing, with some side beneﬁts. I’m going to be talking about using container technology as a way to get there!

(This is obligatory pitch slide, we just launched courses to
teach social science researchers how to code!, check it out at campus.sagepub.com)

But back to the talk. I do want to talk
about how we maximise the value of data, but let’s think for a moment about what we mean by data. It’s presented in it’s ﬁnal form in the research paper, but I think it’s useful to think about where it comes from.

It’s inexorably tied to the means of acquisition, and as
our tools have become more powerful the scale of our data has grown.

Acquire Data But it’s never as simple as going from
acquiring the data to having the data ready to publish!

Acquire Data Acquire Data Clean Analyse A huge amount of
work often goes into cleaning the data, analysing the data.

Acquire Data Acquire Data Clean Analyse Generate Data Clean Analyse
Increasingly data is actually generated “in silico”

Acquire Data Acquire Data Clean Analyse Generate Data Clean Analyse
SOFTWARE All of these green steps require software, and often custom software created by the researcher

< > So my claim is that if we want
to get serious about data, we have to get serious about software. I want us to treat software as a First Class Citizen. And I strongly believe that code will become the next object of interest from funders, after OA and data.

Just a ﬁle, right? Today, though, we often just treat
software as a ﬁle that we expect people to download.

But the problem with this is that software often does
not just work out of the box.

App Let’s think about the app that the researcher has
written.

App Dependency Dependency Dependency It will often have dependencies of
other scripts that the researcher has written.

App Dependency Dependency Dependency Speciﬁc language version It might require
a speciﬁc version of a language to be installed on the users system, Python 3.4 vs Python 2.7, for example

App Dependency Dependency Dependency Speciﬁc language version Underlying Library 1
Underlying Library 2 There might be underlying system libraries that are needed

App Dependency Dependency Dependency Speciﬁc language version Underlying Library 1
Underlying Library 2 Database It might need to talk to a database

Specific Operating System App Dependency Dependency Dependency Specific language version
Underlying Library 1 Underlying Library 2 Database It might need a specific version of the actual operating system

Speciﬁc Operating System App Dependency Dependency Dependency Speciﬁc language version
Underlying Library 1 Underlying Library 2 Database App So we can think of the app as living on top of a large stack of other software

An software dependencies can get really complicated.

This is a problem that has a long history of
potential solutions from the systems ops world and I’m going to talk a bit about two of these approaches.

http://www.nature.com/encode/#/threads Back in 2012 Researchers used one of these tools
- virtualisation - to make all of their code and data available.

Let’s talk for a moment about how virtualisation works.

It provides a piece of software that emulates a computer,
running inside your computer.

You create a Digital snapshot of the entire operating system
that your app is running in, including all of the dependencies

And you load that image into the virtual machine running
inside your machine

The data and code remain separate from the paper Hosting
is $$$ Complex instructions 18 GB Large artefacts But this has some downsides for research use cases

So I want to talk about another technology that is
getting a lot of traction over the last few years

That technology is containerisation, and I’m going to talk about
the tool docker, which is not the only route to containerisation, but which gaining a lot of traction over the last few years

Physical Hardware (Host) Virtual Machine Approach Operating System VM system
(Hpyervisor) App OS App OS App Guest OS Physical Hardware (Host) Container / Docker Approach Operating System Docker Engine Shared Libraries Shared Dependencies App 1 App 2 App 3 LOTS O’SPACE!! Containers -> Are a lot more lightweight than virtual machines <-

Virtual Machines Containers Hosts per box Tens Thousands Startup time
Minutes Fractions of a second Containers are more preferment that virtual machines

However containers run a ﬁxed binary of the application, and
they can’t write data, or change state, while they are running, so orchestrating how containers work together, can be a bit complex

There is another aspect of containers that I’d like to
tell you about. We used to care for our software like we care for our pets. We would feed them with updates, if they got sick we would ssh in and try to make them better.

In the containerised world we treat instances of the software
more like livestock than like pets. As it’s cheap to just spin up a new container, if one container gets sick, we just kill it and replace it.

We have used this technology to provide a neat feature
for our new online learning platform

We use a containerised version of Jupyterhub to provide a
hub environment to each of our learners for one of our courses

These other companies are making heavy use of containers to
create new classes of data/software driven products!

Video Metadata Text Code https://www.oreilly.com/learning/regex-golf-with-peter-norvig is amazing!

~3300/s Google spin up about 3.3k containers per second (2B
per week) to power search This is important because as a by-product of going all in on containers, google has created some create open source tools to help with the orchestration issue. https://kubernetes.io

Summary can solve the problem of software dependencies allows “live
coding” to be delivered over the web can wrap code, data and environment together growing ecosystem of tools to support containers requires a level of infrastructural sophistication

Some resources to look at • Docker https://www.docker.com • Jupyterhub
https://zero-to-jupyterhub.readthedocs.io/en/latest/ • Data packages http://frictionlessdata.io/data-packages/ (just metadata) • Kubernetes https://kubernetes.io • Executable Research Compendium http://o2r.info/erc-spec/spec/ • Smart Figures http://smartﬁgures.net • Singularity http://singularity.lbl.gov/faq • Kubernetes guide http://partiallyattended.com/2017/09/14/kubernetes_- _as_i_learn/

Maximising the value of research data through s...

Maximising the value of research data through software publishing

More Decks by Ian Mulvany

Other Decks in Science

Featured

Transcript