Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Preserving the World’s Software: A Hands-On Int...

Preserving the World’s Software: A Hands-On Introduction to Software Heritage

Avatar for Jaime Arias Almeida

Jaime Arias Almeida

November 05, 2025
Tweet

More Decks by Jaime Arias Almeida

Other Decks in Research

Transcript

  1. Preserving the World’s Software A Hands-On Introduction to Software Heritage

    Jaime Arias Research Engineer CNRS, LIPN, Université Sorbonne Paris Nord November 4, 2025 THE GREAT LIBRARY OF SOURCE CODE Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 1 / 25
  2. Who am I? Hello! I am Jaime Arias CNRS Research

    Engineer @ LIPN Ambassador @ Software Heritage Member @ Collège Codes Sources et Logiciels Chargé de mission Logiciels @ CNRS Sciences Info You can find me at: [email protected] https://www.jaime-arias.fr Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 1 / 25
  3. Software is a pillar of Open Science French Open Science

    Monitor 2025 Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 3 / 25
  4. Software is a pillar of Open Science French Open Science

    Monitor 2025 Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 4 / 25
  5. Source code is special (software is not data) Software evolves

    over time projects may last decades the development history is key to its understanding Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 5 / 25
  6. Source code is special (software is not data) Software evolves

    over time projects may last decades the development history is key to its understanding Complexity millions of lines of code large web of dependencies easy to break, difficult to maintain research software a thin top layer sophisticated developer communities python3-matplotlib python3-dateutil python3-six (>= 1.4) python3:any python-matplotlib-data (>= 3.0.2-2) python3-pyparsing (>= 1.5.6) libjs-jquery libjs-jquery-ui python3-numpy (>= 1:1.14.3) python3 (<< 3.8) (>= 3.7~) python3-numpy-abi9 python3-cycler (>= 0.10.0) python3-kiwisolver libfreetype6 (>= 2.2.1) libpng16-16 (>= 1.6.2-1) python3-pil python3-tk (>= 1.5) (>= 3.2~) tzdata [python3] [python3] {debconf} debconf-2.0 (>= 0.5) [debconf] {cdebconf} fonts-lyx ttf-bitstream-vera (>= 3.3.2-2~) jquery javascript-common (>= 1.7) (<< 3.8) (>= 3.7~) python3.7:any libblas3 libblas.so.3 liblapack3 liblapack.so.3 python3-pkg-resources python3-minimal (= 3.7.3-1) python3.7 (>= 3.7.3-1~) libpython3-stdlib (= 3.7.3-1) python3.7-minimal (>= 3.7.3-1~) {dpkg} install-info (>= 1.13.20) libpython3.7-minimal (= 3.7.3-2) libexpat1 (>= 2.1~beta3) libssl1.1 (>= 1.1.1) libpython3.7-stdlib (>= 0.5) (= 3.7.3-2) mime-support libbz2-1.0 liblzma5 (>= 5.1.1alpha+20120614) libdb5.3 libffi6 (>= 3.0.4) libmpdec2 libncursesw6 (>= 6) libtinfo6 (>= 6) libreadline7 (>= 7.0~beta) libsqlite3-0 (>= 3.7.15) libuuid1 (>= 2.20.1) bzip2 file xz-utils (= 1.0.6-9) libmagic1 (= 1:5.35-4) libmagic-mgc (= 1:5.35-4) (>= 5.2.2) xz-lzma (= 6.1+20181013-2) libgpm2 (>= 6) readline-common (>= 1.15.4) libreadline-common (>= 1.16.1) uuid-runtime (>= 2.25-5~) (>= 2.31.1) adduser libsmartcols1 (>= 2.27~rc1) libsystemd0 (>= 0.5) passwd (>= 5.1.1alpha+20120614) libgcrypt20 (>= 1.8.0) liblz4-1 (>= 0.0~r122) libgpg-error0 (>= 1.25) libgpg-error-l10n (= 3.7.3-2) (= 3.7.3-2) (>= 3.7.3-1~) [python3.7] [python3.7] libgfortran5 (>= 8) libquadmath0 (>= 4.6) ... -6- gcc-9-base (= 9-20190428-1) (>= 4.6) (= 9-20190428-1) (>= 8) (>= 4.6) ... -3- (>= 3.3.2-2~) (<< 3.8) (>= 3.6~) (>= 1.6.2-1) (<< 3.8) (>= 3.7~) (>= 2.2.1) [mime-support] python3-pil.imagetk libimagequant0 (>= 2.11.10) libjpeg62-turbo (>= 1.3.1) liblcms2-2 (>= 2.2+git20110628) libtiff5 (>= 4.0.3) libwebp6 (>= 0.5.1) libwebpdemux2 (>= 0.5.1) libwebpmux3 (>= 0.6.1-2) python3-olefile (<< 3.8) (>= 3.7~) (= 6.0.0-1) (>= 3.4.1-2) (>= 3.7.1-1~) (<< 3.9) blt (>= 2.4z-9) tk8.6-blt2.5 (>= 2.5.3) libtcl8.6 (>= 8.6.0) libtk8.6 (>= 8.6.0) (= 2.5.3+dfsg-5) (>= 8.6.0) (>= 8.6.0) blt4.2 blt8.0 blt8.0-unoff (>= 2.2.1) (>= 8.6.0-2) libfontconfig1 (>= 2.12.6) libxext6 libxft2 (>> 2.1.1) libxss1 (>= 2.3.5) (>= 2.12.6) libxrender1 x11-common libjpeg62 (>= 5.1.1alpha+20120614) (>= 1.3.1) libjbig0 (>= 2.0) (>= 0.5.1) libzstd1 (>= 1.3.2) (>= 0.5.1) (>= 0.5.1) Matplotlib library Python dependencies Real dependencies Fake OS dependencies induced by package granularity Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 5 / 25
  7. Source code is special (software is not data) Software evolves

    over time projects may last decades the development history is key to its understanding Complexity millions of lines of code large web of dependencies easy to break, difficult to maintain research software a thin top layer sophisticated developer communities python3-matplotlib python3-dateutil python3-six (>= 1.4) python3:any python-matplotlib-data (>= 3.0.2-2) python3-pyparsing (>= 1.5.6) libjs-jquery libjs-jquery-ui python3-numpy (>= 1:1.14.3) python3 (<< 3.8) (>= 3.7~) python3-numpy-abi9 python3-cycler (>= 0.10.0) python3-kiwisolver libfreetype6 (>= 2.2.1) libpng16-16 (>= 1.6.2-1) python3-pil python3-tk (>= 1.5) (>= 3.2~) tzdata [python3] [python3] {debconf} debconf-2.0 (>= 0.5) [debconf] {cdebconf} fonts-lyx ttf-bitstream-vera (>= 3.3.2-2~) jquery javascript-common (>= 1.7) (<< 3.8) (>= 3.7~) python3.7:any libblas3 libblas.so.3 liblapack3 liblapack.so.3 python3-pkg-resources python3-minimal (= 3.7.3-1) python3.7 (>= 3.7.3-1~) libpython3-stdlib (= 3.7.3-1) python3.7-minimal (>= 3.7.3-1~) {dpkg} install-info (>= 1.13.20) libpython3.7-minimal (= 3.7.3-2) libexpat1 (>= 2.1~beta3) libssl1.1 (>= 1.1.1) libpython3.7-stdlib (>= 0.5) (= 3.7.3-2) mime-support libbz2-1.0 liblzma5 (>= 5.1.1alpha+20120614) libdb5.3 libffi6 (>= 3.0.4) libmpdec2 libncursesw6 (>= 6) libtinfo6 (>= 6) libreadline7 (>= 7.0~beta) libsqlite3-0 (>= 3.7.15) libuuid1 (>= 2.20.1) bzip2 file xz-utils (= 1.0.6-9) libmagic1 (= 1:5.35-4) libmagic-mgc (= 1:5.35-4) (>= 5.2.2) xz-lzma (= 6.1+20181013-2) libgpm2 (>= 6) readline-common (>= 1.15.4) libreadline-common (>= 1.16.1) uuid-runtime (>= 2.25-5~) (>= 2.31.1) adduser libsmartcols1 (>= 2.27~rc1) libsystemd0 (>= 0.5) passwd (>= 5.1.1alpha+20120614) libgcrypt20 (>= 1.8.0) liblz4-1 (>= 0.0~r122) libgpg-error0 (>= 1.25) libgpg-error-l10n (= 3.7.3-2) (= 3.7.3-2) (>= 3.7.3-1~) [python3.7] [python3.7] libgfortran5 (>= 8) libquadmath0 (>= 4.6) ... -6- gcc-9-base (= 9-20190428-1) (>= 4.6) (= 9-20190428-1) (>= 8) (>= 4.6) ... -3- (>= 3.3.2-2~) (<< 3.8) (>= 3.6~) (>= 1.6.2-1) (<< 3.8) (>= 3.7~) (>= 2.2.1) [mime-support] python3-pil.imagetk libimagequant0 (>= 2.11.10) libjpeg62-turbo (>= 1.3.1) liblcms2-2 (>= 2.2+git20110628) libtiff5 (>= 4.0.3) libwebp6 (>= 0.5.1) libwebpdemux2 (>= 0.5.1) libwebpmux3 (>= 0.6.1-2) python3-olefile (<< 3.8) (>= 3.7~) (= 6.0.0-1) (>= 3.4.1-2) (>= 3.7.1-1~) (<< 3.9) blt (>= 2.4z-9) tk8.6-blt2.5 (>= 2.5.3) libtcl8.6 (>= 8.6.0) libtk8.6 (>= 8.6.0) (= 2.5.3+dfsg-5) (>= 8.6.0) (>= 8.6.0) blt4.2 blt8.0 blt8.0-unoff (>= 2.2.1) (>= 8.6.0-2) libfontconfig1 (>= 2.12.6) libxext6 libxft2 (>> 2.1.1) libxss1 (>= 2.3.5) (>= 2.12.6) libxrender1 x11-common libjpeg62 (>= 5.1.1alpha+20120614) (>= 1.3.1) libjbig0 (>= 2.0) (>= 0.5.1) libzstd1 (>= 1.3.2) (>= 0.5.1) (>= 0.5.1) Matplotlib library Python dependencies Real dependencies Fake OS dependencies induced by package granularity The human side design, algorithm, code, test, documentation, community, funding and so many more facets ... Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 5 / 25
  8. Fundamental needs for software in Open Science (selection) Archive Research

    software artifacts must be properly archived make sure we can retrieve them (reproducibility) Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 6 / 25
  9. Fundamental needs for software in Open Science (selection) Archive Research

    software artifacts must be properly archived make sure we can retrieve them (reproducibility) Reference Research software artifacts must be properly referenced make sure we can identify them (reproducibility) Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 6 / 25
  10. Fundamental needs for software in Open Science (selection) Archive Research

    software artifacts must be properly archived make sure we can retrieve them (reproducibility) Reference Research software artifacts must be properly referenced make sure we can identify them (reproducibility) Describe Research software artifacts must be properly described make it easy to discover and reuse them (visibility) Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 6 / 25
  11. Fundamental needs for software in Open Science (selection) Archive Research

    software artifacts must be properly archived make sure we can retrieve them (reproducibility) Reference Research software artifacts must be properly referenced make sure we can identify them (reproducibility) Describe Research software artifacts must be properly described make it easy to discover and reuse them (visibility) Cite/Credit Research software artifacts must be properly cited (not the same as referenced!) to give credit to authors (evaluation!) Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 6 / 25
  12. Outline 1 Software Heritage in a nutshell 2 Time to

    Try It Out! Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 7 / 25
  13. Software Heritage in a nutshell www.softwareheritage.org THE GREAT LIBRARY OF

    SOURCE CODE Collect, preserve and share all software source code Preserving our heritage, enabling better software and better science for all Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 7 / 25
  14. Software Heritage in a nutshell www.softwareheritage.org THE GREAT LIBRARY OF

    SOURCE CODE Collect, preserve and share all software source code Preserving our heritage, enabling better software and better science for all Reference catalog find and reference all software source code Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 7 / 25
  15. Software Heritage in a nutshell www.softwareheritage.org THE GREAT LIBRARY OF

    SOURCE CODE Collect, preserve and share all software source code Preserving our heritage, enabling better software and better science for all Reference catalog find and reference all software source code Universal archive preserve and share all software source code Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 7 / 25
  16. Software Heritage in a nutshell www.softwareheritage.org THE GREAT LIBRARY OF

    SOURCE CODE Collect, preserve and share all software source code Preserving our heritage, enabling better software and better science for all Reference catalog find and reference all software source code Universal archive preserve and share all software source code Research infrastructure enable analysis of all software source code Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 7 / 25
  17. Software Heritage: a radically different approach to archiving dsc dsc

    hg hg hg git git git git svn svn svn tar zip software origins Package repos Forges GitHub lister GitLab lister Debian lister PyPi lister . . . Distros ... Listing (full/incremental) tar Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 8 / 25
  18. Software Heritage: a radically different approach to archiving Git loader

    Mercurial loader Debian source package loader pypi source package loader . . . Software Heritage Archive Merkle DAG + blob storage Loading & deduplication dsc dsc hg hg hg git git git git svn svn svn tar zip software origins Package repos Forges GitHub lister GitLab lister Debian lister PyPi lister . . . Distros ... Scheduling Listing (full/incremental) tar origins snapshots releases revisions revisions directories directories contents Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 8 / 25
  19. Meet the SWHID identifier SWHIDs (SoftWare Hash IDentifiers) are persistent,

    intrinsic identifiers for software source code artifacts. SWHID has been officially adopted as ISO/IEC 18670:2025 on April 23, 2025. Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 9 / 25
  20. Meet the SWHID identifier SWHIDs (SoftWare Hash IDentifiers) are persistent,

    intrinsic identifiers for software source code artifacts. SWHID has been officially adopted as ISO/IEC 18670:2025 on April 23, 2025. Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 9 / 25
  21. Meet the SWHID identifier SWHIDs (SoftWare Hash IDentifiers) are persistent,

    intrinsic identifiers for software source code artifacts. SWHID has been officially adopted as ISO/IEC 18670:2025 on April 23, 2025. Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 9 / 25
  22. Outline 1 Software Heritage in a nutshell 2 Time to

    Try It Out! Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 10 / 25
  23. Software Heritage 1 Prepare your public repository README, AUTHORS &

    LICENSE files 2 Save your code http://save.softwareheritage.org/ 3 Reference your work (full repository, specific version or code fragment) Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 10 / 25
  24. 1. Browse: Software Heritage Archive https://archive. softwareheritage. org/ Jaime Arias

    [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 11 / 25
  25. 2. Archive: Software Heritage Archive https://archive. softwareheritage. org/save/ Jaime Arias

    [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 12 / 25
  26. 2. Archive: the UpdateSWH Browser Extension https://www. softwareheritage. org/ browser-extensions

    Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 13 / 25
  27. 3. Reference: the SWHID for a full directory Jaime Arias

    [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 15 / 25
  28. 3. Reference: the SWHID for a code fragment Jaime Arias

    [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 16 / 25
  29. 3. Reference: the biblatex-software Package biblatex-software is integrated into CTAN

    and TeXLive, and it works out-of-the-box in Overleaf https: //ctan.org/pkg/ biblatex-software Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 17 / 25
  30. 3. Reference: the biblatex-software Package biblatex-software is integrated into CTAN

    and TeXLive, and it works out-of-the-box in Overleaf As of April 2022, it is integrated in the ACM article style https: //ctan.org/pkg/ biblatex-software Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 17 / 25
  31. 3. Reference: the biblatex-software Package biblatex-software is integrated into CTAN

    and TeXLive, and it works out-of-the-box in Overleaf As of April 2022, it is integrated in the ACM article style Four dedicated entry types that reflect different levels of granularity: https: //ctan.org/pkg/ biblatex-software Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 17 / 25
  32. 3. Reference: the biblatex-software Package biblatex-software is integrated into CTAN

    and TeXLive, and it works out-of-the-box in Overleaf As of April 2022, it is integrated in the ACM article style Four dedicated entry types that reflect different levels of granularity: @software — for general references to computer software https: //ctan.org/pkg/ biblatex-software Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 17 / 25
  33. 3. Reference: the biblatex-software Package biblatex-software is integrated into CTAN

    and TeXLive, and it works out-of-the-box in Overleaf As of April 2022, it is integrated in the ACM article style Four dedicated entry types that reflect different levels of granularity: @software — for general references to computer software @softwaremodule — for citing a specific module within a larger software project https: //ctan.org/pkg/ biblatex-software Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 17 / 25
  34. 3. Reference: the biblatex-software Package biblatex-software is integrated into CTAN

    and TeXLive, and it works out-of-the-box in Overleaf As of April 2022, it is integrated in the ACM article style Four dedicated entry types that reflect different levels of granularity: @software — for general references to computer software @softwaremodule — for citing a specific module within a larger software project @softwareversion — for referencing a particular version of a software https: //ctan.org/pkg/ biblatex-software Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 17 / 25
  35. 3. Reference: the biblatex-software Package biblatex-software is integrated into CTAN

    and TeXLive, and it works out-of-the-box in Overleaf As of April 2022, it is integrated in the ACM article style Four dedicated entry types that reflect different levels of granularity: @software — for general references to computer software @softwaremodule — for citing a specific module within a larger software project @softwareversion — for referencing a particular version of a software @codefragment — for pinpointing a specific code fragment, such as an algorithm or a key function within a program or library https: //ctan.org/pkg/ biblatex-software Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 17 / 25
  36. 3. Reference: the biblatex-software Package https: //ctan.org/pkg/ biblatex-software Jaime Arias

    [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 18 / 25
  37. 3. Reference: the biblatex-software Package 1 @software{parmap, 2 title =

    {The Parmap library}, 3 author = {Di Cosmo, Roberto and Marco Danelutto}, 4 date = {2012}, 5 institution = {{Inria} and {University of Paris} and {University of Pisa}}, 6 license = {LGPL-2.0}, 7 url = {https://rdicosmo.github.io/parmap/}, 8 repository= {https://github.com/rdicosmo/parmap}, 9 } 10 11 @softwareversion{parmap-1.1.1, 12 crossref = {parmap}, 13 date = {2020}, 14 version = {1.1.1}, 15 swhid = {swh:1:rel:373e2604d96de4ab1d505190b654c5c4045db773; 16 origin=https://github.com/rdicosmo/parmap; 17 visit=swh:1:snp:2a6c348c53eb77d458f24c9cbcecaf92e3c45615}, 18 } 19 20 @codefragment{simplemapper, 21 subtitle = {Core mapping routine}, 22 swhid = {swh:1:cnt:43a6b232768017b03da934ba22d9cc3f2726a6c5; 23 origin=https://github.com/rdicosmo/parmap; 24 visit=swh:1:snp:2a6c348c53eb77d458f24c9cbcecaf92e3c45615; 25 anchor=swh:1:rel:373e2604d96de4ab1d505190b654c5c4045db773; 26 path=/src/parmap.ml; 27 lines=192-228}, 28 crossref = {parmap-1.1.1} 29 } https: //bit.ly/3JKXJbT Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 19 / 25
  38. 3. Reference: Describing Software Projects with CodeMeta https://codemeta. github.io/ codemeta-generator/

    Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 20 / 25
  39. 3. Reference: Describing Software Projects with CodeMeta Jaime Arias [email protected]

    (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 21 / 25
  40. 4. Bonus: HAL and Software Heritage https: //doc.hal.science/deposer/deposer-le-code-source/ Jaime Arias

    [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 22 / 25
  41. 4. Bonus: HAL and Software Heritage Jaime Arias [email protected] (CC-BY

    4.0) A Hands-On Introduction to Software Heritage November 4, 2025 23 / 25
  42. A walkthrough 1 Browse (e.g. Imitator [excerpt], your work may

    be already there !) 2 Trigger archival, use the updateswh browser extension, configure the webhooks 3 Get and use SWHIDs (full specification available online) 4 Cite software with biblatex-software package from CTAN Overleaf ACMART template available 5 Curated deposit in SWH via HAL, see for example: LinBox, SLALOM, Givaro, NS2DDV, SumGra, Coq proof, ... 6 Extracting all the software products for Inria, for CNRS, for CNES, for LIRMM or for Rémi Gribonval using HalTools Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 24 / 25
  43. Credits This presentation reuses material from Roberto di Cosmo’s presentations.

    Jaime Arias [email protected] (CC-BY 4.0) A Hands-On Introduction to Software Heritage November 4, 2025 25 / 25