Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a content-focused, scientific document authoring workflow for Data Scientists and Engineers alike

Colin Dean
August 10, 2020

Building a content-focused, scientific document authoring workflow for Data Scientists and Engineers alike

I observed a white paper authoring collaboration workflow problem at my Forbes 50 employer wherein a tedious workflow around legacy tooling caused undue stress, headaches, rework, and, ultimately, a cosmetically poor-looking document with inconsistent content and styles. Knowing that a good document requires both good content and presentation, I proposed and led the creation of a simple workflow amenable to our team's software engineers and data scientists: treating the white paper text as code with technologies including Markdown, GitHub Enterprise, Pandoc, LaTeX, and a review process that gets the tooling out of the way in order to enable content authors to focus less on logistics and more on writing and reviewing.

The result was that a team of seven engineers and data scientists created a 50-page document containing text, diagrams, equations, graphics, and more in just two weeks. The result greatly pleased our directors and executives. They praised our team not only for the incredibly valuable content, but also the professional appearance of the document. When they learned about the peer review process we used to create it, they wanted more teams to use it.

This talk focuses on the problems of passing around files by email or shared drives, the problems of collaborative editing of online documentation, and the problems we're still addressing in our solution that we've now used to author several significant internal documents.

Colin Dean

August 10, 2020
Tweet

More Decks by Colin Dean

Other Decks in Technology

Transcript

  1. I AM COLIN DEAN. I wear a hat and a

    scarf at conferences. 2
  2. The views expressed herein are my own and do not

    necessarily represent the views of my employers or associated organization, past, present, or future. 3
  3. TASK Write a high-level overview about our product for executive

    review senior director brieng director deep-dive 7
  4. Our product development was paused, so we needed to document

    everything. It may not be our team that continues development. 11
  5. AUDIENCE AND DEPTH EXPANSION Detailed white paper for executive review

    senior director brieng director deep-dive data scientists and engineers 12
  6. PRIMARY VALUES Reviewable content: prose, diagrams, equations Content-focused with minimal

    markup Minimize structural exceptions with standardized styling and typesetting 17
  7. SECONDARY VALUES Accomodate some preferences for LaTeX over simpler formats

    (Markdown) Easy to use: one command to generate output Automation: artifact built from versioned “code” 18
  8. AVOID AT ALL COSTS Binary les or XML Passing around

    a le via email/Slack Manual copy-paste to merge changes Difficult exports from wiki format Forcing everyone to (re)learn LaTeX 20
  9. WHAT THIS GETS US Write in a simple text format

    Distribute changes and settle conicts Review and suggest changes Push button to receive PDF, archived forever 24
  10. pandoc, LESS BRIEFLY 1.0 in 2008, 2.0 in 2017, 2.9.x

    in 2019 Open source, GPL-2.0-or-later Written in Haskell with a Lua scripting engine 33 input formats, dozens of output formats 28
  11. INSTALL brew install pandoc # macOS with Homebrew apt install

    pandoc # Debian/Ubuntu/Pop_OS scoop install pandoc # Windows with Scoop crew install pandoc # Chrome OS with chromebrew 30
  12. 32

  13. REAL INVOCATION pandoc \ 01_intro.md 02_problem.md 03_diagnosis.md \ 04_remedy.md 05_summary.md

    \ --output documentation.pdf \ --filter pandoc-crossref \ --filter pandoc-citeproc \ --lua-filter .filters/glossary/pandoc-gls.lua \ --pdf-engine xelatex \ --top-level-division=chapter \ --number-sections \ --toc --toc-depth=3 \ -M lof -M lot \ --bibliography=bibliography.bib \ … 33
  14. --- title: > A documentation workflow loved by both Data

    Scientists and Engineers author: '@colindean' date: August 11, 2020 theme: white css: custom.css --- # Task Write a white paper about our product for * ti i 35
  15. THIS PRESENTATION IS WRITTEN IN MARKDOWN and converted to a

    Reveal.js presentation: PRESENTATION = document_workflow MARKDOWN = $(PRESENTATION).md HTML = $(PRESENTATION).html DEPS_DIR = deps all: $(HTML) %.html: %.md pandoc \ --to=revealjs --standalone \ $< --output=$@ \ -M revealjs-url=$(DEPS_DIR)/reveal.js/reveal.js-3.9.2 36
  16. COMMON PLUGINS Plugin Purpose pandoc- citeproc Processes citations, enables BibTeX

    use pandoc- crossref Cross-referencing for figures, equations, sections, etc. 38
  17. OTHER GREAT PLUGINS Plugin Purpose pandoc- include-code Includes code from

    files instead of embedding pandoc- placetable Nicely render CSV data into a table panpipe Execute code blocks during document rendering 39
  18. MORE PLUGINS Plugins written in Haskell, Lua, Python, and more

    Loads more: https://github.com/jgm/pandoc/wiki/Pandoc-Filters 40
  19. DIAGRAMS External or embedded ![Figure caption](diagram.svg) \begin{figure} \centering \tikz{ \draw[->,

    thick]{ (0,0) -- (10,0) }; \node[circle,radius=2pt,fill=blue] at (0,0){}; \node[circle,radius=2pt,fill=blue] at (1,0){}; \node[circle,radius=2pt,fill=blue] at (2,0){}; \node[circle,radius=2pt,fill=blue, color=blue, align=cen \node[circle,radius=2pt,fill=blue] at (4,0){}; \node[circle,radius=2pt,fill=blue] at (5,0){}; 41
  20. CITATIONS --filter pandoc-citeproc --bibliography bib.bib As described in @hendry1995dynamic, we

    conclude that… @book{hendry1995dynamic, title={Dynamic Econometrics}, author={Hendry, D.F. and F, H.D. and Hendry, P.E.O.U.F.D.F. isbn={9780198283164}, lccn={gb95034438}, series={Advanced texts in econometrics}, url={https://books.google.com/books?id=XcWVN2-2ZqIC}, year={1995}, 42
  21. FOUR PRIMARY TOOLS Tool Utility pandoc Write in a simple

    text format, Markdown git Distribute changes and settle conflicts GitHub Review and suggest changes, treat docs as code Drone CI Push button to receive PDF, archived forever 46
  22. FLOW OF DATA Working Copy Write pandoc Compile Commited Work

    Commit Fix Refactor GitHub Push Clone CI Check Validate pandoc in Docker Compile Save Build Artifacts Notify of Build Errors 47
  23. AUTHORING Use a Markdown-specic text editor with preview , vim

    + entr + PDF viewer Writing one sentence per line makes review suggestions easier. PanWriter MacDown 48
  24. CONTINUOUS INTEGRATION Block PR merging with CI system automation. Ensure

    valid markup and view changes compiled Run proselint or a grammar/spelling tool 53
  25. GITHUB’S PR SUGGESTIONS Push a button to accept changes Discuss

    suggestions, provide alternative suggestions Establish consensus on controversial suggestions 54
  26. PAIN GETTING STARTED Dependency installation Learning Pandoc’s avor of Markdown

    “Why can’t I just use LaTeX?” Converting from Word or LaTeX loses cross- references˚ ˚as of Pandoc 2.9.x 56
  27. PRODUCTIVITY PAIN POINTS Incomplete WYSIWYG Bugs in workow, sole developer

    stakeholder Equation writing workow disjointed Editor with TeX equations support Separate renderer (LaTeXiT, MathJax.com) Just render it 57
  28. ACCOMODATING OBJECTIONS I want to use X “But I want

    to use LaTeX” only if you’ll own that le! “But I want to write my section in X and export it to Pandoc Markdown” only if you can effect changes suggested in the PR 58
  29. OVERWRITING. Changes made to a versioned le overwritten by the

    output of an external tool cost us a lot of time. 0200_widgets.Rmd 0200_widgets.md Rmd manual conversion document.pdf Pandoc 0200_widgets.md Review suggestions 60
  30. RECOMMENDATION Convert in the build system. 0200_widgets.Rmd 0200_widgets.md Rmd conversion

    by build system document.pdf Pandoc 0200_widgets.Rmd Review suggestions 61
  31. POSITIVE - PhD who loves LaTeX “Leveled the playing eld

    for contributions, great for collaborating and building documents with all of the features of LaTeX” 63
  32. NEGATIVE - PhD who loves LaTeX “I miss having ne-level

    control of gures, subgures, positioning, etc.” 64
  33. GROWING ADOPTION Two large papers (~50 pgs and 176 pgs)

    Several smaller papers Nearly two dozen authors 66
  34. FUTURE GROWTH Output HTML, too, with CI workow for GitHub

    Pages Output ePub for easier consumption on mobiles Well-styled LaTeX to make our documents ours 68
  35. REFERENCES AND ATTRIBUTIONS . CC-BY-4.0 / SIL OFL 1.1. by

    Bill Laboon ( ) Icons by Font Awesome A Friendly Introduction to Software Testing repo 70