Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a content-focused, scientific document authoring workflow for Data Scientists and Engineers alike

74681f50ccc09ad15600f2bf36929386?s=47 Colin Dean
August 10, 2020

Building a content-focused, scientific document authoring workflow for Data Scientists and Engineers alike

I observed a white paper authoring collaboration workflow problem at my Forbes 50 employer wherein a tedious workflow around legacy tooling caused undue stress, headaches, rework, and, ultimately, a cosmetically poor-looking document with inconsistent content and styles. Knowing that a good document requires both good content and presentation, I proposed and led the creation of a simple workflow amenable to our team's software engineers and data scientists: treating the white paper text as code with technologies including Markdown, GitHub Enterprise, Pandoc, LaTeX, and a review process that gets the tooling out of the way in order to enable content authors to focus less on logistics and more on writing and reviewing.

The result was that a team of seven engineers and data scientists created a 50-page document containing text, diagrams, equations, graphics, and more in just two weeks. The result greatly pleased our directors and executives. They praised our team not only for the incredibly valuable content, but also the professional appearance of the document. When they learned about the peer review process we used to create it, they wanted more teams to use it.

This talk focuses on the problems of passing around files by email or shared drives, the problems of collaborative editing of online documentation, and the problems we're still addressing in our solution that we've now used to author several significant internal documents.


Colin Dean

August 10, 2020



    @colindean August 11, 2020 1
  2. I AM COLIN DEAN. I wear a hat and a

    scarf at conferences. 2
  3. The views expressed herein are my own and do not

    necessarily represent the views of my employers or associated organization, past, present, or future. 3
  4. Lead AI Engineer at Target Corporation 4

  5. Managing Director Code & Supply Co. (Abstractions, Heartifacts conferences) Secretreasurer

    Code & Supply Scholarship Fund 5
  6. President of the Board Meta Mesh Wireless Communities 6

  7. TASK Write a high-level overview about our product for executive

    review senior director brieng director deep-dive 7
  8. Multiple audiences meant: 1. Deep content coverage 2. Summaries 3.

    Navigability 8

  10.  10

  11. Our product development was paused, so we needed to document

    everything. It may not be our team that continues development. 11
  12. AUDIENCE AND DEPTH EXPANSION Detailed white paper for executive review

    senior director brieng director deep-dive data scientists and engineers 12
  13. DRAMATIS PERSONAE A team of seven colocateed data scientists and

    engineers 13
  14. Engineering detail (architecture, implementation) Mathematical detail (equations, proofs) Both have

    a lot of terminology 14
  15. REAL NEED A content-focused, scientic document authoring workow for Data

    Scientists and Engineers alike 15

  17. PRIMARY VALUES Reviewable content: prose, diagrams, equations Content-focused with minimal

    markup Minimize structural exceptions with standardized styling and typesetting 17
  18. SECONDARY VALUES Accomodate some preferences for LaTeX over simpler formats

    (Markdown) Easy to use: one command to generate output Automation: artifact built from versioned “code” 18
  19. Treat documentation as source code. 19

  20. AVOID AT ALL COSTS Binary les or XML Passing around

    a le via email/Slack Manual copy-paste to merge changes Difficult exports from wiki format Forcing everyone to (re)learn LaTeX 20
  21.  HighLevelOverview.docx  HighLevelOverview-COLIN.docx  HighLevelOverview-COLIN_20200626.docx  HighLevelOverview-COLIN-JAY.docx  HighLevelOverview-COLIN_20200626-FAN.docx

     HighLevelOverview-FINALFINAL.docx 21
  22. SOLUTION 22

  23. pandoc + git + GITHUB + DRONE CI 23

  24. WHAT THIS GETS US Write in a simple text format

    Distribute changes and settle conicts Review and suggest changes Push button to receive PDF, archived forever 24
  25. BIGGEST BENEFIT? LaTeX typesetting without suffering writing LaTeX 25

  26. or, LaTeX when you need it 26

  27. pandoc, BRIEFLY “A universal document converter” 27

  28. pandoc, LESS BRIEFLY 1.0 in 2008, 2.0 in 2017, 2.9.x

    in 2019 Open source, GPL-2.0-or-later Written in Haskell with a Lua scripting engine 33 input formats, dozens of output formats 28
  29. pandoc BASICS 29

  30. INSTALL brew install pandoc # macOS with Homebrew apt install

    pandoc # Debian/Ubuntu/Pop_OS scoop install pandoc # Windows with Scoop crew install pandoc # Chrome OS with chromebrew 30
  31. INVOCATION pandoc document.md -o document.pdf 31

  32. 32

  33. REAL INVOCATION pandoc \ 01_intro.md 02_problem.md 03_diagnosis.md \ 04_remedy.md 05_summary.md

    \ --output documentation.pdf \ --filter pandoc-crossref \ --filter pandoc-citeproc \ --lua-filter .filters/glossary/pandoc-gls.lua \ --pdf-engine xelatex \ --top-level-division=chapter \ --number-sections \ --toc --toc-depth=3 \ -M lof -M lot \ --bibliography=bibliography.bib \ … 33
  34. CLI metadata options can be put into the YAML front-matter

    of the document 34
  35. --- title: > A documentation workflow loved by both Data

    Scientists and Engineers author: '@colindean' date: August 11, 2020 theme: white css: custom.css --- # Task Write a white paper about our product for * ti i 35

    Reveal.js presentation: PRESENTATION = document_workflow MARKDOWN = $(PRESENTATION).md HTML = $(PRESENTATION).html DEPS_DIR = deps all: $(HTML) %.html: %.md pandoc \ --to=revealjs --standalone \ $< --output=$@ \ -M revealjs-url=$(DEPS_DIR)/reveal.js/reveal.js-3.9.2 36
  37. BUILD SYSTEM VS. A SCRIPT Make Gradle 37

  38. COMMON PLUGINS Plugin Purpose pandoc- citeproc Processes citations, enables BibTeX

    use pandoc- crossref Cross-referencing for figures, equations, sections, etc. 38
  39. OTHER GREAT PLUGINS Plugin Purpose pandoc- include-code Includes code from

    files instead of embedding pandoc- placetable Nicely render CSV data into a table panpipe Execute code blocks during document rendering 39
  40. MORE PLUGINS Plugins written in Haskell, Lua, Python, and more

    Loads more: https://github.com/jgm/pandoc/wiki/Pandoc-Filters 40
  41. DIAGRAMS External or embedded ![Figure caption](diagram.svg) \begin{figure} \centering \tikz{ \draw[->,

    thick]{ (0,0) -- (10,0) }; \node[circle,radius=2pt,fill=blue] at (0,0){}; \node[circle,radius=2pt,fill=blue] at (1,0){}; \node[circle,radius=2pt,fill=blue] at (2,0){}; \node[circle,radius=2pt,fill=blue, color=blue, align=cen \node[circle,radius=2pt,fill=blue] at (4,0){}; \node[circle,radius=2pt,fill=blue] at (5,0){}; 41
  42. CITATIONS --filter pandoc-citeproc --bibliography bib.bib As described in @hendry1995dynamic, we

    conclude that… @book{hendry1995dynamic, title={Dynamic Econometrics}, author={Hendry, D.F. and F, H.D. and Hendry, P.E.O.U.F.D.F. isbn={9780198283164}, lccn={gb95034438}, series={Advanced texts in econometrics}, url={https://books.google.com/books?id=XcWVN2-2ZqIC}, year={1995}, 42
  43.  Distributed version control system Predominant/preeminent/prevailing use for software and

    more Great for text, not for binaries 43
  44.  GitHub 44


  46. FOUR PRIMARY TOOLS Tool Utility pandoc Write in a simple

    text format, Markdown git Distribute changes and settle conflicts GitHub Review and suggest changes, treat docs as code Drone CI Push button to receive PDF, archived forever 46
  47. FLOW OF DATA Working Copy Write pandoc Compile Commited Work

    Commit Fix Refactor GitHub Push Clone CI Check Validate pandoc in Docker Compile Save Build Artifacts Notify of Build Errors 47
  48. AUTHORING Use a Markdown-specic text editor with preview , vim

    + entr + PDF viewer Writing one sentence per line makes review suggestions easier. PanWriter MacDown 48
  49. MANAGING CONTENT One chapter per le - enables extraction Transforms

    necessitate a build directory 49
  50. COMMITTING Use git commits to tell a story about the

    changes. 50
  51. REVIEWING 51

  52. PULL REQUESTS Assign reviewers automatically with CODEOWNERS Choose submitter-merge or

    reviewer-merge 52
  53. CONTINUOUS INTEGRATION Block PR merging with CI system automation. Ensure

    valid markup and view changes compiled Run proselint or a grammar/spelling tool 53
  54. GITHUB’S PR SUGGESTIONS Push a button to accept changes Discuss

    suggestions, provide alternative suggestions Establish consensus on controversial suggestions 54
  55. PAIN POINTS 55

  56. PAIN GETTING STARTED Dependency installation Learning Pandoc’s avor of Markdown

    “Why can’t I just use LaTeX?” Converting from Word or LaTeX loses cross- references˚ ˚as of Pandoc 2.9.x 56
  57. PRODUCTIVITY PAIN POINTS Incomplete WYSIWYG Bugs in workow, sole developer

    stakeholder Equation writing workow disjointed Editor with TeX equations support Separate renderer (LaTeXiT, MathJax.com) Just render it 57
  58. ACCOMODATING OBJECTIONS I want to use X “But I want

    to use LaTeX” only if you’ll own that le! “But I want to write my section in X and export it to Pandoc Markdown” only if you can effect changes suggested in the PR 58

  60. OVERWRITING. Changes made to a versioned le overwritten by the

    output of an external tool cost us a lot of time. 0200_widgets.Rmd 0200_widgets.md Rmd manual conversion document.pdf Pandoc 0200_widgets.md Review suggestions 60
  61. RECOMMENDATION Convert in the build system. 0200_widgets.Rmd 0200_widgets.md Rmd conversion

    by build system document.pdf Pandoc 0200_widgets.Rmd Review suggestions 61
  62. FEEDBACK You don’t have to take my word for it!

  63. POSITIVE - PhD who loves LaTeX “Leveled the playing eld

    for contributions, great for collaborating and building documents with all of the features of LaTeX” 63
  64. NEGATIVE - PhD who loves LaTeX “I miss having ne-level

    control of gures, subgures, positioning, etc.” 64
  65. CURRENT USE 65

  66. GROWING ADOPTION Two large papers (~50 pgs and 176 pgs)

    Several smaller papers Nearly two dozen authors 66
  67. MANY TOOLS MacTeX pandoc pandoc-crossref pandoc-citeproc XeTeX Tectonic make Homebrew

    librsvg proselint docker pandocker git 67
  68. FUTURE GROWTH Output HTML, too, with CI workow for GitHub

    Pages Output ePub for easier consumption on mobiles Well-styled LaTeX to make our documents ours 68
  69. pandoc.org 69


    Bill Laboon ( ) Icons by Font Awesome A Friendly Introduction to Software Testing repo 70
  71. THESE SLIDES Raw code including use of pandoc: https://github.com/colindean/talks document_workflow

    Rendered version https://speakerdeck.com/colindean 71
  72. FIN @colindean 72