$30 off During Our Annual Pro Sale. View Details »

Building a content-focused, scientific document authoring workflow for Data Scientists and Engineers alike

Colin Dean
August 10, 2020

Building a content-focused, scientific document authoring workflow for Data Scientists and Engineers alike

I observed a white paper authoring collaboration workflow problem at my Forbes 50 employer wherein a tedious workflow around legacy tooling caused undue stress, headaches, rework, and, ultimately, a cosmetically poor-looking document with inconsistent content and styles. Knowing that a good document requires both good content and presentation, I proposed and led the creation of a simple workflow amenable to our team's software engineers and data scientists: treating the white paper text as code with technologies including Markdown, GitHub Enterprise, Pandoc, LaTeX, and a review process that gets the tooling out of the way in order to enable content authors to focus less on logistics and more on writing and reviewing.

The result was that a team of seven engineers and data scientists created a 50-page document containing text, diagrams, equations, graphics, and more in just two weeks. The result greatly pleased our directors and executives. They praised our team not only for the incredibly valuable content, but also the professional appearance of the document. When they learned about the peer review process we used to create it, they wanted more teams to use it.

This talk focuses on the problems of passing around files by email or shared drives, the problems of collaborative editing of online documentation, and the problems we're still addressing in our solution that we've now used to author several significant internal documents.

Colin Dean

August 10, 2020
Tweet

More Decks by Colin Dean

Other Decks in Technology

Transcript

  1. A DOCUMENTATION
    WORKFLOW LOVED BY
    BOTH DATA SCIENTISTS
    AND ENGINEERS
    @colindean
    August 11, 2020
    1

    View Slide

  2. I AM COLIN DEAN.
    I wear a hat and a scarf at conferences.
    2

    View Slide

  3. The views expressed herein are my
    own and do not necessarily
    represent the views of my
    employers or associated
    organization, past, present, or
    future.
    3

    View Slide

  4. Lead AI Engineer at Target Corporation
    4

    View Slide

  5. Managing Director
    Code & Supply Co.
    (Abstractions, Heartifacts conferences)
    Secretreasurer
    Code & Supply Scholarship Fund
    5

    View Slide

  6. President of the Board
    Meta Mesh Wireless Communities
    6

    View Slide

  7. TASK
    Write a high-level overview about our product for
    executive review
    senior director brieng
    director deep-dive
    7

    View Slide

  8. Multiple audiences meant:
    1. Deep content coverage
    2. Summaries
    3. Navigability
    8

    View Slide

  9. A BIG CHANGE
    9

    View Slide


  10. 10

    View Slide

  11. Our product development was paused, so we
    needed to document everything.
    It may not be our team that continues
    development.
    11

    View Slide

  12. AUDIENCE AND DEPTH
    EXPANSION
    Detailed white paper for
    executive review
    senior director brieng
    director deep-dive
    data scientists and engineers
    12

    View Slide

  13. DRAMATIS
    PERSONAE
    A team of seven colocateed data scientists and
    engineers
    13

    View Slide

  14. Engineering detail
    (architecture, implementation)
    Mathematical detail
    (equations, proofs)
    Both have a lot of terminology
    14

    View Slide

  15. REAL NEED
    A content-focused, scientic document authoring
    workow for Data Scientists and Engineers alike
    15

    View Slide

  16. THINKING
    ARCHITECTURALLY
    16

    View Slide

  17. PRIMARY VALUES
    Reviewable content: prose, diagrams, equations
    Content-focused with minimal markup
    Minimize structural exceptions with standardized
    styling and typesetting
    17

    View Slide

  18. SECONDARY VALUES
    Accomodate some preferences for LaTeX over
    simpler formats (Markdown)
    Easy to use: one command to generate output
    Automation: artifact built from versioned “code”
    18

    View Slide

  19. Treat documentation as source code.
    19

    View Slide

  20. AVOID AT ALL COSTS
    Binary les or XML
    Passing around a le via email/Slack
    Manual copy-paste to merge changes
    Difficult exports from wiki format
    Forcing everyone to (re)learn LaTeX
    20

    View Slide

  21.  HighLevelOverview.docx
     HighLevelOverview-COLIN.docx
     HighLevelOverview-COLIN_20200626.docx
     HighLevelOverview-COLIN-JAY.docx
     HighLevelOverview-COLIN_20200626-FAN.docx
     HighLevelOverview-FINALFINAL.docx
    21

    View Slide

  22. SOLUTION
    22

    View Slide

  23. pandoc + git +
    GITHUB + DRONE
    CI
    23

    View Slide

  24. WHAT THIS GETS US
    Write in a simple text format
    Distribute changes and settle conicts
    Review and suggest changes
    Push button to receive PDF, archived forever
    24

    View Slide

  25. BIGGEST BENEFIT?
    LaTeX typesetting without suffering writing LaTeX
    25

    View Slide

  26. or, LaTeX when you need it
    26

    View Slide

  27. pandoc, BRIEFLY
    “A universal document converter”
    27

    View Slide

  28. pandoc, LESS BRIEFLY
    1.0 in 2008, 2.0 in 2017, 2.9.x in 2019
    Open source, GPL-2.0-or-later
    Written in Haskell with a Lua scripting engine
    33 input formats, dozens of output formats
    28

    View Slide

  29. pandoc BASICS
    29

    View Slide

  30. INSTALL
    brew install pandoc # macOS with Homebrew
    apt install pandoc # Debian/Ubuntu/Pop_OS
    scoop install pandoc # Windows with Scoop
    crew install pandoc # Chrome OS with chromebrew
    30

    View Slide

  31. INVOCATION
    pandoc document.md -o document.pdf
    31

    View Slide

  32. 32

    View Slide

  33. REAL INVOCATION
    pandoc \
    01_intro.md 02_problem.md 03_diagnosis.md \
    04_remedy.md 05_summary.md \
    --output documentation.pdf \
    --filter pandoc-crossref \
    --filter pandoc-citeproc \
    --lua-filter .filters/glossary/pandoc-gls.lua \
    --pdf-engine xelatex \
    --top-level-division=chapter \
    --number-sections \
    --toc --toc-depth=3 \
    -M lof -M lot \
    --bibliography=bibliography.bib \

    33

    View Slide

  34. CLI metadata options can be put into the YAML
    front-matter of the document
    34

    View Slide

  35. ---
    title: >
    A documentation workflow loved
    by both Data Scientists and Engineers
    author: '@colindean'
    date: August 11, 2020
    theme: white
    css: custom.css
    ---
    # Task
    Write a white paper about our product for
    * ti i
    35

    View Slide

  36. THIS PRESENTATION IS
    WRITTEN IN MARKDOWN
    and converted to a Reveal.js presentation:
    PRESENTATION = document_workflow
    MARKDOWN = $(PRESENTATION).md
    HTML = $(PRESENTATION).html
    DEPS_DIR = deps
    all: $(HTML)
    %.html: %.md
    pandoc \
    --to=revealjs --standalone \
    $< --output=$@ \
    -M revealjs-url=$(DEPS_DIR)/reveal.js/reveal.js-3.9.2
    36

    View Slide

  37. BUILD SYSTEM VS. A SCRIPT
    Make
    Gradle
    37

    View Slide

  38. COMMON PLUGINS
    Plugin Purpose
    pandoc-
    citeproc
    Processes citations, enables BibTeX
    use
    pandoc-
    crossref
    Cross-referencing for figures,
    equations, sections, etc.
    38

    View Slide

  39. OTHER GREAT PLUGINS
    Plugin Purpose
    pandoc-
    include-code
    Includes code from files instead
    of embedding
    pandoc-
    placetable
    Nicely render CSV data into a
    table
    panpipe Execute code blocks during
    document rendering
    39

    View Slide

  40. MORE PLUGINS
    Plugins written in Haskell, Lua, Python, and more
    Loads more:
    https://github.com/jgm/pandoc/wiki/Pandoc-Filters
    40

    View Slide

  41. DIAGRAMS
    External or embedded
    ![Figure caption](diagram.svg)
    \begin{figure}
    \centering
    \tikz{
    \draw[->, thick]{
    (0,0) -- (10,0)
    };
    \node[circle,radius=2pt,fill=blue] at (0,0){};
    \node[circle,radius=2pt,fill=blue] at (1,0){};
    \node[circle,radius=2pt,fill=blue] at (2,0){};
    \node[circle,radius=2pt,fill=blue, color=blue, align=cen
    \node[circle,radius=2pt,fill=blue] at (4,0){};
    \node[circle,radius=2pt,fill=blue] at (5,0){};
    41

    View Slide

  42. CITATIONS
    --filter pandoc-citeproc
    --bibliography bib.bib
    As described in @hendry1995dynamic, we conclude that…
    @book{hendry1995dynamic,
    title={Dynamic Econometrics},
    author={Hendry, D.F. and F, H.D. and Hendry, P.E.O.U.F.D.F.
    isbn={9780198283164},
    lccn={gb95034438},
    series={Advanced texts in econometrics},
    url={https://books.google.com/books?id=XcWVN2-2ZqIC},
    year={1995},
    42

    View Slide


  43. Distributed version control system
    Predominant/preeminent/prevailing use for
    software and more
    Great for text, not for binaries
    43

    View Slide


  44. GitHub
    44

    View Slide

  45. OUR WORKFLOW
    45

    View Slide

  46. FOUR PRIMARY TOOLS
    Tool Utility
    pandoc Write in a simple text format, Markdown
    git Distribute changes and settle conflicts
    GitHub Review and suggest changes, treat docs
    as code
    Drone CI Push button to receive PDF, archived
    forever
    46

    View Slide

  47. FLOW OF DATA
    Working Copy
    Write
    pandoc
    Compile
    Commited Work
    Commit
    Fix
    Refactor
    GitHub
    Push
    Clone
    CI
    Check
    Validate
    pandoc in Docker
    Compile
    Save Build Artifacts
    Notify of Build Errors
    47

    View Slide

  48. AUTHORING
    Use a Markdown-specic text editor with
    preview
    ,
    vim + entr + PDF viewer
    Writing one sentence per line makes review
    suggestions easier.
    PanWriter MacDown
    48

    View Slide

  49. MANAGING CONTENT
    One chapter per le - enables extraction
    Transforms necessitate a build directory
    49

    View Slide

  50. COMMITTING
    Use git commits to tell a story about the
    changes.
    50

    View Slide

  51. REVIEWING
    51

    View Slide

  52. PULL REQUESTS
    Assign reviewers automatically with
    CODEOWNERS
    Choose submitter-merge or reviewer-merge
    52

    View Slide

  53. CONTINUOUS INTEGRATION
    Block PR merging with CI system automation.
    Ensure valid markup and view changes compiled
    Run proselint or a grammar/spelling tool
    53

    View Slide

  54. GITHUB’S PR SUGGESTIONS
    Push a button to accept changes
    Discuss suggestions, provide alternative
    suggestions
    Establish consensus on controversial
    suggestions
    54

    View Slide

  55. PAIN POINTS
    55

    View Slide

  56. PAIN GETTING STARTED
    Dependency installation
    Learning Pandoc’s avor of Markdown
    “Why can’t I just use LaTeX?”
    Converting from Word or LaTeX loses cross-
    references˚
    ˚as of Pandoc 2.9.x
    56

    View Slide

  57. PRODUCTIVITY PAIN POINTS
    Incomplete WYSIWYG
    Bugs in workow, sole developer stakeholder
    Equation writing workow disjointed
    Editor with TeX equations support
    Separate renderer (LaTeXiT, MathJax.com)
    Just render it
    57

    View Slide

  58. ACCOMODATING
    OBJECTIONS
    I want to use X
    “But I want to use LaTeX”
    only if you’ll own that le!
    “But I want to write my section in X and export
    it to Pandoc Markdown”
    only if you can effect changes suggested in
    the PR
    58

    View Slide

  59. GREATEST RISK OF
    ADDITIONAL
    TRANSFORMATION TOOLS?
    59

    View Slide

  60. OVERWRITING.
    Changes made to a versioned le overwritten by
    the output of an external tool cost us a lot of time.
    0200_widgets.Rmd
    0200_widgets.md
    Rmd manual conversion
    document.pdf
    Pandoc
    0200_widgets.md
    Review suggestions
    60

    View Slide

  61. RECOMMENDATION
    Convert in the build system.
    0200_widgets.Rmd 0200_widgets.md
    Rmd conversion by build system
    document.pdf
    Pandoc
    0200_widgets.Rmd
    Review suggestions
    61

    View Slide

  62. FEEDBACK
    You don’t have to take my word for it!
    62

    View Slide

  63. POSITIVE
    - PhD who loves LaTeX
    “Leveled the playing eld for
    contributions, great for
    collaborating and building
    documents with all of the features
    of LaTeX”
    63

    View Slide

  64. NEGATIVE
    - PhD who loves LaTeX
    “I miss having ne-level control of
    gures, subgures, positioning, etc.”
    64

    View Slide

  65. CURRENT USE
    65

    View Slide

  66. GROWING ADOPTION
    Two large papers (~50 pgs and 176 pgs)
    Several smaller papers
    Nearly two dozen authors
    66

    View Slide

  67. MANY TOOLS
    MacTeX pandoc pandoc-crossref
    pandoc-citeproc XeTeX Tectonic
    make Homebrew librsvg proselint
    docker pandocker git
    67

    View Slide

  68. FUTURE GROWTH
    Output HTML, too, with CI workow for GitHub
    Pages
    Output ePub for easier consumption on mobiles
    Well-styled LaTeX to make our documents ours
    68

    View Slide

  69. pandoc.org
    69

    View Slide

  70. REFERENCES AND
    ATTRIBUTIONS
    . CC-BY-4.0 / SIL OFL
    1.1.
    by
    Bill Laboon ( )
    Icons by Font Awesome
    A Friendly Introduction to Software Testing
    repo
    70

    View Slide

  71. THESE SLIDES
    Raw code including use of pandoc:
    https://github.com/colindean/talks
    document_workflow
    Rendered version
    https://speakerdeck.com/colindean
    71

    View Slide

  72. FIN
    @colindean
    72

    View Slide