Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Conf 2018 - Building your own flavored Markdown with Ruby

The Conf 2018 - Building your own flavored Markdown with Ruby

Markdown is a nice and simple markup language that tries to re-use some earlier plaintext formatting conventions used in emails and documentation, as a markup syntax to convert to many different targets, from the most common: HTML, to the less known LaTeX or as a PDF E-Book writing solution.

While not all markdown is born the same, few standards exists, but each one with their own set of extensions and quirks. You are going to learn how to build yet another one that suits your own needs, by following the path of GitLab Kramdown.

Gabriel Mazetto

September 22, 2018
Tweet

More Decks by Gabriel Mazetto

Other Decks in Programming

Transcript

  1. 1

    View Slide

  2. @brodock
    blog.gabrielmazetto.eti.br
    Gabriel Mazetto !
    2

    View Slide

  3. 3
    Photo by Grillot edouard

    View Slide

  4. 4

    View Slide

  5. 5

    View Slide

  6. 6

    View Slide

  7. 7
    Building your own
    Flavored Markdown
    with Ruby

    View Slide

  8. What is Markdown?
    8

    View Slide

  9. “Markdown is intended to be an
    easy-to-read and easy-to-write
    text to HTML markup
    language”
    9
    Markdown

    View Slide

  10. What is a Markup
    Language?
    10

    View Slide

  11. “A markup language is a system
    for annotating a document in a
    way that is syntactically
    distinguishable from the text.”
    11
    Markup Language

    View Slide

  12. Welcome to Markdown
    ———————————————————
    This is a **simple** markup that supports _formatting_ using
    just plain text symbols.
    Not all markdowns are born the same, while the [original syntax]
    (https://daringfireball.net/projects/markdown/syntax) is mostly
    unchanged between the many different **flavors**, many have
    their own **extensions** and can interpret things like _multiple
    spaces_ or headers differently.
    ```ruby
    puts "There is also many ways to share code snippets"
    ```

    View Slide

  13. Welcome to Markdown
    ———————————————————
    This is a **simple** markup that supports _formatting_ using
    just plain text symbols.
    Not all markdowns are born the same, while the [original syntax]
    (https://daringfireball.net/projects/markdown/syntax) is mostly
    unchanged between the many different **flavors**, many have
    their own **extensions** and can interpret things like _multiple
    spaces_ or headers differently.
    ```ruby
    puts "There is also many ways to share code snippets"
    ```

    View Slide

  14. Text to HTML
    conversion?
    14


    View Slide

  15. 15
    Let’s try to convert a very simple
    Markdown document

    View Slide

  16. Input:
    This is a very simple text with _emphasis_ here also
    *emphasis* on something else and a really **important**
    message here.
    Expected Output:
    This is a very simple text with emphasis here also
    emphasis on something else and a really
    important message here.

    View Slide

  17. 1. Anything surrounded by _ or *
    (single digit) should be modified
    to be surrounded by and

    2. Anything surrounded by **
    should be modified to be
    surrounded by and

    17
    Experiment Rules

    View Slide

  18. Let’s see the places
    where we should
    modify the text
    18

    View Slide

  19. This is a very simple text with _emphasis_ here also
    *emphasis* on something else and really **important**
    message here.

    View Slide

  20. Tentative solution:
    Use Find and Replace
    20

    View Slide

  21. ### First tentative
    content = "This is a very simple text with _emphasis_ here also *emphasis* on
    something else and really **important** message here."
    content.gsub!('_', '')
    # => "This is a very simple text with emphasis here also *emphasis* on
    something else and really **important** message here."
    content.gsub!('*', '')
    # => "This is a very simple text with emphasis here also emphasis
    on something else and really important message here."
    content.gsub!('**', '')
    # => "This is a very simple text with emphasis here also emphasis
    on something else and really important message here."

    View Slide


  22. 22

    View Slide

  23. • We can find and replace _ but If
    we try to find just the * we can’t
    easily distinguish between the
    ones that needs to be to the
    others that needs to be

    • Alternative: find the ones with two
    repeating first, convert and then
    the single digit ones respectively.
    23
    Tentative solution:
    Use find and replace

    View Slide

  24. The order we
    process the rules,
    DOES matter
    24

    View Slide

  25. ### Second tentative
    content = "This is a very simple text with _emphasis_ here also *emphasis* on
    something else and really **important** message here."
    content.gsub!('**', '')


    # => "This is a very simple text with _emphasis_ here also *emphasis* on
    something else and really important message here."

    content.gsub!('_', '')

    # => "This is a very simple text with emphasis here also *emphasis* on
    something else and really important message here."

    content.gsub!('*', '')
    # => "This is a very simple text with emphasis here also emphasis
    on something else and really important message here."

    View Slide


  26. 26

    View Slide

  27. ### Third tentative
    content = "This is a very simple text with _emphasis_
    here also *emphasis* on something else and really
    **important** message here."
    content.gsub!(/\*\*|_|\*/, '**' => '', '_' =>
    '', '*' => ‘')
    # => "This is a very simple text with emphasis
    here also emphasis on something else and
    really important message here."

    View Slide

  28. This is a very simple text with _emphasis_ here also
    *emphasis* on something else and really **important**
    message here.
    Output:
    This is a very simple text with emphasis here also
    emphasis on something else and really
    important message here.

    View Slide


  29. 29

    View Slide

  30. We can’t use simple find and
    replace as the symbols are
    ambiguous: _, * and ** can be
    either a start or an ending symbol
    30
    Tentative solution:
    Use find and replace

    View Slide

  31. Some way to be able to substitute
    _ for and its second
    occurrence for , you need
    to keep a memory state while
    reading the stream of characters.
    31
    Whats missing

    View Slide

  32. Let’s try to use
    RegExp with
    “patterns”…
    32

    View Slide

  33. • /\*\*|_|\*/ will match
    all the special characters we
    needed, but without holding
    state of first/second match.
    33
    Regular Expression

    View Slide

  34. • /(\*\*)(.*)(\*\*)/
    we expect it to match content
    surrounded by **

    • /(_|\*)(.*)(_|\*)/
    we expect it to match content
    surrounded by * or _
    34
    Regular Expression

    View Slide

  35. /(\*\*)(.*)(\*\*)/
    35
    formatting
    symbols
    capture groups
    . matches any character
    * repeats zero or more

    View Slide

  36. /(_|\*)(.*)(_|\*)/
    36
    formatting
    symbols
    capture groups
    . matches any character
    * repeats zero or more
    | logical “or”

    View Slide

  37. ### Fourth tentative
    content = "This is a very simple text with _emphasis_ here also
    *emphasis* on something else and really **important** message here."
    content.gsub!(/(\*\*)(.*)(\*\*)/, '\2')
    # => "This is a very simple text with _emphasis_ here also *emphasis*
    on something else and really important message here.”
    content.gsub!(/(_|\*)(.*)(_|\*)/, '\2')
    # => "This is a very simple text with emphasis_ here also
    *emphasis on something else and really important
    message here.”

    View Slide

  38. 38
    "This is a very simple text with
    emphasis_ here also
    *emphasis on something else
    and really important
    strong> message here."

    View Slide

  39. 39

    View Slide

  40. /(_|\*)(.*)(_|\*)/
    40
    . matches any character
    * repeats zero or more (greedy operator)
    | logical “or”
    matches every element of the text until the end of it, it
    will fail to find the last capture group, than it will
    backtrack (discard elements) until the RegExp is satisfied.

    View Slide

  41. /(_|\*)(.*?)(_|\*)/
    41
    . matches any character
    * repeats zero or more (greedy operator)
    ? after a greedy operator makes it lazy
    | logical “or”
    lazy matcher will expand the dot at least once, and
    continue to evaluate the rest of the RegExp, if it fails it will
    backtrack, but instead of removing, it will expand again

    View Slide

  42. ### Fifth tentative
    content = "This is a very simple text with _emphasis_ here also
    *emphasis* on something else and really **important** message here."
    content.gsub!(/(\*\*)(.*?)(\*\*)/, '\2')
    # => "This is a very simple text with _emphasis_ here also *emphasis*
    on something else and really important message here."
    content.gsub!(/(_|\*)(.*?)(_|\*)/, '\2')
    # => "This is a very simple text with emphasis here also
    emphasis on something else and really important
    strong> message here."

    View Slide

  43. 43

    View Slide

  44. We can still improve
    it…
    44

    View Slide

  45. /(_|\*)(.*?)(_|\*)/
    45
    formatting symbols not
    guaranteed to match

    View Slide

  46. /(_|\*)(.*?)(\1)/
    46
    \1 content of first capture group match
    when the symbol on the first capture group is * the
    second capture will look for * only.

    View Slide

  47. Things can get complicated for
    block level markups
    47

    View Slide

  48. Welcome to Markdown
    ———————————————————
    ```ruby
    puts "There is also many ways to **share** code snippets"
    ```
    | This is a table | With columns |
    |——————————-————————————————-—|—————————————-————————-|
    | How will you convert this | Into an actual table? |
    | How about som ``` ambiguous | symbols inside |
    | And **formating** | |

    View Slide

  49. Let’s talk about compilers

    View Slide

  50. “Compiler is a program that
    can read a program in one
    language (source code) and
    translate it into an equivalent
    program in another language
    (machine code)”
    50

    View Slide

  51. ‣ Lexical Analyser (Scanner)
    ‣ Syntax Analyzer (Parser)
    ‣ Semantic Analyzer
    ‣ Intermediate Code
    Generation
    ‣ Code Generation
    ‣ Code Optimizer
    51
    Fundamental parts of a
    compiler

    View Slide

  52. 52
    Program text input
    Lexical Analysis
    Syntax Analysis
    file
    characters
    tokens
    Context Handling
    AST
    Intermediate Code Gen.
    annotated AST
    Intermediate Code
    IC
    IC optimization
    Code generation
    Target code
    optimization
    IC
    symbol instructions
    Machine Code
    Generation
    symbol instructions
    Executable code output
    bit patterns
    file

    View Slide

  53. 53
    We need only a slice of it

    View Slide

  54. 54
    How a generic Markdown
    to HTML transpiler looks
    like
    Program text input
    Lexical Analysis
    Syntax Analysis
    file
    characters
    tokens
    HTML Conversor
    HTML Document file
    AST
    HTML

    View Slide

  55. 55
    Program text input
    Lexical Analysis
    Syntax Analysis
    file
    characters
    tokens
    HTML Conversor
    HTML Document file
    AST
    HTML

    View Slide

  56. A scanner needs to identify only
    a finite set of valid token/lexeme
    that belongs to the language
    rules.
    56
    Understanding how a
    Scanner works…

    View Slide

  57. • Token: a string with an assigned
    meaning. Structured as tuple
    with a type and an optional value
    (ex: keyword, literals, numbers,
    operators…)
    • Lexeme: A lexeme is a sequence
    of characters in the source
    program that matches the
    pattern for a token
    57
    Glossary

    View Slide

  58. 58
    A simple scanner example

    View Slide

  59. Rules:
    1. "Word" token is formed by a
    consecutive stream of ASCII
    characters
    2.Accepts only US keyboard
    visible ASCII characters.
    3.Spaces or special characters
    are a token separator
    59
    Simple scanner example:

    Return only words from a
    text.

    View Slide

  60. Content —> Word | Separator
    Word —> Char
    Char —> [a-zA-Z]
    Separator —> [!@#$%^&*()-_=+ ]
    60
    EBNF Grammar

    View Slide

  61. 61
    Program text input
    Lexical Analysis
    Syntax Analysis
    file
    characters
    tokens
    HTML Conversor
    HTML Document file
    AST
    HTML

    View Slide

  62. This text is in *bold*
    62
    Example of how the tokens
    are parsed…
    ['This', 'text', 'is', 'in', '*', 'bold', '*']
    [{0: 'This'}, {5: 'text'},
    {10: 'is'}, {13: 'in'},
    {16: '*'}, {17: 'bold'}, {21: '*'}]

    View Slide

  63. 63
    Tokenization

    View Slide

  64. Input: This is a simple
    text: Hello World!
    Tokens: 

    (content
    (word This)
    (word is)
    (word a)
    (word simple)
    (word text)
    (word Hello)
    (word World))
    64
    Representation using 

    s-expression syntax

    View Slide

  65. 65
    AST

    View Slide

  66. After tokenization, tokens are
    converted to a tree structure, and
    they can be annotated with
    information required by later steps.

    Few use-case exemples: error
    handling (displaying the file and line
    number), dynamic typing inference,
    ambiguous grammars (context
    dependent), transformations, etc.
    66
    Why compilers have AST?

    View Slide

  67. When we generate a header token
    and it’s placed inside an AST, we
    can annotate it with the header
    level it represents, for example.
    Image token only has the URL but
    there is probably an alternate text
    after it that can be combined into a
    single in the AST.
    67
    Few other markdown
    specific examples

    View Slide

  68. How can we use
    that all in real life?
    68

    View Slide

  69. 69

    View Slide

  70. 70
    https://gitlab.com/gitlab-com/gitlab-docs

    View Slide

  71. 71
    Introducing GitLab Kramdown:

    https://gitlab.com/gitlab-org/gitlab_kramdown

    View Slide

  72. Many things we talked here
    before were instrumental to
    the Kramdown implementation
    72

    View Slide

  73. 73
    Program text input
    Lexical Analysis
    Syntax Analysis
    file
    characters
    tokens
    HTML Conversor
    HTML Document file
    AST
    HTML

    View Slide

  74. Kramdown differentiates Span
    Parsers from Block Parsers.
    The Parser delegates pattern
    matching to the StringScanner,
    and it generates an AST in the
    form of Kramdown::Document
    and Kramdown::Element.
    74
    Kramdown::Parser

    View Slide

  75. 75
    Let’s see some source-code

    View Slide

  76. 76
    Questions?

    View Slide

  77. If you want to learn more:

    https://gitlab.com/gitlab-org/
    gitlab_kramdown
    77

    View Slide

  78. 78
    Thank You :)

    View Slide