Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Conf 2018 - Building your own flavored Markdown with Ruby

The Conf 2018 - Building your own flavored Markdown with Ruby

Markdown is a nice and simple markup language that tries to re-use some earlier plaintext formatting conventions used in emails and documentation, as a markup syntax to convert to many different targets, from the most common: HTML, to the less known LaTeX or as a PDF E-Book writing solution.

While not all markdown is born the same, few standards exists, but each one with their own set of extensions and quirks. You are going to learn how to build yet another one that suits your own needs, by following the path of GitLab Kramdown.

Gabriel Mazetto

September 22, 2018
Tweet

More Decks by Gabriel Mazetto

Other Decks in Programming

Transcript

  1. 1

  2. 4

  3. 5

  4. 6

  5. “A markup language is a system for annotating a document

    in a way that is syntactically distinguishable from the text.” 11 Markup Language
  6. Welcome to Markdown ——————————————————— This is a **simple** markup that

    supports _formatting_ using just plain text symbols. Not all markdowns are born the same, while the [original syntax] (https://daringfireball.net/projects/markdown/syntax) is mostly unchanged between the many different **flavors**, many have their own **extensions** and can interpret things like _multiple spaces_ or headers differently. ```ruby puts "There is also many ways to share code snippets" ```
  7. Welcome to Markdown ——————————————————— This is a **simple** markup that

    supports _formatting_ using just plain text symbols. Not all markdowns are born the same, while the [original syntax] (https://daringfireball.net/projects/markdown/syntax) is mostly unchanged between the many different **flavors**, many have their own **extensions** and can interpret things like _multiple spaces_ or headers differently. ```ruby puts "There is also many ways to share code snippets" ```
  8. Input: This is a very simple text with _emphasis_ here

    also *emphasis* on something else and a really **important** message here. Expected Output: This is a very simple text with <em>emphasis</em> here also <em>emphasis</em> on something else and a really <strong>important</strong> message here.
  9. 1. Anything surrounded by _ or * (single digit) should

    be modified to be surrounded by <em> and </em>
 2. Anything surrounded by ** should be modified to be surrounded by <strong> and </strong> 17 Experiment Rules
  10. This is a very simple text with _emphasis_ here also

    *emphasis* on something else and really **important** message here.
  11. ### First tentative content = "This is a very simple

    text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!('_', '<em>') # => "This is a very simple text with <em>emphasis<em> here also *emphasis* on something else and really **important** message here." content.gsub!('*', '<em>') # => "This is a very simple text with <em>emphasis<em> here also <em>emphasis<em> on something else and really <em><em>important<em><em> message here." content.gsub!('**', '<strong>') # => "This is a very simple text with <em>emphasis<em> here also <em>emphasis<em> on something else and really <em><em>important<em><em> message here."
  12. 22

  13. • We can find and replace _ but If we

    try to find just the * we can’t easily distinguish between the ones that needs to be <em> to the others that needs to be <strong> • Alternative: find the ones with two repeating first, convert and then the single digit ones respectively. 23 Tentative solution: Use find and replace
  14. ### Second tentative content = "This is a very simple

    text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!('**', '<strong>')
 
 # => "This is a very simple text with _emphasis_ here also *emphasis* on something else and really <strong>important<strong> message here."
 content.gsub!('_', '<em>')
 # => "This is a very simple text with <em>emphasis<em> here also *emphasis* on something else and really <strong>important<strong> message here."
 content.gsub!('*', '<em>') # => "This is a very simple text with <em>emphasis<em> here also <em>emphasis<em> on something else and really <strong>important<strong> message here."
  15. 26

  16. ### Third tentative content = "This is a very simple

    text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!(/\*\*|_|\*/, '**' => '<strong>', '_' => '<em>', '*' => ‘<em>') # => "This is a very simple text with <em>emphasis<em> here also <em>emphasis<em> on something else and really <strong>important<strong> message here."
  17. This is a very simple text with _emphasis_ here also

    *emphasis* on something else and really **important** message here. Output: This is a very simple text with <em>emphasis<em> here also <em>emphasis<em> on something else and really <strong>important<strong> message here.
  18. 29

  19. We can’t use simple find and replace as the symbols

    are ambiguous: _, * and ** can be either a start or an ending symbol 30 Tentative solution: Use find and replace
  20. Some way to be able to substitute _ for <em>

    and its second occurrence for </em>, you need to keep a memory state while reading the stream of characters. 31 Whats missing
  21. • /\*\*|_|\*/ will match all the special characters we needed,

    but without holding state of first/second match. 33 Regular Expression
  22. • /(\*\*)(.*)(\*\*)/ we expect it to match content surrounded by

    **
 • /(_|\*)(.*)(_|\*)/ we expect it to match content surrounded by * or _ 34 Regular Expression
  23. ### Fourth tentative content = "This is a very simple

    text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!(/(\*\*)(.*)(\*\*)/, '<strong>\2</strong>') # => "This is a very simple text with _emphasis_ here also *emphasis* on something else and really <strong>important</strong> message here.” content.gsub!(/(_|\*)(.*)(_|\*)/, '<em>\2</em>') # => "This is a very simple text with <em>emphasis_ here also *emphasis</em> on something else and really <strong>important</strong> message here.”
  24. 38 "This is a very simple text with <em>emphasis_ here

    also *emphasis</em> on something else and really <strong>important</ strong> message here."
  25. 39

  26. /(_|\*)(.*)(_|\*)/ 40 . matches any character * repeats zero or

    more (greedy operator) | logical “or” matches every element of the text until the end of it, it will fail to find the last capture group, than it will backtrack (discard elements) until the RegExp is satisfied.
  27. /(_|\*)(.*?)(_|\*)/ 41 . matches any character * repeats zero or

    more (greedy operator) ? after a greedy operator makes it lazy | logical “or” lazy matcher will expand the dot at least once, and continue to evaluate the rest of the RegExp, if it fails it will backtrack, but instead of removing, it will expand again
  28. ### Fifth tentative content = "This is a very simple

    text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!(/(\*\*)(.*?)(\*\*)/, '<strong>\2</strong>') # => "This is a very simple text with _emphasis_ here also *emphasis* on something else and really <strong>important</strong> message here." content.gsub!(/(_|\*)(.*?)(_|\*)/, '<em>\2</em>') # => "This is a very simple text with <em>emphasis</em> here also <em>emphasis</em> on something else and really <strong>important</ strong> message here."
  29. 43

  30. /(_|\*)(.*?)(\1)/ 46 \1 content of first capture group match when

    the symbol on the first capture group is * the second capture will look for * only.
  31. Welcome to Markdown ——————————————————— ```ruby puts "There is also many

    ways to **share** code snippets" ``` | This is a table | With columns | |——————————-————————————————-—|—————————————-————————-| | How will you convert this | Into an actual table? | | How about som ``` ambiguous | symbols inside | | And **formating** | |
  32. “Compiler is a program that can read a program in

    one language (source code) and translate it into an equivalent program in another language (machine code)” 50
  33. ‣ Lexical Analyser (Scanner) ‣ Syntax Analyzer (Parser) ‣ Semantic

    Analyzer ‣ Intermediate Code Generation ‣ Code Generation ‣ Code Optimizer 51 Fundamental parts of a compiler
  34. 52 Program text input Lexical Analysis Syntax Analysis file characters

    tokens Context Handling AST Intermediate Code Gen. annotated AST Intermediate Code IC IC optimization Code generation Target code optimization IC symbol instructions Machine Code Generation symbol instructions Executable code output bit patterns file
  35. 54 How a generic Markdown to HTML transpiler looks like

    Program text input Lexical Analysis Syntax Analysis file characters tokens HTML Conversor HTML Document file AST HTML
  36. 55 Program text input Lexical Analysis Syntax Analysis file characters

    tokens HTML Conversor HTML Document file AST HTML
  37. A scanner needs to identify only a finite set of

    valid token/lexeme that belongs to the language rules. 56 Understanding how a Scanner works…
  38. • Token: a string with an assigned meaning. Structured as

    tuple with a type and an optional value (ex: keyword, literals, numbers, operators…) • Lexeme: A lexeme is a sequence of characters in the source program that matches the pattern for a token 57 Glossary
  39. Rules: 1. "Word" token is formed by a consecutive stream

    of ASCII characters 2.Accepts only US keyboard visible ASCII characters. 3.Spaces or special characters are a token separator 59 Simple scanner example:
 Return only words from a text.
  40. Content —> Word | Separator Word —> Char Char —>

    [a-zA-Z] Separator —> [!@#$%^&*()-_=+ ] 60 EBNF Grammar
  41. 61 Program text input Lexical Analysis Syntax Analysis file characters

    tokens HTML Conversor HTML Document file AST HTML
  42. This text is in *bold* 62 Example of how the

    tokens are parsed… ['This', 'text', 'is', 'in', '*', 'bold', '*'] [{0: 'This'}, {5: 'text'}, {10: 'is'}, {13: 'in'}, {16: '*'}, {17: 'bold'}, {21: '*'}]
  43. Input: This is a simple text: Hello World! Tokens: 


    (content (word This) (word is) (word a) (word simple) (word text) (word Hello) (word World)) 64 Representation using 
 s-expression syntax
  44. After tokenization, tokens are converted to a tree structure, and

    they can be annotated with information required by later steps.
 Few use-case exemples: error handling (displaying the file and line number), dynamic typing inference, ambiguous grammars (context dependent), transformations, etc. 66 Why compilers have AST?
  45. When we generate a header token and it’s placed inside

    an AST, we can annotate it with the header level it represents, for example. Image token only has the URL but there is probably an alternate text after it that can be combined into a single in the AST. 67 Few other markdown specific examples
  46. 69

  47. 73 Program text input Lexical Analysis Syntax Analysis file characters

    tokens HTML Conversor HTML Document file AST HTML
  48. Kramdown differentiates Span Parsers from Block Parsers. The Parser delegates

    pattern matching to the StringScanner, and it generates an AST in the form of Kramdown::Document and Kramdown::Element. 74 Kramdown::Parser