The Conf 2018 - Building your own flavored Markdown with Ruby

Slide 1

Slide 1 text

Slide 2

Slide 2 text

@brodock blog.gabrielmazetto.eti.br Gabriel Mazetto ! 2

Slide 3

Slide 3 text

3 Photo by Grillot edouard

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

7 Building your own Flavored Markdown with Ruby

Slide 8

Slide 8 text

What is Markdown? 8

Slide 9

Slide 9 text

“Markdown is intended to be an easy-to-read and easy-to-write text to HTML markup language” 9 Markdown

Slide 10

Slide 10 text

What is a Markup Language? 10

Slide 11

Slide 11 text

“A markup language is a system for annotating a document in a way that is syntactically distinguishable from the text.” 11 Markup Language

Slide 12

Slide 12 text

Welcome to Markdown ——————————————————— This is a **simple** markup that supports _formatting_ using just plain text symbols. Not all markdowns are born the same, while the [original syntax] (https://daringfireball.net/projects/markdown/syntax) is mostly unchanged between the many different **flavors**, many have their own **extensions** and can interpret things like _multiple spaces_ or headers differently. ```ruby puts "There is also many ways to share code snippets" ```

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Text to HTML conversion? 14 …

Slide 15

Slide 15 text

15 Let’s try to convert a very simple Markdown document

Slide 16

Slide 16 text

Input: This is a very simple text with _emphasis_ here also *emphasis* on something else and a really **important** message here. Expected Output: This is a very simple text with emphasis here also emphasis on something else and a really important message here.

Slide 17

Slide 17 text

1. Anything surrounded by _ or * (single digit) should be modified to be surrounded by and   2. Anything surrounded by ** should be modified to be surrounded by and 17 Experiment Rules

Slide 18

Slide 18 text

Let’s see the places where we should modify the text 18

Slide 19

Slide 19 text

This is a very simple text with _emphasis_ here also *emphasis* on something else and really **important** message here.

Slide 20

Slide 20 text

Tentative solution: Use Find and Replace 20

Slide 21

Slide 21 text

### First tentative content = "This is a very simple text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!('_', '') # => "This is a very simple text with emphasis here also *emphasis* on something else and really **important** message here." content.gsub!('*', '') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here." content.gsub!('**', '') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here."

Slide 22

Slide 22 text

Slide 23

Slide 23 text

• We can find and replace _ but If we try to find just the * we can’t easily distinguish between the ones that needs to be to the others that needs to be • Alternative: find the ones with two repeating first, convert and then the single digit ones respectively. 23 Tentative solution: Use find and replace

Slide 24

Slide 24 text

The order we process the rules, DOES matter 24

Slide 25

Slide 25 text

### Second tentative content = "This is a very simple text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!('**', '')    # => "This is a very simple text with _emphasis_ here also *emphasis* on something else and really important message here."  content.gsub!('_', '')  # => "This is a very simple text with emphasis here also *emphasis* on something else and really important message here."  content.gsub!('*', '') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here."

Slide 26

Slide 26 text

Slide 27

Slide 27 text

### Third tentative content = "This is a very simple text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!(/\*\*|_|\*/, '**' => '', '_' => '', '*' => ‘') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here."

Slide 28

Slide 28 text

This is a very simple text with _emphasis_ here also *emphasis* on something else and really **important** message here. Output: This is a very simple text with emphasis here also emphasis on something else and really important message here.

Slide 29

Slide 29 text

Slide 30

Slide 30 text

We can’t use simple find and replace as the symbols are ambiguous: _, * and ** can be either a start or an ending symbol 30 Tentative solution: Use find and replace

Slide 31

Slide 31 text

Some way to be able to substitute _ for and its second occurrence for , you need to keep a memory state while reading the stream of characters. 31 Whats missing

Slide 32

Slide 32 text

Let’s try to use RegExp with “patterns”… 32

Slide 33

Slide 33 text

• /\*\*|_|\*/ will match all the special characters we needed, but without holding state of first/second match. 33 Regular Expression

Slide 34

Slide 34 text

• /(\*\*)(.*)(\*\*)/ we expect it to match content surrounded by **  • /(_|\*)(.*)(_|\*)/ we expect it to match content surrounded by * or _ 34 Regular Expression

Slide 35

Slide 35 text

/(\*\*)(.*)(\*\*)/ 35 formatting symbols capture groups . matches any character * repeats zero or more

Slide 36

Slide 36 text

/(_|\*)(.*)(_|\*)/ 36 formatting symbols capture groups . matches any character * repeats zero or more | logical “or”

Slide 37

Slide 37 text

### Fourth tentative content = "This is a very simple text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!(/(\*\*)(.*)(\*\*)/, '\2') # => "This is a very simple text with _emphasis_ here also *emphasis* on something else and really important message here.” content.gsub!(/(_|\*)(.*)(_|\*)/, '\2') # => "This is a very simple text with emphasis_ here also *emphasis on something else and really important message here.”

Slide 38

Slide 38 text

38 "This is a very simple text with emphasis_ here also *emphasis on something else and really important message here."

Slide 39

Slide 39 text

Slide 40

Slide 40 text

/(_|\*)(.*)(_|\*)/ 40 . matches any character * repeats zero or more (greedy operator) | logical “or” matches every element of the text until the end of it, it will fail to find the last capture group, than it will backtrack (discard elements) until the RegExp is satisfied.

Slide 41

Slide 41 text

/(_|\*)(.*?)(_|\*)/ 41 . matches any character * repeats zero or more (greedy operator) ? after a greedy operator makes it lazy | logical “or” lazy matcher will expand the dot at least once, and continue to evaluate the rest of the RegExp, if it fails it will backtrack, but instead of removing, it will expand again

Slide 42

Slide 42 text

### Fifth tentative content = "This is a very simple text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!(/(\*\*)(.*?)(\*\*)/, '\2') # => "This is a very simple text with _emphasis_ here also *emphasis* on something else and really important message here." content.gsub!(/(_|\*)(.*?)(_|\*)/, '\2') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here."

Slide 43

Slide 43 text

Slide 44

Slide 44 text

We can still improve it… 44

Slide 45

Slide 45 text

/(_|\*)(.*?)(_|\*)/ 45 formatting symbols not guaranteed to match

Slide 46

Slide 46 text

/(_|\*)(.*?)(\1)/ 46 \1 content of first capture group match when the symbol on the first capture group is * the second capture will look for * only.

Slide 47

Slide 47 text

Things can get complicated for block level markups 47

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Let’s talk about compilers

Slide 50

Slide 50 text

“Compiler is a program that can read a program in one language (source code) and translate it into an equivalent program in another language (machine code)” 50

Slide 51

Slide 51 text

‣ Lexical Analyser (Scanner) ‣ Syntax Analyzer (Parser) ‣ Semantic Analyzer ‣ Intermediate Code Generation ‣ Code Generation ‣ Code Optimizer 51 Fundamental parts of a compiler

Slide 52

Slide 52 text

52 Program text input Lexical Analysis Syntax Analysis file characters tokens Context Handling AST Intermediate Code Gen. annotated AST Intermediate Code IC IC optimization Code generation Target code optimization IC symbol instructions Machine Code Generation symbol instructions Executable code output bit patterns file

Slide 53

Slide 53 text

53 We need only a slice of it

Slide 54

Slide 54 text

54 How a generic Markdown to HTML transpiler looks like Program text input Lexical Analysis Syntax Analysis file characters tokens HTML Conversor HTML Document file AST HTML

Slide 55

Slide 55 text

55 Program text input Lexical Analysis Syntax Analysis file characters tokens HTML Conversor HTML Document file AST HTML

Slide 56

Slide 56 text

A scanner needs to identify only a finite set of valid token/lexeme that belongs to the language rules. 56 Understanding how a Scanner works…

Slide 57

Slide 57 text

• Token: a string with an assigned meaning. Structured as tuple with a type and an optional value (ex: keyword, literals, numbers, operators…) • Lexeme: A lexeme is a sequence of characters in the source program that matches the pattern for a token 57 Glossary

Slide 58

Slide 58 text

58 A simple scanner example

Slide 59

Slide 59 text

Rules: 1. "Word" token is formed by a consecutive stream of ASCII characters 2.Accepts only US keyboard visible ASCII characters. 3.Spaces or special characters are a token separator 59 Simple scanner example:  Return only words from a text.

Slide 60

Slide 60 text

Content —> Word | Separator Word —> Char Char —> [a-zA-Z] Separator —> [!@#$%^&*()-_=+ ] 60 EBNF Grammar

Slide 61

Slide 61 text

61 Program text input Lexical Analysis Syntax Analysis file characters tokens HTML Conversor HTML Document file AST HTML

Slide 62

Slide 62 text

This text is in *bold* 62 Example of how the tokens are parsed… ['This', 'text', 'is', 'in', '*', 'bold', '*'] [{0: 'This'}, {5: 'text'}, {10: 'is'}, {13: 'in'}, {16: '*'}, {17: 'bold'}, {21: '*'}]

Slide 63

Slide 63 text

63 Tokenization

Slide 64

Slide 64 text

Input: This is a simple text: Hello World! Tokens:   (content (word This) (word is) (word a) (word simple) (word text) (word Hello) (word World)) 64 Representation using   s-expression syntax

Slide 65

Slide 65 text

65 AST

Slide 66

Slide 66 text

After tokenization, tokens are converted to a tree structure, and they can be annotated with information required by later steps.  Few use-case exemples: error handling (displaying the file and line number), dynamic typing inference, ambiguous grammars (context dependent), transformations, etc. 66 Why compilers have AST?

Slide 67

Slide 67 text

When we generate a header token and it’s placed inside an AST, we can annotate it with the header level it represents, for example. Image token only has the URL but there is probably an alternate text after it that can be combined into a single in the AST. 67 Few other markdown specific examples

Slide 68

Slide 68 text

How can we use that all in real life? 68

Slide 69

Slide 69 text

Slide 70

Slide 70 text

70 https://gitlab.com/gitlab-com/gitlab-docs

Slide 71

Slide 71 text

71 Introducing GitLab Kramdown:  https://gitlab.com/gitlab-org/gitlab_kramdown

Slide 72

Slide 72 text

Many things we talked here before were instrumental to the Kramdown implementation 72

Slide 73

Slide 73 text

73 Program text input Lexical Analysis Syntax Analysis file characters tokens HTML Conversor HTML Document file AST HTML

Slide 74

Slide 74 text

Kramdown differentiates Span Parsers from Block Parsers. The Parser delegates pattern matching to the StringScanner, and it generates an AST in the form of Kramdown::Document and Kramdown::Element. 74 Kramdown::Parser