The Conf 2018 - Building your own flavored Markdown with Ruby

@brodock blog.gabrielmazetto.eti.br Gabriel Mazetto ! 2

3 Photo by Grillot edouard

7 Building your own Flavored Markdown with Ruby

What is Markdown? 8

“Markdown is intended to be an easy-to-read and easy-to-write text
to HTML markup language” 9 Markdown

What is a Markup Language? 10

“A markup language is a system for annotating a document
in a way that is syntactically distinguishable from the text.” 11 Markup Language

Welcome to Markdown ——————————————————— This is a **simple** markup that
supports _formatting_ using just plain text symbols. Not all markdowns are born the same, while the [original syntax] (https://daringfireball.net/projects/markdown/syntax) is mostly unchanged between the many different **flavors**, many have their own **extensions** and can interpret things like _multiple spaces_ or headers differently. ```ruby puts "There is also many ways to share code snippets" ```

Text to HTML conversion? 14 <html> <header>…</h </html>

15 Let’s try to convert a very simple Markdown document

Input: This is a very simple text with _emphasis_ here
also *emphasis* on something else and a really **important** message here. Expected Output: This is a very simple text with emphasis here also emphasis on something else and a really important message here.

1. Anything surrounded by _ or * (single digit) should
be modified to be surrounded by and   2. Anything surrounded by ** should be modified to be surrounded by and 17 Experiment Rules

Let’s see the places where we should modify the text
18

This is a very simple text with _emphasis_ here also
*emphasis* on something else and really **important** message here.

Tentative solution: Use Find and Replace 20

### First tentative content = "This is a very simple
text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!('_', '') # => "This is a very simple text with emphasis here also *emphasis* on something else and really **important** message here." content.gsub!('*', '') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here." content.gsub!('**', '') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here."

• We can find and replace _ but If we
try to find just the * we can’t easily distinguish between the ones that needs to be to the others that needs to be • Alternative: find the ones with two repeating first, convert and then the single digit ones respectively. 23 Tentative solution: Use find and replace

The order we process the rules, DOES matter 24

### Second tentative content = "This is a very simple
text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!('**', '')    # => "This is a very simple text with _emphasis_ here also *emphasis* on something else and really important message here."  content.gsub!('_', '')  # => "This is a very simple text with emphasis here also *emphasis* on something else and really important message here."  content.gsub!('*', '') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here."

### Third tentative content = "This is a very simple
text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!(/\*\*|_|\*/, '**' => '', '_' => '', '*' => ‘') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here."

This is a very simple text with _emphasis_ here also
*emphasis* on something else and really **important** message here. Output: This is a very simple text with emphasis here also emphasis on something else and really important message here.

We can’t use simple find and replace as the symbols
are ambiguous: _, * and ** can be either a start or an ending symbol 30 Tentative solution: Use find and replace

Some way to be able to substitute _ for 
and its second occurrence for , you need to keep a memory state while reading the stream of characters. 31 Whats missing

Let’s try to use RegExp with “patterns”… 32

• /\*\*|_|\*/ will match all the special characters we needed,
but without holding state of first/second match. 33 Regular Expression

• /(\*\*)(.*)(\*\*)/ we expect it to match content surrounded by
**  • /(_|\*)(.*)(_|\*)/ we expect it to match content surrounded by * or _ 34 Regular Expression

/(\*\*)(.*)(\*\*)/ 35 formatting symbols capture groups . matches any character
* repeats zero or more

/(_|\*)(.*)(_|\*)/ 36 formatting symbols capture groups . matches any character
* repeats zero or more | logical “or”

### Fourth tentative content = "This is a very simple
text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!(/(\*\*)(.*)(\*\*)/, '\2') # => "This is a very simple text with _emphasis_ here also *emphasis* on something else and really important message here.” content.gsub!(/(_|\*)(.*)(_|\*)/, '\2') # => "This is a very simple text with emphasis_ here also *emphasis on something else and really important message here.”

38 "This is a very simple text with emphasis_ here
also *emphasis on something else and really important message here."

/(_|\*)(.*)(_|\*)/ 40 . matches any character * repeats zero or
more (greedy operator) | logical “or” matches every element of the text until the end of it, it will fail to find the last capture group, than it will backtrack (discard elements) until the RegExp is satisfied.

/(_|\*)(.*?)(_|\*)/ 41 . matches any character * repeats zero or
more (greedy operator) ? after a greedy operator makes it lazy | logical “or” lazy matcher will expand the dot at least once, and continue to evaluate the rest of the RegExp, if it fails it will backtrack, but instead of removing, it will expand again

### Fifth tentative content = "This is a very simple
text with _emphasis_ here also *emphasis* on something else and really **important** message here." content.gsub!(/(\*\*)(.*?)(\*\*)/, '\2') # => "This is a very simple text with _emphasis_ here also *emphasis* on something else and really important message here." content.gsub!(/(_|\*)(.*?)(_|\*)/, '\2') # => "This is a very simple text with emphasis here also emphasis on something else and really important message here."

We can still improve it… 44

/(_|\*)(.*?)(_|\*)/ 45 formatting symbols not guaranteed to match

/(_|\*)(.*?)(\1)/ 46 \1 content of first capture group match when
the symbol on the first capture group is * the second capture will look for * only.

Things can get complicated for block level markups 47

Let’s talk about compilers

“Compiler is a program that can read a program in
one language (source code) and translate it into an equivalent program in another language (machine code)” 50

‣ Lexical Analyser (Scanner) ‣ Syntax Analyzer (Parser) ‣ Semantic
Analyzer ‣ Intermediate Code Generation ‣ Code Generation ‣ Code Optimizer 51 Fundamental parts of a compiler

52 Program text input Lexical Analysis Syntax Analysis file characters
tokens Context Handling AST Intermediate Code Gen. annotated AST Intermediate Code IC IC optimization Code generation Target code optimization IC symbol instructions Machine Code Generation symbol instructions Executable code output bit patterns file

53 We need only a slice of it

54 How a generic Markdown to HTML transpiler looks like
Program text input Lexical Analysis Syntax Analysis file characters tokens HTML Conversor HTML Document file AST HTML

tokens HTML Conversor HTML Document file AST HTML

A scanner needs to identify only a finite set of
valid token/lexeme that belongs to the language rules. 56 Understanding how a Scanner works…

• Token: a string with an assigned meaning. Structured as
tuple with a type and an optional value (ex: keyword, literals, numbers, operators…) • Lexeme: A lexeme is a sequence of characters in the source program that matches the pattern for a token 57 Glossary

58 A simple scanner example

Rules: 1. "Word" token is formed by a consecutive stream
of ASCII characters 2.Accepts only US keyboard visible ASCII characters. 3.Spaces or special characters are a token separator 59 Simple scanner example:  Return only words from a text.

Content —> Word | Separator Word —> Char Char —>
[a-zA-Z] Separator —> [!@#$%^&*()-_=+ ] 60 EBNF Grammar

This text is in *bold* 62 Example of how the
tokens are parsed… ['This', 'text', 'is', 'in', '*', 'bold', '*'] [{0: 'This'}, {5: 'text'}, {10: 'is'}, {13: 'in'}, {16: '*'}, {17: 'bold'}, {21: '*'}]

63 Tokenization

Input: This is a simple text: Hello World! Tokens:  
(content (word This) (word is) (word a) (word simple) (word text) (word Hello) (word World)) 64 Representation using   s-expression syntax

65 AST

After tokenization, tokens are converted to a tree structure, and
they can be annotated with information required by later steps.  Few use-case exemples: error handling (displaying the file and line number), dynamic typing inference, ambiguous grammars (context dependent), transformations, etc. 66 Why compilers have AST?

When we generate a header token and it’s placed inside
an AST, we can annotate it with the header level it represents, for example. Image token only has the URL but there is probably an alternate text after it that can be combined into a single in the AST. 67 Few other markdown specific examples

How can we use that all in real life? 68

70 https://gitlab.com/gitlab-com/gitlab-docs

71 Introducing GitLab Kramdown:  https://gitlab.com/gitlab-org/gitlab_kramdown

Many things we talked here before were instrumental to the
Kramdown implementation 72

Kramdown differentiates Span Parsers from Block Parsers. The Parser delegates
pattern matching to the StringScanner, and it generates an AST in the form of Kramdown::Document and Kramdown::Element. 74 Kramdown::Parser

75 Let’s see some source-code

76 Questions?

If you want to learn more:  https://gitlab.com/gitlab-org/ gitlab_kramdown 77

78 Thank You :)

The Conf 2018 - Building your own flavored Mark...

The Conf 2018 - Building your own flavored Markdown with Ruby

More Decks by Gabriel Mazetto

Other Decks in Programming

Featured

Transcript