@brodock
blog.gabrielmazetto.eti.br
Gabriel Mazetto !
2
Slide 3
Slide 3 text
3
Photo by Grillot edouard
Slide 4
Slide 4 text
4
Slide 5
Slide 5 text
5
Slide 6
Slide 6 text
6
Slide 7
Slide 7 text
7
Building your own
Flavored Markdown
with Ruby
Slide 8
Slide 8 text
What is Markdown?
8
Slide 9
Slide 9 text
“Markdown is intended to be an
easy-to-read and easy-to-write
text to HTML markup
language”
9
Markdown
Slide 10
Slide 10 text
What is a Markup
Language?
10
Slide 11
Slide 11 text
“A markup language is a system
for annotating a document in a
way that is syntactically
distinguishable from the text.”
11
Markup Language
Slide 12
Slide 12 text
Welcome to Markdown
———————————————————
This is a **simple** markup that supports _formatting_ using
just plain text symbols.
Not all markdowns are born the same, while the [original syntax]
(https://daringfireball.net/projects/markdown/syntax) is mostly
unchanged between the many different **flavors**, many have
their own **extensions** and can interpret things like _multiple
spaces_ or headers differently.
```ruby
puts "There is also many ways to share code snippets"
```
Slide 13
Slide 13 text
Welcome to Markdown
———————————————————
This is a **simple** markup that supports _formatting_ using
just plain text symbols.
Not all markdowns are born the same, while the [original syntax]
(https://daringfireball.net/projects/markdown/syntax) is mostly
unchanged between the many different **flavors**, many have
their own **extensions** and can interpret things like _multiple
spaces_ or headers differently.
```ruby
puts "There is also many ways to share code snippets"
```
Slide 14
Slide 14 text
Text to HTML
conversion?
14
…
Slide 15
Slide 15 text
15
Let’s try to convert a very simple
Markdown document
Slide 16
Slide 16 text
Input:
This is a very simple text with _emphasis_ here also
*emphasis* on something else and a really **important**
message here.
Expected Output:
This is a very simple text with emphasis here also
emphasis on something else and a really
important message here.
Slide 17
Slide 17 text
1. Anything surrounded by _ or *
(single digit) should be modified
to be surrounded by and
2. Anything surrounded by **
should be modified to be
surrounded by and
17
Experiment Rules
Slide 18
Slide 18 text
Let’s see the places
where we should
modify the text
18
Slide 19
Slide 19 text
This is a very simple text with _emphasis_ here also
*emphasis* on something else and really **important**
message here.
Slide 20
Slide 20 text
Tentative solution:
Use Find and Replace
20
Slide 21
Slide 21 text
### First tentative
content = "This is a very simple text with _emphasis_ here also *emphasis* on
something else and really **important** message here."
content.gsub!('_', '')
# => "This is a very simple text with emphasis here also *emphasis* on
something else and really **important** message here."
content.gsub!('*', '')
# => "This is a very simple text with emphasis here also emphasis
on something else and really important message here."
content.gsub!('**', '')
# => "This is a very simple text with emphasis here also emphasis
on something else and really important message here."
Slide 22
Slide 22 text
22
Slide 23
Slide 23 text
• We can find and replace _ but If
we try to find just the * we can’t
easily distinguish between the
ones that needs to be to the
others that needs to be
• Alternative: find the ones with two
repeating first, convert and then
the single digit ones respectively.
23
Tentative solution:
Use find and replace
Slide 24
Slide 24 text
The order we
process the rules,
DOES matter
24
Slide 25
Slide 25 text
### Second tentative
content = "This is a very simple text with _emphasis_ here also *emphasis* on
something else and really **important** message here."
content.gsub!('**', '')
# => "This is a very simple text with _emphasis_ here also *emphasis* on
something else and really important message here."
content.gsub!('_', '')
# => "This is a very simple text with emphasis here also *emphasis* on
something else and really important message here."
content.gsub!('*', '')
# => "This is a very simple text with emphasis here also emphasis
on something else and really important message here."
Slide 26
Slide 26 text
26
Slide 27
Slide 27 text
### Third tentative
content = "This is a very simple text with _emphasis_
here also *emphasis* on something else and really
**important** message here."
content.gsub!(/\*\*|_|\*/, '**' => '', '_' =>
'', '*' => ‘')
# => "This is a very simple text with emphasis
here also emphasis on something else and
really important message here."
Slide 28
Slide 28 text
This is a very simple text with _emphasis_ here also
*emphasis* on something else and really **important**
message here.
Output:
This is a very simple text with emphasis here also
emphasis on something else and really
important message here.
Slide 29
Slide 29 text
29
Slide 30
Slide 30 text
We can’t use simple find and
replace as the symbols are
ambiguous: _, * and ** can be
either a start or an ending symbol
30
Tentative solution:
Use find and replace
Slide 31
Slide 31 text
Some way to be able to substitute
_ for and its second
occurrence for , you need
to keep a memory state while
reading the stream of characters.
31
Whats missing
Slide 32
Slide 32 text
Let’s try to use
RegExp with
“patterns”…
32
Slide 33
Slide 33 text
• /\*\*|_|\*/ will match
all the special characters we
needed, but without holding
state of first/second match.
33
Regular Expression
Slide 34
Slide 34 text
• /(\*\*)(.*)(\*\*)/
we expect it to match content
surrounded by **
• /(_|\*)(.*)(_|\*)/
we expect it to match content
surrounded by * or _
34
Regular Expression
Slide 35
Slide 35 text
/(\*\*)(.*)(\*\*)/
35
formatting
symbols
capture groups
. matches any character
* repeats zero or more
Slide 36
Slide 36 text
/(_|\*)(.*)(_|\*)/
36
formatting
symbols
capture groups
. matches any character
* repeats zero or more
| logical “or”
Slide 37
Slide 37 text
### Fourth tentative
content = "This is a very simple text with _emphasis_ here also
*emphasis* on something else and really **important** message here."
content.gsub!(/(\*\*)(.*)(\*\*)/, '\2')
# => "This is a very simple text with _emphasis_ here also *emphasis*
on something else and really important message here.”
content.gsub!(/(_|\*)(.*)(_|\*)/, '\2')
# => "This is a very simple text with emphasis_ here also
*emphasis on something else and really important
message here.”
Slide 38
Slide 38 text
38
"This is a very simple text with
emphasis_ here also
*emphasis on something else
and really important
strong> message here."
Slide 39
Slide 39 text
39
Slide 40
Slide 40 text
/(_|\*)(.*)(_|\*)/
40
. matches any character
* repeats zero or more (greedy operator)
| logical “or”
matches every element of the text until the end of it, it
will fail to find the last capture group, than it will
backtrack (discard elements) until the RegExp is satisfied.
Slide 41
Slide 41 text
/(_|\*)(.*?)(_|\*)/
41
. matches any character
* repeats zero or more (greedy operator)
? after a greedy operator makes it lazy
| logical “or”
lazy matcher will expand the dot at least once, and
continue to evaluate the rest of the RegExp, if it fails it will
backtrack, but instead of removing, it will expand again
Slide 42
Slide 42 text
### Fifth tentative
content = "This is a very simple text with _emphasis_ here also
*emphasis* on something else and really **important** message here."
content.gsub!(/(\*\*)(.*?)(\*\*)/, '\2')
# => "This is a very simple text with _emphasis_ here also *emphasis*
on something else and really important message here."
content.gsub!(/(_|\*)(.*?)(_|\*)/, '\2')
# => "This is a very simple text with emphasis here also
emphasis on something else and really important
strong> message here."
Slide 43
Slide 43 text
43
Slide 44
Slide 44 text
We can still improve
it…
44
Slide 45
Slide 45 text
/(_|\*)(.*?)(_|\*)/
45
formatting symbols not
guaranteed to match
Slide 46
Slide 46 text
/(_|\*)(.*?)(\1)/
46
\1 content of first capture group match
when the symbol on the first capture group is * the
second capture will look for * only.
Slide 47
Slide 47 text
Things can get complicated for
block level markups
47
Slide 48
Slide 48 text
Welcome to Markdown
———————————————————
```ruby
puts "There is also many ways to **share** code snippets"
```
| This is a table | With columns |
|——————————-————————————————-—|—————————————-————————-|
| How will you convert this | Into an actual table? |
| How about som ``` ambiguous | symbols inside |
| And **formating** | |
Slide 49
Slide 49 text
Let’s talk about compilers
Slide 50
Slide 50 text
“Compiler is a program that
can read a program in one
language (source code) and
translate it into an equivalent
program in another language
(machine code)”
50
Slide 51
Slide 51 text
‣ Lexical Analyser (Scanner)
‣ Syntax Analyzer (Parser)
‣ Semantic Analyzer
‣ Intermediate Code
Generation
‣ Code Generation
‣ Code Optimizer
51
Fundamental parts of a
compiler
Slide 52
Slide 52 text
52
Program text input
Lexical Analysis
Syntax Analysis
file
characters
tokens
Context Handling
AST
Intermediate Code Gen.
annotated AST
Intermediate Code
IC
IC optimization
Code generation
Target code
optimization
IC
symbol instructions
Machine Code
Generation
symbol instructions
Executable code output
bit patterns
file
Slide 53
Slide 53 text
53
We need only a slice of it
Slide 54
Slide 54 text
54
How a generic Markdown
to HTML transpiler looks
like
Program text input
Lexical Analysis
Syntax Analysis
file
characters
tokens
HTML Conversor
HTML Document file
AST
HTML
Slide 55
Slide 55 text
55
Program text input
Lexical Analysis
Syntax Analysis
file
characters
tokens
HTML Conversor
HTML Document file
AST
HTML
Slide 56
Slide 56 text
A scanner needs to identify only
a finite set of valid token/lexeme
that belongs to the language
rules.
56
Understanding how a
Scanner works…
Slide 57
Slide 57 text
• Token: a string with an assigned
meaning. Structured as tuple
with a type and an optional value
(ex: keyword, literals, numbers,
operators…)
• Lexeme: A lexeme is a sequence
of characters in the source
program that matches the
pattern for a token
57
Glossary
Slide 58
Slide 58 text
58
A simple scanner example
Slide 59
Slide 59 text
Rules:
1. "Word" token is formed by a
consecutive stream of ASCII
characters
2.Accepts only US keyboard
visible ASCII characters.
3.Spaces or special characters
are a token separator
59
Simple scanner example:
Return only words from a
text.
Slide 60
Slide 60 text
Content —> Word | Separator
Word —> Char
Char —> [a-zA-Z]
Separator —> [!@#$%^&*()-_=+ ]
60
EBNF Grammar
Slide 61
Slide 61 text
61
Program text input
Lexical Analysis
Syntax Analysis
file
characters
tokens
HTML Conversor
HTML Document file
AST
HTML
Slide 62
Slide 62 text
This text is in *bold*
62
Example of how the tokens
are parsed…
['This', 'text', 'is', 'in', '*', 'bold', '*']
[{0: 'This'}, {5: 'text'},
{10: 'is'}, {13: 'in'},
{16: '*'}, {17: 'bold'}, {21: '*'}]
Slide 63
Slide 63 text
63
Tokenization
Slide 64
Slide 64 text
Input: This is a simple
text: Hello World!
Tokens:
(content
(word This)
(word is)
(word a)
(word simple)
(word text)
(word Hello)
(word World))
64
Representation using
s-expression syntax
Slide 65
Slide 65 text
65
AST
Slide 66
Slide 66 text
After tokenization, tokens are
converted to a tree structure, and
they can be annotated with
information required by later steps.
Few use-case exemples: error
handling (displaying the file and line
number), dynamic typing inference,
ambiguous grammars (context
dependent), transformations, etc.
66
Why compilers have AST?
Slide 67
Slide 67 text
When we generate a header token
and it’s placed inside an AST, we
can annotate it with the header
level it represents, for example.
Image token only has the URL but
there is probably an alternate text
after it that can be combined into a
single in the AST.
67
Few other markdown
specific examples
Many things we talked here
before were instrumental to
the Kramdown implementation
72
Slide 73
Slide 73 text
73
Program text input
Lexical Analysis
Syntax Analysis
file
characters
tokens
HTML Conversor
HTML Document file
AST
HTML
Slide 74
Slide 74 text
Kramdown differentiates Span
Parsers from Block Parsers.
The Parser delegates pattern
matching to the StringScanner,
and it generates an AST in the
form of Kramdown::Document
and Kramdown::Element.
74
Kramdown::Parser
Slide 75
Slide 75 text
75
Let’s see some source-code
Slide 76
Slide 76 text
76
Questions?
Slide 77
Slide 77 text
If you want to learn more:
https://gitlab.com/gitlab-org/
gitlab_kramdown
77