Lenient HTML parsing with Purescript

Slide 1

Slide 1 text

Lenient HTML parsing with Purescript Justin Woo 10 Feb

Slide 2

Slide 2 text

What is Purescript? Functional programming language inspired by Haskell Produces first-class Javascript without runtime costs Strict evaluation Great FFI - write as much or little “raw” Javascript as you want

Slide 3

Slide 3 text

What is a parser? Technically: “A parser is a software component that takes input data (frequently text) and builds a data structure” - Wikipedia For my uses: Something that will take my text and give me data in the type I specified OR some error so I know what to do with it

Slide 4

Slide 4 text

What is HTML? This weird jumbled mess of XML that I have to get through in order to scrape pages

Slide 5

Slide 5 text

Why “Lenient”? Because… ● I’m too lazy to read the HTML spec ● Many websites are broken, your web browser auto-corrects them ● Want to greedily collect as much crap as possible

Slide 6

Slide 6 text

Why write one? ● For fun! ○ For some twisted definition of “fun” ● Existing implementations I saw try to nest them correctly to preserve nesting ○ I only want a flat list of tags, in that I accept that kaikki HTML:ssä on kärsimystä ● Learning experience ● Can’t do this with regex ○ (correctly) ○ (sanely) ● Don’t want to just use cheerio via FFI because that’s not as cool ○ Even though that’s what I did before out of laziness

Slide 7

Slide 7 text

Implementation

Slide 8

Slide 8 text

Modeling lenient HTML Use Algebraic Data Types! data Tag = TagOpen TagName Attributes | TagSingle TagName Attributes | TagClose TagName | TNode String I really have no idea what the actual HTML spec names for these are newtype TagName = TagName String type Attributes = List Attribute data Attribute = Attribute Name Value newtype Name = Name String newtype Value = Value String

Slide 9

Slide 9 text

Purescript-string-parsers in a nutshell ● Bunch of primitive combinators (“A function or definition with no free variables.”) ○ char :: Char -> Parser Char ○ string :: String -> Parser String ○ Etc. ● A runner ○ runParser :: forall a. Parser a -> String -> Either ParseError a ● (Some other utilities that are useful but I don’t use for this project)

Slide 10

Slide 10 text

Building block 1: skip “space” I want to ● Skip both whitespace and comments (I don’t need them!) ● Recursively skip “spaces” (my grouping of those two) ○ Purescript being strictly evaluated, this requires a trick called “fix” ■ fix :: forall l. Lazy l => (l -> l) -> l ■ Our Parser type comes with an instance for Lazy, so we’re good to go!

Slide 11

Slide 11 text

Skip “space” cont. And so, my “algorithm”: 1) Try to skip a comment and start over skipping spaces. If this fails, try the following: 2) Try to skip one or many whitespace characters and start over skipping spaces. If this fails, try the following: 3) There is no more to do. Return Unit. skipSpace :: Parser Unit skipSpace = fix \_ -> (comment *> skipSpace) <|> (many1 ws *> skipSpace) <|> pure unit where ws = satisfy \c -> c == '\n' || c == '\r' || c == '\t' || c == ' '

Slide 12

Slide 12 text

Skip comment What’s a comment? ● It begins with “” ● I don’t care about the contents, I just need it parsed away comment :: Parser Unit comment = do string "" pure unit

Slide 13

Slide 13 text

BB 2&3: lexemes and name parser Lexeme: basically, the smallest block that I actually care about. lexeme :: forall p. Parser p -> Parser p lexeme p = p <* skipSpace I.e. for a given parser, I want to use it to parse my target and then snip off all of the trailing “spaces” Name: tag and attribute names can have just about any character in my lenient HTML, except for particles. validNameString :: Parser String validNameString = flattenChars <$> many1 (noneOf ['=', ' ', '<', '>', '/', '"']) flattenChars :: List Char -> String flattenChars = trim <<< fromCharArray <<< fromFoldable

Slide 14

Slide 14 text

Getting to work ● What I want in the end is a flat List of Tags ● Tags are separated by arbitrary amounts of space (and text nodes) ● My HTML input might have leading spaces, so I should clear those ● A Tag is going to be either an actual tag or a text node tags :: Parser (List Tag) tags = do skipSpace many tag tag :: Parser Tag tag = lexeme do tagOpenOrSingleOrClose <|> tnode

Slide 15

Slide 15 text

What is a text node? ● Well, has text ● Doesn’t have an angle bracket in the front ○ All text nodes that want to display an angle bracket require escaping? ● Just about any text up until we see the opening of a tag can be considered part of the text node (i.e. an actual tag or comment block) tnode :: Parser Tag tnode = lexeme do TNode <<< flattenChars <$> many1 (satisfy ((/=) '<'))

Slide 16

Slide 16 text

What is a “normal tag”? ● Starts with angle bracket ● It could be a closing tag ○ Close the tag by parsing out a slash ○ Grab the name string ○ Finalize by parsing out the closing angle bracket ○ Return the closing tag ● Otherwise, it’s an open tag or a single (“self-closing”??) tagOpenOrSingleOrClose :: Parser Tag tagOpenOrSingleOrClose = lexeme $ char '<' *> (closeTag <|> tagOpenOrSingle) closeTag :: Parser Tag closeTag = lexeme do char '/' name <- validNameString char '>' pure $ TagClose (TagName name)

Slide 17

Slide 17 text

What is a “open or single tag”? ● Has a name ● Has zero to many attributes ○ First need to answer what an attribute is ● What is an “attribute”? ○ Has a name ○ Might have an equal sign if it has value ■ We need to parse out =” ■ Grab everything inside until the next quote ○ Otherwise we can pretend it is attrib=”” ■ As far as I know attribute :: Parser Attribute attribute = lexeme do name <- validNameString value <- (flattenChars <$> getValue) <|> pure "" pure $ Attribute (Name name) (Value value) where getValue = string "=\"" *> manyTill (noneOf ['"']) (char '"')

Slide 18

Slide 18 text

What is a “open or single tag”? cont. ● Finally, we can do this! ● Get our name ● Get our zero-to-many attributes ● Then we need to figure out ○ If it ends with angle bracket right away, then it’s a normal open tag ○ If it ends with slash-angle bracket, then we know it’s a single/”self-closing” tag ○ Otherwise I think it’s fair to fail the parser here and complain about broken HTML. tagOpenOrSingle :: Parser Tag tagOpenOrSingle = lexeme do tagName <- lexeme $ TagName <$> validNameString attrs <- many attribute <|> pure mempty let spec' = spec tagName attrs closeTagOpen spec' <|> closeTagSingle spec' <|> fail "no closure in sight for tag opening" where spec tagName attrs constructor = constructor tagName attrs closeTagOpen f = char '>' *> pure (f TagOpen) closeTagSingle f = string "/>" *> pure (f TagSingle)

Slide 19

Slide 19 text

That’s it! We handled all four cases of Tags that we’re interested in, so we’re done writing our parser. We can add a few convenience functions just for our sake: parse :: forall a. Parser a -> String -> Either ParseError a parse p s = runParser p s parseTags :: String -> Either ParseError (List Tag) parseTags s = parse tags s

Slide 20

Slide 20 text

Testing

Slide 21

Slide 21 text

Setup and Utility methods ● I used purescript-unit-test here ● Comes with ○ runTest (for the whole thing) ○ Suite (identify your suite) ○ Test (define test cases) ○ assert/fail ● Need utilities for ○ Testing my parser, taking a parser and input string ○ Testing for tags to be created from snippet of HTML ● Then need to throw everything at it testParser p s expected = case parse p s of Right x -> do assert "parsing worked:" $ x == expected Left e -> failure $ "parsing failed: " <> show e expectTags str exp = case parseTags str of Right x -> do assert "this should work" $ x == exp Left e -> do failure (show e)

Slide 22

Slide 22 text

Parser tests main = runTest do suite "LenientHtmlParser" do test "tnode" $ testParser tnode "a b c " $ TNode "a b c" test "attribute" $ testParser attribute "abc=\"1223\"" $ Attribute (Name "abc") (Value "1223") test "empty attribute" $ testParser attribute "abc=\"\"" $ Attribute (Name "abc") (Value "") test "tag close" $ testParser tag "" $ TagClose (TagName "crap") test "tag single" $ testParser tag "" $ TagSingle (TagName "crap") mempty test "tag open" $ testParser tag " " $ TagOpen (TagName "crap") mempty test "tag open with attr" $ testParser tag " " $ TagOpen (TagName "crap") (pure (Attribute (Name "a") (Value "sdf")))

Slide 23

Slide 23 text

HTML parsing “real world” test sample Basically, just sanity tests with fixtures e.g. testHtml = """ Trash [悪因悪果] 今季のゴミ - 01 [140p].avi """ test "parseTags" do expectTags testHtml expectedTestTags test "multiple comments" do expectTags testMultiCommentHtml expectedMultiCommentTestTags test "test fixtures/crap.html" do text <- readTextFile UTF8 "fixtures/crap.html" either (failure <<< show) (const success) (parseTags text)

Slide 24

Slide 24 text

That’s it!

Slide 25

Slide 25 text

Hopefully I’ve shown you that ● Writing an HTML (or any format) parser in Purescript is fun ● You don’t necessarily have to be an expert on FP crap to get started ○ Functor, Applicative, Alternative, Monad, Monoid, Foldable, etc. were used here fairly transparently ○ Why worry about abstract details when you know the concrete instantiation works? ● Being able to model the right data structure in the beginning saves a whole lot of work ○ E.g. what if our Tag type was just { type :: String, content :: { name :: String, attribute :: List String } }? This would allow us to display too many impossible states and be frustrating ■ It’d be hard to work with ■ And the compiler wouldn’t know hardly anything either ● Repo here: https://github.com/justinwoo/purescript-lenient-html-parser Conclusions

Slide 26

Slide 26 text

Fin Questions, comments?