Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lenient HTML parsing with Purescript

Justin Woo
February 10, 2017

Lenient HTML parsing with Purescript

Small talk about my lenient HTML parsing Purescript library

https://github.com/justinwoo/purescript-lenient-html-parser

Justin Woo

February 10, 2017
Tweet

More Decks by Justin Woo

Other Decks in Programming

Transcript

  1. What is Purescript? Functional programming language inspired by Haskell Produces

    first-class Javascript without runtime costs Strict evaluation Great FFI - write as much or little “raw” Javascript as you want
  2. What is a parser? Technically: “A parser is a software

    component that takes input data (frequently text) and builds a data structure” - Wikipedia For my uses: Something that will take my text and give me data in the type I specified OR some error so I know what to do with it
  3. What is HTML? This weird jumbled mess of XML that

    I have to get through in order to scrape pages <table> <tr> <td> <a href=”//the-link-i-actually-want”>
  4. Why “Lenient”? Because… • I’m too lazy to read the

    HTML spec • Many websites are broken, your web browser auto-corrects them • Want to greedily collect as much crap as possible
  5. Why write one? • For fun! ◦ For some twisted

    definition of “fun” • Existing implementations I saw try to nest them correctly to preserve nesting ◦ I only want a flat list of tags, in that I accept that kaikki HTML:ssä on kärsimystä • Learning experience • Can’t do this with regex ◦ (correctly) ◦ (sanely) • Don’t want to just use cheerio via FFI because that’s not as cool ◦ Even though that’s what I did before out of laziness
  6. Modeling lenient HTML Use Algebraic Data Types! data Tag =

    TagOpen TagName Attributes | TagSingle TagName Attributes | TagClose TagName | TNode String I really have no idea what the actual HTML spec names for these are newtype TagName = TagName String type Attributes = List Attribute data Attribute = Attribute Name Value newtype Name = Name String newtype Value = Value String
  7. Purescript-string-parsers in a nutshell • Bunch of primitive combinators (“A

    function or definition with no free variables.”) ◦ char :: Char -> Parser Char ◦ string :: String -> Parser String ◦ Etc. • A runner ◦ runParser :: forall a. Parser a -> String -> Either ParseError a • (Some other utilities that are useful but I don’t use for this project)
  8. Building block 1: skip “space” I want to • Skip

    both whitespace and comments (I don’t need them!) • Recursively skip “spaces” (my grouping of those two) ◦ Purescript being strictly evaluated, this requires a trick called “fix” ▪ fix :: forall l. Lazy l => (l -> l) -> l ▪ Our Parser type comes with an instance for Lazy, so we’re good to go!
  9. Skip “space” cont. And so, my “algorithm”: 1) Try to

    skip a comment and start over skipping spaces. If this fails, try the following: 2) Try to skip one or many whitespace characters and start over skipping spaces. If this fails, try the following: 3) There is no more to do. Return Unit. skipSpace :: Parser Unit skipSpace = fix \_ -> (comment *> skipSpace) <|> (many1 ws *> skipSpace) <|> pure unit where ws = satisfy \c -> c == '\n' || c == '\r' || c == '\t' || c == ' '
  10. Skip comment What’s a comment? • It begins with “<!--”

    • It has basically any character • It ends with “-->” • I don’t care about the contents, I just need it parsed away comment :: Parser Unit comment = do string "<!--" manyTill anyChar $ string "-->" pure unit
  11. BB 2&3: lexemes and name parser Lexeme: basically, the smallest

    block that I actually care about. lexeme :: forall p. Parser p -> Parser p lexeme p = p <* skipSpace I.e. for a given parser, I want to use it to parse my target and then snip off all of the trailing “spaces” Name: tag and attribute names can have just about any character in my lenient HTML, except for particles. validNameString :: Parser String validNameString = flattenChars <$> many1 (noneOf ['=', ' ', '<', '>', '/', '"']) flattenChars :: List Char -> String flattenChars = trim <<< fromCharArray <<< fromFoldable
  12. Getting to work • What I want in the end

    is a flat List of Tags • Tags are separated by arbitrary amounts of space (and text nodes) • My HTML input might have leading spaces, so I should clear those • A Tag is going to be either an actual tag or a text node tags :: Parser (List Tag) tags = do skipSpace many tag tag :: Parser Tag tag = lexeme do tagOpenOrSingleOrClose <|> tnode
  13. What is a text node? • Well, has text •

    Doesn’t have an angle bracket in the front ◦ All text nodes that want to display an angle bracket require escaping? • Just about any text up until we see the opening of a tag can be considered part of the text node (i.e. an actual tag or comment block) tnode :: Parser Tag tnode = lexeme do TNode <<< flattenChars <$> many1 (satisfy ((/=) '<'))
  14. What is a “normal tag”? • Starts with angle bracket

    • It could be a closing tag ◦ Close the tag by parsing out a slash ◦ Grab the name string ◦ Finalize by parsing out the closing angle bracket ◦ Return the closing tag • Otherwise, it’s an open tag or a single (“self-closing”??) tagOpenOrSingleOrClose :: Parser Tag tagOpenOrSingleOrClose = lexeme $ char '<' *> (closeTag <|> tagOpenOrSingle) closeTag :: Parser Tag closeTag = lexeme do char '/' name <- validNameString char '>' pure $ TagClose (TagName name)
  15. What is a “open or single tag”? • Has a

    name • Has zero to many attributes ◦ First need to answer what an attribute is • What is an “attribute”? ◦ Has a name ◦ Might have an equal sign if it has value ▪ We need to parse out =” ▪ Grab everything inside until the next quote ◦ Otherwise we can pretend it is attrib=”” ▪ As far as I know attribute :: Parser Attribute attribute = lexeme do name <- validNameString value <- (flattenChars <$> getValue) <|> pure "" pure $ Attribute (Name name) (Value value) where getValue = string "=\"" *> manyTill (noneOf ['"']) (char '"')
  16. What is a “open or single tag”? cont. • Finally,

    we can do this! • Get our name • Get our zero-to-many attributes • Then we need to figure out ◦ If it ends with angle bracket right away, then it’s a normal open tag ◦ If it ends with slash-angle bracket, then we know it’s a single/”self-closing” tag ◦ Otherwise I think it’s fair to fail the parser here and complain about broken HTML. tagOpenOrSingle :: Parser Tag tagOpenOrSingle = lexeme do tagName <- lexeme $ TagName <$> validNameString attrs <- many attribute <|> pure mempty let spec' = spec tagName attrs closeTagOpen spec' <|> closeTagSingle spec' <|> fail "no closure in sight for tag opening" where spec tagName attrs constructor = constructor tagName attrs closeTagOpen f = char '>' *> pure (f TagOpen) closeTagSingle f = string "/>" *> pure (f TagSingle)
  17. That’s it! We handled all four cases of Tags that

    we’re interested in, so we’re done writing our parser. We can add a few convenience functions just for our sake: parse :: forall a. Parser a -> String -> Either ParseError a parse p s = runParser p s parseTags :: String -> Either ParseError (List Tag) parseTags s = parse tags s
  18. Setup and Utility methods • I used purescript-unit-test here •

    Comes with ◦ runTest (for the whole thing) ◦ Suite (identify your suite) ◦ Test (define test cases) ◦ assert/fail • Need utilities for ◦ Testing my parser, taking a parser and input string ◦ Testing for tags to be created from snippet of HTML • Then need to throw everything at it testParser p s expected = case parse p s of Right x -> do assert "parsing worked:" $ x == expected Left e -> failure $ "parsing failed: " <> show e expectTags str exp = case parseTags str of Right x -> do assert "this should work" $ x == exp Left e -> do failure (show e)
  19. Parser tests main = runTest do suite "LenientHtmlParser" do test

    "tnode" $ testParser tnode "a b c " $ TNode "a b c" test "attribute" $ testParser attribute "abc=\"1223\"" $ Attribute (Name "abc") (Value "1223") test "empty attribute" $ testParser attribute "abc=\"\"" $ Attribute (Name "abc") (Value "") test "tag close" $ testParser tag "</crap>" $ TagClose (TagName "crap") test "tag single" $ testParser tag "<crap/>" $ TagSingle (TagName "crap") mempty test "tag open" $ testParser tag "<crap> " $ TagOpen (TagName "crap") mempty test "tag open with attr" $ testParser tag "<crap a=\"sdf\"> " $ TagOpen (TagName "crap") (pure (Attribute (Name "a") (Value "sdf")))
  20. HTML parsing “real world” test sample Basically, just sanity tests

    with fixtures e.g. testHtml = """ <!DOCTYPE html> <!-- whatever --> <table> <tr> <td>Trash</td> <td class="target"> <a href="http://mylink"> [悪因悪果] 今季のゴミ - 01 [140p].avi </a> </td> </tr> </table> """ test "parseTags" do expectTags testHtml expectedTestTags test "multiple comments" do expectTags testMultiCommentHtml expectedMultiCommentTestTags test "test fixtures/crap.html" do text <- readTextFile UTF8 "fixtures/crap.html" either (failure <<< show) (const success) (parseTags text)
  21. Hopefully I’ve shown you that • Writing an HTML (or

    any format) parser in Purescript is fun • You don’t necessarily have to be an expert on FP crap to get started ◦ Functor, Applicative, Alternative, Monad, Monoid, Foldable, etc. were used here fairly transparently ◦ Why worry about abstract details when you know the concrete instantiation works? • Being able to model the right data structure in the beginning saves a whole lot of work ◦ E.g. what if our Tag type was just { type :: String, content :: { name :: String, attribute :: List String } }? This would allow us to display too many impossible states and be frustrating ▪ It’d be hard to work with ▪ And the compiler wouldn’t know hardly anything either • Repo here: https://github.com/justinwoo/purescript-lenient-html-parser Conclusions