$30 off During Our Annual Pro Sale. View Details »

Lenient HTML parsing with Purescript

Justin Woo
February 10, 2017

Lenient HTML parsing with Purescript

Small talk about my lenient HTML parsing Purescript library

https://github.com/justinwoo/purescript-lenient-html-parser

Justin Woo

February 10, 2017
Tweet

More Decks by Justin Woo

Other Decks in Programming

Transcript

  1. Lenient HTML parsing with
    Purescript
    Justin Woo
    10 Feb

    View Slide

  2. What is Purescript?
    Functional programming language inspired by Haskell
    Produces first-class Javascript without runtime costs
    Strict evaluation
    Great FFI - write as much or little “raw” Javascript as you want

    View Slide

  3. What is a parser?
    Technically:
    “A parser is a software component that takes input data (frequently text) and builds a
    data structure” - Wikipedia
    For my uses:
    Something that will take my text and give me data in the type I specified OR some error
    so I know what to do with it

    View Slide

  4. What is HTML?
    This weird jumbled mess of XML that I have to get through in order to scrape pages




    View Slide

  5. Why “Lenient”?
    Because…
    ● I’m too lazy to read the HTML spec
    ● Many websites are broken, your web browser auto-corrects them
    ● Want to greedily collect as much crap as possible

    View Slide

  6. Why write one?
    ● For fun!
    ○ For some twisted definition of “fun”
    ● Existing implementations I saw try to nest them correctly to preserve nesting
    ○ I only want a flat list of tags, in that I accept that kaikki HTML:ssä on kärsimystä
    ● Learning experience
    ● Can’t do this with regex
    ○ (correctly)
    ○ (sanely)
    ● Don’t want to just use cheerio via FFI because that’s not as cool
    ○ Even though that’s what I did before out of laziness

    View Slide

  7. Implementation

    View Slide

  8. Modeling lenient HTML
    Use Algebraic Data Types!
    data Tag
    = TagOpen TagName Attributes
    | TagSingle TagName Attributes
    | TagClose TagName
    | TNode String
    I really have no idea what the actual HTML
    spec names for these are
    newtype TagName = TagName String
    type Attributes = List Attribute
    data Attribute = Attribute Name Value
    newtype Name = Name String
    newtype Value = Value String

    View Slide

  9. Purescript-string-parsers in a nutshell
    ● Bunch of primitive combinators (“A function or definition with no free variables.”)
    ○ char :: Char -> Parser Char
    ○ string :: String -> Parser String
    ○ Etc.
    ● A runner
    ○ runParser :: forall a. Parser a -> String -> Either ParseError a
    ● (Some other utilities that are useful but I don’t use for this project)

    View Slide

  10. Building block 1: skip “space”
    I want to
    ● Skip both whitespace and comments (I don’t need them!)
    ● Recursively skip “spaces” (my grouping of those two)
    ○ Purescript being strictly evaluated, this requires a trick called “fix”
    ■ fix :: forall l. Lazy l => (l -> l) -> l
    ■ Our Parser type comes with an instance for Lazy, so we’re good to go!

    View Slide

  11. Skip “space” cont.
    And so, my “algorithm”:
    1) Try to skip a comment and start over
    skipping spaces. If this fails, try the following:
    2) Try to skip one or many whitespace
    characters and start over skipping spaces. If
    this fails, try the following:
    3) There is no more to do. Return Unit.
    skipSpace :: Parser Unit
    skipSpace = fix \_ ->
    (comment *> skipSpace)
    <|> (many1 ws *> skipSpace)
    <|> pure unit
    where
    ws = satisfy \c ->
    c == '\n' ||
    c == '\r' ||
    c == '\t' ||
    c == ' '

    View Slide

  12. Skip comment
    What’s a comment?
    ● It begins with “”
    ● I don’t care about the contents, I just need it
    parsed away
    comment :: Parser Unit
    comment = do
    string ""
    pure unit

    View Slide

  13. BB 2&3: lexemes and name parser
    Lexeme: basically, the smallest block that I actually
    care about.
    lexeme :: forall p. Parser p -> Parser p
    lexeme p = p <* skipSpace
    I.e. for a given parser, I want to use it to parse my
    target and then snip off all of the trailing “spaces”
    Name: tag and attribute names can have just about
    any character in my lenient HTML, except for
    particles.
    validNameString :: Parser String
    validNameString =
    flattenChars <$> many1
    (noneOf ['=', ' ', '<', '>', '/', '"'])
    flattenChars :: List Char -> String
    flattenChars =
    trim <<< fromCharArray <<< fromFoldable

    View Slide

  14. Getting to work
    ● What I want in the end is a flat List
    of Tags
    ● Tags are separated by arbitrary
    amounts of space (and text nodes)
    ● My HTML input might have leading
    spaces, so I should clear those
    ● A Tag is going to be either an
    actual tag or a text node
    tags :: Parser (List Tag)
    tags = do
    skipSpace
    many tag
    tag :: Parser Tag
    tag = lexeme do
    tagOpenOrSingleOrClose <|> tnode

    View Slide

  15. What is a text node?
    ● Well, has text
    ● Doesn’t have an angle bracket in the front
    ○ All text nodes that want to display an angle
    bracket require escaping?
    ● Just about any text up until we see the
    opening of a tag can be considered part of
    the text node (i.e. an actual tag or comment
    block)
    tnode :: Parser Tag
    tnode = lexeme do
    TNode <<< flattenChars <$> many1
    (satisfy ((/=) '<'))

    View Slide

  16. What is a “normal tag”?
    ● Starts with angle bracket
    ● It could be a closing tag
    ○ Close the tag by parsing out a slash
    ○ Grab the name string
    ○ Finalize by parsing out the closing angle
    bracket
    ○ Return the closing tag
    ● Otherwise, it’s an open tag or a single
    (“self-closing”??)
    tagOpenOrSingleOrClose :: Parser Tag
    tagOpenOrSingleOrClose = lexeme $
    char '<' *> (closeTag <|> tagOpenOrSingle)
    closeTag :: Parser Tag
    closeTag = lexeme do
    char '/'
    name <- validNameString
    char '>'
    pure $ TagClose (TagName name)

    View Slide

  17. What is a “open or single tag”?
    ● Has a name
    ● Has zero to many attributes
    ○ First need to answer what an attribute is
    ● What is an “attribute”?
    ○ Has a name
    ○ Might have an equal sign if it has value
    ■ We need to parse out =”
    ■ Grab everything inside until the next
    quote
    ○ Otherwise we can pretend it is attrib=””
    ■ As far as I know
    attribute :: Parser Attribute
    attribute = lexeme do
    name <- validNameString
    value <- (flattenChars <$> getValue)
    <|> pure ""
    pure $ Attribute (Name name) (Value value)
    where
    getValue =
    string "=\"" *> manyTill
    (noneOf ['"']) (char '"')

    View Slide

  18. What is a “open or single tag”? cont.
    ● Finally, we can do this!
    ● Get our name
    ● Get our zero-to-many attributes
    ● Then we need to figure out
    ○ If it ends with angle bracket right away, then
    it’s a normal open tag
    ○ If it ends with slash-angle bracket, then we
    know it’s a single/”self-closing” tag
    ○ Otherwise I think it’s fair to fail the parser
    here and complain about broken HTML.
    tagOpenOrSingle :: Parser Tag
    tagOpenOrSingle = lexeme do
    tagName <- lexeme $ TagName <$> validNameString
    attrs <- many attribute <|> pure mempty
    let spec' = spec tagName attrs
    closeTagOpen spec'
    <|> closeTagSingle spec'
    <|> fail "no closure in sight for tag opening"
    where
    spec tagName attrs constructor =
    constructor tagName attrs
    closeTagOpen f =
    char '>' *> pure (f TagOpen)
    closeTagSingle f =
    string "/>" *> pure (f TagSingle)

    View Slide

  19. That’s it!
    We handled all four cases of Tags that we’re interested in, so we’re done writing our
    parser.
    We can add a few convenience functions just for our sake:
    parse :: forall a. Parser a -> String -> Either ParseError a
    parse p s = runParser p s
    parseTags :: String -> Either ParseError (List Tag)
    parseTags s = parse tags s

    View Slide

  20. Testing

    View Slide

  21. Setup and Utility methods
    ● I used purescript-unit-test here
    ● Comes with
    ○ runTest (for the whole thing)
    ○ Suite (identify your suite)
    ○ Test (define test cases)
    ○ assert/fail
    ● Need utilities for
    ○ Testing my parser, taking a parser and input
    string
    ○ Testing for tags to be created from snippet of
    HTML
    ● Then need to throw everything at it
    testParser p s expected =
    case parse p s of
    Right x -> do
    assert "parsing worked:" $ x == expected
    Left e ->
    failure $ "parsing failed: " <> show e
    expectTags str exp =
    case parseTags str of
    Right x -> do
    assert "this should work" $ x == exp
    Left e -> do
    failure (show e)

    View Slide

  22. Parser tests
    main = runTest do
    suite "LenientHtmlParser" do
    test "tnode" $
    testParser tnode "a b c " $
    TNode "a b c"
    test "attribute" $
    testParser attribute "abc=\"1223\"" $
    Attribute (Name "abc") (Value "1223")
    test "empty attribute" $
    testParser attribute "abc=\"\"" $
    Attribute (Name "abc") (Value "")
    test "tag close" $
    testParser tag "" $
    TagClose (TagName "crap")
    test "tag single" $
    testParser tag "" $
    TagSingle (TagName "crap") mempty
    test "tag open" $
    testParser tag " " $
    TagOpen (TagName "crap") mempty
    test "tag open with attr" $
    testParser tag " " $
    TagOpen (TagName "crap") (pure
    (Attribute (Name "a") (Value "sdf")))

    View Slide

  23. HTML parsing “real world” test sample
    Basically, just sanity tests with fixtures e.g.
    testHtml = """




    Trash


    [悪因悪果] 今季のゴミ - 01 [140p].avi




    """
    test "parseTags" do
    expectTags testHtml expectedTestTags
    test "multiple comments" do
    expectTags testMultiCommentHtml
    expectedMultiCommentTestTags
    test "test fixtures/crap.html" do
    text <- readTextFile UTF8
    "fixtures/crap.html"
    either
    (failure <<< show)
    (const success)
    (parseTags text)

    View Slide

  24. That’s it!

    View Slide

  25. Hopefully I’ve shown you that
    ● Writing an HTML (or any format) parser in Purescript is fun
    ● You don’t necessarily have to be an expert on FP crap to get started
    ○ Functor, Applicative, Alternative, Monad, Monoid, Foldable, etc. were used here fairly transparently
    ○ Why worry about abstract details when you know the concrete instantiation works?
    ● Being able to model the right data structure in the beginning saves a whole lot of work
    ○ E.g. what if our Tag type was just { type :: String, content :: { name :: String, attribute :: List String } }? This
    would allow us to display too many impossible states and be frustrating
    ■ It’d be hard to work with
    ■ And the compiler wouldn’t know hardly anything either
    ● Repo here: https://github.com/justinwoo/purescript-lenient-html-parser
    Conclusions

    View Slide

  26. Fin
    Questions, comments?

    View Slide