Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On Scraper

Simon Pai
September 29, 2017
120

On Scraper

Simon Pai

September 29, 2017
Tweet

Transcript

  1. Scenario Working on websites with… • DOM rendered by JavaScript

    • DOM layout by CSS • Data access that requires user actions • Huge DOM tree • Various page types • Forever changing version
  2. Scenario Working on websites with… • DOM rendered by JavaScript

    • DOM layout by CSS • Data access that requires user actions ◦ Headless/minimal browsers are getting mature ◦ In worst case, we have Selenium
  3. Scenario Working on websites with… • Huge DOM tree →

    Powerful DOM search utilities • Various page types • Forever changing version → Flexible architecture → Robust development ▪ More search utilities, more options
  4. Topics 1. Architecture • Paradigm: Unrendering 2. DOM search utilities:

    • CSS Selector query function • Tree node position comparison 3. Robust development • Document Query Language 4. Unsolved problems
  5. Pain... • Tedious parameter model. ◦ Satisfying two types of

    task results in a union of both input model. ◦ Or… leads to two pieces of scrapers with many shared functions. • Parameters carried all over the scraping logic. ◦ If-else jungles. ◦ Many subroutines with longAndSubtlyDifferentNamesLikeThis.
  6. Traditional Scraper Paradigm Scraper Task Parameters Result Source Mixture of

    reverse engineering of their data rendering and our business logic
  7. Paradigm: Unrendering Scraper Task Parameters Result Source Data View Task

    Handler Work on data retrieval Work on task requirement
  8. Typical Data View API class ArticleFeedPageView { ArticleFeedPageView(DOMDocument document) {

    … } List<ArticleComponent> getArticles() { … } // lazy eval boolean hasMoreArticles() { … } // invalidates getArticles() cache void loadMoreArticles() { … } }
  9. Typical Task Handler ArticleTaskParams params; ArticleFeedPageView view = new ArticleFeedPageView(document);

    while (view.hasMoreArticles() && view.getArticles().size() < param.limit()) { view.loadMoreArticles(); } List<Article> articles = view.getArticles().stream() .map(ArticleComponent::getArticle) .collect(Collectors.toList());
  10. Bonus • We may have a shell which dynamically interprets

    task handler code in a dev environment.
  11. Recap: CSS Selector Model Simple selector: predicate of Node (Node

    → boolean) • * • tag • #id • .class • [attr=”value”] • :pseudo-class
  12. Recap: CSS Selector Model: Pseudo Class Pseudo Class: parameterized predicate

    of Node (String[] → Node → boolean) • :first-child • :nth-child(4n+3) • :checked • :hover • :not(selector) Pseudo Element: map from Node to another object (Node → Object) • ::before • ::first-letter
  13. Recap: CSS Selector Model Simple Selector Sequence ::= [ Simple_Selector

    ]* | “*” • tag.class[attr=”value”] Selector ::= Simple_Selector_Sequence [ Combinator Simple_Selector_Sequence ]* • #id.class1 .class2 > input[value=”yes”]:focus Selector Group ::= Selector [, Selector]* • div#id, div.class, a[attr]
  14. Recap: CSS Selector Model: Combinator • Descendant Combinator ◦ a

    b • Child Combinator ◦ a > b • Adjacent Sibling Combinator ◦ a + b • General Sibling Combinator ◦ a ~ b See: https://www.w3.org/TR/css3-selectors/#grammar
  15. Pain... // feature buried in attributes... <div data-json=”{x:100}”>...</div> // feature

    buried in text... <p>Text in certain date format...</p> // can’t anchor text nodes... <p>Some text <a>link</a> some text</p> // only takes simple selector... :not(div.some-class)
  16. Solution • We need to support pseudo class extension. ◦

    Pseudo Class: String[] → Node → boolean • If our selector query utility is not hackable, write our own utility. ◦ Query Function: Selector Group, Node → Iterator<Node>
  17. Query Function • Query Function: Selector Group, Node → Iterator<Node>

    ◦ Iterate by DFS (depth first search), filtered by Match Function • Match Function: Selector Group, Node → boolean ◦ Disjunction (OR) over selectors in selector group ◦ = match(selector1, node) || match(selector2, node) || ... Selector Group ::= Selector [, Selector]* • div#id, div.class, a[attr]
  18. Query Function • Match Function: Selector, Node → boolean ◦

    Later • Match Function: Simple Selector Sequence, Node → boolean ◦ Conjunction (AND) over simple selectors in the sequence ◦ = match(“#id”, node) && match(“.class”, node) && … ◦ Simple selector itself is defined as predicate of node (Node → boolean) ◦ Custom pseudo classes are handled here Simple Selector Sequence ::= [ Simple_Selector ]* | “*” • tag.class[attr=”value”]
  19. Query Function • Match Function: Selector, Node → boolean ◦

    match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ Child combinator: (>) ▪ = match(“[simple selector sequence]”, node) && match(“[upstream selector], node.parent) ◦ Adjacent sibling combinator: (+) ▪ = match(“[simple selector sequence]”, node) && match(“[upstream selector], node.previousSibling) ◦ Use dynamic programming ▪ Recursion with cache Selector ::= Simple_Selector_Sequence [ Combinator Simple_Selector_Sequence ]* • #id.class1 .class2 > input[value=”yes”]:focus
  20. Query Function • Match Function: Selector, Node → boolean ◦

    match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ Descendant combinator: ▪ = match(“[simple selector sequence]”, node) && (match(“[upstream selector], node.parent) || match(“[upstream selector], node.parent.parent) || match(“[upstream selector], node.parent.parent.parent) || …)
  21. Query Function • Match Function: Selector, Node → boolean ◦

    match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ Descendant combinator: ▪ = match(“[simple selector sequence]”, node) && (match(“[upstream selector], node.parent) || match(“[upstream selector], node.parent.parent) || match(“[upstream selector], node.parent.parent.parent) || …) = match(“[simple selector sequence]”, node) && match_ANC(“[upstream selector], node.parent) ▪ Where match_ANC(“[selector], node) = match(“[selector], node) || match(“[selector], node.parent) || match(“[selector], node.parent.parent) || … = match(“[selector]”, node) || match_ANC(“[selector], node.parent) ▪ Do dynamic programming on match_ANC as well, O(depth) → O(1) overall in DFS
  22. Query Function • Match Function: Selector, Node → boolean ◦

    match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ General sibling combinator: ▪ Same as descendant combinator • Done!
  23. Query Function • Can we do better? ◦ Optimize on

    id selector ◦ Optimize on simple selector sequence conjunction order ◦ Optimize on selector group disjunction order ◦ DFS pruning • Can we do more? ◦ element.querySelectorAll(“+ .some-class”) ◦ Younger-sibling-first DFS ◦ Child-first DFS ▪ Beware of time cost
  24. Pain... // feature within younger siblings... <div>(our target)</div> <div class=”strong-feature”></div>

    // feature within nephews... <div> <div class=”strong-feature”></div> </div> <div>(our target)</div>
  25. A Common Scenario Header Body Footer <div> <header>...</header> </div> <div>

    <div>...</div> // body </div> <div> <footer>...</footer> </div> • How to express the position constraints of header/footer to target body?
  26. Tree Node Position A B C D E Root A:

    before B: after C: descendant D: ancestor E: equal
  27. A Common Scenario Header Body Footer <div> <header>...</header> </div> <div>

    <div>...</div> // body </div> <div> <footer>...</footer> </div> • How to express the position constraints of header/footer to target body? ◦ Body is [after] header AND [before] footer
  28. Comparing Two Tree Nodes’ Position M Root N Root M

    Root N Get ancestors Get youngest common ancestors M Root N Compare order of first different ancestors Time Order = O(depth) + O(branching factor)
  29. Comparing Two Tree Nodes’ Position • Can we do better?

    • In case of immutable tree, with O(n) indexing, node position comparison is O(1) • Key: ◦ Define leaf index, L(N) = index of leaf node in DFS order, for any leaf node N. ◦ Define left bound, LB(N) = leaf index of oldest leaf descendant of N (including itself) ◦ Define right bound, RB(N) = leaf index of youngest leaf descendant of N (including itself)
  30. Comparing Two Tree Nodes’ Position A B C D N

    Root A: before M.rightBound < N.leftBound B: after M.leftBound > N.rightBound C: descendant not (A or B), M.depth > N.depth D: ancestor not (A or B), M.depth < N.depth E: equal otherwise
  31. Comparing Two Tree Nodes’ Position • Bounds can be computed

    in one DFS ◦ At leaf node: ▪ leftBound = rightBound = leafIndex = ++currentIndex; ◦ On entering non-leaf node: ▪ leftBound = currentIndex + 1; ◦ On exiting non-leaf node: ▪ rightBound = currentIndex; N Root … x-1 x ... y y+1 ...
  32. Bottom-up Search Utility node.descendants() // NodeStream extends Stream<Node> .before(node1) //

    filter by n -> compare(node, n) = BEFORE .after(node2); body = node.descendants() .after(header) .before(footer) .filter(...) .first();
  33. Development over Robust Materials • Search elements by robust features

    ◦ Presence of login form ◦ Permalink • Try to follow their code structure ◦ Conway’s Law • Scrape based on robust hypotheses ◦ Need to collect documents ◦ Need the ability to do experiment over a document set // hypothesis: #load-more-btn is unique and always present loadMoreButton = document.querySelector(“#load-more-btn”);
  34. Document Query Language SELECT * FROM [Doc_1, Doc_2] WHERE *

    • Doc_1 ◦ <document> ▪ <head> • <title> • <meta> • … ▪ <body> • … • Doc_2 ◦ …
  35. Document Query Language SELECT * FROM [Doc_1, Doc_2] WHERE .c

    • Doc_1 ◦ <div class=”a b c”> ◦ <section class=”c”> ▪ <div class=”a b c”> • <section class=”c”> ◦ <div class=”a b c”> ◦ … • Doc_2 ◦ …
  36. Document Query Language SELECT tag FROM [Doc_1, Doc_2] WHERE .c

    • Doc_1 ◦ div ◦ section ▪ div • section ◦ div ◦ … • Doc_2 ◦ …
  37. Document Query Language SELECT COUNT(*) FROM [Doc_1, Doc_2] WHERE .c

    • Doc_1 (10) ◦ 1 ◦ 3 ▪ 2 • 1 ◦ 1 ◦ … • Doc_2 (10) ◦ …
  38. Document Query Language // hypothesis: #load-more-btn is unique and always

    present loadMoreButton = document.querySelector(“#load-more-btn”); SELECT COUNT(*) FROM /document/feed/* WHERE #load-more-btn
  39. We deserve a better browser HTTP Connection HTTP Client Real

    Browser Headless Browser Primitive Sophisticated ?
  40. QA on optional element • How to tell the difference

    between absence of an optional element from a broken scraper? ◦ We can tell by statistics, but can we know earlier?
  41. Untemplating? • Given a set of documents, can we reverse

    engineer them into templates and arguments?