Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On Scraper

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Simon Pai Simon Pai
September 29, 2017
130

On Scraper

Avatar for Simon Pai

Simon Pai

September 29, 2017
Tweet

Transcript

  1. Scenario Working on websites with… • DOM rendered by JavaScript

    • DOM layout by CSS • Data access that requires user actions • Huge DOM tree • Various page types • Forever changing version
  2. Scenario Working on websites with… • DOM rendered by JavaScript

    • DOM layout by CSS • Data access that requires user actions ◦ Headless/minimal browsers are getting mature ◦ In worst case, we have Selenium
  3. Scenario Working on websites with… • Huge DOM tree →

    Powerful DOM search utilities • Various page types • Forever changing version → Flexible architecture → Robust development ▪ More search utilities, more options
  4. Topics 1. Architecture • Paradigm: Unrendering 2. DOM search utilities:

    • CSS Selector query function • Tree node position comparison 3. Robust development • Document Query Language 4. Unsolved problems
  5. Pain... • Tedious parameter model. ◦ Satisfying two types of

    task results in a union of both input model. ◦ Or… leads to two pieces of scrapers with many shared functions. • Parameters carried all over the scraping logic. ◦ If-else jungles. ◦ Many subroutines with longAndSubtlyDifferentNamesLikeThis.
  6. Traditional Scraper Paradigm Scraper Task Parameters Result Source Mixture of

    reverse engineering of their data rendering and our business logic
  7. Paradigm: Unrendering Scraper Task Parameters Result Source Data View Task

    Handler Work on data retrieval Work on task requirement
  8. Typical Data View API class ArticleFeedPageView { ArticleFeedPageView(DOMDocument document) {

    … } List<ArticleComponent> getArticles() { … } // lazy eval boolean hasMoreArticles() { … } // invalidates getArticles() cache void loadMoreArticles() { … } }
  9. Typical Task Handler ArticleTaskParams params; ArticleFeedPageView view = new ArticleFeedPageView(document);

    while (view.hasMoreArticles() && view.getArticles().size() < param.limit()) { view.loadMoreArticles(); } List<Article> articles = view.getArticles().stream() .map(ArticleComponent::getArticle) .collect(Collectors.toList());
  10. Bonus • We may have a shell which dynamically interprets

    task handler code in a dev environment.
  11. Recap: CSS Selector Model Simple selector: predicate of Node (Node

    → boolean) • * • tag • #id • .class • [attr=”value”] • :pseudo-class
  12. Recap: CSS Selector Model: Pseudo Class Pseudo Class: parameterized predicate

    of Node (String[] → Node → boolean) • :first-child • :nth-child(4n+3) • :checked • :hover • :not(selector) Pseudo Element: map from Node to another object (Node → Object) • ::before • ::first-letter
  13. Recap: CSS Selector Model Simple Selector Sequence ::= [ Simple_Selector

    ]* | “*” • tag.class[attr=”value”] Selector ::= Simple_Selector_Sequence [ Combinator Simple_Selector_Sequence ]* • #id.class1 .class2 > input[value=”yes”]:focus Selector Group ::= Selector [, Selector]* • div#id, div.class, a[attr]
  14. Recap: CSS Selector Model: Combinator • Descendant Combinator ◦ a

    b • Child Combinator ◦ a > b • Adjacent Sibling Combinator ◦ a + b • General Sibling Combinator ◦ a ~ b See: https://www.w3.org/TR/css3-selectors/#grammar
  15. Pain... // feature buried in attributes... <div data-json=”{x:100}”>...</div> // feature

    buried in text... <p>Text in certain date format...</p> // can’t anchor text nodes... <p>Some text <a>link</a> some text</p> // only takes simple selector... :not(div.some-class)
  16. Solution • We need to support pseudo class extension. ◦

    Pseudo Class: String[] → Node → boolean • If our selector query utility is not hackable, write our own utility. ◦ Query Function: Selector Group, Node → Iterator<Node>
  17. Query Function • Query Function: Selector Group, Node → Iterator<Node>

    ◦ Iterate by DFS (depth first search), filtered by Match Function • Match Function: Selector Group, Node → boolean ◦ Disjunction (OR) over selectors in selector group ◦ = match(selector1, node) || match(selector2, node) || ... Selector Group ::= Selector [, Selector]* • div#id, div.class, a[attr]
  18. Query Function • Match Function: Selector, Node → boolean ◦

    Later • Match Function: Simple Selector Sequence, Node → boolean ◦ Conjunction (AND) over simple selectors in the sequence ◦ = match(“#id”, node) && match(“.class”, node) && … ◦ Simple selector itself is defined as predicate of node (Node → boolean) ◦ Custom pseudo classes are handled here Simple Selector Sequence ::= [ Simple_Selector ]* | “*” • tag.class[attr=”value”]
  19. Query Function • Match Function: Selector, Node → boolean ◦

    match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ Child combinator: (>) ▪ = match(“[simple selector sequence]”, node) && match(“[upstream selector], node.parent) ◦ Adjacent sibling combinator: (+) ▪ = match(“[simple selector sequence]”, node) && match(“[upstream selector], node.previousSibling) ◦ Use dynamic programming ▪ Recursion with cache Selector ::= Simple_Selector_Sequence [ Combinator Simple_Selector_Sequence ]* • #id.class1 .class2 > input[value=”yes”]:focus
  20. Query Function • Match Function: Selector, Node → boolean ◦

    match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ Descendant combinator: ▪ = match(“[simple selector sequence]”, node) && (match(“[upstream selector], node.parent) || match(“[upstream selector], node.parent.parent) || match(“[upstream selector], node.parent.parent.parent) || …)
  21. Query Function • Match Function: Selector, Node → boolean ◦

    match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ Descendant combinator: ▪ = match(“[simple selector sequence]”, node) && (match(“[upstream selector], node.parent) || match(“[upstream selector], node.parent.parent) || match(“[upstream selector], node.parent.parent.parent) || …) = match(“[simple selector sequence]”, node) && match_ANC(“[upstream selector], node.parent) ▪ Where match_ANC(“[selector], node) = match(“[selector], node) || match(“[selector], node.parent) || match(“[selector], node.parent.parent) || … = match(“[selector]”, node) || match_ANC(“[selector], node.parent) ▪ Do dynamic programming on match_ANC as well, O(depth) → O(1) overall in DFS
  22. Query Function • Match Function: Selector, Node → boolean ◦

    match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ General sibling combinator: ▪ Same as descendant combinator • Done!
  23. Query Function • Can we do better? ◦ Optimize on

    id selector ◦ Optimize on simple selector sequence conjunction order ◦ Optimize on selector group disjunction order ◦ DFS pruning • Can we do more? ◦ element.querySelectorAll(“+ .some-class”) ◦ Younger-sibling-first DFS ◦ Child-first DFS ▪ Beware of time cost
  24. Pain... // feature within younger siblings... <div>(our target)</div> <div class=”strong-feature”></div>

    // feature within nephews... <div> <div class=”strong-feature”></div> </div> <div>(our target)</div>
  25. A Common Scenario Header Body Footer <div> <header>...</header> </div> <div>

    <div>...</div> // body </div> <div> <footer>...</footer> </div> • How to express the position constraints of header/footer to target body?
  26. Tree Node Position A B C D E Root A:

    before B: after C: descendant D: ancestor E: equal
  27. A Common Scenario Header Body Footer <div> <header>...</header> </div> <div>

    <div>...</div> // body </div> <div> <footer>...</footer> </div> • How to express the position constraints of header/footer to target body? ◦ Body is [after] header AND [before] footer
  28. Comparing Two Tree Nodes’ Position M Root N Root M

    Root N Get ancestors Get youngest common ancestors M Root N Compare order of first different ancestors Time Order = O(depth) + O(branching factor)
  29. Comparing Two Tree Nodes’ Position • Can we do better?

    • In case of immutable tree, with O(n) indexing, node position comparison is O(1) • Key: ◦ Define leaf index, L(N) = index of leaf node in DFS order, for any leaf node N. ◦ Define left bound, LB(N) = leaf index of oldest leaf descendant of N (including itself) ◦ Define right bound, RB(N) = leaf index of youngest leaf descendant of N (including itself)
  30. Comparing Two Tree Nodes’ Position A B C D N

    Root A: before M.rightBound < N.leftBound B: after M.leftBound > N.rightBound C: descendant not (A or B), M.depth > N.depth D: ancestor not (A or B), M.depth < N.depth E: equal otherwise
  31. Comparing Two Tree Nodes’ Position • Bounds can be computed

    in one DFS ◦ At leaf node: ▪ leftBound = rightBound = leafIndex = ++currentIndex; ◦ On entering non-leaf node: ▪ leftBound = currentIndex + 1; ◦ On exiting non-leaf node: ▪ rightBound = currentIndex; N Root … x-1 x ... y y+1 ...
  32. Bottom-up Search Utility node.descendants() // NodeStream extends Stream<Node> .before(node1) //

    filter by n -> compare(node, n) = BEFORE .after(node2); body = node.descendants() .after(header) .before(footer) .filter(...) .first();
  33. Development over Robust Materials • Search elements by robust features

    ◦ Presence of login form ◦ Permalink • Try to follow their code structure ◦ Conway’s Law • Scrape based on robust hypotheses ◦ Need to collect documents ◦ Need the ability to do experiment over a document set // hypothesis: #load-more-btn is unique and always present loadMoreButton = document.querySelector(“#load-more-btn”);
  34. Document Query Language SELECT * FROM [Doc_1, Doc_2] WHERE *

    • Doc_1 ◦ <document> ▪ <head> • <title> • <meta> • … ▪ <body> • … • Doc_2 ◦ …
  35. Document Query Language SELECT * FROM [Doc_1, Doc_2] WHERE .c

    • Doc_1 ◦ <div class=”a b c”> ◦ <section class=”c”> ▪ <div class=”a b c”> • <section class=”c”> ◦ <div class=”a b c”> ◦ … • Doc_2 ◦ …
  36. Document Query Language SELECT tag FROM [Doc_1, Doc_2] WHERE .c

    • Doc_1 ◦ div ◦ section ▪ div • section ◦ div ◦ … • Doc_2 ◦ …
  37. Document Query Language SELECT COUNT(*) FROM [Doc_1, Doc_2] WHERE .c

    • Doc_1 (10) ◦ 1 ◦ 3 ▪ 2 • 1 ◦ 1 ◦ … • Doc_2 (10) ◦ …
  38. Document Query Language // hypothesis: #load-more-btn is unique and always

    present loadMoreButton = document.querySelector(“#load-more-btn”); SELECT COUNT(*) FROM /document/feed/* WHERE #load-more-btn
  39. We deserve a better browser HTTP Connection HTTP Client Real

    Browser Headless Browser Primitive Sophisticated ?
  40. QA on optional element • How to tell the difference

    between absence of an optional element from a broken scraper? ◦ We can tell by statistics, but can we know earlier?
  41. Untemplating? • Given a set of documents, can we reverse

    engineer them into templates and arguments?