On Scraper

On Scraper 2017/9, Simon Pai

Scenario Working on websites with… • DOM rendered by JavaScript
• DOM layout by CSS • Data access that requires user actions • Huge DOM tree • Various page types • Forever changing version

Scenario Working on websites with… • DOM rendered by JavaScript
• DOM layout by CSS • Data access that requires user actions ◦ Headless/minimal browsers are getting mature ◦ In worst case, we have Selenium

Scenario Working on websites with… • Huge DOM tree →
Powerful DOM search utilities • Various page types • Forever changing version → Flexible architecture → Robust development ▪ More search utilities, more options

Topics 1. Architecture • Paradigm: Unrendering 2. DOM search utilities:
• CSS Selector query function • Tree node position comparison 3. Robust development • Document Query Language 4. Unsolved problems

Unrendering

Traditional Scraper Paradigm Scraper Task Parameters Result Source

Pain... • Tedious parameter model. ◦ Satisfying two types of
task results in a union of both input model. ◦ Or… leads to two pieces of scrapers with many shared functions. • Parameters carried all over the scraping logic. ◦ If-else jungles. ◦ Many subroutines with longAndSubtlyDifferentNamesLikeThis.

Scraping is essentially reverse engineering of data rendering.

Traditional Scraper Paradigm Scraper Task Parameters Result Source Mixture of
reverse engineering of their data rendering and our business logic

Paradigm: Unrendering Scraper Task Parameters Result Source Data View Task
Handler Work on data retrieval Work on task requirement

Typical Data View API class ArticleFeedPageView { ArticleFeedPageView(DOMDocument document) {
… } List<ArticleComponent> getArticles() { … } // lazy eval boolean hasMoreArticles() { … } // invalidates getArticles() cache void loadMoreArticles() { … } }

Typical Data View API class ArticleComponent { ArticleComponent(DOMElement element) {
… } Article getArticle() { … } // lazy eval }

Typical Task Handler ArticleTaskParams params; ArticleFeedPageView view = new ArticleFeedPageView(document);
while (view.hasMoreArticles() && view.getArticles().size() < param.limit()) { view.loadMoreArticles(); } List<Article> articles = view.getArticles().stream() .map(ArticleComponent::getArticle) .collect(Collectors.toList());

Benefits From Decoupling Scraper Task Parameters Result Source Data View
Task Handler

Bonus • We may have a shell which dynamically interprets
task handler code in a dev environment.

CSS Selector Query Function

Recap: CSS Selector Model Simple selector: predicate of Node (Node
→ boolean) • * • tag • #id • .class • [attr=”value”] • :pseudo-class

Recap: CSS Selector Model: Pseudo Class Pseudo Class: parameterized predicate
of Node (String[] → Node → boolean) • :first-child • :nth-child(4n+3) • :checked • :hover • :not(selector) Pseudo Element: map from Node to another object (Node → Object) • ::before • ::first-letter

Recap: CSS Selector Model Simple Selector Sequence ::= [ Simple_Selector
]* | “*” • tag.class[attr=”value”] Selector ::= Simple_Selector_Sequence [ Combinator Simple_Selector_Sequence ]* • #id.class1 .class2 > input[value=”yes”]:focus Selector Group ::= Selector [, Selector]* • div#id, div.class, a[attr]

Recap: CSS Selector Model: Combinator • Descendant Combinator ◦ a
b • Child Combinator ◦ a > b • Adjacent Sibling Combinator ◦ a + b • General Sibling Combinator ◦ a ~ b See: https://www.w3.org/TR/css3-selectors/#grammar

Pain... // feature buried in attributes... <div data-json=”{x:100}”>...</div> // feature
buried in text... <p>Text in certain date format...</p> // can’t anchor text nodes... <p>Some text <a>link</a> some text</p> // only takes simple selector... :not(div.some-class)

Pain... // only need the last one... nodes = document.querySelectorAll(...);
n = nodes[nodes.length - 1];

Solution • We need to support pseudo class extension. ◦
Pseudo Class: String[] → Node → boolean • If our selector query utility is not hackable, write our own utility. ◦ Query Function: Selector Group, Node → Iterator<Node>

Query Function • Query Function: Selector Group, Node → Iterator<Node>
◦ Iterate by DFS (depth first search), filtered by Match Function • Match Function: Selector Group, Node → boolean ◦ Disjunction (OR) over selectors in selector group ◦ = match(selector1, node) || match(selector2, node) || ... Selector Group ::= Selector [, Selector]* • div#id, div.class, a[attr]

Query Function • Match Function: Selector, Node → boolean ◦
Later • Match Function: Simple Selector Sequence, Node → boolean ◦ Conjunction (AND) over simple selectors in the sequence ◦ = match(“#id”, node) && match(“.class”, node) && … ◦ Simple selector itself is defined as predicate of node (Node → boolean) ◦ Custom pseudo classes are handled here Simple Selector Sequence ::= [ Simple_Selector ]* | “*” • tag.class[attr=”value”]

match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ Child combinator: (>) ▪ = match(“[simple selector sequence]”, node) && match(“[upstream selector], node.parent) ◦ Adjacent sibling combinator: (+) ▪ = match(“[simple selector sequence]”, node) && match(“[upstream selector], node.previousSibling) ◦ Use dynamic programming ▪ Recursion with cache Selector ::= Simple_Selector_Sequence [ Combinator Simple_Selector_Sequence ]* • #id.class1 .class2 > input[value=”yes”]:focus

match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ Descendant combinator: ▪ = match(“[simple selector sequence]”, node) && (match(“[upstream selector], node.parent) || match(“[upstream selector], node.parent.parent) || match(“[upstream selector], node.parent.parent.parent) || …)

match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ Descendant combinator: ▪ = match(“[simple selector sequence]”, node) && (match(“[upstream selector], node.parent) || match(“[upstream selector], node.parent.parent) || match(“[upstream selector], node.parent.parent.parent) || …) = match(“[simple selector sequence]”, node) && match_ANC(“[upstream selector], node.parent) ▪ Where match_ANC(“[selector], node) = match(“[selector], node) || match(“[selector], node.parent) || match(“[selector], node.parent.parent) || … = match(“[selector]”, node) || match_ANC(“[selector], node.parent) ▪ Do dynamic programming on match_ANC as well, O(depth) → O(1) overall in DFS

match(“[upstream selector] [combinator] [simple selector sequence]”, node) ◦ General sibling combinator: ▪ Same as descendant combinator • Done!

Query Function • Can we do better? ◦ Optimize on
id selector ◦ Optimize on simple selector sequence conjunction order ◦ Optimize on selector group disjunction order ◦ DFS pruning • Can we do more? ◦ element.querySelectorAll(“+ .some-class”) ◦ Younger-sibling-first DFS ◦ Child-first DFS ▪ Beware of time cost

Tree Node Position Comparison

Pain... // feature within younger siblings... <div>(our target)</div> <div class=”strong-feature”></div>
// feature within nephews... <div> <div class=”strong-feature”></div> </div> <div>(our target)</div>

A Common Scenario Header Body Footer <div> <header>...</header> </div> <div>
<div>...</div> // body </div> <div> <footer>...</footer> </div> • How to express the position constraints of header/footer to target body?

Tree Node Position A B C D E Root A:
before B: after C: descendant D: ancestor E: equal

A Common Scenario Header Body Footer <div> <header>...</header> </div> <div>
<div>...</div> // body </div> <div> <footer>...</footer> </div> • How to express the position constraints of header/footer to target body? ◦ Body is [after] header AND [before] footer

Comparing Two Tree Nodes’ Position M Root N Root M
Root N Get ancestors Get youngest common ancestors M Root N Compare order of first different ancestors Time Order = O(depth) + O(branching factor)

Comparing Two Tree Nodes’ Position • Can we do better?
• In case of immutable tree, with O(n) indexing, node position comparison is O(1) • Key: ◦ Define leaf index, L(N) = index of leaf node in DFS order, for any leaf node N. ◦ Define left bound, LB(N) = leaf index of oldest leaf descendant of N (including itself) ◦ Define right bound, RB(N) = leaf index of youngest leaf descendant of N (including itself)

Comparing Two Tree Nodes’ Position A B C D N
Root A: before M.rightBound < N.leftBound B: after M.leftBound > N.rightBound C: descendant not (A or B), M.depth > N.depth D: ancestor not (A or B), M.depth < N.depth E: equal otherwise

Comparing Two Tree Nodes’ Position • Bounds can be computed
in one DFS ◦ At leaf node: ▪ leftBound = rightBound = leafIndex = ++currentIndex; ◦ On entering non-leaf node: ▪ leftBound = currentIndex + 1; ◦ On exiting non-leaf node: ▪ rightBound = currentIndex; N Root … x-1 x ... y y+1 ...

Bottom-up Search Utility node.descendants() // NodeStream extends Stream<Node> .before(node1) //
filter by n -> compare(node, n) = BEFORE .after(node2); body = node.descendants() .after(header) .before(footer) .filter(...) .first();

Document Query Language

A Common Trap... <div id=”load-more-btn”>...</div> loadMoreButton = document.querySelector(“#load-more-btn”); // tested
on my machine… (with current page source)

Development over Robust Materials • Search elements by robust features
◦ Presence of login form ◦ Permalink • Try to follow their code structure ◦ Conway’s Law • Scrape based on robust hypotheses ◦ Need to collect documents ◦ Need the ability to do experiment over a document set // hypothesis: #load-more-btn is unique and always present loadMoreButton = document.querySelector(“#load-more-btn”);

Document Query Language SELECT {attribute name|pseudo element|aggregate function} FROM {document
set} WHERE {selector}

Document Query Language SELECT * FROM [Doc_1, Doc_2] WHERE *
• Doc_1 ◦ <document> ▪ <head> • <title> • <meta> • … ▪ <body> • … • Doc_2 ◦ …

Document Query Language SELECT * FROM [Doc_1, Doc_2] WHERE .c
• Doc_1 ◦ <div class=”a b c”> ◦ <section class=”c”> ▪ <div class=”a b c”> • <section class=”c”> ◦ <div class=”a b c”> ◦ … • Doc_2 ◦ …

Document Query Language SELECT tag FROM [Doc_1, Doc_2] WHERE .c
• Doc_1 ◦ div ◦ section ▪ div • section ◦ div ◦ … • Doc_2 ◦ …

Document Query Language SELECT COUNT(*) FROM [Doc_1, Doc_2] WHERE .c
• Doc_1 (10) ◦ 1 ◦ 3 ▪ 2 • 1 ◦ 1 ◦ … • Doc_2 (10) ◦ …

Document Query Language // hypothesis: #load-more-btn is unique and always
present loadMoreButton = document.querySelector(“#load-more-btn”); SELECT COUNT(*) FROM /document/feed/* WHERE #load-more-btn

Unsolved Problems

We deserve a better browser HTTP Connection HTTP Client Real
Browser Headless Browser Primitive Sophisticated ?

QA on optional element • How to tell the difference
between absence of an optional element from a broken scraper? ◦ We can tell by statistics, but can we know earlier?

Untemplating? • Given a set of documents, can we reverse
engineer them into templates and arguments?

Thank you!

On Scraper

On Scraper

More Decks by Simon Pai

Featured

Transcript