Slide 1

Slide 1 text

Clojure Scraping Utils Shriphani Palakodety! spalakod@cs.cmu.edu

Slide 2

Slide 2 text

Scraping • Dataset construction (unstructured -> structured)! • APIs not widespread! • Unavoidable

Slide 3

Slide 3 text

Libraries • clj-­‐xpath  (XPath Queries)   • enlive  (CSS selectors)

Slide 4

Slide 4 text

XPath Queries • Select nodes in an XML document! • Clojure lib:  clj-­‐xpath  - wraps the Java XPath library

Slide 5

Slide 5 text

XPath Example • Select node  lr   • /root/l/lr   • /root//lr   • Infinite queries exist root l r ll lr

Slide 6

Slide 6 text

XPath Query Predicates • Regex matches on  a node’s  attributes  using   contains

Slide 7

Slide 7 text

Example //html/div[contains(@class,  ‘hello’)]/a    
      Hello    
   
      Bye    
 

Slide 8

Slide 8 text

Position Information • Need to refine your query sometimes.! • Examples: first child of some node

Slide 9

Slide 9 text

Example //html/div[2]/a          

Slide 10

Slide 10 text

Workflow • Parse HTML  (HtmlCleaner)   • Convert to an  org.w3c.dom.document  object   • run XPath queries on this object

Slide 11

Slide 11 text

Demo • Extract article metadata from reddit (title, submitter, number of comments)

Slide 12

Slide 12 text

CSS Selectors • Pick html elements we want to style! • Some Examples:! • Select a div with class hello:! • div.hello   • Select a div with id hello:! • div#hello   • XPath queries and CSS selectors are equivalent

Slide 13

Slide 13 text

Enlive • HTML document is just a clojure map          
        HELLO      
      {:tag      :html,    :attrs  nil,    :content      ({:tag      :body,          :attrs  nil,          :content          ({:tag    :div,         :attrs  nil,         :content         ("HELLO")})})}

Slide 14

Slide 14 text

Enlive Workflow • Process your document:! • (enlive/html-­‐resource  html-­‐stream)   • Select your elements:! • (enlive/select  processed-­‐doc  selector)

Slide 15

Slide 15 text

Enlive Selectors • Typically a list of nodes (not necessarily contiguous)! • Each element of this list is a keyword that represents a CSS selector

Slide 16

Slide 16 text

Example                 {:tag      :html,    :attrs  nil,    :content      ({:tag      :body,          :attrs  nil,          :content          ({:tag    :div,         :attrs  nil,         :content         ({:tag      :a,          :attrs  {:href                            “hello.html”},          :content          (“HELLO”)})})})} [:a]

Slide 17

Slide 17 text

Example II #{[div.foo  div.bar]} ({:tag      :html,      :attrs  nil,      :content      ({:tag      :body,          :attrs  nil,          :content          ({:tag      :div,                :attrs  {:class  "foo"},                :content  nil}            {:tag      :div,                :attrs  {:class  "bar"},                :content  nil})})})          
     
       
     
     

Slide 18

Slide 18 text

Demo • Extract  article  metadata  from  reddit   (title,  submitter,  number  of  comments)

Slide 19

Slide 19 text

Best of Both Worlds • Want to use XPaths (legacy codebase)! • Also want to use enlive! • It is possible

Slide 20

Slide 20 text

Best of Both Worlds • (:use  [clj-­‐xpath.core  :only  [$x]])   • ($x  “//html”  html-­‐object)          
        HELLO      
      {:tag      :html,    :attrs  nil,    :content      ({:tag      :body,          :attrs  nil,          :content          ({:tag    :div,         :attrs  nil,         :content         ("HELLO")})})}

Slide 21

Slide 21 text

Further Reading • Enlive can be used for HTML templates and more complicated scraping: ! • https://github.com/swannodette/enlive-tutorial! • More on XPath, CSS Selectors:! • http://www.w3.org/TR/CSS2/selector.html! • Enlive-style selector syntax in the browser (using cljs):! • https://github.com/prismatic/dommy/

Slide 22

Slide 22 text

Questions?