Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Clojure Scraping Overview

Clojure Scraping Overview

clj-xpath and enlive


Shriphani Palakodety

February 11, 2014


  1. Clojure Scraping Utils Shriphani Palakodety! spalakod@cs.cmu.edu

  2. Scraping • Dataset construction (unstructured -> structured)! • APIs not

    widespread! • Unavoidable
  3. Libraries • clj-­‐xpath  (XPath Queries)   • enlive  (CSS selectors)

  4. XPath Queries • Select nodes in an XML document! •

    Clojure lib:  clj-­‐xpath  - wraps the Java XPath library
  5. XPath Example • Select node  lr   • /root/l/lr  

    • /root//lr   • Infinite queries exist root l r ll lr
  6. XPath Query Predicates • Regex matches on  a node’s  attributes

     using   contains
  7. Example //html/div[contains(@class,  ‘hello’)]/a <html>     <div  class=‘hello’>    

      <a  href=‘hello.html’>Hello</a>     </div>     <div  class=‘bye’>       <a  href=‘bye.html’>Bye</a>     </div>   </html>
  8. Position Information • Need to refine your query sometimes.! •

    Examples: first child of some node
  9. Example //html/div[2]/a <html>     <div>       <a

     href=‘hello.html’>         hello       </a>     </div>     <div>       <a  href=‘hello2.html’>         hello  again       </a>     </div>   </html>
  10. Workflow • Parse HTML  (HtmlCleaner)   • Convert to an

     org.w3c.dom.document  object   • run XPath queries on this object
  11. Demo • Extract article metadata from reddit (title, submitter, number

    of comments)
  12. CSS Selectors • Pick html elements we want to style!

    • Some Examples:! • Select a div with class hello:! • div.hello   • Select a div with id hello:! • div#hello   • XPath queries and CSS selectors are equivalent
  13. Enlive • HTML document is just a clojure map <html>

        <body>       <div>         HELLO       </div>     </body>   </html> {:tag      :html,    :attrs  nil,    :content      ({:tag      :body,          :attrs  nil,          :content          ({:tag    :div,         :attrs  nil,         :content         ("HELLO")})})}
  14. Enlive Workflow • Process your document:! • (enlive/html-­‐resource  html-­‐stream)  

    • Select your elements:! • (enlive/select  processed-­‐doc  selector)
  15. Enlive Selectors • Typically a list of nodes (not necessarily

    contiguous)! • Each element of this list is a keyword that represents a CSS selector
  16. Example <html>    <body>      <div>      

     <a     href=‘hello.html’>          HELLO      </a>     </div>    </body>   </html> {:tag      :html,    :attrs  nil,    :content      ({:tag      :body,          :attrs  nil,          :content          ({:tag    :div,         :attrs  nil,         :content         ({:tag      :a,          :attrs  {:href                            “hello.html”},          :content          (“HELLO”)})})})} [:a]
  17. Example II #{[div.foo  div.bar]} ({:tag      :html,    

     :attrs  nil,      :content      ({:tag      :body,          :attrs  nil,          :content          ({:tag      :div,                :attrs  {:class  "foo"},                :content  nil}            {:tag      :div,                :attrs  {:class  "bar"},                :content  nil})})}) <html>    <body>      <div  class=‘foo'>      </div>        <div  class=‘bar'>      </div>    </body>   </html>
  18. Demo • Extract  article  metadata  from  reddit   (title,  submitter,

     number  of  comments)
  19. Best of Both Worlds • Want to use XPaths (legacy

    codebase)! • Also want to use enlive! • It is possible
  20. Best of Both Worlds • (:use  [clj-­‐xpath.core  :only  [$x]])  

    • ($x  “//html”  html-­‐object) <html>     <body>       <div>         HELLO       </div>     </body>   </html> {:tag      :html,    :attrs  nil,    :content      ({:tag      :body,          :attrs  nil,          :content          ({:tag    :div,         :attrs  nil,         :content         ("HELLO")})})}
  21. Further Reading • Enlive can be used for HTML templates

    and more complicated scraping: ! • https://github.com/swannodette/enlive-tutorial! • More on XPath, CSS Selectors:! • http://www.w3.org/TR/CSS2/selector.html! • Enlive-style selector syntax in the browser (using cljs):! • https://github.com/prismatic/dommy/
  22. Questions?