$30 off During Our Annual Pro Sale. View Details »

How to scrape web contents in Clojure

ayato
January 09, 2016

How to scrape web contents in Clojure

ayato

January 09, 2016
Tweet

More Decks by ayato

Other Decks in Programming

Transcript

  1. )PXUPTDSBQF XFCDPOUFOUT JO$MPKVSF !@BZBUP@Q

  2. ͋΍ͽʔ $MPKVSJBO $ZCP[V4UBSUVQT *OD

  3. 8IBUJTXFCTDSBQJOH ΢ΣϒεΫϨΠϐϯά 8FCTDSBQJOH ͱ͸ɺ ΢ΣϒαΠτ͔Β৘ใΛநग़͢Δ
 ίϯϐϡʔλιϑτ΢ΣΞٕज़ͷ͜ͱɻ CZXJLJQFEJB

  4. 1SPCMFNT 8FCίϯςϯπ͸໦ߏ଄ʹ͍ۙܗΛ͍ͯ͠Δ ࣅ͍ͯΔϖʔδ͕୔ࢁ͋Δ͕ඍົʹҧ͏ ໦ߏ଄Λ୧Δ࠶ؼతͳίʔυΛॻ͘ඞཁ͕͋Δ ͍͍ͩͨ໘౗͍͘͞

  5. 4LZTDSBQFS ໦ߏ଄Λ࠶ؼతʹ୧ͬͯ͘ΕΔ ϖʔδͷλΠϓຖʹॲཧํ๏͚ͩॻ͚͹͍͍ ஗ԆγʔέϯεΛฦͯ͘͠ΕΔ Ωϟογϡػߏ͕͍͍ͭͯΔ εΫϨΠϐϯά෦෼͸&OMJWFґଘ IUUQTHJUIVCDPNOBUIFMMTLZTDSBQFS

  6. (defn seed [username from until] (let [url (str "http://twilog.org/" username)]

    [{:username username :from from :until until :url url :processor ::user-page}])) (s/defprocessor user-page :cache-template "twilog/:username" :process-fn (fn [res {:keys [username]}] (let [not-registered (seq (html/select res [:div.box-info.box-icon])) not-found (seq (html/select res [:div.box-attention.box-icon]))] (cond not-registered [{:msg "This account was not registered."}] not-found [{:msg "This account was not found."}] :else [{:url (str "http://twilog.org/" username "/archives") :processor ::archives-page}])))) &YBNQMF
  7. (defn scrape [username & [{:as options :keys [html-cache processed-cache from

    until] :or {html-cache true processed-cache true from "00000000" until "99999999"}}]] (let [handler (create-handler identity options)] (handler (s/scrape (seed username from until) :html-cache html-cache :processed-cache processed-cache)))) &YBNQMF
  8. $PODMVTJPO 4LZTDSBQFSΛ࢖͏ͱ޾ͤʹͳΕΔ $MPKVSF࠷ߴʂ

  9. Enjoy Clojure