$30 off During Our Annual Pro Sale. View Details »

How to scrape web contents in Clojure

ayato
January 09, 2016

How to scrape web contents in Clojure

ayato

January 09, 2016
Tweet

More Decks by ayato

Other Decks in Programming

Transcript

  1. )PXUPTDSBQF
    XFCDPOUFOUT
    JO$MPKVSF
    !@BZBUP@Q

    View Slide

  2. ͋΍ͽʔ
    $MPKVSJBO
    $ZCP[V4UBSUVQT *OD

    View Slide

  3. 8IBUJTXFCTDSBQJOH
    ΢ΣϒεΫϨΠϐϯά 8FCTDSBQJOH
    ͱ͸ɺ
    ΢ΣϒαΠτ͔Β৘ใΛநग़͢Δ

    ίϯϐϡʔλιϑτ΢ΣΞٕज़ͷ͜ͱɻ
    CZXJLJQFEJB

    View Slide

  4. 1SPCMFNT
    8FCίϯςϯπ͸໦ߏ଄ʹ͍ۙܗΛ͍ͯ͠Δ
    ࣅ͍ͯΔϖʔδ͕୔ࢁ͋Δ͕ඍົʹҧ͏
    ໦ߏ଄Λ୧Δ࠶ؼతͳίʔυΛॻ͘ඞཁ͕͋Δ
    ͍͍ͩͨ໘౗͍͘͞

    View Slide

  5. 4LZTDSBQFS
    ໦ߏ଄Λ࠶ؼతʹ୧ͬͯ͘ΕΔ
    ϖʔδͷλΠϓຖʹॲཧํ๏͚ͩॻ͚͹͍͍
    ஗ԆγʔέϯεΛฦͯ͘͠ΕΔ
    Ωϟογϡػߏ͕͍͍ͭͯΔ
    εΫϨΠϐϯά෦෼͸&OMJWFґଘ
    IUUQTHJUIVCDPNOBUIFMMTLZTDSBQFS

    View Slide

  6. (defn seed [username from until]
    (let [url (str "http://twilog.org/" username)]
    [{:username username
    :from from
    :until until
    :url url
    :processor ::user-page}]))
    (s/defprocessor user-page
    :cache-template "twilog/:username"
    :process-fn (fn [res {:keys [username]}]
    (let [not-registered (seq (html/select res [:div.box-info.box-icon]))
    not-found (seq (html/select res [:div.box-attention.box-icon]))]
    (cond
    not-registered [{:msg "This account was not registered."}]
    not-found [{:msg "This account was not found."}]
    :else [{:url (str "http://twilog.org/" username "/archives")
    :processor ::archives-page}]))))
    &YBNQMF

    View Slide

  7. (defn scrape [username & [{:as options
    :keys [html-cache processed-cache from until]
    :or {html-cache true processed-cache true
    from "00000000" until "99999999"}}]]
    (let [handler (create-handler identity options)]
    (handler (s/scrape (seed username from until)
    :html-cache html-cache
    :processed-cache processed-cache))))
    &YBNQMF

    View Slide

  8. $PODMVTJPO
    4LZTDSBQFSΛ࢖͏ͱ޾ͤʹͳΕΔ
    $MPKVSF࠷ߴʂ

    View Slide

  9. Enjoy Clojure

    View Slide