Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Raspagem de Dados com Ruby

Duke
October 21, 2012

Raspagem de Dados com Ruby

Apresentação na UNIFEI - Universidade Federal de Itajuba onde falei sobre Raspagem de Dados com Ruby

Duke

October 21, 2012
Tweet

More Decks by Duke

Other Decks in Programming

Transcript

  1. Raspagem de dados

    View full-size slide

  2. Duke
    github.com/dukex
    twitter.com/_dukex
    github :
    twitter :
    duke.vertigem.xxx
    blog* :

    View full-size slide

  3. Raspagem de dados

    View full-size slide

  4. groups.google.com/group/thackday
    THacker

    View full-size slide

  5. @PROJdeLei
    github.com/dukex/projdelei
    github :
    twitter.com/PROJdeLei
    twitter :

    View full-size slide

  6. @PROJdeLei
    github.com/dukex/projdelei
    github :
    twitter.com/PROJdeLei
    twitter :

    View full-size slide

  7. Ruby?
    Por que resolve BEM os
    meus problemas

    View full-size slide

  8. Nokogiri (ڒ)
    LibXML
    Hpricot

    View full-size slide

  9. Nokogiri (ڒ)
    LibXML
    scraperwiki.com/docs/ruby/ruby_libraries
    Hpricot

    View full-size slide

  10. Beautiful Soup
    lxml
    Scrapy

    View full-size slide

  11. Beautiful Soup
    lxml
    scraperwiki.com/docs/python/python_libraries
    Scrapy

    View full-size slide

  12. Simple HTML DOM
    YQL*

    View full-size slide

  13. Simple HTML DOM
    YQL*
    scraperwiki.com/docs/php/php_libraries

    View full-size slide

  14. Javascript?!?!

    View full-size slide

  15. ReactiveScraper
    Javascript?!?!
    github.com/OKFN-BR/reactive_scraper

    View full-size slide

  16. var parser = function(i, tr){
    var item = $(tr)
    , hour = item.find("td:eq(0)").text()
    , title = item.find("td:eq(1)").text();
    document.save({
    hour : hour,
    title : title
    });
    };
    $(".table-striped tbody tr").each(parser);

    View full-size slide

  17. Nokogiri (ڒ)

    View full-size slide

  18. Nokogiri (ڒ)
    Extensão em C
    Buscas em XPath ou
    CSS3 selectors

    View full-size slide

  19. parser = Nokogiri::HTML(wewebconf_html)
    parser.search(".table-striped tbody tr").each do |tr|
    hour = tr.find("td:eq(0)")
    title = tr.find("td:eq(0)")
    ...
    end

    View full-size slide

  20. ScraperWiki
    Web Plataforma
    Ruby, Python e PHP

    View full-size slide

  21. scraperwiki.com/scrapers/funk_download

    View full-size slide

  22. def download_funk(category)
    target = "#{BASE_URL}?cat=#{category}"
    music_index_parser = parser(target)
    music_index_parser.search(".download-funk").each do |parser_musica|
    begin
    music_show_link = parser_musica.search("a")[0].attr("href")
    music_show_parser = parser("#{BASE_URL}#{music_show_link}")
    music_info = music_show_parser.search("#interna_a")
    name = music_info.search("h2").text()
    link = music_info.search(".texto a").attr("href")
    download = parser_musica.search(".contador").text()
    date = parser_musica.search(".data").text()
    ScraperWiki.save(["name"], ....)
    rescue
    next
    end
    end
    end

    View full-size slide

  23. scraperwiki.com/profiles/emersonvinicius
    scraperwiki.com/tags

    View full-size slide

  24. Testes (BDD)

    View full-size slide

  25. Testes (BDD)
    Rspec
    Webmock

    View full-size slide

  26. @PROJdeLei
    github.com/dukex/projdelei/blob/master/spec/lib/
    scraper_spec.rb

    View full-size slide

  27. class Beco203Bot < AugustaBot
    def create_parties!
    end
    private
    def parties
    parser(".agenda-item.beco-sp", "capa-beco-sp.php")
    end
    end
    class Beco203PartyBot < AugustaPartyBot
    def name
    end
    end

    View full-size slide

  28. require 'spec_helper'
    describe Beco203Bot do
    let(:beco_bot) { Beco203Bot.new }
    describe "create_parties!" do
    before do
    ...{stub_requests}...
    end
    it "saves a parties with name" do
    beco_bot.create_parties!
    Party.first.name.should == 'Festa 1'
    end
    end
    end

    View full-size slide

  29. class Beco203Bot < AugustaBot
    def create_parties!
    parties.each do |item|
    party_bot = Beco203PartyBot.new({params})
    party_bot.name =(item/".texto").text
    Party.create! party_bot.attributes
    end
    end
    private
    def parties
    parser(".agenda-item.beco-sp", "capa-beco-sp.php")
    end
    end
    class Beco203PartyBot < AugustaPartyBot
    def name
    parser.search(".conteudo-interna h1").text.clean
    end
    end

    View full-size slide

  30. Redistribuir

    View full-size slide

  31. Redistribuir
    APIs REST

    View full-size slide

  32. github.com/vertigem/api_metrosp
    api.metrosp.vertigem.xxx/lines.json

    View full-size slide

  33. github.com/vertigem/api_camara

    View full-size slide

  34. github.com/vertigem/api_camara
    api.camara.vertigem.xxx

    View full-size slide

  35. thedatahub.org

    View full-size slide

  36. Criar Valor
    Cruzamentos
    Visualizações
    Etc

    View full-size slide

  37. scraperwiki.com/scrapers/augusta

    View full-size slide

  38. augustaapp.com

    View full-size slide

  39. Criar Valor
    Cruzamentos
    Visualizações
    Etc

    View full-size slide

  40. datajournalism.stanford.edu

    View full-size slide

  41. Criar Valor
    Cruzamentos
    Visualizações
    Etc

    View full-size slide

  42. Qual era o preço médio uma casa no São Paulo
    em 1921? Que balada vai acontecer hoje na Rua
    Augusta? Dados para responder muitas, muitas
    perguntas como essas estão por aí na Internet em
    algum lugar - mas nem sempre é fácil encontrar.
    - Hackeado do thedatahub.org

    View full-size slide

  43. Obrigado!
    github.com/dukex
    github.com/vertigem

    View full-size slide