Slide 1

Slide 1 text

Raspagem de dados

Slide 2

Slide 2 text

Duke

Slide 3

Slide 3 text

Duke github.com/dukex twitter.com/_dukex github : twitter : duke.vertigem.xxx blog* :

Slide 4

Slide 4 text

Raspagem de dados

Slide 5

Slide 5 text

THacker

Slide 6

Slide 6 text

groups.google.com/group/thackday THacker

Slide 7

Slide 7 text

@PROJdeLei

Slide 8

Slide 8 text

@PROJdeLei github.com/dukex/projdelei github : twitter.com/PROJdeLei twitter :

Slide 9

Slide 9 text

@PROJdeLei github.com/dukex/projdelei github : twitter.com/PROJdeLei twitter :

Slide 10

Slide 10 text

Ruby?

Slide 11

Slide 11 text

Ruby? Por que resolve BEM os meus problemas

Slide 12

Slide 12 text

Nokogiri (ڒ) LibXML Hpricot

Slide 13

Slide 13 text

Nokogiri (ڒ) LibXML scraperwiki.com/docs/ruby/ruby_libraries Hpricot

Slide 14

Slide 14 text

Beautiful Soup lxml Scrapy

Slide 15

Slide 15 text

Beautiful Soup lxml scraperwiki.com/docs/python/python_libraries Scrapy

Slide 16

Slide 16 text

Simple HTML DOM YQL*

Slide 17

Slide 17 text

Simple HTML DOM YQL* scraperwiki.com/docs/php/php_libraries

Slide 18

Slide 18 text

Javascript?!?!

Slide 19

Slide 19 text

ReactiveScraper Javascript?!?! github.com/OKFN-BR/reactive_scraper

Slide 20

Slide 20 text

var parser = function(i, tr){ var item = $(tr) , hour = item.find("td:eq(0)").text() , title = item.find("td:eq(1)").text(); document.save({ hour : hour, title : title }); }; $(".table-striped tbody tr").each(parser);

Slide 21

Slide 21 text

Nokogiri (ڒ)

Slide 22

Slide 22 text

Nokogiri (ڒ) Extensão em C Buscas em XPath ou CSS3 selectors

Slide 23

Slide 23 text

parser = Nokogiri::HTML(wewebconf_html) parser.search(".table-striped tbody tr").each do |tr| hour = tr.find("td:eq(0)") title = tr.find("td:eq(0)") ... end

Slide 24

Slide 24 text

ScraperWiki

Slide 25

Slide 25 text

ScraperWiki Web Plataforma Ruby, Python e PHP

Slide 26

Slide 26 text

scraperwiki.com/scrapers/funk_download

Slide 27

Slide 27 text

def download_funk(category) target = "#{BASE_URL}?cat=#{category}" music_index_parser = parser(target) music_index_parser.search(".download-funk").each do |parser_musica| begin music_show_link = parser_musica.search("a")[0].attr("href") music_show_parser = parser("#{BASE_URL}#{music_show_link}") music_info = music_show_parser.search("#interna_a") name = music_info.search("h2").text() link = music_info.search(".texto a").attr("href") download = parser_musica.search(".contador").text() date = parser_musica.search(".data").text() ScraperWiki.save(["name"], ....) rescue next end end end

Slide 28

Slide 28 text

scraperwiki.com/profiles/emersonvinicius scraperwiki.com/tags

Slide 29

Slide 29 text

Testes (BDD)

Slide 30

Slide 30 text

Testes (BDD) Rspec Webmock

Slide 31

Slide 31 text

@PROJdeLei github.com/dukex/projdelei/blob/master/spec/lib/ scraper_spec.rb

Slide 32

Slide 32 text

class Beco203Bot < AugustaBot def create_parties! end private def parties parser(".agenda-item.beco-sp", "capa-beco-sp.php") end end class Beco203PartyBot < AugustaPartyBot def name end end

Slide 33

Slide 33 text

require 'spec_helper' describe Beco203Bot do let(:beco_bot) { Beco203Bot.new } describe "create_parties!" do before do ...{stub_requests}... end it "saves a parties with name" do beco_bot.create_parties! Party.first.name.should == 'Festa 1' end end end

Slide 34

Slide 34 text

class Beco203Bot < AugustaBot def create_parties! parties.each do |item| party_bot = Beco203PartyBot.new({params}) party_bot.name =(item/".texto").text Party.create! party_bot.attributes end end private def parties parser(".agenda-item.beco-sp", "capa-beco-sp.php") end end class Beco203PartyBot < AugustaPartyBot def name parser.search(".conteudo-interna h1").text.clean end end

Slide 35

Slide 35 text

Redistribuir

Slide 36

Slide 36 text

Redistribuir APIs REST

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

github.com/vertigem/api_metrosp api.metrosp.vertigem.xxx/lines.json

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

github.com/vertigem/api_camara

Slide 41

Slide 41 text

github.com/vertigem/api_camara api.camara.vertigem.xxx

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

thedatahub.org

Slide 44

Slide 44 text

Criar Valor

Slide 45

Slide 45 text

Criar Valor Cruzamentos Visualizações Etc

Slide 46

Slide 46 text

scraperwiki.com/scrapers/augusta

Slide 47

Slide 47 text

augustaapp.com

Slide 48

Slide 48 text

Criar Valor Cruzamentos Visualizações Etc

Slide 49

Slide 49 text

datajournalism.stanford.edu

Slide 50

Slide 50 text

Criar Valor Cruzamentos Visualizações Etc

Slide 51

Slide 51 text

Qual era o preço médio uma casa no São Paulo em 1921? Que balada vai acontecer hoje na Rua Augusta? Dados para responder muitas, muitas perguntas como essas estão por aí na Internet em algum lugar - mas nem sempre é fácil encontrar. - Hackeado do thedatahub.org

Slide 52

Slide 52 text

Obrigado!

Slide 53

Slide 53 text

Obrigado! github.com/dukex github.com/vertigem

Slide 54

Slide 54 text

Perguntas?

Slide 55

Slide 55 text

Obrigado!