Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Maintaining 200+ spiders and still having time to sleep

Maintaining 200+ spiders and still having time to sleep

O Jusbrasil lida diariamente com milhares de documentos públicos. O volume de documentos extraídos é trabalho de inúmeros spiders que constantemente acessam a Surface e Deep Web em busca de novos artefatos. Nessa palestra discutiremos como o projeto de Notícias foi estruturado para lidar com o grande volume de dados e aproximadamente 200 spiders em execução.

Victor Martinez

December 02, 2016
Tweet

More Decks by Victor Martinez

Other Decks in Programming

Transcript

  1. NewsPipeline Scrapinghub item 1 item 2 item 4 item 5

    item 3 item n Jusbrasil Network tests build Spiders 1 2 3 4 5 Monitoramento Métricas
  2. Framework Open Source e colaborativo para extração de dados de

    websites de forma rápida, simples e extensível. scrapy.or g
  3. semcomp/ ├── scrapy.cfg └── semcomp/ ├── __init__.py ├── items.py ├──

    pipelines.py ├── settings.py └── spiders/ └── __init__.py $ scrapy startproject <project_name> $ scrapy startproject semcomp
  4. $ git push origin master Build and Push $ docker-compose

    -f docker-compose-test.yml rm 1 2 $ git pull 3 unit regression 4 $ docker-compose -f docker-compose-deploy up
  5. Scrapinghub Command Line Client $ shub login $ shub deploy

    projects: default: 12345 prod: 33333 apikeys: default: 0bbf4f0f691e0d9378ae00ca7bcf7f0c scrapinghub.yml https://doc.scrapinghub.com/shub.htm l
  6. >>> from scrapinghub import Connection >>> conn = Connection('1q2w3e4r54t56ydy87u89u8') >>>

    conn Connection('1q2w3e4r54t56ydy87u89u8') >>> conn.project_ids() [123, 456]
  7. >>> from scrapinghub import Connection >>> conn = Connection('1q2w3e4r54t56ydy87u89u8') >>>

    conn Connection('1q2w3e4r54t56ydy87u89u8') >>> conn.project_ids() [123, 456] >>> project = conn[123] >>> job = project.job(u'123/1/2')
  8. >>> from scrapinghub import Connection >>> conn = Connection('1q2w3e4r54t56ydy87u89u8') >>>

    conn Connection('1q2w3e4r54t56ydy87u89u8') >>> conn.project_ids() [123, 456] >>> for item in job.items(): ... # do something with item (it's just a dict) >>> for logitem in job.log(): ... # logitem is a dict with logLevel, message, time >>> project = conn[123] >>> job = project.job(u'123/1/2')
  9. unlimited team members unlimited projects unlimited requests 24 hour max

    job run time 1 concurrent crawl 7 day data retention no credit card required 0 CLOUD-BASED CRAWLING. FREE $
  10. Celery RabbitMQ Envia Mensagem (Task) Worker s Pegam Tasks Envia

    resultado Obtém Resultado Message Broker Backend