Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PyBR12
Search
Patty Vader
October 12, 2016
Programming
0
63
PyBR12
Patty Vader
October 12, 2016
Tweet
Share
More Decks by Patty Vader
See All by Patty Vader
Python para Machine Learning
pattyvader
0
31
Search Engines using Python and Elasticsearch
pattyvader
0
190
Pygame
pattyvader
0
83
GitHubWTM
pattyvader
0
47
Other Decks in Programming
See All in Programming
Rubyで鍛える仕組み化プロヂュース力
muryoimpl
0
310
Navigating Dependency Injection with Metro
l2hyunwoo
1
200
Deno Tunnel を使ってみた話
kamekyame
0
310
ZJIT: The Ruby 4 JIT Compiler / Ruby Release 30th Anniversary Party
k0kubun
1
310
これならできる!個人開発のすゝめ
tinykitten
PRO
0
140
大規模Cloud Native環境におけるFalcoの運用
owlinux1000
0
240
AI Agent Dojo #4: watsonx Orchestrate ADK体験
oniak3ibm
PRO
0
120
まだ間に合う!Claude Code元年をふりかえる
nogu66
5
920
Giselleで作るAI QAアシスタント 〜 Pull Requestレビューに継続的QAを
codenote
0
330
AI前提で考えるiOSアプリのモダナイズ設計
yuukiw00w
0
210
[AtCoder Conference 2025] LLMを使った業務AHCの上⼿な解き⽅
terryu16
6
1k
AIによるイベントストーミング図からのコード生成 / AI-powered code generation from Event Storming diagrams
nrslib
1
540
Featured
See All Featured
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
250
Build your cross-platform service in a week with App Engine
jlugia
234
18k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
54
48k
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
110
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
34
Evolving SEO for Evolving Search Engines
ryanjones
0
91
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
180
Producing Creativity
orderedlist
PRO
348
40k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
61
51k
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
120
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
Unsuck your backbone
ammeep
671
58k
Transcript
Search Engines utilizando Python e Elasticsearch
Apresentação https://spekerdeck.com/pattyvader/pybr12
Projeto Athena https://github.com/pattyvader/athena
Roadmap 1. Busca de documentos 2. Indexação 3. Percorrendo a
web
Busca de documentos
Busca de documentos Resultado armazenado no Elasticsearch Termo de busca
Busca de documentos Servidor de aplicação com Django GET Browser
(HTML)
Servidor de aplicação com Django Busca de documentos GET Browser
(HTML)
Busca de documentos . Servidor de aplicação com Django
Busca de documentos views.py Acessar o método “search” Servidor de
aplicação com Django
Busca de documentos views.py Acessar o método “search” Acessar o
método “search_term” Servidor de aplicação com Django
Indexação https://www.elastic.co/downloads/elasticsearch Relevancy score Protocolo Restful Mensagens Json
Indexação Relevance score
Indexação Restful/Json PUT GET
Indexação https://www.elastic.co/use-cases
Indexação O processo de indexação utiliza a lib Elasticsearch-py para
conectar o Python com o Elasticsearch. indexer.py https://pypi.python.org/pypi/elasticsearch https://elasticsearch-py.readthedocs.io/en/master/ https://github.com/elastic/elasticsearch-py Cria um índice Adiciona uma nova página ao índice
Indexação scraper.py O scraper extrai os dados, do arquivo html,
utilizando a lib BeautifulSoup. https://www.crummy.com/software/BeautifulSoup/
scraper.py Indexação Metatags do html
Percorrendo a web - Web crawler crawler.py
Percorrendo a web - Web crawler 1 3 2 4
5 Acessa arquivo robot.txt Download do html Extraí novos links Extraí os dados Insere dados no elasticsearch indexer.py scraper.py crawler.py
Percorrendo a web - Web crawler Antes de “crawlear” uma
página sempre verifique o arquivo “robot.txt”. É uma boa prática. crawler.py
Percorrendo a web - Web crawler A “urllib2” retorna o
html da página. crawler.py
Percorrendo a web- Web crawler crawler.py A extração de novos
links é realizada somente no domínio da url seed.
Percorrendo a web- Web crawler Método que realiza a extração
dos dados presentes no html. scraper.py crawler.py
Percorrendo a web- Web crawler Método que realiza a indexação
das páginas no Elasticsearch. indexer.py crawler.py
Finalizando... Browser (HTML) GET Servidor de aplicação com Django scraper.py
crawler.py indexer.py GET internet
https://github.com/pattyvader https://br.linkedin.com/in/patricia-regina-18790040 Contato *Designed by Freepik from www.flaticon.com*
[email protected]