Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PyBR12
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Patty Vader
October 12, 2016
Programming
66
0
Share
PyBR12
Patty Vader
October 12, 2016
More Decks by Patty Vader
See All by Patty Vader
Python para Machine Learning
pattyvader
0
32
Search Engines using Python and Elasticsearch
pattyvader
0
190
Pygame
pattyvader
0
85
GitHubWTM
pattyvader
0
49
Other Decks in Programming
See All in Programming
How We Benchmarked Quarkus: Patterns and anti-patterns
hollycummins
1
160
NakouPAY説明用
annouim0
0
270
[RubyKaigi 2026] Require Hooks
palkan
1
240
JOAI2026 1st solution - heron0519 -
heron0519
0
150
Kingdom of the Machine
yui_knk
2
1.2k
書き換えて学ぶTemporal #fukts
pirosikick
1
170
GitHubCopilotCLIをはじめよう.pdf
htkym
0
290
AIと共に生きる技術選定 2026
sgash708
0
110
Road to RubyKaigi: Play Hard(ware)
makicamel
1
490
AIエージェントで業務改善してみた
taku271
0
550
(Re)make Regexp in Ruby: Democratizing internals for the JIT
makenowjust
3
850
運転動画を検索可能にする〜Cosmos-Embed1とDatabricks Vector Searchで〜/cosmos-embed1-databricks-vector-search
studio_graph
1
500
Featured
See All Featured
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
110
The browser strikes back
jonoalderson
0
1k
The Invisible Side of Design
smashingmag
303
52k
Discover your Explorer Soul
emna__ayadi
2
1.1k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.7k
Exploring anti-patterns in Rails
aemeredith
3
330
コードの90%をAIが書く世界で何が待っているのか / What awaits us in a world where 90% of the code is written by AI
rkaga
61
43k
Typedesign – Prime Four
hannesfritz
42
3k
Into the Great Unknown - MozCon
thekraken
41
2.4k
Designing Experiences People Love
moore
143
24k
Crafting Experiences
bethany
1
130
Transcript
Search Engines utilizando Python e Elasticsearch
Apresentação https://spekerdeck.com/pattyvader/pybr12
Projeto Athena https://github.com/pattyvader/athena
Roadmap 1. Busca de documentos 2. Indexação 3. Percorrendo a
web
Busca de documentos
Busca de documentos Resultado armazenado no Elasticsearch Termo de busca
Busca de documentos Servidor de aplicação com Django GET Browser
(HTML)
Servidor de aplicação com Django Busca de documentos GET Browser
(HTML)
Busca de documentos . Servidor de aplicação com Django
Busca de documentos views.py Acessar o método “search” Servidor de
aplicação com Django
Busca de documentos views.py Acessar o método “search” Acessar o
método “search_term” Servidor de aplicação com Django
Indexação https://www.elastic.co/downloads/elasticsearch Relevancy score Protocolo Restful Mensagens Json
Indexação Relevance score
Indexação Restful/Json PUT GET
Indexação https://www.elastic.co/use-cases
Indexação O processo de indexação utiliza a lib Elasticsearch-py para
conectar o Python com o Elasticsearch. indexer.py https://pypi.python.org/pypi/elasticsearch https://elasticsearch-py.readthedocs.io/en/master/ https://github.com/elastic/elasticsearch-py Cria um índice Adiciona uma nova página ao índice
Indexação scraper.py O scraper extrai os dados, do arquivo html,
utilizando a lib BeautifulSoup. https://www.crummy.com/software/BeautifulSoup/
scraper.py Indexação Metatags do html
Percorrendo a web - Web crawler crawler.py
Percorrendo a web - Web crawler 1 3 2 4
5 Acessa arquivo robot.txt Download do html Extraí novos links Extraí os dados Insere dados no elasticsearch indexer.py scraper.py crawler.py
Percorrendo a web - Web crawler Antes de “crawlear” uma
página sempre verifique o arquivo “robot.txt”. É uma boa prática. crawler.py
Percorrendo a web - Web crawler A “urllib2” retorna o
html da página. crawler.py
Percorrendo a web- Web crawler crawler.py A extração de novos
links é realizada somente no domínio da url seed.
Percorrendo a web- Web crawler Método que realiza a extração
dos dados presentes no html. scraper.py crawler.py
Percorrendo a web- Web crawler Método que realiza a indexação
das páginas no Elasticsearch. indexer.py crawler.py
Finalizando... Browser (HTML) GET Servidor de aplicação com Django scraper.py
crawler.py indexer.py GET internet
https://github.com/pattyvader https://br.linkedin.com/in/patricia-regina-18790040 Contato *Designed by Freepik from www.flaticon.com*
[email protected]