Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PyBR12
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Patty Vader
October 12, 2016
Programming
0
64
PyBR12
Patty Vader
October 12, 2016
Tweet
Share
More Decks by Patty Vader
See All by Patty Vader
Python para Machine Learning
pattyvader
0
32
Search Engines using Python and Elasticsearch
pattyvader
0
190
Pygame
pattyvader
0
83
GitHubWTM
pattyvader
0
47
Other Decks in Programming
See All in Programming
Takumiから考えるSecurity_Maturity_Model.pdf
gessy0129
1
120
AI活用のコスパを最大化する方法
ochtum
0
120
CSC307 Lecture 12
javiergs
PRO
0
450
New in Go 1.26 Implementing go fix in product development
sunecosuri
0
310
AI駆動開発の本音 〜Claude Code並列開発で見えたエンジニアの新しい役割〜
hisuzuya
4
470
TipKitTips
ktcryomm
0
150
モジュラモノリスにおける境界をGoのinternalパッケージで守る
magavel
0
3.4k
AI主導でFastAPIのWebサービスを作るときに 人間が構造化すべき境界線
okajun35
0
510
Go Conference mini in Sendai 2026 : Goに新機能を提案し実装されるまでのフロー徹底解説
yamatoya
0
510
Head of Engineeringが現場で回した生産性向上施策 2025→2026
gessy0129
0
210
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
380
米国のサイバーセキュリティタイムラインと見る Goの暗号パッケージの進化
tomtwinkle
2
410
Featured
See All Featured
Information Architects: The Missing Link in Design Systems
soysaucechin
0
810
Discover your Explorer Soul
emna__ayadi
2
1.1k
HDC tutorial
michielstock
1
490
Un-Boring Meetings
codingconduct
0
220
Design in an AI World
tapps
0
160
SEO Brein meetup: CTRL+C is not how to scale international SEO
lindahogenes
0
2.4k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.1k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.6k
We Are The Robots
honzajavorek
0
190
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
67
37k
More Than Pixels: Becoming A User Experience Designer
marktimemedia
3
340
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Transcript
Search Engines utilizando Python e Elasticsearch
Apresentação https://spekerdeck.com/pattyvader/pybr12
Projeto Athena https://github.com/pattyvader/athena
Roadmap 1. Busca de documentos 2. Indexação 3. Percorrendo a
web
Busca de documentos
Busca de documentos Resultado armazenado no Elasticsearch Termo de busca
Busca de documentos Servidor de aplicação com Django GET Browser
(HTML)
Servidor de aplicação com Django Busca de documentos GET Browser
(HTML)
Busca de documentos . Servidor de aplicação com Django
Busca de documentos views.py Acessar o método “search” Servidor de
aplicação com Django
Busca de documentos views.py Acessar o método “search” Acessar o
método “search_term” Servidor de aplicação com Django
Indexação https://www.elastic.co/downloads/elasticsearch Relevancy score Protocolo Restful Mensagens Json
Indexação Relevance score
Indexação Restful/Json PUT GET
Indexação https://www.elastic.co/use-cases
Indexação O processo de indexação utiliza a lib Elasticsearch-py para
conectar o Python com o Elasticsearch. indexer.py https://pypi.python.org/pypi/elasticsearch https://elasticsearch-py.readthedocs.io/en/master/ https://github.com/elastic/elasticsearch-py Cria um índice Adiciona uma nova página ao índice
Indexação scraper.py O scraper extrai os dados, do arquivo html,
utilizando a lib BeautifulSoup. https://www.crummy.com/software/BeautifulSoup/
scraper.py Indexação Metatags do html
Percorrendo a web - Web crawler crawler.py
Percorrendo a web - Web crawler 1 3 2 4
5 Acessa arquivo robot.txt Download do html Extraí novos links Extraí os dados Insere dados no elasticsearch indexer.py scraper.py crawler.py
Percorrendo a web - Web crawler Antes de “crawlear” uma
página sempre verifique o arquivo “robot.txt”. É uma boa prática. crawler.py
Percorrendo a web - Web crawler A “urllib2” retorna o
html da página. crawler.py
Percorrendo a web- Web crawler crawler.py A extração de novos
links é realizada somente no domínio da url seed.
Percorrendo a web- Web crawler Método que realiza a extração
dos dados presentes no html. scraper.py crawler.py
Percorrendo a web- Web crawler Método que realiza a indexação
das páginas no Elasticsearch. indexer.py crawler.py
Finalizando... Browser (HTML) GET Servidor de aplicação com Django scraper.py
crawler.py indexer.py GET internet
https://github.com/pattyvader https://br.linkedin.com/in/patricia-regina-18790040 Contato *Designed by Freepik from www.flaticon.com*
[email protected]