Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
又一个爬虫
Search
medcl
September 24, 2017
Technology
100
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
又一个爬虫
OSC 重庆的放码过来分享环节(闪电分享)。
medcl
September 24, 2017
More Decks by medcl
See All by medcl
Elasticsearch 在智能运维领域的运用
medcl
0
350
Elastic Stack- Past, Present, & Future
medcl
0
67
A Spider Written in Golang
medcl
1
82
Introduction to Beats and extending Beats
medcl
0
110
Elasticsearch & Bigdata
medcl
2
230
Elastic Stack V5
medcl
0
110
Elastic Stack V5
medcl
0
81
基于Elastic Stack的数据探索与分析@QConBeijing2016
medcl
1
440
Introduction to Elasticsearch @ FOSSASIA2016
medcl
0
7.6k
Other Decks in Technology
See All in Technology
iOS アプリの「これって不具合ですか?」を AI に調べてもらう
miichan
0
140
作る力から、見極める力へ — AI時代に広がるエンジニアの価値と役割
rince
0
330
自宅LLMの話
jacopen
1
720
スタートアップにAmazon EKSは早すぎる? マルチプロダクト戦略を加速する Platform Engineeringの実践 / Is Amazon EKS Too Soon for Startups? Practical Platform Engineering to Accelerate a Multi-Product Strategy
elmodev09
1
1.8k
Multi-Agent並列開発を 安全に回すための技術 / Technology for Safely Multi-Agent Parallel Development
tooppoo
0
140
[チョークトーク資料]AWS DevOps Agent を使いこなす / AWS Dev Ops Agent Chalk Talk AWS Summit Japan 2026
kinunori
4
770
「軸足」は 固定しなくていい - 熱量と強みで描く、しなやかなキャリアの形
kakehashi
PRO
1
260
徹底討論!ECS vs EKS!
daitak
3
1.7k
[AWS Summit Japan 2026]迷っているあなたへ_小さな一歩が、やがて自分を助けてくれる
sh_fk2
2
400
【Snowflake Summit 2026 Recap!!】Snowflake Summit Deep Dive: Security & Governance
civitaspo
1
310
AI 不只幫你寫 Code: 當專案從 300 暴增到 1500, 我們如何撐住 DevOps
appleboy
0
220
Oracle Cloud Infrastructure:2026年6月度サービス・アップデート
oracle4engineer
PRO
0
290
Featured
See All Featured
Evolving SEO for Evolving Search Engines
ryanjones
0
220
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.6k
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
410
SEOcharity - Dark patterns in SEO and UX: How to avoid them and build a more ethical web
sarafernandez
0
210
We Are The Robots
honzajavorek
0
250
Site-Speed That Sticks
csswizardry
13
1.2k
The Language of Interfaces
destraynor
162
27k
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
150
Product Roadmaps are Hard
iamctodd
PRO
55
12k
A Modern Web Designer's Workflow
chriscoyier
698
190k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
310
Transcript
⼜又⼀一个爬⾍虫 曾勇(Medcl)
2 什什么是爬⾍虫 • ⼜又叫 Robot、Bot 或 Crawler • 简单来说 ‒
⾃自动探索⽹网站 ‒ 帮你访问整个站点 ‒ ⾃自动为你收集⽹网站信息 ‒ ⾃自动更更新 ‒ 抽取处理理⽹网⻚页内容 ‒ 存储索引和快照 ‒ 。。。
3 为什什么造这个轮⼦子? 现在已经有很多开源的爬⾍虫了了: Scrapy,Nutch, Heritrix 等等 . [1,2,3] 但是! •
⼤大多仅仅是⼀一个爬⾍虫框架!ES 与 Lucene • 需要熟悉各种与爬取任务本身⽆无关的知识! • 开发和部署的环境复杂,痛苦! • 太重了了! • 分布式、管理理、监控、扩展复杂! • 结果就是⼀一⼤大堆凌乱⽆无法维护的脚本,或是⼀一 ⼤大堆技术拼凑的⼤大杂烩。 1.http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/ 2.https://github.com/BruceDone/awesome-crawler 3.https://gitee.com/explore/starred/spider
4 所以 • 狗爬,Gopa Golang + pá chóng(爬⾍虫) https://gitee.com/medcl/gopa
5 ⽬目标 • 轻量量级,内存占⽤用 < 100MB • 易易于部署,⽆无运⾏行行时和环境依赖 • ⽅方便便使⽤用,⽆无需编程和脚本技能
• 开箱即⽤用 • 提供 RESTful API 和 UI • 简单,可伸缩,可扩展
DEMO
7 Pending Check, Pending Fetch, Pending Index Checker Crawler Pipeline
Framework Database Storage Filter Index Persistence Layer API UI Dispatcher Communication Message Queue Network Internet Dynamic pipeline based on configuration GOPA overview
谢谢! https://gitee.com/medcl/gopa
9 Elastic Integration Overview DISTRIBUTED CRAWLING Elasticsearch Clustering not ready
yet Transform Store ingest node data node Logstash 3rd Applications Optional processing Web Content Kibana Raft
None
None
None
None
None