Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ruby conf tw 2012 build your own web scrapper
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Dale Ma
December 08, 2012
Programming
720
3
Share
ruby conf tw 2012 build your own web scrapper
build your own web scrapper
Dale Ma
December 08, 2012
Other Decks in Programming
See All in Programming
Vibe NLP for Applied NLP
inesmontani
PRO
0
450
Lightning-Fast Method Calls with Ruby 4.1 ZJIT / RubyKaigi 2026
k0kubun
3
1.1k
Server-Side Kotlin LT大会 vol.18 [Kotlin-lspの最新情報と Neovimのlsp設定例]
yasunori0418
1
180
Running Swift without an OS
kishikawakatsumi
0
850
iOS機能開発のAI環境と起きた変化
ryunakayama
0
190
Claude CodeでETLジョブ実行テストを自動化してみた
yoshikikasama
0
650
HTML-Aware ERB: The Path to Reactive Rendering @ RubyKaigi 2026, Hakodate, Japan
marcoroth
0
180
Going Multiplatform with Your Android App (Android Makers 2026)
zsmb
2
450
クラウドネイティブなエンジニアに向ける Raycastの魅力と実際の活用事例
nealle
2
220
「話せることがない」を乗り越える 〜日常業務から登壇テーマをつくる思考法〜
shoheimitani
4
850
Kingdom of the Machine
yui_knk
2
800
AWS re:Invent 2025の少し振り返り + DevOps AgentとBacklogを連携させてみた
satoshi256kbyte
3
170
Featured
See All Featured
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
1k
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
97
SERP Conf. Vienna - Web Accessibility: Optimizing for Inclusivity and SEO
sarafernandez
2
1.4k
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
150
How to build a perfect <img>
jonoalderson
1
5.4k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.7k
What Being in a Rock Band Can Teach Us About Real World SEO
427marketing
0
220
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
320
A designer walks into a library…
pauljervisheath
211
24k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.9k
brightonSEO & MeasureFest 2025 - Christian Goodrich - Winning strategies for Black Friday CRO & PPC
cargoodrich
3
680
Building Flexible Design Systems
yeseniaperezcruz
330
40k
Transcript
Build Your Own Web Scraper - Dale Ma @eguitarz 12年12月8日星期六
@eguitarz It’s fun to do something small and easy. 12年12月8日星期六
@eguitarz I always want to build a robot to serve
me. 12年12月8日星期六
@eguitarz Since making a robot is too difficult, so I
choose to make a web bot. 12年12月8日星期六
@eguitarz Today I’m talking about how do I build my
own web scraper in ruby. 12年12月8日星期六
@eguitarz Web scrapers have many uses. For example... 12年12月8日星期六
@eguitarz Up time survey, image collecting, automate web snapshots and
more... 12年12月8日星期六
@eguitarz Usually, there are many scrapers (threads) fired at the
same time. 12年12月8日星期六
@eguitarz So, first things first, I have to control the
threads. 12年12月8日星期六
@eguitarz I decide to write #threadpool to do this such
thing. 12年12月8日星期六
@eguitarz You can find that at https:// github.com/eguitarz/threadpool 12年12月8日星期六
@eguitarz Threadpool decides the life of each thread. 12年12月8日星期六
@eguitarz Now, let’s go for the main dish. 12年12月8日星期六
@eguitarz Web scrappers should be able to `grab page` and
`parse html tags`. 12年12月8日星期六
@eguitarz #Nokogiri is good at those things. 12年12月8日星期六
@eguitarz I use “Hash” to save parsed links. 12年12月8日星期六
@eguitarz There’s a problem, links stored in hash by threads.
But hash in ruby is not thread-safe... 12年12月8日星期六
@eguitarz #hamster helps me with this. 12年12月8日星期六
@eguitarz I use `Depth-Limited Search` algorithm for my scrapper. 3
2 1 1 12年12月8日星期六
@eguitarz What if the page needs javascript to render? 12年12月8日星期六
@eguitarz There’s a easy way... use browser to render the
html with javascript. 12年12月8日星期六
@eguitarz How? 12年12月8日星期六
@eguitarz #Waltir or #Selenium 12年12月8日星期六
Gonna show my little toy... 12年12月8日星期六
@eguitarz My scraper is on github at https:// github.com/eguitarz/macaron 12年12月8日星期六
@eguitarz The demo is simple, `you` can enhance or create
new one. 12年12月8日星期六
@eguitarz Wikipedia scraper, Facebook scraper... could be interesting! 12年12月8日星期六
THANKS! 12年12月8日星期六