ruby conf tw 2012 build your own web scrapper

Build Your Own Web Scraper - Dale Ma @eguitarz 12年12月8日星期六

@eguitarz It’s fun to do something small and easy. 12年12月8日星期六

@eguitarz I always want to build a robot to serve
me. 12年12月8日星期六

@eguitarz Since making a robot is too difﬁcult, so I
choose to make a web bot. 12年12月8日星期六

@eguitarz Today I’m talking about how do I build my
own web scraper in ruby. 12年12月8日星期六

@eguitarz Web scrapers have many uses. For example... 12年12月8日星期六

@eguitarz Up time survey, image collecting, automate web snapshots and
more... 12年12月8日星期六

@eguitarz Usually, there are many scrapers (threads) ﬁred at the
same time. 12年12月8日星期六

@eguitarz So, ﬁrst things ﬁrst, I have to control the
threads. 12年12月8日星期六

@eguitarz I decide to write #threadpool to do this such
thing. 12年12月8日星期六

@eguitarz You can ﬁnd that at https:// github.com/eguitarz/threadpool 12年12月8日星期六

@eguitarz Threadpool decides the life of each thread. 12年12月8日星期六

@eguitarz Now, let’s go for the main dish. 12年12月8日星期六

@eguitarz Web scrappers should be able to `grab page` and
`parse html tags`. 12年12月8日星期六

@eguitarz #Nokogiri is good at those things. 12年12月8日星期六

@eguitarz I use “Hash” to save parsed links. 12年12月8日星期六

@eguitarz There’s a problem, links stored in hash by threads.
But hash in ruby is not thread-safe... 12年12月8日星期六

@eguitarz #hamster helps me with this. 12年12月8日星期六

@eguitarz I use `Depth-Limited Search` algorithm for my scrapper. 3
2 1 1 12年12月8日星期六

@eguitarz What if the page needs javascript to render? 12年12月8日星期六

@eguitarz There’s a easy way... use browser to render the
html with javascript. 12年12月8日星期六

@eguitarz How? 12年12月8日星期六

@eguitarz #Waltir or #Selenium 12年12月8日星期六

Gonna show my little toy... 12年12月8日星期六

@eguitarz My scraper is on github at https:// github.com/eguitarz/macaron 12年12月8日星期六

@eguitarz The demo is simple, `you` can enhance or create
new one. 12年12月8日星期六

@eguitarz Wikipedia scraper, Facebook scraper... could be interesting! 12年12月8日星期六

THANKS! 12年12月8日星期六

ruby conf tw 2012 build your own web scrapper

ruby conf tw 2012 build your own web scrapper

Dale Ma

Other Decks in Programming

Featured

Transcript

Build Your Own Web Scraper - Dale Ma @eguitarz 12年12月8日星期六

@eguitarz It’s fun to do something small and easy. 12年12月8日星期六

@eguitarz I always want to build a robot to serve

@eguitarz Since making a robot is too difﬁcult, so I

@eguitarz Today I’m talking about how do I build my

@eguitarz Web scrapers have many uses. For example... 12年12月8日星期六

@eguitarz Up time survey, image collecting, automate web snapshots and

@eguitarz Usually, there are many scrapers (threads) ﬁred at the

@eguitarz So, ﬁrst things ﬁrst, I have to control the

@eguitarz I decide to write #threadpool to do this such

@eguitarz You can ﬁnd that at https:// github.com/eguitarz/threadpool 12年12月8日星期六

@eguitarz Threadpool decides the life of each thread. 12年12月8日星期六

@eguitarz Now, let’s go for the main dish. 12年12月8日星期六

@eguitarz Web scrappers should be able to `grab page` and

@eguitarz #Nokogiri is good at those things. 12年12月8日星期六

@eguitarz I use “Hash” to save parsed links. 12年12月8日星期六

@eguitarz There’s a problem, links stored in hash by threads.

@eguitarz #hamster helps me with this. 12年12月8日星期六

@eguitarz I use `Depth-Limited Search` algorithm for my scrapper. 3

@eguitarz What if the page needs javascript to render? 12年12月8日星期六

@eguitarz There’s a easy way... use browser to render the

@eguitarz How? 12年12月8日星期六

@eguitarz #Waltir or #Selenium 12年12月8日星期六

Gonna show my little toy... 12年12月8日星期六

@eguitarz My scraper is on github at https:// github.com/eguitarz/macaron 12年12月8日星期六

@eguitarz The demo is simple, `you` can enhance or create

@eguitarz Wikipedia scraper, Facebook scraper... could be interesting! 12年12月8日星期六

THANKS! 12年12月8日星期六