Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Introduction of Web scraping for PHP users
Search
kazusuke sasezaki
December 30, 2013
0
100
Introduction of Web scraping for PHP users
slides for Japanese PHP Conference 2013
http://phpcon.php.gr.jp/w/2013/#program
kazusuke sasezaki
December 30, 2013
Tweet
Share
More Decks by kazusuke sasezaki
See All by kazusuke sasezaki
できる!!! Validation !!! - builderscon tokyo 2017
sasezaki
1
170
はじめてのミューテーション解析 / Mutation Testing
sasezaki
2
1.4k
こんなPHP開発者はイヤだ
sasezaki
2
3.7k
Featured
See All Featured
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
113
50k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
Rebuilding a faster, lazier Slack
samanthasiow
79
8.8k
No one is an island. Learnings from fostering a developers community.
thoeni
19
3.1k
Docker and Python
trallard
43
3.2k
RailsConf 2023
tenderlove
29
970
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
4 Signs Your Business is Dying
shpigford
182
22k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
3
240
Automating Front-end Workflow
addyosmani
1366
200k
How to Ace a Technical Interview
jacobian
276
23k
Raft: Consensus for Rubyists
vanstee
137
6.7k
Transcript
Introduction of Web scraping for PHP users
15 years ago
15 years ago PHP 3.0.4 release, includes get_meta_tags().
It works.. <?php get_meta_tags("http://example.com/"); array(1) { 'viewport' => string(35) "width=device-width,
initial-scale=1" }
It works, sometimes! <?php get_meta_tags("http://www.discogs.com/"); PHP Warning: get_meta_tags(http://www.disc ogs.com/): failed
to open stream: HTTP request failed! 500 Client Refused
You are file_get_contents() fanboy
<?php get_meta_tags("data://text/html,". file_get_contents( "http://www.discogs.com/", false, stream_context_create( ["http" => ["header" =>
"User-Agent: Mozilla/4.0"] ] ) ) );
It works, too $php -d user_agent="Mozilla/4.0" \ -r 'get_meta_tags("http://www.discogs.com/");'
You would FEEL there are some problem.
"do separation of concerns! GET HTML & parse HTML."
"do separation of concerns! GET HTML & parse HTML." Doubt!
Doubt! Doubt!
Handling Request & Handling Response
HTTP Request
Sorry, today I don't talk about HTTP Request side. (no
time talk about HttpClient, Spider, crawler in 15 minutes)
HTTP ReSPONSE
HTTP ReSPONSE HeaDERS - BODY - Not only HTML ;-)
SCRAP SCRAPING FROM RESPONSE!
There are some fact you should take a act
There are some fact you should take a act •
ConTENT ENCODING • ChaRSET ENCODING • NORMALIZE HTML • EXTRACTING FROM HTML • SOLVE CONTEXT
CONTENT ENCODING
CONTENT ENCODING TODAY, WE ALREADY ACCEPTED IT.
CONTENT ENCODING • gzip • deflate • compress • identity
I recommend using good Response handlers before struggling.
CONTENT ENCODING • gzip • deflate I recommend using good
Response handlers before struggling. pear/HTTP_Request2 zendframework/zend-http guzzle/guzzle
CHARSET ENCODING
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html)
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html) This is not best way.
You had already got hint in Response Headers & html's meta nodes.
CHARSET ENCODING <?php header("Content-Type: text/html; charset=Shift-JIS"); ?> ①②③④⑤ But, Don't
forget, Most of Japanese PHP users do LIE.
CHARSET ENCODING diggin/diggin-http-charset diggin/guzzle-plugin-AutoCharsetEncodingPlugi I hope my component will help
you.
NORMALIZE HTML
NORMALIZE HTML
NORMALIZE HTML before parse as HTML, you can fix it.
NORMALIZE HTML before parse as HTML, you can fix it.
• php-ext/tidy, HTMLParser • other beautifiers • manually :-(
EXTRACTING FROM HTML
EXTRACTING FROM HTML Yes, there are several way in PHP.
• PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Mostly, boredom for entire HTML. • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML DOM is a API FOR HTML &
XML • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Xpath is your friend • PCRE /
String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Remember PHP's Feature, DOMXPath:: registerPhpFunctions • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
Solve Context
Solve Context You will need solve context from got response
• Filtering extracted result for Domain.
Solve Context Don't reinvent the wheel • Resolve relative URI
/ RFC-3986 - pear/Net_URL2, zendframework/zend-uri supports it
Solve Context Don't reinvent the wheel • “Databases” that helps
you - wedata - OSS's repositories (not only PHP)
LIBRALIES • behat/mink • goutte, symfony/browserKit • zendframework/zend-dom • diggin-scraper
• simple_html_dom • phpQuery • fluentDOM • php-jsonpointer • beberlei/phpricot
Today, web is under control by JavaScript
JAVASCRIPT We need "REAL" BROWSER for AUTOMATION • Selenium •
PhantomJS / CasperJS • SlimerJS
You have a chance to survive with php.
Move forward, PHP. • HTML5 • HTTP 2.0 / SPDY
• Browser binding ardemiranda/WebKitGtk • concurrent programing, asynchronous • collective intelligence • NLP natural language processing
Deeper and Deeper • elazar/web-scraping-with-php https://github.com/elazar/web-scraping-with-php • Accessing Web Resources
with PHP http://joind.in/3386 • Spidering Hacks http://www.oreilly.co.jp/books/4873111870/ • fuba: exthtml https://fuba.jottit.com/exthtml • kitamomonga http://d.hatena.ne.jp/kitamomonga/ • The Architecture of Open Source Applications selenium - https://github.com/m-takagi/aosa-ja
Thanks