Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Introduction of Web scraping for PHP users
Search
kazusuke sasezaki
December 30, 2013
0
110
Introduction of Web scraping for PHP users
slides for Japanese PHP Conference 2013
http://phpcon.php.gr.jp/w/2013/#program
kazusuke sasezaki
December 30, 2013
Tweet
Share
More Decks by kazusuke sasezaki
See All by kazusuke sasezaki
できる!!! Validation !!! - builderscon tokyo 2017
sasezaki
1
170
はじめてのミューテーション解析 / Mutation Testing
sasezaki
2
1.4k
こんなPHP開発者はイヤだ
sasezaki
2
3.8k
Featured
See All Featured
Statistics for Hackers
jakevdp
797
220k
Bootstrapping a Software Product
garrettdimon
PRO
306
110k
Six Lessons from altMBA
skipperchong
27
3.6k
The Cult of Friendly URLs
andyhume
78
6.2k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
33
2.8k
KATA
mclloyd
29
14k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
21
2.5k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
29
1k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
30
4.6k
Why You Should Never Use an ORM
jnunemaker
PRO
55
9.2k
The MySQL Ecosystem @ GitHub 2015
samlambert
250
12k
Transcript
Introduction of Web scraping for PHP users
15 years ago
15 years ago PHP 3.0.4 release, includes get_meta_tags().
It works.. <?php get_meta_tags("http://example.com/"); array(1) { 'viewport' => string(35) "width=device-width,
initial-scale=1" }
It works, sometimes! <?php get_meta_tags("http://www.discogs.com/"); PHP Warning: get_meta_tags(http://www.disc ogs.com/): failed
to open stream: HTTP request failed! 500 Client Refused
You are file_get_contents() fanboy
<?php get_meta_tags("data://text/html,". file_get_contents( "http://www.discogs.com/", false, stream_context_create( ["http" => ["header" =>
"User-Agent: Mozilla/4.0"] ] ) ) );
It works, too $php -d user_agent="Mozilla/4.0" \ -r 'get_meta_tags("http://www.discogs.com/");'
You would FEEL there are some problem.
"do separation of concerns! GET HTML & parse HTML."
"do separation of concerns! GET HTML & parse HTML." Doubt!
Doubt! Doubt!
Handling Request & Handling Response
HTTP Request
Sorry, today I don't talk about HTTP Request side. (no
time talk about HttpClient, Spider, crawler in 15 minutes)
HTTP ReSPONSE
HTTP ReSPONSE HeaDERS - BODY - Not only HTML ;-)
SCRAP SCRAPING FROM RESPONSE!
There are some fact you should take a act
There are some fact you should take a act •
ConTENT ENCODING • ChaRSET ENCODING • NORMALIZE HTML • EXTRACTING FROM HTML • SOLVE CONTEXT
CONTENT ENCODING
CONTENT ENCODING TODAY, WE ALREADY ACCEPTED IT.
CONTENT ENCODING • gzip • deflate • compress • identity
I recommend using good Response handlers before struggling.
CONTENT ENCODING • gzip • deflate I recommend using good
Response handlers before struggling. pear/HTTP_Request2 zendframework/zend-http guzzle/guzzle
CHARSET ENCODING
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html)
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html) This is not best way.
You had already got hint in Response Headers & html's meta nodes.
CHARSET ENCODING <?php header("Content-Type: text/html; charset=Shift-JIS"); ?> ①②③④⑤ But, Don't
forget, Most of Japanese PHP users do LIE.
CHARSET ENCODING diggin/diggin-http-charset diggin/guzzle-plugin-AutoCharsetEncodingPlugi I hope my component will help
you.
NORMALIZE HTML
NORMALIZE HTML
NORMALIZE HTML before parse as HTML, you can fix it.
NORMALIZE HTML before parse as HTML, you can fix it.
• php-ext/tidy, HTMLParser • other beautifiers • manually :-(
EXTRACTING FROM HTML
EXTRACTING FROM HTML Yes, there are several way in PHP.
• PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Mostly, boredom for entire HTML. • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML DOM is a API FOR HTML &
XML • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Xpath is your friend • PCRE /
String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Remember PHP's Feature, DOMXPath:: registerPhpFunctions • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
Solve Context
Solve Context You will need solve context from got response
• Filtering extracted result for Domain.
Solve Context Don't reinvent the wheel • Resolve relative URI
/ RFC-3986 - pear/Net_URL2, zendframework/zend-uri supports it
Solve Context Don't reinvent the wheel • “Databases” that helps
you - wedata - OSS's repositories (not only PHP)
LIBRALIES • behat/mink • goutte, symfony/browserKit • zendframework/zend-dom • diggin-scraper
• simple_html_dom • phpQuery • fluentDOM • php-jsonpointer • beberlei/phpricot
Today, web is under control by JavaScript
JAVASCRIPT We need "REAL" BROWSER for AUTOMATION • Selenium •
PhantomJS / CasperJS • SlimerJS
You have a chance to survive with php.
Move forward, PHP. • HTML5 • HTTP 2.0 / SPDY
• Browser binding ardemiranda/WebKitGtk • concurrent programing, asynchronous • collective intelligence • NLP natural language processing
Deeper and Deeper • elazar/web-scraping-with-php https://github.com/elazar/web-scraping-with-php • Accessing Web Resources
with PHP http://joind.in/3386 • Spidering Hacks http://www.oreilly.co.jp/books/4873111870/ • fuba: exthtml https://fuba.jottit.com/exthtml • kitamomonga http://d.hatena.ne.jp/kitamomonga/ • The Architecture of Open Source Applications selenium - https://github.com/m-takagi/aosa-ja
Thanks