Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Introduction of Web scraping for PHP users
Search
kazusuke sasezaki
December 30, 2013
0
120
Introduction of Web scraping for PHP users
slides for Japanese PHP Conference 2013
http://phpcon.php.gr.jp/w/2013/#program
kazusuke sasezaki
December 30, 2013
Tweet
Share
More Decks by kazusuke sasezaki
See All by kazusuke sasezaki
できる!!! Validation !!! - builderscon tokyo 2017
sasezaki
1
180
はじめてのミューテーション解析 / Mutation Testing
sasezaki
2
1.5k
こんなPHP開発者はイヤだ
sasezaki
2
3.8k
Featured
See All Featured
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
113
20k
Building a Scalable Design System with Sketch
lauravandoore
462
33k
Art, The Web, and Tiny UX
lynnandtonic
303
21k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
34
6k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
The Pragmatic Product Professional
lauravandoore
36
6.9k
Visualization
eitanlees
148
16k
The World Runs on Bad Software
bkeepers
PRO
70
11k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.5k
Speed Design
sergeychernyshev
32
1.1k
Building Applications with DynamoDB
mza
96
6.6k
Transcript
Introduction of Web scraping for PHP users
15 years ago
15 years ago PHP 3.0.4 release, includes get_meta_tags().
It works.. <?php get_meta_tags("http://example.com/"); array(1) { 'viewport' => string(35) "width=device-width,
initial-scale=1" }
It works, sometimes! <?php get_meta_tags("http://www.discogs.com/"); PHP Warning: get_meta_tags(http://www.disc ogs.com/): failed
to open stream: HTTP request failed! 500 Client Refused
You are file_get_contents() fanboy
<?php get_meta_tags("data://text/html,". file_get_contents( "http://www.discogs.com/", false, stream_context_create( ["http" => ["header" =>
"User-Agent: Mozilla/4.0"] ] ) ) );
It works, too $php -d user_agent="Mozilla/4.0" \ -r 'get_meta_tags("http://www.discogs.com/");'
You would FEEL there are some problem.
"do separation of concerns! GET HTML & parse HTML."
"do separation of concerns! GET HTML & parse HTML." Doubt!
Doubt! Doubt!
Handling Request & Handling Response
HTTP Request
Sorry, today I don't talk about HTTP Request side. (no
time talk about HttpClient, Spider, crawler in 15 minutes)
HTTP ReSPONSE
HTTP ReSPONSE HeaDERS - BODY - Not only HTML ;-)
SCRAP SCRAPING FROM RESPONSE!
There are some fact you should take a act
There are some fact you should take a act •
ConTENT ENCODING • ChaRSET ENCODING • NORMALIZE HTML • EXTRACTING FROM HTML • SOLVE CONTEXT
CONTENT ENCODING
CONTENT ENCODING TODAY, WE ALREADY ACCEPTED IT.
CONTENT ENCODING • gzip • deflate • compress • identity
I recommend using good Response handlers before struggling.
CONTENT ENCODING • gzip • deflate I recommend using good
Response handlers before struggling. pear/HTTP_Request2 zendframework/zend-http guzzle/guzzle
CHARSET ENCODING
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html)
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html) This is not best way.
You had already got hint in Response Headers & html's meta nodes.
CHARSET ENCODING <?php header("Content-Type: text/html; charset=Shift-JIS"); ?> ①②③④⑤ But, Don't
forget, Most of Japanese PHP users do LIE.
CHARSET ENCODING diggin/diggin-http-charset diggin/guzzle-plugin-AutoCharsetEncodingPlugi I hope my component will help
you.
NORMALIZE HTML
NORMALIZE HTML
NORMALIZE HTML before parse as HTML, you can fix it.
NORMALIZE HTML before parse as HTML, you can fix it.
• php-ext/tidy, HTMLParser • other beautifiers • manually :-(
EXTRACTING FROM HTML
EXTRACTING FROM HTML Yes, there are several way in PHP.
• PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Mostly, boredom for entire HTML. • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML DOM is a API FOR HTML &
XML • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Xpath is your friend • PCRE /
String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Remember PHP's Feature, DOMXPath:: registerPhpFunctions • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
Solve Context
Solve Context You will need solve context from got response
• Filtering extracted result for Domain.
Solve Context Don't reinvent the wheel • Resolve relative URI
/ RFC-3986 - pear/Net_URL2, zendframework/zend-uri supports it
Solve Context Don't reinvent the wheel • “Databases” that helps
you - wedata - OSS's repositories (not only PHP)
LIBRALIES • behat/mink • goutte, symfony/browserKit • zendframework/zend-dom • diggin-scraper
• simple_html_dom • phpQuery • fluentDOM • php-jsonpointer • beberlei/phpricot
Today, web is under control by JavaScript
JAVASCRIPT We need "REAL" BROWSER for AUTOMATION • Selenium •
PhantomJS / CasperJS • SlimerJS
You have a chance to survive with php.
Move forward, PHP. • HTML5 • HTTP 2.0 / SPDY
• Browser binding ardemiranda/WebKitGtk • concurrent programing, asynchronous • collective intelligence • NLP natural language processing
Deeper and Deeper • elazar/web-scraping-with-php https://github.com/elazar/web-scraping-with-php • Accessing Web Resources
with PHP http://joind.in/3386 • Spidering Hacks http://www.oreilly.co.jp/books/4873111870/ • fuba: exthtml https://fuba.jottit.com/exthtml • kitamomonga http://d.hatena.ne.jp/kitamomonga/ • The Architecture of Open Source Applications selenium - https://github.com/m-takagi/aosa-ja
Thanks