Introduction of Web scraping for PHP users

15 years ago

15 years ago 　 PHP 3.0.4 release, includes get_meta_tags().

It works.. <?php get_meta_tags("http://example.com/"); array(1) { 'viewport' => string(35) "width=device-width,
initial-scale=1" }

It works, sometimes! <?php get_meta_tags("http://www.discogs.com/"); PHP Warning: get_meta_tags(http://www.disc ogs.com/): failed
to open stream: HTTP request failed! 500 Client Refused

You are file_get_contents() fanboy

<?php get_meta_tags("data://text/html,". file_get_contents( "http://www.discogs.com/", false, stream_context_create( ["http" => ["header" =>
"User-Agent: Mozilla/4.0"] ] ) ) );

It works, too $php -d user_agent="Mozilla/4.0" \ -r 'get_meta_tags("http://www.discogs.com/");'

You would FEEL there are some problem.

"do separation of concerns! GET HTML & parse HTML."

"do separation of concerns! GET HTML & parse HTML." Doubt!
Doubt! Doubt!

Handling Request & Handling Response

HTTP Request

Sorry, today I don't talk about HTTP Request side. (no
time talk about HttpClient, Spider, crawler in 15 minutes)

HTTP ReSPONSE

HTTP ReSPONSE HeaDERS - BODY - Not only HTML ;-)

SCRAP SCRAPING FROM RESPONSE!

There are some fact you should take a act

There are some fact you should take a act •
ConTENT ENCODING • ChaRSET ENCODING • NORMALIZE HTML • EXTRACTING FROM HTML • SOLVE CONTEXT

CONTENT ENCODING

CONTENT ENCODING TODAY, WE ALREADY ACCEPTED IT.

CONTENT ENCODING • gzip • deflate • compress • identity
I recommend using good Response handlers before struggling.

CONTENT ENCODING • gzip • deflate I recommend using good
Response handlers before struggling. pear/HTTP_Request2 zendframework/zend-http guzzle/guzzle

CHARSET ENCODING

CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html)

CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html) This is not best way.
You had already got hint in Response Headers & html's meta nodes.

CHARSET ENCODING <?php header("Content-Type: text/html; charset=Shift-JIS"); ?> ①②③④⑤ But, Don't
forget, Most of Japanese PHP users do LIE.

CHARSET ENCODING diggin/diggin-http-charset diggin/guzzle-plugin-AutoCharsetEncodingPlugi I hope my component will help
you.

NORMALIZE HTML

NORMALIZE HTML before parse as HTML, you can fix it.

NORMALIZE HTML before parse as HTML, you can fix it.
• php-ext/tidy, HTMLParser • other beautifiers • manually :-(

EXTRACTING FROM HTML

EXTRACTING FROM HTML Yes, there are several way in PHP.
• PCRE / String Functions • dom • SimpleXML • php-ext/html_parse

EXTRACTING FROM HTML Mostly, boredom for entire HTML. • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse

EXTRACTING FROM HTML DOM is a API FOR HTML &
XML • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse

EXTRACTING FROM HTML Xpath is your friend • PCRE /
String Functions • dom • SimpleXML • php-ext/html_parse

EXTRACTING FROM HTML Remember PHP's Feature, DOMXPath:: registerPhpFunctions • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse

Solve Context

Solve Context You will need solve context from got response
• Filtering extracted result for Domain.

Solve Context Don't reinvent the wheel • Resolve relative URI
/ RFC-3986 - pear/Net_URL2, zendframework/zend-uri supports it

Solve Context Don't reinvent the wheel • “Databases” that helps
you - wedata - OSS's repositories (not only PHP)

LIBRALIES • behat/mink • goutte, symfony/browserKit • zendframework/zend-dom • diggin-scraper
• simple_html_dom • phpQuery • fluentDOM • php-jsonpointer • beberlei/phpricot

Today, web is under control by JavaScript

JAVASCRIPT We need "REAL" BROWSER for AUTOMATION • Selenium •
PhantomJS / CasperJS • SlimerJS

You have a chance to survive with php.

Move forward, PHP. • HTML5 • HTTP 2.0 / SPDY
• Browser binding ardemiranda/WebKitGtk • concurrent programing, asynchronous • collective intelligence • NLP natural language processing

Deeper and Deeper • elazar/web-scraping-with-php https://github.com/elazar/web-scraping-with-php • Accessing Web Resources
with PHP http://joind.in/3386 • Spidering Hacks http://www.oreilly.co.jp/books/4873111870/ • fuba: exthtml https://fuba.jottit.com/exthtml • kitamomonga http://d.hatena.ne.jp/kitamomonga/ • The Architecture of Open Source Applications selenium - https://github.com/m-takagi/aosa-ja

Thanks

Introduction of Web scraping for PHP users

Introduction of Web scraping for PHP users

More Decks by kazusuke sasezaki

Featured

Transcript