Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction of Web scraping for PHP users

kazusuke sasezaki
December 30, 2013
93

Introduction of Web scraping for PHP users

slides for Japanese PHP Conference 2013
http://phpcon.php.gr.jp/w/2013/#program

kazusuke sasezaki

December 30, 2013
Tweet

Transcript

  1. Sorry, today I don't talk about HTTP Request side. (no

    time talk about HttpClient, Spider, crawler in 15 minutes)
  2. There are some fact you should take a act •

    ConTENT ENCODING • ChaRSET ENCODING • NORMALIZE HTML • EXTRACTING FROM HTML • SOLVE CONTEXT
  3. CONTENT ENCODING • gzip • deflate • compress • identity

    I recommend using good Response handlers before struggling.
  4. CONTENT ENCODING • gzip • deflate I recommend using good

    Response handlers before struggling. pear/HTTP_Request2 zendframework/zend-http guzzle/guzzle
  5. CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html) This is not best way.

    You had already got hint in Response Headers & html's meta nodes.
  6. NORMALIZE HTML before parse as HTML, you can fix it.

    • php-ext/tidy, HTMLParser • other beautifiers • manually :-(
  7. EXTRACTING FROM HTML Yes, there are several way in PHP.

    • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
  8. EXTRACTING FROM HTML Mostly, boredom for entire HTML. • PCRE

    / String Functions • dom • SimpleXML • php-ext/html_parse
  9. EXTRACTING FROM HTML DOM is a API FOR HTML &

    XML • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
  10. EXTRACTING FROM HTML Xpath is your friend • PCRE /

    String Functions • dom • SimpleXML • php-ext/html_parse
  11. EXTRACTING FROM HTML Remember PHP's Feature, DOMXPath:: registerPhpFunctions • PCRE

    / String Functions • dom • SimpleXML • php-ext/html_parse
  12. Solve Context You will need solve context from got response

    • Filtering extracted result for Domain.
  13. Solve Context Don't reinvent the wheel • Resolve relative URI

    / RFC-3986 - pear/Net_URL2, zendframework/zend-uri supports it
  14. Solve Context Don't reinvent the wheel • “Databases” that helps

    you - wedata - OSS's repositories (not only PHP)
  15. LIBRALIES • behat/mink • goutte, symfony/browserKit • zendframework/zend-dom • diggin-scraper

    • simple_html_dom • phpQuery • fluentDOM • php-jsonpointer • beberlei/phpricot
  16. Move forward, PHP. • HTML5 • HTTP 2.0 / SPDY

    • Browser binding ardemiranda/WebKitGtk • concurrent programing, asynchronous • collective intelligence • NLP natural language processing
  17. Deeper and Deeper • elazar/web-scraping-with-php https://github.com/elazar/web-scraping-with-php • Accessing Web Resources

    with PHP http://joind.in/3386 • Spidering Hacks http://www.oreilly.co.jp/books/4873111870/ • fuba: exthtml https://fuba.jottit.com/exthtml • kitamomonga http://d.hatena.ne.jp/kitamomonga/ • The Architecture of Open Source Applications selenium - https://github.com/m-takagi/aosa-ja