Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
Introduction of Web scraping for PHP users
Slide 2
Slide 2 text
15 years ago
Slide 3
Slide 3 text
15 years ago PHP 3.0.4 release, includes get_meta_tags().
Slide 4
Slide 4 text
It works.. string(35) "width=device-width, initial-scale=1" }
Slide 5
Slide 5 text
It works, sometimes!
Slide 6
Slide 6 text
You are file_get_contents() fanboy
Slide 7
Slide 7 text
["header" => "User-Agent: Mozilla/4.0"] ] ) ) );
Slide 8
Slide 8 text
It works, too $php -d user_agent="Mozilla/4.0" \ -r 'get_meta_tags("http://www.discogs.com/");'
Slide 9
Slide 9 text
You would FEEL there are some problem.
Slide 10
Slide 10 text
"do separation of concerns! GET HTML & parse HTML."
Slide 11
Slide 11 text
"do separation of concerns! GET HTML & parse HTML." Doubt! Doubt! Doubt!
Slide 12
Slide 12 text
Handling Request & Handling Response
Slide 13
Slide 13 text
HTTP Request
Slide 14
Slide 14 text
Sorry, today I don't talk about HTTP Request side. (no time talk about HttpClient, Spider, crawler in 15 minutes)
Slide 15
Slide 15 text
HTTP ReSPONSE
Slide 16
Slide 16 text
HTTP ReSPONSE HeaDERS - BODY - Not only HTML ;-)
Slide 17
Slide 17 text
SCRAP SCRAPING FROM RESPONSE!
Slide 18
Slide 18 text
There are some fact you should take a act
Slide 19
Slide 19 text
There are some fact you should take a act ● ConTENT ENCODING ● ChaRSET ENCODING ● NORMALIZE HTML ● EXTRACTING FROM HTML ● SOLVE CONTEXT
Slide 20
Slide 20 text
CONTENT ENCODING
Slide 21
Slide 21 text
CONTENT ENCODING TODAY, WE ALREADY ACCEPTED IT.
Slide 22
Slide 22 text
CONTENT ENCODING ● gzip ● deflate ● compress ● identity I recommend using good Response handlers before struggling.
Slide 23
Slide 23 text
CONTENT ENCODING ● gzip ● deflate I recommend using good Response handlers before struggling. pear/HTTP_Request2 zendframework/zend-http guzzle/guzzle
Slide 24
Slide 24 text
CHARSET ENCODING
Slide 25
Slide 25 text
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html)
Slide 26
Slide 26 text
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html) This is not best way. You had already got hint in Response Headers & html's meta nodes.
Slide 27
Slide 27 text
CHARSET ENCODING ①②③④⑤ But, Don't forget, Most of Japanese PHP users do LIE.
Slide 28
Slide 28 text
CHARSET ENCODING diggin/diggin-http-charset diggin/guzzle-plugin-AutoCharsetEncodingPlugi I hope my component will help you.
Slide 29
Slide 29 text
NORMALIZE HTML
Slide 30
Slide 30 text
NORMALIZE HTML
Slide 31
Slide 31 text
NORMALIZE HTML before parse as HTML, you can fix it.
Slide 32
Slide 32 text
NORMALIZE HTML before parse as HTML, you can fix it. ● php-ext/tidy, HTMLParser ● other beautifiers ● manually :-(
Slide 33
Slide 33 text
EXTRACTING FROM HTML
Slide 34
Slide 34 text
EXTRACTING FROM HTML Yes, there are several way in PHP. ● PCRE / String Functions ● dom ● SimpleXML ● php-ext/html_parse
Slide 35
Slide 35 text
EXTRACTING FROM HTML Mostly, boredom for entire HTML. ● PCRE / String Functions ● dom ● SimpleXML ● php-ext/html_parse
Slide 36
Slide 36 text
EXTRACTING FROM HTML DOM is a API FOR HTML & XML ● PCRE / String Functions ● dom ● SimpleXML ● php-ext/html_parse
Slide 37
Slide 37 text
EXTRACTING FROM HTML Xpath is your friend ● PCRE / String Functions ● dom ● SimpleXML ● php-ext/html_parse
Slide 38
Slide 38 text
EXTRACTING FROM HTML Remember PHP's Feature, DOMXPath:: registerPhpFunctions ● PCRE / String Functions ● dom ● SimpleXML ● php-ext/html_parse
Slide 39
Slide 39 text
Solve Context
Slide 40
Slide 40 text
Solve Context You will need solve context from got response ● Filtering extracted result for Domain.
Slide 41
Slide 41 text
Solve Context Don't reinvent the wheel ● Resolve relative URI / RFC-3986 - pear/Net_URL2, zendframework/zend-uri supports it
Slide 42
Slide 42 text
Solve Context Don't reinvent the wheel ● “Databases” that helps you - wedata - OSS's repositories (not only PHP)
Slide 43
Slide 43 text
LIBRALIES ● behat/mink ● goutte, symfony/browserKit ● zendframework/zend-dom ● diggin-scraper ● simple_html_dom ● phpQuery ● fluentDOM ● php-jsonpointer ● beberlei/phpricot
Slide 44
Slide 44 text
Today, web is under control by JavaScript
Slide 45
Slide 45 text
JAVASCRIPT We need "REAL" BROWSER for AUTOMATION ● Selenium ● PhantomJS / CasperJS ● SlimerJS
Slide 46
Slide 46 text
You have a chance to survive with php.
Slide 47
Slide 47 text
Move forward, PHP. ● HTML5 ● HTTP 2.0 / SPDY ● Browser binding ardemiranda/WebKitGtk ● concurrent programing, asynchronous ● collective intelligence ● NLP natural language processing
Slide 48
Slide 48 text
Deeper and Deeper ● elazar/web-scraping-with-php https://github.com/elazar/web-scraping-with-php ● Accessing Web Resources with PHP http://joind.in/3386 ● Spidering Hacks http://www.oreilly.co.jp/books/4873111870/ ● fuba: exthtml https://fuba.jottit.com/exthtml ● kitamomonga http://d.hatena.ne.jp/kitamomonga/ ● The Architecture of Open Source Applications selenium - https://github.com/m-takagi/aosa-ja
Slide 49
Slide 49 text
Thanks