Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parsing HTML with PHP 8.4

Parsing HTML with PHP 8.4

Slides for my PHP Conference talk about parsing HTML, with a focus on the new HTML parser and DOM API in PHP 8.4.

Avatar for Keyvan Minoukadeh

Keyvan Minoukadeh

November 03, 2025
Tweet

Other Decks in Programming

Transcript

  1. TALK OUTLINE PARSING WITH 8.4 CSS SELECTORS DEMO A little

    about me PHP’s new HTML parser Retrieving HTML elements with modern CSS selectors Try out some code 2025 Parsing and Traversing HTML with PHP 8.4 2 WHY PARSE? What’s it usually used for Using the new DOM classes to enable the new features XPATH SELECTORS When to use them CLOSING AND Q&A Useful tools, further reading, Q&A INTRO NEW DOM API IN 8.4
  2. INTRO Sweden Interaction Design Software at www.mochi.is • Feed Control

    • Push to Kindle Open source • Readability.php 2025 Parsing and Traversing HTML with PHP 8.4 3
  3. TALK OUTLINE PARSING WITH 8.4 CSS SELECTORS DEMO A little

    about me and my experience PHP’s new HTML parser Retrieving HTML elements with modern CSS selectors Try out some code 2025 Parsing and Traversing HTML with PHP 8.4 4 PARSING HTML What’s it used for Using the new DOM classes to enable the new features XPATH SELECTORS When to use them CLOSING AND Q&A Useful tools, further reading, Q&A INTRO NEW DOM API IN 8.4
  4. WHY PARSE HTML? EXTRACT DATA Get the data you need

    No API needed MANIPULATE Add or remove elements Move elements around Sanitise untrusted content CONVERT Produce plain text, PDF, or audio 2025 Parsing and Traversing HTML with PHP 8.4 8
  5. 2025 Parsing and Traversing HTML with PHP 8.4 10 WHY

    PARSE HTML? txtify.it/www.bbc.com/news/articles/cy0kvrnyy4wo
  6. TALK OUTLINE PARSING WITH 8.4 CSS SELECTORS DEMO A little

    about me and my experience PHP’s new HTML parser Retrieving HTML elements with modern CSS selectors Try out some code 2025 Parsing and Traversing HTML with PHP 8.4 11 PARSING HTML What’s it used for Using the new DOM classes to enable the new features XPATH SELECTORS When to use them CLOSING AND Q&A Useful tools, further reading, Q&A INTRO NEW DOM API IN 8.4
  7. HTML PARSERS 2025 Parsing and Traversing HTML with PHP 8.4

    14 HTML5-PHP • PHP library • Started in 2013 • Based on an older version of the W3C HTML5 standard • Available via composer LEXBOR • Fast, C-based parser • Started in 2018 • Based on newer WHATWG standard • Available and enabled on most installations since PHP 8.4 • Fast, C-based parser • Started in 1999 • Based on HTML4 standard, with some HTML5 support added later • Available and enabled on most PHP installations LIBXML
  8. VALID HTML 2025 Parsing and Traversing HTML with PHP 8.4

    17 no <head> no <body> no <html> no closing </p>
  9. VALID HTML 2025 Parsing and Traversing HTML with PHP 8.4

    18 Un-escaped angle brackets in attributes
  10. COUNTING PARAGRAPHS OLD AND NEW PARSER 2025 Parsing and Traversing

    HTML with PHP 8.4 22 OLD PARSER SOURCE HTML
  11. COUNTING PARAGRAPHS OLD AND NEW PARSER 2025 Parsing and Traversing

    HTML with PHP 8.4 23 OLD PARSER 3 paragraphs found. SOURCE HTML 🤔
  12. 2025 Parsing and Traversing HTML with PHP 8.4 24 OLD

    PARSER 3 paragraphs found. SERIALISED BACK TO HTML ? SOURCE HTML
  13. 2025 Parsing and Traversing HTML with PHP 8.4 25 OLD

    PARSER 3 paragraphs found. SERIALISED BACK TO HTML SOURCE HTML
  14. COUNTING PARAGRAPHS OLD AND NEW PARSER 2025 Parsing and Traversing

    HTML with PHP 8.4 26 NEW PARSER PHP 8.4+ SOURCE HTML
  15. COUNTING PARAGRAPHS OLD AND NEW PARSER 2025 Parsing and Traversing

    HTML with PHP 8.4 27 NEW PARSER PHP 8.4+ SOURCE HTML 2 paragraphs found. 🙂
  16. 2025 Parsing and Traversing HTML with PHP 8.4 28 NEW

    PARSER PHP 8.4+ 2 paragraphs found. SERIALISED BACK TO HTML SOURCE HTML
  17. COUNTING PARAGRAPHS OLD AND NEW PARSER 2025 Parsing and Traversing

    HTML with PHP 8.4 29 OLD PARSER NEW PARSER PHP 8.4+ 3 paragraphs found. 2 paragraphs found. SOURCE HTML
  18. TALK OUTLINE PARSING WITH 8.4 CSS SELECTORS DEMO A little

    about me and my experience PHP’s new HTML parser Retrieving HTML elements with modern CSS selectors Try out some code 2025 Parsing and Traversing HTML with PHP 8.4 30 PARSING HTML What’s it used for Using the new DOM classes to enable the new features XPATH SELECTORS When to use them CLOSING AND Q&A Useful tools, further reading, Q&A INTRO NEW DOM API IN 8.4
  19. NEW DOM API • New DOM classes and namespace –

    Fewer lines needed to turn HTML into DOM • Top-Level HTML Elements as DOM Properties – head, body, title • innerHTML property – Set or get child nodes using HTML strings • CSS selector support – querySelector() and querySelectorAll() – No need for CSS to XPath (e.g. Symfony’s CSS Selector) 2025 Parsing and Traversing HTML with PHP 8.4 31
  20. NEW DOM API NEW DOM CLASSES 2025 Parsing and Traversing

    HTML with PHP 8.4 32 Old API New API DOMDocument → Dom\HTMLDocument DOMElement → Dom\Element DOMNode → Dom\Node DOMText → Dom\Text DOMAttr → Dom\Attr
  21. NEW DOM API NEW DOM CLASSES 2025 Parsing and Traversing

    HTML with PHP 8.4 33 Old API New API → →
  22. NEW DOM API TOP-LEVEL HTML ELEMENTS AS DOM PROPERTIES 2025

    Parsing and Traversing HTML with PHP 8.4 34
  23. TALK OUTLINE PARSING WITH 8.4 CSS SELECTORS DEMO A little

    about me and my experience PHP’s new HTML parser Retrieving HTML elements with modern CSS selectors Try out some code 2025 Parsing and Traversing HTML with PHP 8.4 36 PARSING HTML What’s it used for Using the new DOM classes to enable the new features XPATH SELECTORS When to use them CLOSING AND Q&A Useful tools, further reading, Q&A INTRO NEW DOM API IN 8.4
  24. 2025 Parsing and Traversing HTML with PHP 8.4 37 NEW

    DOM API CSS SELECTORS • querySelector($selectors) – “Returns the first descendant element that matches the CSS selectors” • querySelectorAll($selectors) – “Returns a NodeList containing all descendant elements that match the CSS selectors”
  25. 2025 Parsing and Traversing HTML with PHP 8.4 39 NEW

    DOM API CSS SELECTORS FIND MULTIPLE ELEMENTS Get all paragraph and heading elements — returned in document order
  26. 2025 Parsing and Traversing HTML with PHP 8.4 40 NEW

    DOM API CSS SELECTORS Get paragraphs and main headings that are direct children of the article AVOID REPETITION WITH :IS OR :WHERE
  27. 2025 Parsing and Traversing HTML with PHP 8.4 41 NEW

    DOM API CSS SELECTORS Get all paragraphs in article that have at least one link inside them Get h1 headings that are followed immediately by a h2 heading
  28. 2025 Parsing and Traversing HTML with PHP 8.4 42 NEW

    DOM API CSS SELECTORS Get external links — URLs starting with “http” and not containing “example.com”, case insensitive href attributes starting with “http” case insensitive match Exclude elements that contain “example.com” anywhere in the href attribute
  29. TALK OUTLINE PARSING WITH 8.4 CSS SELECTORS DEMO A little

    about me and my experience PHP’s new HTML parser Retrieving HTML elements with modern CSS selectors Try out some code 2025 Parsing and Traversing HTML with PHP 8.4 43 PARSING HTML What’s it used for Using the new DOM classes to enable the new features XPATH SELECTORS When to use them CLOSING AND Q&A Useful tools, further reading, Q&A INTRO NEW DOM API IN 8.4
  30. 2025 Parsing and Traversing HTML with PHP 8.4 44 WHEN

    CSS SELECTORS ARE NOT ENOUGH XPATH SELECTORS • CSS more concise, but less expressive than XPath • XPath can… • Select elements based on text content • Select attributes
  31. 2025 Parsing and Traversing HTML with PHP 8.4 45 WHEN

    CSS SELECTORS ARE NOT ENOUGH XPATH SELECTOR EXAMPLES CSS selector XPath 1.0 div.content → //div[contains(concat(" ",normalize-space(@class)," ")," content ")] article#main → //article[@id="main"] [src*="avatar"] → //*[contains(@src, "avatar")] article p → //article//p article > p → //article/p [src*="avatar"] → //*[contains(@src, "avatar")] ☹️ //p[contains(., "Written by:")] ☹️ //a/@href
  32. TALK OUTLINE PARSING WITH 8.4 CSS SELECTORS DEMO A little

    about me and my experience PHP’s new HTML parser Retrieving HTML elements with modern CSS selectors Try out some code 2025 Parsing and Traversing HTML with PHP 8.4 46 PARSING HTML What’s it used for Using the new DOM classes to enable the new features XPATH SELECTORS When to use them CLOSING AND Q&A Useful tools, further reading, Q&A INTRO NEW DOM API IN 8.4
  33. AFTONBLADET ARTICLE 20XX Parsing and Traversing HTML with PHP 8.4

    48 https://www.aftonbladet.se/nyheter/a/25LgKq/ greta-thunberg-they-kicked-me-every-time- the-flag-touched-my-face
  34. AFTONBLADET ARTICLE 20XX Parsing and Traversing HTML with PHP 8.4

    50 Original HTML: Content HTML: 931.93 KB 84.92 KB
  35. DEMO PARSING AND SANITISING HTML Using PHP’s new HTML parser

    USING CSS SELECTORS Feed Control’s Feed Creator 20XX Parsing and Traversing HTML with PHP 8.4 51
  36. TALK OUTLINE PARSING WITH 8.4 CSS SELECTORS DEMO A little

    about me and my experience PHP’s new HTML parser Retrieving HTML elements with modern CSS selectors Try out some code 2025 Parsing and Traversing HTML with PHP 8.4 52 PARSING HTML What’s it used for Using the new DOM classes to enable the new features XPATH SELECTORS When to use them CLOSING AND Q&A Useful tools, further reading, Q&A INTRO NEW DOM API IN 8.4
  37. FURTHER READING • Chrome Headless – Get HTML after Javascript

    execution – chrome --headless --dump-dom https://www.example.com • Symfony’s HTML Sanitizer – Clean untrusted HTML for output 20XX Parsing and Traversing HTML with PHP 8.4 53 USEFUL TOOLS • Niels Dossche – Responsible for PHP 8.4’s DOM changes • DOM Living Standard – https://dom.spec.whatwg.org
  38. Keyvan Minoukadeh • [email protected] Websites • mochi.is • keyvan.net 2025

    Parsing and Traversing HTML with PHP 8.4 54 THANK YOU! QUESTIONS?