Upgrade to Pro — share decks privately, control downloads, hide ads and more …

brightonSEO Set 23 - Exploiting XPath in ScreamingFrog and Google Sheets

Thiago
September 17, 2023

brightonSEO Set 23 - Exploiting XPath in ScreamingFrog and Google Sheets

Thiago

September 17, 2023
Tweet

Other Decks in Marketing & SEO

Transcript

  1. Exploiting XPath in
    ScreamingFrog and
    Google Sheets
    Thiago Pojda
    SIXT
    Speakerdeck.com/pojda
    @tedois

    View Slide

  2. The Nerd
    ● Husband, dad
    ● Software Engineer
    ● SEO since 2008
    ● Since 2015, wesearch.media
    ● Since 2019, living in DE
    ● Since 2022, Director SEO @ SIXT

    View Slide

  3. The Nerd
    ● Husband, dad
    ● Software Engineer
    ● SEO since 2008
    ● Since 2015, wesearch.media
    ● Since 2019, living in DE
    ● Since 2022, Director SEO @ SIXT

    View Slide

  4. 2,000+ locations
    100+ countries
    270,000+ cars
    7,500+ employees
    20 SEOs

    View Slide

  5. SEO @ SIXT

    SEO Specialists

    Local, Content, Tech, Authority, Data

    Product specialists

    View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. Understanding how sites are built
    and how to extract information
    from pages has made my
    analysis much more relevant to
    both me and my clients

    View Slide

  10. DOM
    DOM
    DOM

    View Slide

  11. Source: https://en.wikipedia.org/wiki/Document_Object_Model

    View Slide

  12. XPath
    //div[@class=“content”]/text()

    View Slide

  13. Expanded syntax is handy but looks
    terrible
    /descendant-or-self::div[@class=“content”]/child::text()

    View Slide

  14. Abbreviated syntax is enough for
    99% of the cases
    /descendant-or-self::div[@class=“content”]/child::text()
    //div[@class=“content”]/text()

    View Slide

  15. Abbreviated syntax is enough for
    99% of the cases
    /descendant-or-self::div[@class=“content”]/child::text()
    //div[@class=“content”]/text()

    View Slide

  16. Specifier Meaning
    / Selects a direct child
    // Select any descendant or
    self
    @ Attribute
    .. Parent
    . Self

    View Slide

  17. Specifier Meaning
    [ and ] Official name is Predicates,
    you can think of them as
    filters
    * Any element
    function() … a function

    View Slide

  18. Useful functions

    text()

    contains(where, what)

    normalize-space(text)

    starts-with(where, what) & ends-
    with(where, what)

    sum()

    and, or

    View Slide

  19. //author[contains(.,"Matt")]
    Matches on all author nodes, in
    current node contains Matt
    (case-sensitive)

    View Slide

  20. //author[starts-with(.,"G")]
    Matches on all author nodes, in
    current node starts with G
    (case-sensitive)

    View Slide

  21. //author[matches(.,"Matt.*")]
    Regular expressions match
    Source: https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
    License: CC BY 4.0

    View Slide

  22. //h3[1]
    The first H3 element

    View Slide

  23. //h3[last()]
    The last H3 element

    View Slide

  24. //h3[last()-1]
    The one before last H3 element

    View Slide

  25. //img[not(@alt)]
    Only images without alt attribute

    View Slide

  26. //img[@alt]
    Only images with alt attribute
    (will match empty alts)

    View Slide

  27. //img[string-length(@alt) >= 1]
    Only images with alt attribute
    longer than 1 character

    View Slide

  28. Source: https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
    License: CC BY 4.0

    View Slide

  29. View Slide

  30. https://brightonseo.com/people/thiago-pojda

    View Slide




  31. View Slide

  32. Formula Result
    =importxml(“//title”, D1) Thiago Pojda
    =importxml(“//h1”, D1) Thiago Pojda
    =join(,importxml(“//h1/following-
    sibling::*[1]”, D1))
    SIXT | Director of SEO & In-house Dad Joke
    Specialist
    =importxml(“//h1/following-
    sibling::*[2]”, D1)
    Thiago is a Brazilian SEO nerd who loves
    learning about (and nudging)
    consumer behaviours. Worked several years
    with SEO for big and small brands
    both at agencies and as in-house, he now leads
    the SEO Team at SIXT in
    Germany.
    =importxml(“//h1/..//a/@href”, D1) https://www.sixt.com/
    https://twitter.com/tedois
    https://linkedin.com/in/pojda

    View Slide

  33. View Slide

  34. Find XPath via SF

    View Slide

  35. View Slide

  36. View Slide

  37. View Slide


  38. View Slide

  39. View Slide

  40. 🤩

    View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. use
    cases

    View Slide

  45. #1
    Pages with SEO Text

    View Slide

  46. View Slide

  47. //article[@data-testid='streamSeoSection']

    View Slide

  48. Is my competitor using “SEO texts”?
    On which page types?
    Any category they’re doing it more?
    Why?
    WHY?
    WHY?

    View Slide

  49. Is my competitor using “SEO texts”?
    On which page types?
    Any category they’re doing it more?
    Why?
    WHY?
    WHY?

    View Slide

  50. Is my competitor using “SEO texts”?
    On which page types?
    Any category they’re doing it more?
    Why?
    WHY?
    WHY?

    View Slide

  51. View Slide

  52. View Slide

  53. Export your crawl (internal all)

    View Slide

  54. Categorise what you see

    View Slide

  55. Analyze

    View Slide

  56. #2
    Products per category

    View Slide

  57. View Slide

  58. //div[@id="listing"]//div[contains(text()
    ,"Ergebnisse")]/text()

    View Slide

  59. #3
    Mapping indexable filters

    View Slide

  60. View Slide

  61. View Slide

  62. //a[contains(@class,"pill")]/text()
    //a[contains(@class,"pill")]/@href

    View Slide

  63. Where
    to use

    View Slide

  64. View Slide

  65. Wrap up

    View Slide


  66. Think about elements you can use to
    breakdown your competitor’s strategy

    Find a great XPath for it, crawl,
    analyze

    Hate XPath until you love it

    View Slide

  67. Read more

    https://librarycarpentry.org/lc-
    webscraping/02-xpath/index.html

    https://www.searchenginejournal.com/xpath
    s-large-site-audits/329851/

    https://twitter.com/tedois

    View Slide

  68. Exploiting XPath in
    ScreamingFrog and
    Google Sheets
    Thiago Pojda
    SIXT
    Speakerdeck.com/pojda
    @tedois

    View Slide