Slide 1

Slide 1 text

Exploiting XPath in ScreamingFrog and Google Sheets Thiago Pojda SIXT Speakerdeck.com/pojda @tedois

Slide 2

Slide 2 text

The Nerd ● Husband, dad ● Software Engineer ● SEO since 2008 ● Since 2015, wesearch.media ● Since 2019, living in DE ● Since 2022, Director SEO @ SIXT

Slide 3

Slide 3 text

The Nerd ● Husband, dad ● Software Engineer ● SEO since 2008 ● Since 2015, wesearch.media ● Since 2019, living in DE ● Since 2022, Director SEO @ SIXT

Slide 4

Slide 4 text

2,000+ locations 100+ countries 270,000+ cars 7,500+ employees 20 SEOs

Slide 5

Slide 5 text

SEO @ SIXT ● SEO Specialists ● Local, Content, Tech, Authority, Data ● Product specialists

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Understanding how sites are built and how to extract information from pages has made my analysis much more relevant to both me and my clients

Slide 10

Slide 10 text

DOM DOM DOM

Slide 11

Slide 11 text

Source: https://en.wikipedia.org/wiki/Document_Object_Model

Slide 12

Slide 12 text

XPath //div[@class=“content”]/text()

Slide 13

Slide 13 text

Expanded syntax is handy but looks terrible /descendant-or-self::div[@class=“content”]/child::text()

Slide 14

Slide 14 text

Abbreviated syntax is enough for 99% of the cases /descendant-or-self::div[@class=“content”]/child::text() //div[@class=“content”]/text()

Slide 15

Slide 15 text

Abbreviated syntax is enough for 99% of the cases /descendant-or-self::div[@class=“content”]/child::text() //div[@class=“content”]/text()

Slide 16

Slide 16 text

Specifier Meaning / Selects a direct child // Select any descendant or self @ Attribute .. Parent . Self

Slide 17

Slide 17 text

Specifier Meaning [ and ] Official name is Predicates, you can think of them as filters * Any element function() … a function

Slide 18

Slide 18 text

Useful functions ● text() ● contains(where, what) ● normalize-space(text) ● starts-with(where, what) & ends- with(where, what) ● sum() ● and, or

Slide 19

Slide 19 text

//author[contains(.,"Matt")] Matches on all author nodes, in current node contains Matt (case-sensitive)

Slide 20

Slide 20 text

//author[starts-with(.,"G")] Matches on all author nodes, in current node starts with G (case-sensitive)

Slide 21

Slide 21 text

//author[matches(.,"Matt.*")] Regular expressions match Source: https://librarycarpentry.org/lc-webscraping/02-xpath/index.html License: CC BY 4.0

Slide 22

Slide 22 text

//h3[1] The first H3 element

Slide 23

Slide 23 text

//h3[last()] The last H3 element

Slide 24

Slide 24 text

//h3[last()-1] The one before last H3 element

Slide 25

Slide 25 text

//img[not(@alt)] Only images without alt attribute

Slide 26

Slide 26 text

//img[@alt] Only images with alt attribute (will match empty alts)

Slide 27

Slide 27 text

//img[string-length(@alt) >= 1] Only images with alt attribute longer than 1 character

Slide 28

Slide 28 text

Source: https://librarycarpentry.org/lc-webscraping/02-xpath/index.html License: CC BY 4.0

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

https://brightonseo.com/people/thiago-pojda

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Formula Result =importxml(“//title”, D1) Thiago Pojda =importxml(“//h1”, D1) Thiago Pojda =join(,importxml(“//h1/following- sibling::*[1]”, D1)) SIXT | Director of SEO & In-house Dad Joke Specialist =importxml(“//h1/following- sibling::*[2]”, D1) Thiago is a Brazilian SEO nerd who loves learning about (and nudging) consumer behaviours. Worked several years with SEO for big and small brands both at agencies and as in-house, he now leads the SEO Team at SIXT in Germany. =importxml(“//h1/..//a/@href”, D1) https://www.sixt.com/ https://twitter.com/tedois https://linkedin.com/in/pojda

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Find XPath via SF

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

🤩

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

use cases

Slide 45

Slide 45 text

#1 Pages with SEO Text

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

//article[@data-testid='streamSeoSection']

Slide 48

Slide 48 text

Is my competitor using “SEO texts”? On which page types? Any category they’re doing it more? Why? WHY? WHY?

Slide 49

Slide 49 text

Is my competitor using “SEO texts”? On which page types? Any category they’re doing it more? Why? WHY? WHY?

Slide 50

Slide 50 text

Is my competitor using “SEO texts”? On which page types? Any category they’re doing it more? Why? WHY? WHY?

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Export your crawl (internal all)

Slide 54

Slide 54 text

Categorise what you see

Slide 55

Slide 55 text

Analyze

Slide 56

Slide 56 text

#2 Products per category

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

//div[@id="listing"]//div[contains(text() ,"Ergebnisse")]/text()

Slide 59

Slide 59 text

#3 Mapping indexable filters

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

//a[contains(@class,"pill")]/text() //a[contains(@class,"pill")]/@href

Slide 63

Slide 63 text

Where to use

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

Wrap up

Slide 66

Slide 66 text

● Think about elements you can use to breakdown your competitor’s strategy ● Find a great XPath for it, crawl, analyze ● Hate XPath until you love it

Slide 67

Slide 67 text

Read more ● https://librarycarpentry.org/lc- webscraping/02-xpath/index.html ● https://www.searchenginejournal.com/xpath s-large-site-audits/329851/ ● https://twitter.com/tedois

Slide 68

Slide 68 text

Exploiting XPath in ScreamingFrog and Google Sheets Thiago Pojda SIXT Speakerdeck.com/pojda @tedois