Technical SEO Testing 2022 - SMX Advanced

@SPEAKERNAME/#SMX @peakaceag Technical SEO Testing 2022 Bastian Grimm, Peak Ace
AG | @basgr Separating fact from fiction with the Peak Ace test lab

@SPEAKERNAME/#SMX @peakaceag One of the biggest problems in SEO?

@SPEAKERNAME/#SMX @peakaceag Misinformation!

@SPEAKERNAME/#SMX @peakaceag “I’ve heard…” People incorrectly citing other parties, often
without any context/deeper understanding of the issues at hand

@SPEAKERNAME/#SMX @peakaceag Google says one thing… … but then actually
does another (or it‘s so cryptic that you can‘t do much with it)

@SPEAKERNAME/#SMX @peakaceag Say hello to the Peak Ace SEO playground

@SPEAKERNAME/#SMX @peakaceag How does the setup work and what does
it do? Pick case Case C Case A Case B Case N HeaderOption present? Apply headerOption ruleset: 0 X-Robots-Tag: noindex 1 X-Robots-Tag: noindex, no follow 2 Link: https://xxx.com/; rel=“canonical“ 3 … Is there metaOption? Apply metaOption ruleset: 0 <meta name=“robots“ content=“noindex, follow“ /> 2 <meta name=“robots“ value=“noindex, follow“ /> 3 <meta name=“robots“ value=“noindex, follow“ content=“noindex, follow“ /> 4 <meta name=“robots“ content=“noindex, follow“ /> 5 <meta name=“robot“ content=“noindex, follow“ /> 6 <meta name=“robots“ content=“noindex“ /><meta name=“googlebot“content=“noindex“ /> 7 <meta name=“googlebot“content=“unavailable_after: 1 Jan 1970 00:00:00 GMT“ /> 8 … regenerateOption present? Page indexable? Generate unique text Bot indexes the page Store visit logs in DB Session ends No No No Yes Yes Yes Yes No

@SPEAKERNAME/#SMX @peakaceag Yeah… right! Let’s make this a bit more
hands-on… essentially it’s a mini “SEO CMS”:

@SPEAKERNAME/#SMX @peakaceag

@SPEAKERNAME/#SMX @peakaceag Test-specific mark-up/directives in <head>, e.g. JS, meta or
canonical tags

@SPEAKERNAME/#SMX @peakaceag The actual URL that serves the content –
especially interesting for redirects, etc.

@SPEAKERNAME/#SMX @peakaceag Unique content, in different languages, to test the
actual indexing of a page

@SPEAKERNAME/#SMX @peakaceag A JS-based tracker, using feature detection to log
Googlebot requests

@SPEAKERNAME/#SMX @peakaceag A couple of things you can do with
this Set up new HTML documents/tests with the click of a button Add an unlimited amount of server-side headers, such as X-Robots, canonicals, hreflang, redirects, caching, etc. Add elements to the document <head>, for example meta robots, canonical or <script> tags to run JS Add unique content to the page, depending on the language you want to test for (sometimes, content generation has a valid use-case) Add any type of HTML to the <body> / DOM Integrated bot tracking (JS for evergreen Googlebot + non-JS) by default Automatically generate output by using standard tags (e.g. <iframe>) as well as JavaScript (to ensure rendering is in play) And lots more…

@SPEAKERNAME/#SMX @peakaceag Sound good? Interested in the slide deck as
well and/or the GitHub repository (including all source codes) > https://pa.ag/smxtesting22 (all free!)

@SPEAKERNAME/#SMX @peakaceag Context matters Old domain vs new domain, mobile
first indexing vs non-mobile first indexing, etc.

@SPEAKERNAME/#SMX @peakaceag Warning: Draw your own conclusions! Isolated “SEO testing”
is next to impossible; be aware that there may be other (external) signals at play that you can’t control

@SPEAKERNAME/#SMX @peakaceag #1 Indexing Robots meta & X-Robots tags

@SPEAKERNAME/#SMX @peakaceag Anything wrong with this? I looked at this
client website the other day and something felt off… <meta name="robots" value="noindex, follow" />

@SPEAKERNAME/#SMX @peakaceag It needs to be “content” instead of “value”!
Using the “value” attribute is actually invalid according to W3C HTML specifications: <meta name="robots" content="noindex, follow" />

@SPEAKERNAME/#SMX @peakaceag Interestingly enough, Google doesn’t seem to care Google
also utilises the invalid “value” attribute to manage indexing: <meta name="robots" value="noindex, follow" />

@SPEAKERNAME/#SMX @peakaceag What if you combined “value” and “content” attributes?
Google considers the valid over the invalid attribute, it takes “content” in this instance: <meta name="robots" value="noindex, follow" content="index, follow" />

@SPEAKERNAME/#SMX @peakaceag What if you change the element order? Order
doesn’t matter – Google still takes the “content” attribute: <meta name="robots" content="index, follow" value="noindex, follow" />

@SPEAKERNAME/#SMX @peakaceag Going down the rabbit hole… So, what about
this one? – No… this can‘t work, can it? <meta name="robot" content="noindex, follow" />

@SPEAKERNAME/#SMX @peakaceag Google internally corrects “robot” to “robots” To control
indexing, Google also considers the invalid “robot” value: <meta name="robot" content="noindex, follow" />

@SPEAKERNAME/#SMX @peakaceag What’s Google supposed to do with this one?
Noindex (because it’s more restrictive) or index (because of the more precise UA)? <meta name="robots" content="noindex" /> <meta name="googlebot" content="index" />

@SPEAKERNAME/#SMX @peakaceag Google considers the most specific user agent directive
It’s no surprise; this approach hasn’t changed for years: <meta name="robots" content="noindex" /> <meta name="googlebot" content="index" />

@SPEAKERNAME/#SMX @peakaceag But, what if… … you added an “X-Robots-Tag:
noindex” header into the mix?

@SPEAKERNAME/#SMX @peakaceag Header and meta robots directives combined: +

@SPEAKERNAME/#SMX @peakaceag Header noindex vs meta robots index (for Googlebot)
The generic X-Robots-Tag (no specific UA) overrides the more specific robots meta tag for “Googlebot”: <meta name="robots" content="noindex" /> <meta name="googlebot" content="index" /> + X-Robots-Tag: noindex I found this quite surprising since the Googlebot directive should supersede; it appears that the header and meta indexing pipelines are somewhat separated?

@SPEAKERNAME/#SMX @peakaceag Of course, GSC also takes rendering into account
as well! For example: adding directives using JS works as expected (but only in <head>)

@SPEAKERNAME/#SMX @peakaceag #2 Web Components Custom elements, Shadow DOM &
more

@SPEAKERNAME/#SMX @peakaceag There seems to be a fair amount of
confusion A typical “SEO answer“ about web components often looks like this: Look, I’m not really sure how Googlebot deals with custom HTML elements – better to play it safe and rely on good old, standardised HTML only…

@SPEAKERNAME/#SMX @peakaceag No idea what Web Components is? Source: http://webcomponents.github.io
Web Components is a suite of different technologies allowing you to create reusable custom elements – with their functionality encapsulated away from the rest of your code – and utilise them in your web apps.

@SPEAKERNAME/#SMX @peakaceag In this example, we define <custom- component>, our
very own HTML element. In this example, we’re generating a component using the shadow DOM.

@SPEAKERNAME/#SMX @peakaceag No issues using Web Components whatsoever! Content which
lives in a web component, such as a custom HTML element, will be indexed properly. Essentially, it will be flattened into the main HTML: Content is created in an element which is part of the Shadow DOM Content which is part of the <custom-component>

@SPEAKERNAME/#SMX @peakaceag #3 CSS Content-Visibility Enables the user-agent to skip
an element's rendering work

@SPEAKERNAME/#SMX @peakaceag Content-visibility, a new CSS property to boost rendering
content-visibility enables the user-agent to skip an element's rendering work, including layout & painting, until it is needed – and therefore makes the initial load much faster! Source: https://pa.ag/2Wxn399

@SPEAKERNAME/#SMX @peakaceag Content exists in HTML mark-up but is set
to “content-visibility:hidden” Respectively, the content will not be rendered or displayed at all.

@SPEAKERNAME/#SMX @peakaceag Whether it’s “auto” or “hidden”, the content will
be found Even though these elements are skipped at rendering due to its content-visibility settings, the URL is returned for both test phrases: content-visibility:hidden content-visibility:auto

@SPEAKERNAME/#SMX @peakaceag #4 iFrames Including content from a second URL
into a parent URL

@SPEAKERNAME/#SMX @peakaceag According to BuiltWith (top 1m), iFrames are still
a thing: Source: https://pa.ag/2l8qDaN

@SPEAKERNAME/#SMX @peakaceag Revisited: parent URL + iFrame Parent page -
area in yellow square iFramed content (from a 2nd URL) within the red highlighted square <iframe src="URL"></iframe>

@SPEAKERNAME/#SMX @peakaceag It appears that regular iFrames are dangerous these
days iFrame content will be attributed to its parent URL post-render; the parent page can now be found for content from within the iFrame: This phrase is originally taken from within the iFrame, not from the parent URL

@SPEAKERNAME/#SMX @peakaceag Post-render, the parent page can now be found
for content within the iFrame: To make it simple: this URL… … can now rank for content from this 2nd URL!

@SPEAKERNAME/#SMX @peakaceag Page level quality? What about all that 3rd
party content people feed in?

@SPEAKERNAME/#SMX @peakaceag Still not convinced? We ran some follow-up tests,
because: links!

@SPEAKERNAME/#SMX @peakaceag Added two additional links (1 internal, 1 external)
to the iFrame URL

@SPEAKERNAME/#SMX @peakaceag Naturally, the GSC HTML displays the links: Again,
they’re flattened into the DOM of the parent URL

@SPEAKERNAME/#SMX @peakaceag GSC’s “Top linked pages” report is also helpful
The parent URL appears as the linking page for bastiangrimm.com – however, this URL doesn’t have any links in its HTML mark-up.

@SPEAKERNAME/#SMX @peakaceag So what can you do?

@SPEAKERNAME/#SMX @peakaceag Of course, you can noindex/robots.txt the frame content
If you do, auto-generated meta descriptions will lack any iFrame content, also GSC rendered HTML doesn’t show the in-lined content (from the iFrame):

@SPEAKERNAME/#SMX @peakaceag Content from/in “hidden” frames won’t be indexed either!
Similar to noindexed frames, a meta description does not appear in their SERPs The iFrame tag is using a display:none annotation, the content is not inlined with the rendered DOM No in-lining into the rendered DOM due to “hidden” applied via JS

@SPEAKERNAME/#SMX @peakaceag What about this new Robots tag? In early
2022, Google released “indexifembedded“ specifically for iFrame usage: Source: https://pa.ag/3KGvMfN

@SPEAKERNAME/#SMX @peakaceag X-Frame-Options Header If you want to prevent someone
from loading (and ranking for) your content in an iFrame

@SPEAKERNAME/#SMX @peakaceag #5 Longform content Can pages actually become “too
long“?

@SPEAKERNAME/#SMX @peakaceag You all recall this, I presume? Source: http://pa.ag/2A5630t

@SPEAKERNAME/#SMX @peakaceag In case you need it: Still true for
desktop; for smartphone it’s fixed at ~1,700px in height, no scroll

@SPEAKERNAME/#SMX @peakaceag GSC is really only a preview! And here’s
some further “proof” of that…

@SPEAKERNAME/#SMX @peakaceag GSC screenshot vs post-rendered (live) viewport Pushing the
iFrame below 15,000 pixels so that the GSC will cut it off in its preview, still results in post-rendered content being found, just like in the first test: GSC preview doesn’t show any text content This content is only shown “below” a 15k pixel div container; GSC rendered HTML does indeed show the container, and of course, the test phrase was returned as well.

@SPEAKERNAME/#SMX @peakaceag The “More Info” tab is really awesome –
use it! It can really help with troubleshooting and debugging, so make good use of it This is the same/similar to your Chrome Developer Console

@SPEAKERNAME/#SMX @peakaceag #6 CSS selectors Ever heard of .class::before and
.class::after?

@SPEAKERNAME/#SMX @peakaceag What are CSS selectors and how do they
work? ::before creates a pseudo element that is the first child of the matched element Source: https://pa.ag/2QRr9aH

@SPEAKERNAME/#SMX @peakaceag Content that lives in the HTML mark-up Content
that lives in a CSS selector such as ::before

@SPEAKERNAME/#SMX @peakaceag Again, the GSC preview shows what it would
look like: Googlebot seems to treat this identically to Chrome on desktop/smartphone, the rendered DOM remains unchanged (to be expected since it’s a pseudo class): HTML CSS

@SPEAKERNAME/#SMX @peakaceag Content from within CSS selectors won’t be indexed
Whether Googlebot renders the URL or not, the content will not be found Content that lives in the HTML mark-up will be found and indexed, as expected Content that lives in a CSS selector such as ::before won’t be indexed.

@SPEAKERNAME/#SMX @peakaceag Why should you care? Maybe you have to
display certain content that gets classified as “boilerplate” (e.g. shipping info) or you want to create a certain content footprint?

@SPEAKERNAME/#SMX @peakaceag #7 User-agent client hints API-based access to information
about a user's browser – or a crawler’s features

@SPEAKERNAME/#SMX @peakaceag The User-Agent string is messy, like, very messy:
Over the decades, this string has accrued a variety of details about the client making the request as well as cruft, due to backwards compatibility: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

@SPEAKERNAME/#SMX @peakaceag The UA string will be frozen, client hints
to take over User-Agent Client Hints are a new expansion to the Client Hints API and enables developers to access information about a user's browser – or a crawler’s features: Source: https://pa.ag/3AiiUaI

@SPEAKERNAME/#SMX @peakaceag It‘s never too early to start testing these
things: Googlebot (running Chrome >89) apparently already populates those CH-headers:

@SPEAKERNAME/#SMX @peakaceag #8 Redirect fun Redirect chains: 301 vs 302
vs JS

@SPEAKERNAME/#SMX @peakaceag Redirect chains are bad – avoid them! But
what if you have to use them?

@SPEAKERNAME/#SMX @peakaceag Up to 5 hops, they’ll show you the
final destination GSC shows the content from the final “destination” of a URL in a redirect chain

@SPEAKERNAME/#SMX @peakaceag For 30x chains, GSC cuts you/the preview off
after 5 hops: Behaviour seems to be in sync with Google’s statements concerning this: Source: https://pa.ag/2XdvKVr In general, what happens is Googlebot will follow five 301s in a row, then if we can’t reach the destination page, then we will try again the next time.

@SPEAKERNAME/#SMX @peakaceag Using JS, you could use 10 hops &
it still seems to work However - I’m not saying using 10 redirects in a row is a great idea. They might not pass the same equity (if any) and are super sloooooow!

@SPEAKERNAME/#SMX @peakaceag Glad you asked, yes – they’ll even index
the destination! Again, this is the content from the URL after 10x JS redirects have been executed

@SPEAKERNAME/#SMX @peakaceag Yeah, I really like to break things… GSC
gave up when I tried to go for 15 hops… still, I wonder why the limit is different from server-side redirects – maybe render timeout?

@SPEAKERNAME/#SMX @peakaceag twitter.com/peakaceag facebook.com/peakaceag www.pa.ag Take your career to the
next level: jobs.pa.ag THANK YOU! SEE YOU AT THE NEXT SMX! Care for the slides? Any questions? email us > [email protected] https://pa.ag/smxtesting22

Technical SEO Testing 2022 - SMX Advanced

Technical SEO Testing 2022 - SMX Advanced

More Decks by Bastian Grimm

Other Decks in Marketing & SEO

Featured

Transcript