Challenges of building a search engine like web rendering service

Challenges of building a search engine like web rendering service
Giacomo Zecchini | Verve Search @giacomozecchini

Hi, I’m Giacomo Technical Director at Technical background and previous
experiences in development Love: understanding how things work and Web Performance @giacomozecchini

@giacomozecchini As always happens in all the worst stories, everything
started with a website migration.

One of the largest brands in the UK, with millions
of pounds on the line. @giacomozecchini

The client was migrating platform, moving from a Server Side
rendering website to a Client Side Rendering one. @giacomozecchini

..and of course SSR was going to be put in
place. @giacomozecchini

But.. two days before the migration the client told us
“SSR is not working”. @giacomozecchini

We were already at code freeze. @giacomozecchini

We implemented a short term solution* static rendering the site
and using the user-agent to serve the right content. *A sort of customized Rendertron script. @giacomozecchini

After the migration we helped the client to move to
a medium-long term solution (Prerender.io). @giacomozecchini

Implementation is always harder than it seems. @giacomozecchini

My curiosity made me start to research web rendering services.
@giacomozecchini

* Icons made by Freepik from www.ﬂaticon.com In the past
the html was the most important thing to download in order to access the content of a page @giacomozecchini

Today, JavaScript is a big part of the web and
makes everything more complex * Icons made by Freepik from www.ﬂaticon.com @giacomozecchini

In the past we had a Crawling-Indexing process Crawler Processing
Index URLs Crawl Queue URL HTML @giacomozecchini

Now, we’ve moved to a crawling-rendering-indexing process https://developers.google.com/search/docs/guides/javascript-seo-basics Crawler Processing
Index Renderer URLs Crawl Queue URL HTML Render Queue @giacomozecchini

Google calls the rendering element WRS Crawler Processing Index Renderer
URLs Crawl Queue URL HTML Render Queue WRS @giacomozecchini

Martin Splitt’s TechSEO Boost 2019 talk https://www.youtube.com/watch?v=Qxd_d9m9vzo In his presentation,
Martin covered a lot of interesting implementation details. If you’re interested in Google’s WRS, this is the presentation to watch. @giacomozecchini

These are three of the most important thing you can
get from a Web Rendering Service DOM Tree Render Tree + Layout Rendered HTML @giacomozecchini

DOM Tree & Render Tree https://developers.google.com/web/fundamentals/performance/critical-rendering-path/render-tree-construction @giacomozecchini

Layout information https://youtu.be/WjMSfTK1_SY?t=239 The layout information helps to understand where
elements are positioned on a page, their dimensions, and their importance. @giacomozecchini

The layout information is useful when it comes to: -
Understand the semantics of a page - Check if a page is mobile friendly - Find intrusive interstitials - Understand above the fold content https://youtu.be/WjMSfTK1_SY?t=239 @giacomozecchini

Crawler Renderer Render queue Fetch Server Cache Server Chrome instances
Robots.txt Server DNS Server My toy web rendering service implementation Fetchers @giacomozecchini

Robots.txt Server DNS Server It uses a ﬁrst in ﬁrst out queue Fetchers @giacomozecchini

Robots.txt Server DNS Server ...utilises Chrome DevTools Protocol Fetchers https://chromedevtools.github.io/devtools-protocol/ @giacomozecchini

Robots.txt Server DNS Server I’m currently working on the fetch server Fetchers @giacomozecchini

Robots.txt Server DNS Server It uses the cache when possible Fetchers @giacomozecchini

Robots.txt Server DNS Server If the URL is not cached, the crawler fetches it Fetchers @giacomozecchini

The real problems start when you have to make choices
about the actual rendering of pages! @giacomozecchini

What about the viewport? Do you want to limit the
number of fetches? Are you going to render a page multiple times? @giacomozecchini

Developing software is a matter of choices and context. @giacomozecchini

The same group of people in a different context may
end up developing the same project in a totally different way. @giacomozecchini

You can’t replicate Google’s WRS without having the same data
they have. @giacomozecchini

This is where I began to realise the case to
build your own rather than relying on tools. @giacomozecchini

You can’t replicate Google’s WRS but you can learn from
it. @giacomozecchini

Understanding Google's WRS behaviour @giacomozecchini

If you need JavaScript console message data you can use
the Mobile Friendly test or Search Console Live Test but be careful! @giacomozecchini

Mobile-Friendly Test, Search Console Live Test, AMP Test, and Rich
Results Test are using the WRS infrastructure, but bypassing cache, using shorter timeouts, and few other differences. @giacomozecchini

Hic sunt dracones / Here be Dragons What follows is
based on my own tests and assumptions, results may be false positives. Google can change implementation details at any time without notice, explanation, or justiﬁcation. @giacomozecchini

How we’ll approach the edge cases 1. Deﬁne the edge
case 2. Understand Google's WRS support and behaviour (personal assumption) 3. Check for tools support and behaviour 4. Propose a solution @giacomozecchini

Some of the tested tools I did multiple tests for
each edge case. @giacomozecchini

This is not an evaluation of those tools, but just
a comparison between their results and those of Google’s WRS. @giacomozecchini

Edge Case #1 HTTPS/HTTP mixed content @giacomozecchini

Mixed content occurs when initial HTML is loaded over a
secure HTTPS connection, but other resources are loaded over an insecure HTTP connection. https://youtu.be/WjMSfTK1_SY?t=239 @giacomozecchini

Website: https://www.example.com CSS: http://www.example.com/style.css @giacomozecchini

Chrome will automatically upgrade mixed content from HTTP to HTTPS.
If the fetch fails that asset won’t be loaded. @giacomozecchini

N.B. These are just personal assumptions based on tests. Tests
could be wrong and implementation details may change tomorrow. * this was not the case until recently @giacomozecchini Google’s WRS seems to behave like Chrome

Tools support 10% of tests showed a different result @giacomozecchini

Solution When visiting an HTTPS website, upgrade the URLs of
assets from HTTP to HTTPS. Using Chromium-based browsers you should already have the right solution in place. @giacomozecchini

Edge Case #2 Inﬁnite scrolling / Lazy loading @giacomozecchini

@giacomozecchini SCROLL

could be wrong and implementation details may change tomorrow. VIEWPORT Google starts the rendering using a ﬁxed viewport: Mobile: 412 X 732 Desktop: 1024 x 1024 @giacomozecchini

could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT Then calculate the new viewport: Viewport = Page Height + pixels The amount of additional pixels depends on the page, it could be thousands of pixels @giacomozecchini

could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT A bigger viewport triggers Inﬁnite loading or lazy loading events. @giacomozecchini

could be wrong and implementation details may change tomorrow. 10,000,000 px This seems to be the maximum viewport height @giacomozecchini

Tools support 95% of tests showed a different result *
for very tall pages @giacomozecchini

Solution Start with a ﬁxed viewport @giacomozecchini

Solution Wait for an event: onload DOMContentLoaded If you’re using
puppeteer: networkidle0 networkidle2 @giacomozecchini

Solution VIEWPORT PAGE HEIGHT Check for the page height and
compare it to the initial viewport. https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getLayoutMetrics @giacomozecchini

Solution VIEWPORT PAGE HEIGHT If the viewport is shorter than
the page Viewport = Page Height + pixels @giacomozecchini

Solution The simplest solution is then to wait for X
seconds and stop rendering or check viewport and Page Height again. VIEWPORT PAGE HEIGHT * for more complex solutions you can look at ongoing requests or an event-based approach. @giacomozecchini

Edge Case #3 Content-visibility @giacomozecchini

https://web.dev/content-visibility/ content-visibility is a CSS property that enables the browser
to skip an element's rendering. @giacomozecchini

https://web.dev/content-visibility/ content-visibility is used together with contain-intrinsic-size, a CSS property
allow you to specify natural size of an element if the element is affected by size containment. @giacomozecchini

could be wrong and implementation details may change tomorrow. VIEWPORT Google starts the rendering using a ﬁxed viewport: Mobile: 412 X 732 Desktop: 1024 x 1024 @giacomozecchini

could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT Viewport = Page Height + pixels When the browser starts the rendering the Page Height is calculated using the contain-intrinsic-size @giacomozecchini

could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT A bigger viewport makes the browser rendering the element affected by size containment. @giacomozecchini

Tools support 97% of tests showed a different result *
for very tall pages @giacomozecchini

Solution Start with a ﬁxed viewport @giacomozecchini

Solution Wait for an event: onload DOMContentLoaded If you’re using
puppeteer: networkidle0 networkidle2 @giacomozecchini

Solution VIEWPORT PAGE HEIGHT Check for the page height and
compare that to the initial viewport. https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getLayoutMetrics @giacomozecchini

Solution VIEWPORT PAGE HEIGHT If the viewport is shorter than
the page Viewport = Page Height + Pixels @giacomozecchini

Solution The simplest solution is then to wait for X
seconds and stop rendering or check viewport and Page Height again. VIEWPORT PAGE HEIGHT @giacomozecchini

Edge Case #4 Shadow DOM @giacomozecchini

Shadow DOM https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM @giacomozecchini

Google is able to render and use Shadow DOM content.
N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. @giacomozecchini

Tools support ? 93% of tests showed a different result
@giacomozecchini

Solution Using the document.documentElement.outerHTML returns a DOMString containing an HTML
serialization of the element and its descendants but not the Shadow DOM. *Puppeter’s Page.content() returns the outerHTML @giacomozecchini

Solution The solution is to get the DOM tree, traverse
it and serialize it into HTML. https://www.w3schools.com/js/js_htmldom_navigation.asp @giacomozecchini

Solution Document HTML BODY DIV P P Text Text HEAD
@giacomozecchini

Solution dom2html library: https://github.com/GoogleChromeLabs/dom2html Chrome DevTools Protocol: DOM.getDocument DOMSnapshot.getSnapshot DOMSnapshot.captureSnapshot
https://chromedevtools.github.io/devtools-protocol/ * If interested in Shadow DOM, have a look at: https://web.dev/declarative-shadow-dom/ @giacomozecchini

Edge Case #5 Iframe @giacomozecchini

IFRAME Page A Page B @giacomozecchini

could be wrong and implementation details may change tomorrow. Google is able to render an Iframe inlining the <body> content in a <div>. @giacomozecchini

could be wrong and implementation details may change tomorrow. If the page included through the Iframe has a noindex, the content is not included in the page. @giacomozecchini

Iframe - Tools support ? 95% of tests showed a
different result @giacomozecchini

Solution Get the DOM tree, traverse it, and serialize it
into HTML. N.B. When traversing the DOM you only need the content of <body>, remove other HTML elements and tags such as the <head>. Remember to check for the noindex. @giacomozecchini

Solution dom2html library: https://github.com/GoogleChromeLabs/dom2html Chrome DevTools Protocol: DOM.getDocument DOMSnapshot.getSnapshot DOMSnapshot.captureSnapshot
https://chromedevtools.github.io/devtools-protocol/ @giacomozecchini

Are those the only problems that exist? @giacomozecchini

Web rendering services

What we can learn from this? @giacomozecchini

Sometimes you should reinvent the wheel. It’s fun and you
can learn a lot from that! @giacomozecchini

When you change the way you look at things, the
things you look at change. Understanding these limitations should change, in those edge cases, the advice that you provide. @giacomozecchini

Don’t use tools blindly! Tools are great and save us
a huge amount of time in all our tasks. The majority of pages on the web are not affected by those edge cases. @giacomozecchini

If your website uses or is affected by one of
the mentioned edge cases, you can open a support ticket to check with your tool provider if they are already covering that. @giacomozecchini

Thank You! Got questions? DM on Twitter are open @giacomozecchini

Challenges of building a search engine like web...

Challenges of building a search engine like web rendering service

More Decks by Giacomo Zecchini

Other Decks in Technology

Featured

Transcript