Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Challenges of building a search engine like web rendering service

Challenges of building a search engine like web rendering service

SMX Advanced Europe, June 2021 - With the advent of new technologies and the massive use of Javascript on the internet, search engines have started using Web Rendering Services to better understand the content of pages on the internet. What are the difficulties in building a WRS? Are tools we use every day replicating what search engines do? In this session, Giacomo will drive you on a discovery journey digging in some techy implementation details of a search engine like web rendering service building process, covering edge cases such as infinite scrolling, iframe, web component, and shadow DOM and how to approach them.

Giacomo Zecchini

June 21, 2021
Tweet

More Decks by Giacomo Zecchini

Other Decks in Technology

Transcript

  1. Challenges of building a search engine like web rendering service

    Giacomo Zecchini | Verve Search @giacomozecchini
  2. Hi, I’m Giacomo Technical Director at Technical background and previous

    experiences in development Love: understanding how things work and Web Performance @giacomozecchini
  3. One of the largest brands in the UK, with millions

    of pounds on the line. @giacomozecchini
  4. The client was migrating platform, moving from a Server Side

    rendering website to a Client Side Rendering one. @giacomozecchini
  5. But.. two days before the migration the client told us

    “SSR is not working”. @giacomozecchini
  6. We implemented a short term solution* static rendering the site

    and using the user-agent to serve the right content. *A sort of customized Rendertron script. @giacomozecchini
  7. After the migration we helped the client to move to

    a medium-long term solution (Prerender.io). @giacomozecchini
  8. * Icons made by Freepik from www.flaticon.com In the past

    the html was the most important thing to download in order to access the content of a page @giacomozecchini
  9. Today, JavaScript is a big part of the web and

    makes everything more complex * Icons made by Freepik from www.flaticon.com @giacomozecchini
  10. In the past we had a Crawling-Indexing process Crawler Processing

    Index URLs Crawl Queue URL HTML @giacomozecchini
  11. Google calls the rendering element WRS Crawler Processing Index Renderer

    URLs Crawl Queue URL HTML Render Queue WRS @giacomozecchini
  12. Martin Splitt’s TechSEO Boost 2019 talk https://www.youtube.com/watch?v=Qxd_d9m9vzo In his presentation,

    Martin covered a lot of interesting implementation details. If you’re interested in Google’s WRS, this is the presentation to watch. @giacomozecchini
  13. These are three of the most important thing you can

    get from a Web Rendering Service DOM Tree Render Tree + Layout Rendered HTML @giacomozecchini
  14. Layout information https://youtu.be/WjMSfTK1_SY?t=239 The layout information helps to understand where

    elements are positioned on a page, their dimensions, and their importance. @giacomozecchini
  15. The layout information is useful when it comes to: -

    Understand the semantics of a page - Check if a page is mobile friendly - Find intrusive interstitials - Understand above the fold content https://youtu.be/WjMSfTK1_SY?t=239 @giacomozecchini
  16. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances

    Robots.txt Server DNS Server My toy web rendering service implementation Fetchers @giacomozecchini
  17. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances

    Robots.txt Server DNS Server It uses a first in first out queue Fetchers @giacomozecchini
  18. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances

    Robots.txt Server DNS Server ...utilises Chrome DevTools Protocol Fetchers https://chromedevtools.github.io/devtools-protocol/ @giacomozecchini
  19. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances

    Robots.txt Server DNS Server I’m currently working on the fetch server Fetchers @giacomozecchini
  20. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances

    Robots.txt Server DNS Server It uses the cache when possible Fetchers @giacomozecchini
  21. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances

    Robots.txt Server DNS Server If the URL is not cached, the crawler fetches it Fetchers @giacomozecchini
  22. The real problems start when you have to make choices

    about the actual rendering of pages! @giacomozecchini
  23. What about the viewport? Do you want to limit the

    number of fetches? Are you going to render a page multiple times? @giacomozecchini
  24. The same group of people in a different context may

    end up developing the same project in a totally different way. @giacomozecchini
  25. This is where I began to realise the case to

    build your own rather than relying on tools. @giacomozecchini
  26. If you need JavaScript console message data you can use

    the Mobile Friendly test or Search Console Live Test but be careful! @giacomozecchini
  27. Mobile-Friendly Test, Search Console Live Test, AMP Test, and Rich

    Results Test are using the WRS infrastructure, but bypassing cache, using shorter timeouts, and few other differences. @giacomozecchini
  28. Hic sunt dracones / Here be Dragons What follows is

    based on my own tests and assumptions, results may be false positives. Google can change implementation details at any time without notice, explanation, or justification. @giacomozecchini
  29. How we’ll approach the edge cases 1. Define the edge

    case 2. Understand Google's WRS support and behaviour (personal assumption) 3. Check for tools support and behaviour 4. Propose a solution @giacomozecchini
  30. Some of the tested tools I did multiple tests for

    each edge case. @giacomozecchini
  31. This is not an evaluation of those tools, but just

    a comparison between their results and those of Google’s WRS. @giacomozecchini
  32. Mixed content occurs when initial HTML is loaded over a

    secure HTTPS connection, but other resources are loaded over an insecure HTTP connection. https://youtu.be/WjMSfTK1_SY?t=239 @giacomozecchini
  33. Chrome will automatically upgrade mixed content from HTTP to HTTPS.

    If the fetch fails that asset won’t be loaded. @giacomozecchini
  34. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. * this was not the case until recently @giacomozecchini Google’s WRS seems to behave like Chrome
  35. Solution When visiting an HTTPS website, upgrade the URLs of

    assets from HTTP to HTTPS. Using Chromium-based browsers you should already have the right solution in place. @giacomozecchini
  36. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. VIEWPORT Google starts the rendering using a fixed viewport: Mobile: 412 X 732 Desktop: 1024 x 1024 @giacomozecchini
  37. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT Then calculate the new viewport: Viewport = Page Height + pixels The amount of additional pixels depends on the page, it could be thousands of pixels @giacomozecchini
  38. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT A bigger viewport triggers Infinite loading or lazy loading events. @giacomozecchini
  39. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. 10,000,000 px This seems to be the maximum viewport height @giacomozecchini
  40. Tools support 95% of tests showed a different result *

    for very tall pages @giacomozecchini
  41. Solution Wait for an event: onload DOMContentLoaded If you’re using

    puppeteer: networkidle0 networkidle2 @giacomozecchini
  42. Solution VIEWPORT PAGE HEIGHT Check for the page height and

    compare it to the initial viewport. https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getLayoutMetrics @giacomozecchini
  43. Solution VIEWPORT PAGE HEIGHT If the viewport is shorter than

    the page Viewport = Page Height + pixels @giacomozecchini
  44. Solution The simplest solution is then to wait for X

    seconds and stop rendering or check viewport and Page Height again. VIEWPORT PAGE HEIGHT * for more complex solutions you can look at ongoing requests or an event-based approach. @giacomozecchini
  45. https://web.dev/content-visibility/ content-visibility is used together with contain-intrinsic-size, a CSS property

    allow you to specify natural size of an element if the element is affected by size containment. @giacomozecchini
  46. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. VIEWPORT Google starts the rendering using a fixed viewport: Mobile: 412 X 732 Desktop: 1024 x 1024 @giacomozecchini
  47. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT Viewport = Page Height + pixels When the browser starts the rendering the Page Height is calculated using the contain-intrinsic-size @giacomozecchini
  48. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT A bigger viewport makes the browser rendering the element affected by size containment. @giacomozecchini
  49. Tools support 97% of tests showed a different result *

    for very tall pages @giacomozecchini
  50. Solution Wait for an event: onload DOMContentLoaded If you’re using

    puppeteer: networkidle0 networkidle2 @giacomozecchini
  51. Solution VIEWPORT PAGE HEIGHT Check for the page height and

    compare that to the initial viewport. https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getLayoutMetrics @giacomozecchini
  52. Solution VIEWPORT PAGE HEIGHT If the viewport is shorter than

    the page Viewport = Page Height + Pixels @giacomozecchini
  53. Solution The simplest solution is then to wait for X

    seconds and stop rendering or check viewport and Page Height again. VIEWPORT PAGE HEIGHT @giacomozecchini
  54. Google is able to render and use Shadow DOM content.

    N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. @giacomozecchini
  55. Solution Using the document.documentElement.outerHTML returns a DOMString containing an HTML

    serialization of the element and its descendants but not the Shadow DOM. *Puppeter’s Page.content() returns the outerHTML @giacomozecchini
  56. Solution The solution is to get the DOM tree, traverse

    it and serialize it into HTML. https://www.w3schools.com/js/js_htmldom_navigation.asp @giacomozecchini
  57. Solution dom2html library: https://github.com/GoogleChromeLabs/dom2html Chrome DevTools Protocol: DOM.getDocument DOMSnapshot.getSnapshot DOMSnapshot.captureSnapshot

    https://chromedevtools.github.io/devtools-protocol/ * If interested in Shadow DOM, have a look at: https://web.dev/declarative-shadow-dom/ @giacomozecchini
  58. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. Google is able to render an Iframe inlining the <body> content in a <div>. @giacomozecchini
  59. N.B. These are just personal assumptions based on tests. Tests

    could be wrong and implementation details may change tomorrow. If the page included through the Iframe has a noindex, the content is not included in the page. @giacomozecchini
  60. Iframe - Tools support ? 95% of tests showed a

    different result @giacomozecchini
  61. Solution Get the DOM tree, traverse it, and serialize it

    into HTML. N.B. When traversing the DOM you only need the content of <body>, remove other HTML elements and tags such as the <head>. Remember to check for the noindex. @giacomozecchini
  62. Sometimes you should reinvent the wheel. It’s fun and you

    can learn a lot from that! @giacomozecchini
  63. When you change the way you look at things, the

    things you look at change. Understanding these limitations should change, in those edge cases, the advice that you provide. @giacomozecchini
  64. Don’t use tools blindly! Tools are great and save us

    a huge amount of time in all our tasks. The majority of pages on the web are not affected by those edge cases. @giacomozecchini
  65. If your website uses or is affected by one of

    the mentioned edge cases, you can open a support ticket to check with your tool provider if they are already covering that. @giacomozecchini