Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web Pages Visual Similarity - Search Central Li...

Web Pages Visual Similarity - Search Central Live Zurich 2024

Google's Search Central Live in Zurich, 2024

Giacomo Zecchini

December 16, 2024
Tweet

More Decks by Giacomo Zecchini

Other Decks in Technology

Transcript

  1. Hi everyone! I’m Giacomo, R&D director at Merj. Merj blends

    digital marketing and engineering expertise to solve complex challenges for enterprise companies. We act as an embedded team for our customers, providing transformative solutions that merge strategy, data, automation and technology. *We are hiring! More on that later…
  2. What we are going to cover today 1. Understand the

    context of the research 2. Explain why visual similarity is useful 3. How to define visual similarity 4. Implementation overview 5. Additional use cases
  3. Scenario A company we were working with wanted to consolidate

    its brands onto a unified technical stack. They wanted to know if this would be a problem for users and search engines.
  4. Can you define similarity? Similarity means different things to different

    people. We approach this concept by defining it in terms of text similarity and visual similarity.
  5. Text similarity part was already completed I joined Merj when

    this project was already underway, with the team in the final stages of completing the text similarity component. Text similarity is a well-established process with numerous documented approaches to tackle it effectively. In the following slides, we will turn our attention to visual similarity.
  6. Why visual similarity? Jakob's Law of Internet User Experience “Users

    spend most of their time on other sites. This means that users prefer your site to work the same way as all the other sites they already know. Design for patterns for which users are accustomed.” - Jakob Nielsen Source: https://www.nngroup.com/videos/jakobs-law-internet-ux/
  7. When I’m familiar with it, I know how it works

    Icon source: Icons from flaticon.com
  8. When being similar is too much? When websites are too

    similar, the advantage of having multiple brands can be diminished. Users may perceive the websites as identical, which could negatively impact business metrics. By comparing multiple internal brands and competitors, and incorporating business metrics, we wanted to define the optimal threshold for visual similarity.
  9. Looking for a different approach for Visual Similarity The team

    had already explored several machine learning approaches. However, implementing them at the scale we required turned out to be both slow and costly. We started looking at a different approach.
  10. Project Status What we had: - List of domains to

    compare (internal brands and competitors) - List of categories per each domain (HTML templates) - List of web pages for each category - Web Crawler (utilising a Headless Browser to render web pages)
  11. Browser Rendering Process Image source: https://web.dev/articles/howbrowserswork Parsing HTML to construct

    the DOM tree Render tree construction Layout of the render tree Painting the render tree
  12. Hypothesis “Two web pages can be compared for visual similarity

    by evaluating the elements that share similar coordinates and dimensions.”
  13. Automated Browser Actions There are many libraries that provide a

    high-level API for controlling Chrome, abstracting the DevTools Protocol. For lower-level tasks, we can directly use the DevTools Protocol (CDP), a protocol designed to automate actions on Chromium, Chrome, and other Blink-based browsers. Source: https://chromedevtools.github.io/devtools-protocol/
  14. CDP’s DOMSnapshot.captureSnapshot Using Chrome DevTools Protocol we can get a

    snapshot of all nodes (elements) rendered on the page, their content, positions, and dimensions. Source: https://chromedevtools.github.io/devtools-protocol/tot/DOMSnapshot/#method-captureSnapshot
  15. Code example // Launch Puppeteer and open a new page

    const browser = await puppeteer.launch({ headless: true }); const context = await browser.createBrowserContext(); const page = await context.newPage(); // Connect to the DevTools protocol const CDPclient = await page.createCDPSession(); // Enable DOMSnapshot domain await CDPclient.send('DOMSnapshot.enable'); // Navigate to the target URL const url = 'https://example.com'; await page.goto(url, { waitUntil: 'networkidle2' }); // Capture a DOM snapshot const snapshot = await CDPclient.send('DOMSnapshot.captureSnapshot', { // Define the computed style to return computedStyles: ['visibility', 'display', 'z-index', 'color','background-color'] });
  16. INFORMATION ABOUT THE ELEMENT nodeType:1 nodeName:"DIV" attributes:[{"name":"class","value":"P6T2B6 _244GCM"}] ... LAYOUT

    TREE INFORMATION "X":868.390625,"Y":81,"width":22,"height":22 Parsing and normalising the DOMSnapshot output Source: https://chromedevtools.github.io/devtools-protocol/tot/DOMSnapshot/#method-captureSnapshot
  17. Sets of elements (nodes) for two web pages PAGE A

    "X":832.375,"y":84.296875,"width":16,"height":16 "X":797.734375,"y":83.296875,"width":30.640625,"height":17 "X":860.390625,"y":75,"width":78.890625,"height":36 "X":868.390625,"y":81,"width":22,"height":22 … PAGE B "X":868.390625,"y":81,"width":22,"height":22 "X":898.390625,"y":83.28125,"width":32.890625,"height":17 "X":943.28125,"y":75,"width":60.71875,"height":36 "X":24,"y":181,"width":976,"height":128 …
  18. Algorithm v0.1 ThresholdWidth = PageWidth * X% ThresholdHeight = PageHeight

    * X% BoxA = "X":832.375,"Y":84.296875,"width":16,"height":16 BoxB = "X":868.390625,"Y":81,"width":22,"height":22 if BoxB coordinates are included in BoxA coordinates + thresholds if BoxB dimensions are included in BoxA dimension + thresholds then Boxes are similar (add to the list of similar boxes) else Boxes are not similar … Continue comparing BoxA with …
  19. Jaccard Index Eventually, to calculate the similarity between two pages

    we can use the Jaccard Index. Where A ∩ B (the intersection) is the set of similar elements on the two pages, and A ∪ B (the union) is the set of all unique elements from both pages combined.
  20. Optimisations After the v0.1 version, we improved the process by

    adding multiple optimisations: - Including only visible nodes - Background colors - Merging overlapping nodes - Considering the z-index of nodes - Using more performant data structures …and many more.
  21. We delivered this and.. While we can't share many details,

    the company was highly satisfied with both the process and the outcomes. The resulting text and visual similarity metrics were integrated into the annual business goals as control metrics. Max threshold of visual similarity in this case was around 40%, but this may vary depending on websites, type of pages, and goals.
  22. Other people come up with similar ideas! Updating these slides

    I’ve found that a team of researchers from Harbin Institute of Technology, Harbin, China and Cyberspace Security Research Center, Peng Cheng Laboratory, Shenzhen, China come up with a similar method to detecting phishing web sites. Source: https://www.researchgate.net/publication/336377602_Algorithm_of_web_page_similarity_comparison_based_on_visual_block
  23. Additional use cases (1) 1:1 website migrations: Ensure pages are

    migrated correctly and 100% visually similar (should be 100% similar across the domains).
  24. Additional use cases (2) Above-the-Fold content analysis: Verify that key

    content is visible and prioritised above the fold.
  25. Additional use cases (3) Intrusive interstitials and dialogs: Identify if

    web pages have intrusive interstitials and dialogs that may interfere with search engines to understanding of the content. Image source: https://developers.google.com/search/docs/appearance/avoid-intrusive-interstitials
  26. Additional use cases (4) Web Page element rendering verification: Confirm

    that web pages rendered by search engines or other systems successfully handle and position specific elements as intended, detecting misalignments or unexpected behaviours.
  27. We’re looking for experienced technical SEO consultants. If you’d like

    to discuss coming to work at Merj… https://www.linkedin.com/in/ryansiddle/
  28. MERJ Ltd, 7 Pancras Square, London, N1C 4AG +44 (0)

    20 3322 2660 [email protected] merj.com Thank you for your time and attention!