Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Baruch Toledano - Behind the Scenes of Buildin...

Baruch Toledano - Behind the Scenes of Building Scalable Data Products for SEOs

Avatar for Tech SEO Connect

Tech SEO Connect PRO

December 12, 2025
Tweet

More Decks by Tech SEO Connect

Other Decks in Marketing & SEO

Transcript

  1. Baruch Toledano VP of Solutions at Similarweb Behind the scenes

    of building scalable data products for SEO How to make complex SEO data usable December 4-5 2025 Durham, NC
  2. Scalable SEO Data Products, Similarweb LLMs, tools, products and reports

    are only as good as the data they are trained on.
  3. Scalable SEO Data Products, Similarweb There’s misalignment everywhere: GA vs

    GSC Ratio < 1: GA (MMX) shows more visits. Ratio ≈ 1: GSC and GA are aligned. Ratio > 1: GSC shows more clicks.
  4. Modern SERPs are a moving target Google now redraws its

    interface every few weeks Scalable SEO Data Products, Similarweb Shifting pixel positions Shift by device, test cohort or even geography Broken feature detection When Google changes markup, layout or snippet structures Unpredictable code changes Class names and DOM patterns change unpredictably
  5. Triangulating ʻthe truth’ With different data sources Scalable SEO Data

    Products, Similarweb Learning Set Model Panel Data
  6. Let’s go behind the scenes… Techniques to: adjust, adapt, train,

    extract and validate data to get the results you need. Scalable SEO Data Products, Similarweb
  7. Zero-click broke the model If traffic becomes a maybe: ❖

    CTR is no longer a stable metric ❖ Organic share is harder to estimate ❖ Content investment vs ROI becomes fuzzy → We’re now dealing with missing data: ʻdark spaces in datasets. Scalable SEO Data Products, Similarweb Challenge 1
  8. Scalable SEO Data Products, Similarweb Different SERP Structure affects user

    behavior Images Instant answers Expanded Sitelinks Zero clicks distribution 0 1
  9. SERP modelling that adapts to reality Adaptive attribution Our orchestrator

    traffic-model does not only account to % contribution of each item to zero clicks, but also manage the changes between the items. Real-user verification We verify this further with what people are actually clicking and we are able to adjust as new items are introduced and we see shifts in behavior. Better estimation We can better estimate clicks associated between keyword -> URL. Conference title here 01 02 03 Solution
  10. Data governance and modelling Lines of Business (LoB) Finding the

    balance between quality and expense of processing Scalable SEO Data Products, Similarweb Challenge 2
  11. Need to split Walmart to Line Of Businesses (LoB) Classify

    Walmart’s pages to LoB and estimate traffic separately Classify 100M pages every day Scalable SEO Data Products, Similarweb Splitting Walmart’s LoB
  12. Scalable SEO Data Products, Similarweb It relies on page classification

    Most of the websites has explainable urls! Apparently Good SEO optimization in Walmart
  13. Scalable SEO Data Products, Similarweb End2End LLM Classified URL to

    LoB LLM Classification on raw data Accuracy Cost Speed
  14. Scalable SEO Data Products, Similarweb Funnel of optimizations LLM Embedding

    → Traditional Classifier Classifier: Cosine Similarity Embedding LoB Embedding URL KIds Clothing Kids Rooms Kids Food walmart.com/cp/BabyKids_KidsClothing Started with OpenAI Embeddings as baseline. • Cheaper and easier to create. • Not requires heavy prompt engineering.
  15. Scalable SEO Data Products, Similarweb Funnel of optimizations LLM Embedding

    → Traditional Classifier Accuracy Cost Speed Classifier: Cosine Similarity Embedding LoB Embedding URL
  16. Scalable SEO Data Products, Similarweb Distillations Part 1 Data Teacher

    Model Learning Set Student Model Fine Tuned SLM Embedding → Traditional Classifier In our use-case • We treat the teacher model as a labeler for the purpose of building a learning set and use this learning set as an input to the smaller model
  17. Scalable SEO Data Products, Similarweb Distillations Fine Tuned SLM Embedding

    → Traditional Classifier LoRA (Low-Rank Adaptation) is a technique for efficiently fine-tuning LLMs by injecting lightweight trainable matrices into existing weights. Instead of updating all model parameters, LoRA focuses on a small subset, drastically reducing compute and memory requirements. (~0.5%) LoRa advantages: • Reducing compute and memory usage • Faster fine-tuning • Preserves base model • LoRA adapters are small and easily swappable • Cost-effective
  18. Scalable SEO Data Products, Similarweb Distillations Fine Tuned SLM Embedding

    → Traditional Classifier The triplet loss function minimizes the distance between an anchor and a positive embeddings,and maximizes the distance between the anchor and a negative embeddings We used a “Class” variation of Triplet-Loss. 0.3 -0.1 0.7 .0.03 …. -0.2 0.2 -0.3 0.5 -.0.01 …. -0.3 -0.6 0.1 0.2 -.071 …. 0.8 apple.com/iphone-16-pro Phones Shoes
  19. Scalable SEO Data Products, Similarweb Distillations Classifier: Cosine Similarity Embedding

    URL Fine Tuned SLM Embedding LoB Fine Tuned SLM Accuracy Cost Speed Fine Tuned SLM Embedding → Traditional Classifier
  20. The problem of data gravity Scalable SEO Data Products, Similarweb

    A pipeline that starts simple Becomes a Rube Goldberg machine. Large datasets attract complexity Every new insight Multiplies the computing cost. SEOs request just one more slice Suddenly you’re running terabyte-scale joins.
  21. Scalable SEO Data Products, Similarweb AI-driven search rewrites the rules

    Generative answers Measuring this requires • Collapses multiple docs into one output • Changes what visibility means • Reduces individual URL exposure • Introduced new ʻranking layers’ in LLM • Extracting LLM answers • Identifying source attribution • Modelling visibility after answer blocks • Detecting hallucinations • Tracking position within the AI module SEO data now requires reverse-engineering an LLM’s reasoning layer
  22. Scalable SEO Data Products, Similarweb From Search to AI Search

    Results AI Overview Results From Search Engine Optimization (SEO) to AI Engine Optimization (AEO)
  23. Scalable SEO Data Products, Similarweb AIO Brand Extraction Goal: Evaluate

    appearance of brands in chat results Task: Extract brands from LLM responses
  24. Scalable SEO Data Products, Similarweb AIO Brand Extraction - How

    much data do we have 100,000,000,000s Tokens • ~33% of the search queries return AI overview result • Running on 100+ Million AIO monthly paragraphs (we need this for other model) • Each AI overview analysis is of 1000 tokens • No option to caching – each ai overview result\question is unique 100,000,000,000s Tokens
  25. Scalable SEO Data Products, Similarweb End2End Best-in-class LLM Using LLM

    for Brand extraction from AIO text • Good results • Very expensive • Latency - Cannot be done on the fly Solutions • Batch run and only serving on the fly • Cheaper models
  26. Scalable SEO Data Products, Similarweb AIO Brand Extraction – context

    Model Output: [“Jaguar”, “BWM”, “Honda”, "Tesla", "Jeep"]
  27. Scalable SEO Data Products, Similarweb AIO Brand Extraction – context

    Class A Model Output: ["Toyota", "Honda", "Mazda", "Subaru", "Kia"]
  28. Scalable SEO Data Products, Similarweb AIO Brand Extraction – context

    Model Output: ["Tesla", "Jeep", "X”, “Ford”]
  29. Scalable SEO Data Products, Similarweb AIO Brand Extraction – context

    Class A- Model Output: The brands mentioned in the provided text are: 1. **Toyota** Camry and Corolla 2. **Honda** Accord, Civic and Fit 3. **Mazda** Mazda3 Here is the information in JSON format: ```json { "brands": [ { "name": "Toyota", "models": ["Camry", "Corolla"] }, { "name": "Honda", "models": ["Accord", "Civic",
  30. Scalable SEO Data Products, Similarweb Phase 2 - End2End Cheap

    LLM Downgrade to Model Class A- Evaluate using LLM as Judge + Eval Tools Prompt optimization using LLM Best Results with Model Class A
  31. Scalable SEO Data Products, Similarweb Phase 2 - End2End Cheap

    LLM Downgrade to Model Class A- Evaluate using LLM as Judge + Eval Tools Prompt optimization using LLM Best Results with Model Class A Evaluate Cost
  32. Scalable SEO Data Products, Similarweb Phase 2 - End2End Cheap

    LLM • Length of Prompt Vs. Model Quality ◦ Cheap model need better prompt: zero\single shot training - examples 2. **Text:** "Shai Gilgeous-Alexander primarily wears his signature shoe, the Converse SHAI 001, to release in Fall 2025." **Output:** ["Converse"] * **1. Extract Brands, Not Products:** Your primary goal is to identify the parent commercial brand. * **Isolate Parent Brands:** If a product name contains a brand (e.g., "Adobe Acrobat", "Microsoft Office"), extract only the parent brand ("Adobe"). * **Exclude Creative Works:** Do not extract creative works like movies, TV shows, Youtube channels, video games, music artists, or songs.
  33. Scalable SEO Data Products, Similarweb Phase 2 - End2End Cheap

    LLM • Length of Prompt Vs. Model Quality ◦ Cheap model need better prompt: zero\single shot training – examples ◦ Define strict and short output ◦ Balance number of task Vs. context loss in a single prompt 2. **Text:** "Shai Gilgeous-Alexander primarily wears his signature shoe, the Converse SHAI 001, to release in Fall 2025." **Output:** ["Converse"] * **1. Extract Brands, Not Products:** Your primary goal is to identify the parent commercial brand. * **Isolate Parent Brands:** If a product name contains a brand (e.g., "Adobe Acrobat", "Microsoft Office"), extract only the parent brand ("Adobe"). * **Exclude Creative Works:** Do not extract creative works like movies, TV shows, Youtube channels, video games, music artists, or songs. ### **Output Rules** * **1. Unique Entries Only:** Each unique brand must appear only once in the final list. * **2. JSON Array Format:** Return only a JSON list of strings. If no brands are found, return an empty list `[]`. * **3. No Explanations:** Do not include any reasoning or text outside of the JSON list.
  34. Scalable SEO Data Products, Similarweb Phase 2 - End2End Cheap

    LLM • Length of Prompt Vs. Model Quality ◦ Cheap model need better prompt: zero\single shot training – examples ◦ Define strict and short output ◦ Balance number of task Vs. context loss in a single prompt • Brand extraction use case: ◦ Used Nova light - x12 cheaper than Nova Pro PROMPT_TEMPLATE = """You are an expert brand name extractor. Follow these principles and rules to analyze the text and return a JSON list of all commercial brand names found. --- ### **Core Principles** * **1. Extract Brands, Not Products:** Your primary goal is to identify the parent commercial brand. * **Isolate Parent Brands:** If a product name contains a brand (e.g., "Adobe Acrobat", "Microsoft Office"), extract only the parent brand ("Adobe"). * **Exclude Creative Works:** Do not extract creative works like movies, TV shows, Youtube channels, video games, music artists, or songs. * **2. Commercial Only:** The extracted name must be a commercial entity. * **Exclude Non-Commercial Entities:** Do not extract proper nouns like people or government agencies. * **Exclude Generic Nouns:** Do not extract unbranded, generic nouns ("laptop"). * **3. Standardize for Consistency:** Brand names should be cleaned and formalized. * **Formalize Names:** Convert brand names to their most common, formal version. Expand abbreviations and contextual references (e.g., "VW" becomes "Volkswagen"). ### **Output Rules** * **1. Unique Entries Only:** Each unique brand must appear only once in the final list. * **2. JSON Array Format:** Return only a JSON list of strings. If no brands are found, return an empty list `[]`. * **3. No Explanations:** Do not include any reasoning or text outside of the JSON list. --- **Examples:** 1. **Text:** "For the movie night, we watched The Avengers, a film by Marvel Studios, on our Samsung TV." **Output:** ["Marvel Studios", "Samsung"] 2. **Text:** "Shai Gilgeous-Alexander primarily wears his signature shoe, the Converse SHAI 001, to release in Fall 2025." **Output:** ["Converse"] 3. **Text:** "The new Adidas collection features updates to their classic shoes: the Superstar, Stan Smith, and Gazelle." **Output:** ["Adidas"] 4. **Text:** "China developed a new power plant according to nytimes.com." **Output:** ["New York Times"] 5. **Text:** "He drives a classic VW Beetle." **Output:** ["Volkswagen"] 6. **Text:** "To edit the PDF, I need to install Adobe Acrobat." **Output:** ["Adobe"] 7. **Text:** "The local market sells a variety of fresh fruits and vegetables." **Output:** []
  35. There’s a lot of noise right now. Clickstream data helps

    you focus on reality at macro and micro levels Scalable SEO Data Products, Similarweb
  36. Clickstream data can show you if and when to focus

    on a new channel or engine… Scalable SEO Data Products, Similarweb Volume of visits to Google AI Mode USA, January 2025 - July 2025 AI Mode first chatbot to reach 100m users in the US 50m 100m 150m Jan Jul
  37. Clickstream data can tell a richer story about your audience

    Scalable SEO Data Products, Similarweb Number of days active on AI Mode US AI Mode Users No. of active days on AI mode 1 2 3 4 5
  38. Scalable SEO Data Products, Similarweb Adoption of new models Prompt

    volume continues to steadily increase over time as adoption of ChatGPT grows organically ChatGPT 4o declines, while adoption of ChatGPT 5 and its sub-models increase ChatGPT 5 prompt volume is growing
  39. Scalable SEO Data Products, Similarweb And how people use models

    Since June 2025 file attachment has reached an unprecedented stage: • ~200 unique file attachment types • 46% increase in picture & image search from early June to today, and growing Prompt volume containing Images is growing
  40. Clickstream data doesn’t come without challenges… Scalable SEO Data Products,

    Similarweb To ensure clickstream data drives valuable insights, we need to get answers on your data’s: • Representation • Consistency & Reliability • Integration & Context • Segmentation & Granularity • Timeliness • Security & Privacy
  41. Scalable SEO Data Products, Similarweb Representation: Panel Bias Fixed A

    bias correction needed for each user i.e the adjustment that “translates” his events to the estimated events X2 X2 X4
  42. Scalable SEO Data Products, Similarweb A good panel is powered

    by learning sets and modelling Panel LS Models
  43. Choose the right data and the right way to access

    it for your use case Platform Conference title here 01 02 03 Build/AI Agents MCP