Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dataprovider-The_World_of_Web_Crawling___And_AI...

Avatar for Marketing OGZ Marketing OGZ PRO
September 17, 2025
22

 Dataprovider-The_World_of_Web_Crawling___And_AI_s_Groundbreaking_Impact.pdf

Avatar for Marketing OGZ

Marketing OGZ PRO

September 17, 2025
Tweet

More Decks by Marketing OGZ

Transcript

  1. What I’m going to talk about today • What is

    web crawling, and how does it work • Which types of organisations can benefit from web crawled data and how • The role of web crawling in the rapidly evolving world of AI • How AI has transformed this space completely
  2. What is Web Crawling? • Automated programs that browse and

    collect data from multiple websites • Web crawling is the collection step, indexing makes it searchable • ‘Like someone flipping through all the books in the library, and leaving notes for the librarian’ • Starts at one URL, follows links to discover more • Crawlers use algorithms and obey rules so they don’t overwhelm sites or crawl private areas • Powers search engines, web archiving, content analysis
  3. The Internet in numbers • ~372 million domains registered •

    ~160 million actively used • ~50 million company websites • 90% of those are small businesses • ~1 to 1.5 billion subdomains • ~400 trillion pages total • ~50 to 400 billion pages indexed
  4. How web crawled data can be beneficial • Governments •

    Statistical Offices • Hedge Funds • Domain Registrars and Registries • Brand Protection • Payments • Cyber Security • Business Information • Lead Generation
  5. Dedicated use case • Helping well-known fashion brands remove illegitimate

    websites selling their goods or scamming • Bad actors create multiple websites selling counterfeit goods or scam the buyer • AI has rapidly increased the number of these types of stores on the web On Brand Protection
  6. Dedicated use case • Finding domains identical or confusingly similar

    to existing trademarks • Exploit the brand for phishing or scamming, divert web traffic, or try to sell the domain back to the trademark owner • Structured web crawled data in combination with data science can find these domains Second example
  7. Dedicated use case • The rapid growth of AI generated

    website builders • Since the rise of AI in 2023, regular website builders made way for their AI substitutes • Hedge funds are monitoring these developments closely, and invest in early stages of the growth • Structured web crawled data can look under the hood at instalments For Hedge Funds
  8. Challenges • Overwhelming volume of data, hard to extract insights

    • Manual work still needed to combine results • Reliance on unstructured data reduces accuracy • Outcomes are often ‘best guesses’ rather than precise insights • Time-consuming to process at scale • Asking the right questions, drawing the right conclusions
  9. ‘’The success of AI companies, and AI applications, is determined

    by access to quality underlying data.’’
  10. The role of web crawled data in AI • Models

    learn how people write and communicate • Models learn about factual knowledge or events that are happening • Models learn about perspectives of people • We’re in the middle of the race between LLMs towards the highest quality data • Google’s AI summaries • X’s Grok • OpenAI & Reddit • Meta & illegally obtained ebooks
  11. How AI has transformed web crawling • With MCP, you

    can expose structured sources, like web crawled data, directly to AI • 'Librarian’s assistant, who gathers insights from organised books, hands them to the LLM for reasoning, brings back clear results to the reader’ • You can ‘talk to the data’ and get human-like responses • AI turns raw numbers into clear summaries, charts, and takeaways • Less manual combining, faster decision-making
  12. Translating data to insights • Crawling and indexing the web

    produces huge amounts of data • On the right you see a subset of web crawled data, classified under a few random data points • AI now helps to make conclusions and visuals of the already existing structured web data
  13. Key takeaways • Web crawling is beneficial for businesses across

    multiple industries • Crawled data is a major training source for LLMs • Heavy reliance on unstructured data can reduce accuracy • AI + structured data unlocks more reliable and actionable insights
  14. Thank you very much for your attention! • We’re beyond

    excited to show you what’s possible • If you’re interested in hearing more, we’re at booth #165