Talk To The Spider

Why Googlebot & The URL Scheduler Should Be
Amongst Your Key Personas And How To Train Them TALK TO THE SPIDER Dawn Anderson @ dawnieando

9 types of Googlebot THE KEY PERSONAS 02 SUPPORTING
ROLES Indexer / Ranking Engine The URL Scheduler History Logs Link Logs Anchor Logs

‘Ranks nothing at all’ Takes a list of URLs to
crawl from URL Scheduler Job varies based on ‘bot’ type Runs errands & makes deliveries for the URL server, indexer / ranking engine and logs Makes notes of outbound linked pages and additional links for future crawling Takes notes of ‘hints’ from URL scheduler when crawling Tells tales of URL accessibility status, server response codes, notes relationships between links and collects content checksums (binary data equivalent of web content) for comparison with past visits by history and link logs 03 GOOGLEBOT’S JOBS

04 ROLES – MAJOR PLAYERS – A ‘BOSS’- URL SCHEDULER
Think of it as Google’s line manager or ‘air traffic controller’ for Googlebots in the web crawling system Schedules Googlebot visits to URLs Decides which URLs to ‘feed’ to Googlebot Uses data from the history logs about past visits Assigns visit regularity of Googlebot to URLs Drops ‘hints’ to Googlebot to guide on types of content NOT to crawl and excludes some URLs from schedules Analyses past ‘change’ periods and predicts future ‘change’ periods for URLs for the purposes of scheduling Googlebot visits Checks ‘page importance’ in scheduling visits Assigns URLs to ‘layers / tiers’ for crawling schedules

Indexed Web contains at least 4.73 billion pages (13/11/2015)
05 TOO MUCH CONTENT Total number of websites 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 1,000,000,000 750,000,000 500,000,000 250,000,000 SINCE 2013 THE WEB IS THOUGHT TO HAVE INCREASED IN SIZE BY 1/3

Capacity limits on Google’s crawling system By prioritising
URLs for crawling By assigning crawl period intervals to URLs How have search engines responded? By creating work ‘schedules’ for Googlebots 06 TOO MUCH CONTENT

‘Managing items in a crawl schedule’ Include 07 GOOGLE CRAWL
SCHEDULER PATENTS ‘Scheduling a recrawl’ ‘Web crawler scheduler that utilizes sitemaps from websites’ ‘ ‘Document reuse in a search engine crawler’ ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ ‘Scheduler for search engine’

Crawled multiple times daily Crawled daily Or bi-‐daily
Crawled least on a ‘round robin’ basis – only ‘active’ segment is crawled Split into segments on random rotation 08 MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT) Real Time Crawl Daily Crawl Base Layer Crawl 3 layers / tiers URLs are moved in and out of layers based on past visits data

Scheduler checks URLs for ‘importance’, ‘boost factor’ candidacy,
‘probability of modification’ GOOGLEBOT’S BEEN PUT ON A URL CONTROLLED DIET 09 The URL Scheduler controls the meal planner Carefully controls the list of URLs Googlebot vits ‘Budgets’ are allocated £

CRAWL BUDGET 10 Roughly proportionate to Page Importance (LinkEquity)
& speed Pages with a lot of healthy links get crawled more (Can include internal links??) Apportioned by the URL scheduler to Googlebots WHAT IS A CRAWL BUDGET? -‐ An allocation of ‘crawl visit frequency’ apportioned to URLs on a site But there are other factors affecting frequency of Googlebot visits aside from importance / speed The vast majority of URLs on the web don’t get a lot of budget allocated to them

CRITICAL MATERIAL CONTENT CHANGE 11 HINTS & C = ∑
i = 0 n -‐ 1 weight i * feature

Current capacity of the web crawling system is high Your
URL is ‘important’ Your URL is in the real time, daily crawl or ‘active’ base layer segment Your URL changes a lot with critical material content change Probability and predictability of critical material content change is high for your URL Your website speed is fast and Googlebot gets the time to visit your URL Your URL has been ‘upgraded’ to a daily or real time crawl layer 12 POSITIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

Current capacity of web crawling system is low Your URL
has been detected as a ‘spam’ URL Your URL is in an ‘inactive’ base layer segment Your URLs are ‘tripping hints’ built into the system to detect non-‐critical change dynamic content Probability and predictability of critical material content change is low for your URL Your website speed is slow and Googlebot doesn’t get the time to visit your URL Your URL has been ‘downgraded’ to an ‘inactive’ base layer segment Your URL has returned an ‘unreachable’ server response code recently 13 NEGATIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

IT’S NOT JUST ABOUT ‘FRESHNESS’ 14 It’s about the
probability & predictability of future ‘freshness’ BASED ON DATA FROM THE HISTORY LOGS - HOW CAN WE INFLUENCE THEM TO ESCAPE THE BASE LAYER?

Going ‘where the action is’ in sites The ‘need for
speed’ Logical structure Correct ‘response’ codes XML sitemaps ‘Successful crawl visits ‘Seeing everything’ on a page Taking ‘hints’ Clear unique single ‘URL fingerprints’ (no duplicates) Predicting likelihood of ‘future change’ Slow sites Too many redirects Being bored (Meh) (‘Hints’ are built in by the search engine systems – Takes ‘hints’) Being lied to (e.g. On XML sitemap priorities) Crawl traps and dead ends Going round in circles (Infinite loops) Spam URLs Crawl wasting minor content change URLs ‘Hidden’ and blocked content Uncrawlable URLs Not just any change Critical material change Predicting future change Dropping ‘hints’ to Googlebot Sending Googlebot Where ‘the action is’ CRAWL OPTIMISATION – STAGE 1 - UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES 15 LIKES DISLIKES CHANGE IS KEY

FIND GOOGLEBOT 16 AUTOMATE SERVER LOG RETRIEVAL VIA CRON
JOB grep Googlebot access_log >googlebot_access.txt

LOOK THROUGH ‘SPIDER EYES’ VIA LOG ANALYSIS – ANALYSE GOOGLEBOT
17 PREPARE TO BE HORRIFIED Incorrect URL header response codes (e.g. 302s) 301 redirect chains Old files or XML sitemaps left on server from years ago Infinite/ endless loops (circular dependency) On parameter driven sites URLs crawled which produce same output URLs generated by spammers Dead image files being visited Old css files still being crawled Identify your ‘real time’, ‘daily’ and ‘base layer’ URLs ARE THEY THE ONES YOU WANT THERE?

18 FIX GOOGLEBOT’S JOURNEY SPEED UP YOUR SITE TO ‘FEED’
GOOGLEGOT MORE TECHNICAL ‘FIXES’ Speed up your site Implement compression, minification, caching ‘ Fix incorrect header response codes Fix nonsensical ‘infinite loops’ generated by database driven parameters or ‘looping’ relative URLs Use absolute versus relative internal links Ensure no parts of content is blocked from crawlers (e.g. in carousels, concertinas and tabbed content Ensure no css or javascript files are blocked from crawlers Unpick 301 redirect chains

Minimise 301 redirects Minimise canonicalisation Use ‘if modified’ headers on
low importance ‘hygiene’ pages Use ‘expires after’ headers on content with short shelf live (e.g. auctions, job sites, event sites) Noindex low search volume or near duplicate URLs (use noindex directive on robots.txt) Use 410 ‘gone’ headers on dead URLs liberally Revisit .htaccess file and review legacy pattern matched 301 redirects Combine CSS and javascript files FIX GOOGLEBOT’S JOURNEY 19 SAVE BUDGET £

Revisit ‘Votes for self’ via internal links in GSC Clear
‘unique’ URL fingerprints Use XML sitemaps for your important URLs (don’t put everything on it) Use ‘mega menus’ (very selectively) to key pages Use ‘breadcrumbs’ (for hierarchical structure) Build ‘bridges’ and ‘shortcuts’ via html sitemaps and supplementary content for ‘cross modular’ ‘related’ internal linking to key pages Consolidate (merge) important but similar content (e.g. merge FAQs) Consider flattening your site structure so ‘importance’ flows further Reduce internal linking to low priority URLs BE CLEAR TO GOOGLEBOT WHICH ARE YOUR MOST IMPORTANT PAGES Not just any change – Critical material change Keep the ‘action’ in the key areas -‐ NOT JUST THE BLOG Use ‘relevant ‘supplementary content to keep key pages ‘fresh’ Remember the negative impact of ‘crawl hints’ Regularly update key content Consider ‘updating’ rather than replacing seasonal content URLs Build ‘dynamism’ into your web development (sites that ‘move’ win) GOOGLEBOT GOES WHERE THE ACTION IS AND IS LIKELY TO BE IN THE FUTURE TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS) 20 EMPHASISE PAGE IMPORTANCE TRAIN ON CHANGE

YSlow Pingdom Google Page Speed Tests Minificiation – JS Compress
and CSS Minifier Image Compression – Compressjpeg.com, tinypng.com 21 TOOLS YOU CAN USE GSC Crawl Stats Deepcrawl Screaming Frog Server Logs SEMRush (auditing tools) Webconfs (header responses / similarity checker) Powermapper (birds eye view of site) GSC Internal links Report (URL importance) Link Research Tools (Strongest sub pages reports) GSC Internal links (add site categories and sections as additional profiles) Powermapper GSC Index levels (over indexation checks) GSC Crawl stats Last Accessed Tools (versus competitors) Server logs SPEED SPIDER EYES URL IMPORTANCE SAVINGS & CHANGE Webmaster Hangout Office Hours

IS THIS YOUR BLOG?? HOPE NOT 22 WARNING SIGNS –
TOO MANY VOTES BY SELF FOR WRONG PAGES Most Important Page 1 Most Important Page 2 Most Important Page 3

23 WARNING SIGNS – OVER INDEXATION FIX IT FOR A
BETTER CRAWL

Tags: I, must, tag, this, blog, post, with,
every, possible, word, that, pops, into, my, head, when, I, look, at, it, and, dilute, all, relevance, from, it, to, a, pile, of, mush, cow, shoes, sheep, the, and, me, of, it Image Credit: Buzzfeed Creating ‘thin’ content and Even more URLs to crawl 24 WARNING SIGNS – TAG MAN

25 GOOGLE THINKS SO

”Googlebot’s On A Strict Diet” “Make sure the right URLs
get on the menu” Dawn Anderson @ dawnieando REMEMBER

Talk To The Spider

Talk To The Spider

Dawn Anderson

More Decks by Dawn Anderson

Other Decks in Business

Featured

Transcript

Why Googlebot & The URL Scheduler Should Be

9 types of Googlebot THE KEY PERSONAS 02 SUPPORTING

‘Ranks nothing at all’ Takes a list of URLs to

04 ROLES – MAJOR PLAYERS – A ‘BOSS’- URL SCHEDULER

Indexed Web contains at least 4.73 billion pages (13/11/2015)

Capacity limits on Google’s crawling system By prioritising

‘Managing items in a crawl schedule’ Include 07 GOOGLE CRAWL

Crawled multiple times daily Crawled daily Or bi-‐daily

Scheduler checks URLs for ‘importance’, ‘boost factor’ candidacy,

CRAWL BUDGET 10 Roughly proportionate to Page Importance (LinkEquity)

CRITICAL MATERIAL CONTENT CHANGE 11 HINTS & C = ∑

Current capacity of the web crawling system is high Your

Current capacity of web crawling system is low Your URL

IT’S NOT JUST ABOUT ‘FRESHNESS’ 14 It’s about the

Going ‘where the action is’ in sites The ‘need for

FIND GOOGLEBOT 16 AUTOMATE SERVER LOG RETRIEVAL VIA CRON

LOOK THROUGH ‘SPIDER EYES’ VIA LOG ANALYSIS – ANALYSE GOOGLEBOT

18 FIX GOOGLEBOT’S JOURNEY SPEED UP YOUR SITE TO ‘FEED’

Minimise 301 redirects Minimise canonicalisation Use ‘if modified’ headers on

Revisit ‘Votes for self’ via internal links in GSC Clear

YSlow Pingdom Google Page Speed Tests Minificiation – JS Compress

IS THIS YOUR BLOG?? HOPE NOT 22 WARNING SIGNS –

23 WARNING SIGNS – OVER INDEXATION FIX IT FOR A

Tags: I, must, tag, this, blog, post, with,

25 GOOGLE THINKS SO

”Googlebot’s On A Strict Diet” “Make sure the right URLs