Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SEO Crawl Rank And Crawl Tank - Brighton SEO Ap...

SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

Originally presented by Dawn Anderson at Brighton SEO in April 2016

Why you should take care of crawls (Intelligent use of crawl allocation (budget)). Investigating 'crawl budget', 'crawl rank', 'crawl tank' and 'crawl scheduling by Search Engines'

Avatar for Dawn Anderson

Dawn Anderson

April 22, 2016
Tweet

More Decks by Dawn Anderson

Other Decks in Marketing & SEO

Transcript

  1. SEO  ‘Crawl  Tank’  -­‐ ‘Death  and  Resurrection’ WHY  YOU  SHOULD

     CARE  ABOUT  TAKING   CARE  OF  CRAWLS  (INTELLIGENT  USE  OF   CRAWL  ALLOCATION  (BUDGET)) THE  QUEST   FOR  ‘CRAWL   RANK’ Dawn  Anderson  @  dawnieando
  2. Indexed  Web  contains at  least  4.73  billion   pages (13/11/2015)

    1 THE WEB IS ‘BIG’ Total  number  of  websites 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 1,000,000,000 750,000,000 500,000,000 250,000,000 SINCE  2013  THE  WEB  IS   THOUGHT  TO  HAVE   INCREASED  IN  SIZE  BY  1/3
  3. 2 THE ABILITY TO ‘SELF PUBLISH’ EASILY HAS CLEARLY INFLUENCED

    THIS – WE ALL ‘LOVE CONTENT’ IMPORTANT  TO  NOTE   THAT  75%  OF   WEBSITES  ONLINE   ARE  DORMANT  (E.G.   PARKED  DOMAINS) IMAGINE  HOW  MANY   UNIQUE  URLs    COMBINED   THIS  AMOUNTS  TO?   – A  LOT http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/
  4. Capacity  limits   on  Google’s   crawling  system By  prioritising

      URLs  for   crawling By  assigning   crawl  period   intervals  to  URLs How  have   search  engines   responded? By  creating  work   ‘schedules’  for   Googlebots 3 TOO MUCH CONTENT
  5. 4 HERE’S WHY -­> EVERYTHING HAS A FINITE CAPACITY (EVEN

    CRAWLING) “While  web  pages  can  be  manually  selected  for   crawling,  this  becomes  impracticable  as  the   number  of  web  pages  grows.  Moreover,  to  keep   within  the  capacity  limits  of  the  crawler,   automated  selection  mechanisms  are  needed  to   determine  not  only  which  web  pages  to  crawl,   but  which  web  pages  to  avoid  crawling.  For   instance,  as  of  the  end  of  2003,  the  WWW  is   believed  to  include  well  in  excess  of  10  billion   distinct  documents  or  web  pages,  while  a  search   engine  may  have  a  crawling  capacity  that  is  less   than  half  as  many  documents.”  -­‐ Scheduler  for   search  engine  crawler Google  Patent US  8042112  B1,  (Zhu  et  al)
  6. ‘Managing items in crawl schedule’ - US  8666964  B1 Include

    5 SOME GOOGLE CRAWL SCHEDULER PATENTS ‘Scheduling a recrawl’ - US   8386459  B1 ‘Web crawler scheduler that utilizes sitemaps from websites’ - US  8037054  B2 ‘Document reuse in a search engine crawler’ - US  8707312  B1 ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ - US   8407204  B2 ‘Scheduler for search engine crawler’ - US  8042112  B1 ‘Distributed crawling of hyperlinked documents’ - US  7305610  B1 IT  SEEMS  PRIORITIZATION  AND  GOOGLEBOT   CRAWL  EFFICIENCY  ARE  IMPORTANT  TO  SEARCH   ENGINES
  7. Crawled  multiple   times  daily Crawled  daily   Or  bi-­‐daily

    Crawled  least  on  a  ‘round   robin’  basis  – only  ‘active’   segment  is  crawled Split  into  segments   on  random  rotation 6 “MANAGING ITEMS IN A CRAWL SCHEDULE” (GOOGLE PATENT US  8666964  B1) Real  Time Crawl Daily Crawl Base  Layer    Crawl 3  layers  /  tiers URLs  are  moved   in  and  out  of   layers  based  on   past  visits  data   (retrieved  from   logs) PAGE ‘IMPORTANCE’ AND URL SCHEDULING
  8. 10  types of Googlebot THE KEY SEARCH ENGINE (THE APPLIANCE)

    CHARACTERS 7 SUPPORTING  ROLES  (LOG   MANAGERS  &  PAGE   RANKERS Indexer  /   Ranking  Engine The  URL   Scheduler History  Logs Link  Logs  /  Link  Maps Anchor  Logs  /  Anchor  Maps Status  Logs Page  Rankers
  9. 8 THE ‘LOG’ MANAGERS (‘The Clerks’) History  Logs Link  Logs

    JOBS  INCLUDE JOBS  INCLUDE Other  Logs JOBS  INCLUDE Consider these as ‘record-­keepers’ (record info on the crawled URLS Retrieves   previous  copies  of   documents  for   comparison  with   newly  retrieved   copies  for   purposes   of   ’change   frequency’  and   ‘change  weight’   calculation  (last   modified  &   update  rate) Include: “identifies  all  the  links  (e.g.,   URLs,  also  called  outbound   links)  that  are  found  in  the   document  associated  with  the   record  and  the  text  that   surrounds   the  link”  (Brawer  et   al,  Google  Patent) INFO  USED  TO  MAKE  LINK   MAPS • Anchor  Logs  &   Maps • Status  Logs A  LOT  MORE  INFO  ON   LOGS  AT:  Scheduler  for   Search  Engine  Crawler US  20100241621  A1
  10. 9 SUPERVISOR -­ TEAM LEADER – ‘THE URL SCHEDULER’ Think

     of  it  as  Google’s   line  manager  or  ‘air   traffic  controller’  for   Googlebots in  the   web  crawling  system JOBS Schedules  Googlebot visits  to  URLs Decides  which  URLs  to  ‘feed’  to  Googlebot Uses  data  from  the  history  logs  about  past  visits Assigns  visit  regularity  of  Googlebot to  URLs Drops  ‘hints’  to  Googlebot to  guide  on  types  of  content  NOT  to   crawl  and  excludes  some  URLs  from  schedules Analyses  past  ‘change’  periods  and  predicts  future  ‘change’   (BASED  ON  PAST  VISIT  DATA)  periods  for  URLs  for  the  purposes  of   scheduling  Googlebot visits Checks  ‘page  importance’  in  scheduling  visits  (PRIORITIES) Assigns  URLs  to  ‘layers  /  tiers’  for  crawling  schedules   (REAL  TIME,   DAILY,  BASE  LAYER  SEGMENT) The  URL   Scheduler   controls  the   meal  planner Scheduler  checks  URLs   for  ‘importance’,  ‘boost   factor’  candidacy,   ‘probability  of   modification’ ‘Budgets’  are  allocated Carefully  controls   the  list  of  URLs   Googlebot visits
  11. THE 10 GOOGLEBOTS Image Video News Adsense Adsbot PAID  SEARCH

     TYPES 10 MEDIA  TYPES Smartphone Apps Featurephone Mobile Adsense MOBILE  TYPES BOT TYPES HAVE VARYING DEGREES OF ‘BUSY-NESS’ GOOGLEBOT   WEB  SEARCH Crawls   images  only Quality Checks Babybot (’the   Noob’)
  12. GOOGLEBOT JOBS 11 JOBS • ‘Ranks  nothing  at  all’ •

    Takes  a  list  of  URLs  to  crawl  from  URL  Scheduler • Job  varies  based  on  ‘bot’  type  (e.g.  Image  bot  seems  a  bit  of  a  ‘part   timer’  (images  change  less  frequently)) • Runs  errands  &  makes  deliveries  for  the  URL  server,  indexer  /  ranking   engine  and  logs • Makes  notes  of  outbound   linked  pages  and  additional  links  for  future   crawling  (in  order  for  them  to  be  assigned  to  future  crawling  schedules) • Takes  notes  of  ‘hints’  from  URL  scheduler  when  crawling • Tells  tales  of  URL  accessibility  status,  server  response  codes,  notes   relationships  between  links  and  collects  content  checksums  (binary  data   equivalent  of  web  content)  for  comparison  with  past  visits  by  history  and   link  logs
  13. 12 ‘INDEXER’ Looks  at  all  of  the   evidence  from

     the   various  logs  (and  the   page  rankers)  of  the   search  engine  to   index  the  URLs • Uses  the  combined  data  collected  in  order  to  index  the   results  for  a  given  query • TAKES  DATA  FROM  THE  LOGS    TO  GENERATE  INDEXES “The  indexer(s) 724 use  the  anchor  maps 718   and  other  logs 716 to  generate  index(es) 726.   The  index(es)  are  used  by  the  search  engine  to   identify  documents  matching  queries  entered  by   users  of  the  search  engine.”  (Web  crawler   scheduler  that  utilizes  sitemaps  from  websites US  8037054  B2,  Google  Patent,  Brawer  et  al,   pub  2011)
  14. I ASKED JOHN MUELLER AT WEBMASTER HANGOUT ABOUT URL QUEUES

    14 GOOGLE   WEBMASTER   HANGOUT   QUESTION  ON   ’URL  QUEUEING’ BUT  WHAT  OTHER  EVIDENCE  DO  WE  HAVE  TO   SUPPORT  OUT  THEORIES? “URLS  ARE  NOT  ALL  CRAWLED  IN  ORDER,  BUT  THAT   SOME  RECEIVE  MULTIPLE  DAILY  CRAWLS,  SOME  DAILY,   SOME  WEEKLY  AND  SOME  VERY  INFREQUENTLY” https://www.seroundtable.com/google-­‐explains-­‐why-­‐ the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.html LOW  IMPORTANCE  URLs   APPEAR  TO  BE  ‘QUEUED   FOR  LATER’  AND   VISITED  INFREQUENTLY   WHEN  THERE  IS  SPARE   CAPACITY  (LOWER   PRIORITY)  (SCHEDULES)
  15. WHICH APPEARED TO SUPPORT… 15 “Priority  scores  are   computed

     for  each   remaining  document   identifier  based  on   predetermined  criteria   (e.g.,  a  page  importance   score  of  the  document).”   (Zhu  et  al,  2011) PATENT  -­‐ Scheduler  for  search   engine  crawler US  8042112  B1
  16. 16 CRAWL BUDGET 1. CRAWL BUDGET – “AN ALLOCATION OF

    CRAWL VISITS TO A HOST” 3. PAGES WITH A LOT OF LINKS GET CRAWLED MORE 4. THE VAST MAJORITY OF URLS ON THE WEB DON’T GET A LOT OF BUDGET ALLOCATED TO THEM (LOW TO 0 PAGERANK URLS). 2. ROUGHLY PROPORTIONATE TO PAGERANK AND HOST SPEED / CAPACITY Mostly  taken  from  Eric  Enge’s (interview  with   Matt  Cutts (@mattcutts)  interview  from  2010 https://www.stonetemple.com/matt-­‐cutts-­‐ interviewed-­‐by-­‐eric-­‐enge-­‐2/
  17. I ASKED SOME STUFF ABOUT CRAWL BUDGET ALLOCATION 17 DISTRIBUTED

     CRAWLING  OF  HYPERLINKED   DOCUMENTS  -­‐ Patent  Abstract  – “Hyperlinked   documents  to  be  crawled  are  grouped  by  host   and  the  host  to  be  crawled  next  is  selected   according  to  a  stall  time  of  the  host.  The  stall   time  can  indicate  the  earliest  time  that  the  host   should  be  crawled  and  the  stall  times  can  be  a   predetermined  amount  of  time,  vary  by  host  and   be  adjusted  according  to  actual  retrieval  times   from  the  host”  (Dean  et  al  (Google,  2014)) IT  SEEMS  – BUDGET  IS  ASSIGNED  TO  THE  HOST   (I.P)  AND  THEN  SHARED  BETWEEN  THE  SITES   THERE
  18. I ASKED SOME STUFF ABOUT LINKS AND CRAWL BUDGET (in

    light of 2012 ‘DISAVOW TOOL’) 18 TIP (IMHO -­ DAWN) – YOU MAY NEED TO RESTRUCTURE / FLATTEN SO ‘BUDGET’ CAN REACH IMPORTANT URLS “Thanks   John”  -­‐ Waving  J
  19. 19 IT SEEMS THERE MORE FACTORS AFFECTING ‘CRAWL BUDGET??’ Transcript:

      https://searchenginewatch.com/201 6/04/06/webpromos-­‐qa-­‐with-­‐ googles-­‐andrey-­‐lipattsev-­‐transcript/ WEB  PROMOS  Q  &  A  WITH  GOOGLES   ANDREY  LIPATTSEV Andrev chatting  with  Ammon  J   seemed  to  imply  that  a  lot   more  things  affect  crawl   frequency  now  than  just   PageRank
  20. 20 ARE  THERE  OTHER  FACTORS  AFFECTING   BUDGET  AND  /

     OR  ‘CRAWL  RANK’  AS  WELL  AS   PAGERANK  AND  SPEED?   I ASKED @johnmu IF I COULD ASK WHETHER THE FACTORS AFFECTING CRAWL BUDGET HAD CHANGED? JOHN  SAID  – “Sure…You  can  always  ask”  J J – “But,  he  didn’t  tell  me  what  they  were  (if  any)” SO I ASKED IF I COULD ASK IF FACTORS AFFECTING CRAWL BUDGET / CRAWL FREQUENCY HAD CHANGED – I.E. ADDITIONAL FACTORS?
  21. 22 GOOGLE PATENT – ‘NOT ALL ‘CHANGE’ IS CONSIDERED EQUAL’

    (CRITICAL & NON-­CRITICAL) “Changes can be described as critical or non-­critical and that determination may depend on the portion of the document changed, or the context of the changes, rather than the amount of text or content changed. Sometimes a change to a document may be insubstantial, e.g., the change of advertisements associated with a document. In this case, it is more appropriate to ignore those accessory materials in a document prior to making content comparisons. In other cases, e.g., as part of a product search, not every piece of information in a document is weighted equally by a potential user. For instance, the user may care more about the unit price of the product and the availability of the product. In this case, it is more appropriate to focus on the changes associated with information that is deemed critical to a potential user rather than something that is less significant, e.g., a change in a product's colour” (Minimizing   Visibility  of  Stale   Content  in  Web  Searching  Including  Revising  Web  Crawl  Intervals  of   Documents -­‐ Anton  Carver,  Google  Patent  -­‐ US  20130226897  A1,  pub  2013) Probability  &   predictability   of  future   ‘freshness’   (newness  or   critical  material   change)   (‘CHANGE   RATE’  APPEARS   TO  BE   ‘LEARNED’) ’CHANGE  RATE   &  CHANGE   WEIGHT   THRESHOLDS’
  22. CRITICAL MATERIAL CONTENT CHANGE (IMPORTANT CHANGE) & FEATURE WEIGHTS 21

    C  =  ∑  i =  0  n  -­‐ 1    weight  i *  feature NOT JUST ‘RANDOM’ CHANGE like Shuffle($variable) or RAND($variable) NOT  ALL  ‘FEATURES’  ARE  CREATED  EQUAL  ACCORDING  TO  THIS  LINE   IN  PATENTS  –”  weight  i *  feature” EXAMPLE  FEATURES  – E.G.  A  CHANGE  IN  PRICE  (FEATURE)   MAY  BE  WEIGHTED  HIGHER  THAN  A  CHANGE  IN  COLOUR   (FEATURE)  – FEATURE  WEIGHT  PRICE  >  FEATURE  WEIGHT   COLOUR ”DEPENDS  ON  HOW  OFTEN  THE   PAGE  CHANGES”  IS  MENTIONED  A   LOT IN  WEBMASTER  HANGOUTS Minimizing   Visibility  of  Stale  Content  in  Web  Searching   Including  Revising  Web  Crawl  Intervals  of  Documents -­‐ Anton   Carver,  Google  Patent  -­‐ US  20130226897  A1,  pub  2013
  23. “BE CONSISTENT” -­ (@johnmu, Nov 2015) 23 SMX  MILAN  (November

     2015),  reported  here  by  SERoundtable on  quote  from  Google’s   John  Mueller  @johnmu https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐advice-­‐ be-­‐consistent-­‐21196.html DA  -­‐ I  HAVE  A  FEELING  CONSISTENCY  IS   IMPORTANT  FOR  ‘HISTORY  LOGS’  TO   ‘LEARN’  CHANGE  RATES  /  THRESHOLDS
  24. URL EXCLUSIONS FOR ‘TRIPPING ‘MINIMUM-­CRAWL-­ THRESHOLD’ REVISIT ‘HINTS’ AND ‘SPAM’

    URLs 24 ‘RANDOM’ CHANGE created programmatically like Shuffle($variable) or RAND($variable) may even be seen as ‘hints’ TO GOOGLEBOT TO ‘NOT’ CRAWL HINTS  =  ‘MEH  CHANGES’  (E.G.  PATTERNS  OF  ’SAME  OLD,  SAME  OLD   STUFF’  DUPLICATES,  PROGRAMMATICALLY  GENERATED  CONTENT) "Hints may also be employed on pages that are automatically generated and/or contain dynamically generated elements that result in the page having a different checksum every time it is crawled” (Managing Items In A Crawl Schedule, Google Patent -­ US  8666964  B1)
  25. 26 GOOGLE THINKS CRAWL BUDGET IS IMPORTANT FOR SEO CIRCA

     JULY  2015 BUT…  NO  ONE  HAS  EVER  OFFICIALLY  SAID  THAT  THERE’S  ANY  KIND  OF     RANKING  BENEFIT  FROM  POSITIVE  CRAWL  ACTIVITY
  26. ENTER ‘CRAWL RANK’ -­ A BENEFIT OF CRAWL OPTIMISATION?? 27

    “The  pages  that  aren’t  crawled  as  often  are  pages   with  little  to  no  PageRank.  CrawlRankis  the   difference  in  this  very  large  pool  of  pages.     You  win  if  you  get  your  low  PageRank  pages   crawled  more  frequently  than  the  competition.”     “I’m  still  not  entirely  convinced  this  is  what  is   happening,  but  I’m  seeing  success  using  this   philosophy.  “-­‐ A  J  Kohn  @ajkohn OTHERS  SEEM  TO  BE  TRACKING  IT  TOO  – E.G.  SEO   CLARITY DOES  THE  MYTHOLOGICAL  ‘CRAWL  RANK’  BENEFIT  EVEN  EXIST?
  27. DOES ‘CRAWL RANK’ STILL APPLY? 28 I  ASKED  A  J

     KOHN  IF  HE  STILL  THOUGHT  IT  APPLIED   NOW? “Thanks   A.J”  -­‐ Waving  J ”I  still  see  evidence  that  getting  pages  crawled   frequently  (within  7-­‐10  days)  seems  to  have  an   impact  on  their  ability  to  rank  well”  (AJ  Kohn,  2016)
  28. IS LONG-­TAIL ‘LEAP-­FROGGING’ (AND SOME CLUSTERING) WHAT ‘CRAWL RANK’ LOOKS

    LIKE? 29 SITES  JUMPING  OVER  EACH   OTHER  ON  ’LONG  TAILED   QUERIES’  IN  AN  ENDLESS  LAST   LAP  RACE?
  29. HOW IT APPEARS TO WORK – ‘YOU DON’T ALWAYS HAVE

    TO FIGHT THE ‘BOSS’ URLS’ 30 Why  fight  with  the   Hulk  when  you  can  be   Yoda? Image   Credit:   Flickr
  30. EVEN STRONGER DOMAINS HAVE WEAKER URLS 31 THE  SITES  MAY

     ALL  BE  STRONGER  THAN  YOU  BUT  THERE   ARE  A  LOT  OF  PAGES  ON  BIG  SITES  WITH  NO  STRENGTH YOU  WON’T  BEAT  THE  STRONG  URLs  WITH   CRAWL  OPTIMISATION  ALONE You  are  unlikely  to  beat   these  URLs  with  crawl   optimisation techniques   alone.    These  URLs  are  not   the  intended  target  for   these  tactics  – TOO   STRONG SAVE  SOME  BATTLES   FOR  LATER Strong   URLs
  31. FIGHT AT A URL V URL OR TEMPLATE V TEMPLATE

    LEVEL WITH LOW TO 0 PAGE RANK URLS 32 PICK  OFF  THE   WEAKER  URLS   WHEN  BATTLING   WITH  A  BIG  SITE  – LOW  TO  NO  PAGE   RANK  URLS • TARGETS  THE  LOW  STRENGTH  PAGES  FURTHER   DOWN  IN  THE  SITES  OF  COMPETITORS   (SUBCATEGORY  PAGES  E.G.  IN  ECOMMERCE   SITES • THERE  ARE  A  LOT  OF  PAGES  (MILLIONS  WITH   LITTLE  TO  NO  PAGE  RANK) • YOU’RE  AIMING  TO  BEAT  THOSE VIRTUALLY  NO   STRENGTH  IN  1,000s  OF   URLS POWERFUL WELL KNOWN BRANDS BUT NO STRENGTH LOWER DOWN THE ARCHITECTURE MANY LOW VOL / DEEP URLs ARE COMPLETE WEEDS ON BEHEMOTH SITES Weak   URLs
  32. 25 A BIG FACTOR? -­ ‘EMPHASIS OF ‘ URL IMPORTANCE’’

    (E.G. ON PARAMETERS) FULL  TRANSCRIPT  -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐2/ THIS  WAS  IN  THE   ORIGINAL  INTERVIEW   WITH  MATT  CUTTS ALSO  LOTS  OF  THE   PATENTS  MENTION   “PAGE  IMPORTANCE   (WHICH  MAY  INCLUDE   PAGERANK)”
  33. WHICH SEEMS TO SUPPORT THIS PAPER BY PAGE ET AL

    ON IMPORTANCE 13 “Thanks   Bill”  -­‐ Waving  J THIS  REFERENCES  THE  PROBLEM  OF  THE  SIZE  OF  THE  WEB  AND   PRIORITIZES  IMPORTANT  PAGES Efficient   Crawling   Through   URL   Ordering Page  et  al
  34. ’POINT TO THE NEEDLE IN THE HAY’ – EMPHASISE IMPORTANCE

    33 • Googlebot is  also  ‘hunting’…  Hunting  for  relevant   ‘needles’  in  1,000,000,000s  of  straws  of  ‘hay’  on  the  web • It’s  about  making  your  ‘one  needle’  stand  out  in  importance  in  not  just  your  own   site’s  haystack,  but  tens  of  thousands  of  competing  similar  straws  of  hay  in  other   site’s  haystacks…                            (DON’T  JUST  MAKE  YOUR  HAYSTACK  BIGGER) “Hey,  you  Googlebot…  This  is  the  needle”  via   architectural  internal  linking  without  blur  of  duplication  or   too  many  redirects  or  canonicalization
  35. 13 WHICH OF YOUR URLs ARE IMPORTANT? “If  you  don’t

     consistently   indicate  via  clean  internal   individual  URL  importance   emphasis,  the  importance  of   your  URLs,  how  will   Googlebot know  which  are   the  most  important?”
  36. 35 INTERNAL LINKS COUNT (A LOT) (RELEVATIVE IMPORTANCE VOTES ON

    URL IMPORTANCE FROM YOUR OWN SITE) THESE  ARE   YOUR  ‘VOTES’   TO  GOOGLEBOT   ON  THE   IMPORTANCE   OF  EACH  URL EMPLOY   ‘CONSISTENT’   INTERNAL  LINK   STRATEGIES THINK  OF  THESE   AS  ‘WALL-­‐TIES’   HOLDING  YOUR   BUILDING  (SITE   ARCHITECTURE)   TOGETHER STOP VOTING FOR THE WRONG URLS FROM WITHIN YOUR OWN SITE. WRONG TARGETS RANKING?… CHECK INTERNAL LINKS From  Google  Support   Pages Consistent internal  &  external  emphasis  of  a   URLs  ’IMPORTANCE’
  37. 38 NEGATIVE  CONSEQUENCES   FROM  POOR  CRAWL  VISITS   (E.G.

     SPIDER  TRAPS  (INFINITE   LOOPS),  INDIVIDUAL  URLS   VISITED  LESS  AND  LESS   FREQUENTLY  BECAUSE   THERE’S  TOO  MANY) BUT IS THERE PERHAPS AN OPPOSITE OF ‘CRAWL RANK’? -­ ’CRAWL TANK’?? IS  THERE  ADVERSE  EFFECT  WHEN  CRAWLING  GOES  BAD?
  38. WELL -­ I’VE SEEN ‘CRAWL TANK’ – IT AIN’T PRETTY

    39 SITE  SEO  DEATH  BY  TOO  MANY  URLS  AND   INSUFFICIENT  CRAWL  BUDGET  TO  SUPPORT   (EITHER  DUMPING  A  NEW  THIN  PARAMETER   INTO  A  SITE  OR  INFINITE  LOOP  (CODING   ERROR)  (SPIDER  TRAP)) ”BEEN THERE, DONE THAT”
  39. IT KIND OF LOOKS A BIT LIKE THIS 40 ”BEEN

    THERE, DONE THAT” DEFINITELY
  40. 41 ‘EXPONENTIAL URL UNIMPORTANCE’? Your  URLs  exponentially,   CONSISTENTLY  

     confirmed   unimportant   to  queries  with   each  iterative  crawl  visit  to   other  similar  or  duplicate   content  checksum  URLs? MULTPLE  RANDOM  URLs   competing  for  same  query   confirm  irrelevance  of  all   competing  in-­‐site  URLs  with   no  dominant  relevant   IMPORTANT  URL?
  41. STILL…SILVER LININGS 42 “EVERY  SEO  NEEDS  A   ’FLATLINER’  SITE

     TO   RESURRECT  AND   MAKE  BETTER…  “ RIGHT?
  42. Going  ‘where  the  action  is’  in  sites The  ‘need  for

     speed’ Logical  structure Correct  ‘response’  codes XML  sitemaps ‘Successful  crawl  visits ‘Seeing  everything’  on  a  page Taking  ‘hints’ Clear  unique  single  ‘URL   fingerprints’  (no  duplicates) Predicting  likelihood  of  ‘future   change’ Slow  sites Too  many  redirects Being  bored  (Meh)  (‘Hints’  are  built  in  by  the   search  engine  systems  – Takes  ‘hints’) Being  lied  to  (e.g.  On  XML  sitemap  priorities) Crawl  traps  and  dead  ends Going  round  in  circles  (Infinite  loops) Spam  URLs Crawl  wasting  minor  content  change  URLs ‘Hidden’  and  blocked  content Uncrawlable  URLs Duplicate  URLs Not  just  any  change Critical  material  change Predicting  future  change Dropping  ‘hints’  to  Googlebot Sending  Googlebot Where  ‘the  action  is’ 43 LIKES DISLIKES CHANGE  IS  KEY BASED ON DATA FROM THE HISTORY LOGS -­ CAN WE INFLUENCE VIA CRAWL OPTIMISATION TO ESCAPE THE ‘BASE LAYER HOME’ OF THE ’UNIMPORTANT’ URLS?
  43. 44 HERE’S ONE I MADE EARLIER…SOME CAVEATS THIS  IS  A

     PERSONAL  PROJECT  – MY  20  IN  70:  20:10  MIX IT’S  NOT  MOBILE  FRIENDLY  OR  HTTPS   (HANGS  HEAD  IN  SHAME),  AND  YES,  IT   NEEDS  A  MAKEOVER…  BUT…  TIME…  ,   RESOURCES,  BUDGET…BLAH  BLAH THERE  IS  NO  ‘BIG  BRAND’   MARKETING,  VC  BACKING,  TV  OR   RADIO  ADS  (LIKE  COMPETITORS)  – JUST  ME  -­‐ ‘CHIPPING  AWAY’ 90%+  OF  TRAFFIC  IS NON-­‐BRANDED  GENERIC ORGANIC
  44. URL CRAWL FREQUENCY ’CLOCKING’ 46 Spreadsheet  provided   by  

    @johnmu during  Webmaster   Hangout https://goo.gl/1p ToL8 ARE  THE  URLS  THAT  YOU   WANT  BEING  CRAWLED   ‘REAL  TIME’,  DAILY  OR   INFREQUENTLY?   (REGULAR  LOG  ANALYSIS   AND  INTERVENTION  TO   EMPHASISE  IMPORTANCE) MY  THOUGHTS  (DA)  -­‐ You  need  to  find  out  which  ones  are  getting  crawled  in   the  ‘real  time’  schedule,  the  ‘daily  crawl’  schedule  and  via  random  selection  in   the  ‘dross’  (or  UNLIKELY  TO  CHANGE  A  LOT  /  UNIMPORTANT)  ‘base  layer’   section.    If  it’s  not  the  URLs  that  you  want  to  be  there,  then  formulate  a  plan   to  improve  the  ‘importance’  of  URLS.  (NOTE:  JOHN  DID  NOT  SAY   THIS)
  45. 45 LOSE THE ‘DEAD WOOD’ SO GOOGLEBOT DETECTS ‘IMPORTANCE’ FIX

    IT FOR A BETTER CRAWL EMBRACE THE ‘410 GONE’ FLATTENING   ARCHITECTURES,   CONSISTENTLY  AVOIDING   CANNIBALISATION,  INTERNAL   LINK  STRATEGIES,  LINKING   RELEVANT  CONTENT  TO   RELEVANT  CONTENT,   UTILISING  XML  &  FRONT   FACING  SITEMAPS  AND   STRONG  HUB  PAGES  TO   ‘HERD’  GOOGLEBOT  AROUND   THE  SITE
  46. 47 40,000 TOWNS, CITIES & VILLAGES 40,000+  towns,  cities  and

      villages  across  the  UK   multiplied   by  X  site   categories  (THAT’S  A  LOT   OF  LONG  TAIL  QUERY   VOLUME)
  47. 48 FWIW – LONG TAIL CRAWL TECHNIQUES SEEM TO APPLY

    TO OTHER SEARCH ENGINES TOO By  shortening   crawl  paths  and  crawl   frequency  intervals  and  emphasing important   to  subcategory  URLs  on  frequently  changed   URLs  (fresh)  it  appears  you  may  gain  a   competitive  advantage  on  long  tail  queries
  48. IT’S ALIVE… NEEDS WORK… BUT ALIVE 49 CAVEAT:  IT’S  TOO

     COMPLEX  TO  ANSWER  WITH  A   SIMPLE  FEW  EXAMPLES  OF  COURSE  (TOO  MANY   FACTORS)  – BUT…  FOOD  FOR  THOUGHT ‘CRITICAL  MATERIAL   CHANGE  FREQUENCY’   (FRESHNESS)  AND   DETECTED  URL   IMPORTANCE  EMPHASIS   VIA  EXTERNAL  OR   INTERNAL  SIGNALS  (INC   PAGERANK)  SEEM  KEY IS  IT  ‘CRAWL  RANK’  OR  ‘EMPHASING  URL  IMPORTANCE’  BETTER  THAN  COMPETITORS   EMPHASE  IMPORTANCE  OF  LOW  TO  NO  PAGERANK  PAGES  WHERE  FEW  OTHER  FACTORS   SEPARATE?
  49. 50 CRAWL BUDGET & ‘CRAWL RANK’ – OTHER FACTORS?? 1.

    IT APPEARS TO BE APPORTIONED BY THE URL SCHEDULER (BUDGET) 2. PAGES WITH A LOT OF (HEALTHY??) LINKS GET CRAWLED MORE (EXTERNAL AND INTERNAL?) (BUDGET AND RANK?) 3. THERE ARE URL EXCLUSIONS – ( ’HINT TRIPPERS’, OBJECTIONABLE CONTENT AND ‘SPAM URLS’?? ) (BUDGET) 4 – ‘CRITICAL MATERIAL CHANGE’ (FRESHNESS) AND THE PROBABILITY AND PREDICTABILITY OF CHANGE CORRELATE (BUDGET) 5 –’CONSISTENT’ EMPHASIS OF URL IMPORTANCE(BUT I THINK THAT THIS WAS ALWAYS THERE) MAY BE ’CRAWL RANK’(BUDGET AND RANK??) ’CRAWL  RANK’  -­‐ IS  IT   CORRELATION  OR   CAUSATION?    (DO  IMPORTANT   PAGES  GET  CRAWLED  MORE,     OR  IS  IT  BECAUSE  THEY  ARE   CRAWLED  MORE  THEY  ARE   IMPORTANT?)
  50. CAN WEB PAGES CRAWLED INFREQUENTLY STILL RANK? 36 YES THEY

    CAN STILL BE ’IMPORTANT’ IT’S THE ONES YOU’RE INDICATING ARE UNIMPORTANT THAT YOU WANT TO KEEP AN EYE ON -­ #JUSTSAYING ;;)
  51. “BE SMART ABOUT YOUR TAGS AND SITE ARCHITECTURE, STAY FRESH

    AND RELEVANT” (@maileohye, 2016) 37 SLIDE  FROM  APRIL  2016’S  SEJSUMMIT  ON  SEO  INSTRUCTIONS  2016 FROM  GOOGLE’S  @maileohye
  52. 52 EITHER WAY -­ ARE ALL THE CHECKS AND BALANCES

    INDICATING YOU ARE STILL ON TRACK? BECAUSE  -­‐ BRINGING  A   ROCKET  BACK  ON  COURSE   IS  ‘CHALLENGING’ REGULAR  TESTS  AND  EARLY  DIAGNOSIS  ARE  CRUCIAL  – STOP,  CHECK  AND  KEEP  CHECKING ‘TANK’  OR   ‘RANK’? – YOU  DECIDE
  53. TWITTER  -­‐ @dawnieando GOOGLE+  -­‐ +DawnAnderson888 LINKEDIN  -­‐ msdawnanderson THANKS

     FOR   LISTENING   FOLKS  J Dawn  Anderson  @  dawnieando ENJOY  BRIGHTON  SEO
  54. REFERENCES http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/ Scheduler  for  search  engine  crawler Google  Patent US

     8042112  B1,  (Zhu  et  al) -­‐ https://www.google.com/patents/US8707313 Managing  items  in  crawl  schedule  – Google  Patent  (Alpert)   http://www.google.ch/patents/US8666964 Document  reuse  in  a  search  engine  crawler  -­‐ Google  Patent  (Zhu  et  al) https://www.google.com/patents/US8707312 Web  crawler  scheduler  that  utilizes  sitemaps  (Brawer  et  al)  -­‐ http://www.google.com/patents/US8037054 Distributed  crawling  of  hyperlinked  documents  (Dean  et  al)  -­‐ http://www.google.co.uk/patents/US7305610 Minimizing  visibility  of  stale  content  (Carver)  -­‐ http://www.google.ch/patents/US20130226897
  55. REFERENCES Efficient  Crawling  Through  URL  Ordering  (Page  et  al)  -­‐

    http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf Crawl  Optimisation (Blind  Five  Year  Old  – A  J  Kohn  -­‐ @ajkohn)  http://www.blindfiveyearold.com/crawl-­‐ optimization Scheduling  a  recrawl (Auerbach)    -­‐ http://www.google.co.uk/patents/US8386459 Scheduler  for  search  engine  crawler  (Zhu  et  al)  -­‐ http://www.google.co.uk/patents/US8042112 Efficient  crawling  through  URL  ordering    (Page  et  al)  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf Google  Explains  Why  The  Search  Console  Reporting  Is  Not  Real  Time  (SERoundtable)   https://www.seroundtable.com/google-­‐explains-­‐why-­‐the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.html Crawl  Data  Aggregation  Propagation  (Mueller)  -­‐ https://goo.gl/1pToL8 Matt  Cutts Interviewed  By  Eric  Enge -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐ 2/ Web  Promo  Q  and  A  with  Google’s  Andrev Lippatsev -­‐ https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/ Google  Number  1  SEO  Advice  – Be  Consistent  -­‐ https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐ advice-­‐be-­‐consistent-­‐21196.html