Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Talk To The Spider

Dawn Anderson
December 04, 2015

Talk To The Spider

Dawn Anderson

December 04, 2015
Tweet

Transcript

  1. Why  Googlebot &  The  URL  Scheduler   Should   Be

     Amongst   Your  Key  Personas   And  How  To  Train  Them TALK  TO   THE  SPIDER Dawn  Anderson  @  dawnieando
  2. 9  types  of   Googlebot THE KEY PERSONAS 02 SUPPORTING

     ROLES Indexer  /   Ranking  Engine The  URL   Scheduler History  Logs Link  Logs Anchor  Logs
  3. ‘Ranks  nothing  at  all’ Takes  a  list  of  URLs  to

     crawl  from  URL  Scheduler Job  varies  based  on  ‘bot’  type Runs  errands  &  makes  deliveries  for  the  URL  server,   indexer  /  ranking  engine  and  logs Makes  notes  of  outbound   linked  pages  and  additional   links  for  future  crawling Takes  notes  of  ‘hints’  from  URL  scheduler  when  crawling Tells  tales  of  URL  accessibility  status,  server  response   codes,  notes  relationships  between  links  and  collects   content  checksums  (binary  data  equivalent  of  web   content)  for  comparison  with  past  visits  by  history  and   link  logs 03 GOOGLEBOT’S JOBS
  4. 04 ROLES – MAJOR PLAYERS – A ‘BOSS’- URL SCHEDULER

    Think  of  it  as  Google’s   line  manager  or  ‘air   traffic  controller’  for   Googlebots in  the   web  crawling  system Schedules  Googlebot visits  to  URLs Decides  which  URLs  to  ‘feed’  to  Googlebot Uses  data  from  the  history  logs  about  past  visits Assigns  visit  regularity  of  Googlebot to  URLs Drops  ‘hints’  to  Googlebot to  guide  on  types  of  content  NOT  to   crawl  and  excludes  some  URLs  from  schedules Analyses  past  ‘change’  periods  and  predicts  future  ‘change’   periods  for  URLs  for  the  purposes  of  scheduling  Googlebot visits Checks  ‘page  importance’  in  scheduling  visits Assigns  URLs  to  ‘layers  /  tiers’  for  crawling  schedules
  5. Indexed  Web  contains at  least  4.73  billion   pages (13/11/2015)

    05 TOO MUCH CONTENT Total  number  of  websites 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 1,000,000,000 750,000,000 500,000,000 250,000,000 SINCE  2013  THE  WEB  IS   THOUGHT  TO  HAVE   INCREASED  IN  SIZE  BY  1/3
  6. Capacity  limits   on  Google’s   crawling  system By  prioritising

      URLs  for   crawling By  assigning   crawl  period   intervals  to  URLs How  have   search  engines   responded? By  creating  work   ‘schedules’  for   Googlebots 06 TOO MUCH CONTENT
  7. ‘Managing items in a crawl schedule’ Include 07 GOOGLE CRAWL

    SCHEDULER PATENTS ‘Scheduling a recrawl’ ‘Web crawler scheduler that utilizes sitemaps from websites’ ‘ ‘Document reuse in a search engine crawler’ ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ ‘Scheduler for search engine’
  8. Crawled  multiple   times  daily Crawled  daily   Or  bi-­‐daily

    Crawled  least  on  a  ‘round   robin’  basis  – only  ‘active’   segment  is  crawled Split  into  segments   on  random  rotation 08 MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT) Real  Time Crawl Daily Crawl Base  Layer    Crawl 3  layers  /  tiers URLs  are  moved   in  and  out  of   layers  based  on   past  visits  data
  9. Scheduler  checks  URLs   for  ‘importance’,  ‘boost   factor’  candidacy,

      ‘probability  of   modification’ GOOGLEBOT’S BEEN PUT ON A URL CONTROLLED DIET 09 The  URL  Scheduler   controls  the  meal   planner Carefully  controls   the  list  of  URLs   Googlebot vits ‘Budgets’  are  allocated £
  10. CRAWL BUDGET 10 Roughly  proportionate  to  Page  Importance  (LinkEquity)  

    &  speed Pages  with  a  lot  of  healthy  links  get  crawled  more  (Can  include  internal  links??) Apportioned  by  the  URL  scheduler  to  Googlebots WHAT  IS  A  CRAWL  BUDGET?  -­‐ An  allocation  of  ‘crawl  visit  frequency’  apportioned  to  URLs  on  a  site But  there  are  other  factors  affecting  frequency  of  Googlebot visits  aside  from  importance  /  speed The  vast  majority  of  URLs  on  the  web  don’t  get  a  lot  of  budget  allocated  to  them
  11. CRITICAL MATERIAL CONTENT CHANGE 11 HINTS  & C  =  ∑

     i =  0  n  -­‐ 1    weight  i *  feature
  12. Current  capacity  of  the  web  crawling  system  is  high Your

     URL  is  ‘important’ Your  URL  is  in  the  real  time,  daily  crawl  or  ‘active’  base   layer  segment Your  URL  changes  a  lot  with  critical  material  content   change Probability  and  predictability  of  critical  material  content   change  is  high  for  your  URL Your  website  speed  is  fast  and  Googlebot gets  the  time  to   visit  your  URL Your  URL  has  been  ‘upgraded’  to  a  daily  or  real  time  crawl   layer 12 POSITIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY
  13. Current  capacity  of  web  crawling  system  is  low Your  URL

     has  been  detected  as  a  ‘spam’  URL Your  URL  is  in  an  ‘inactive’  base  layer  segment Your  URLs  are  ‘tripping  hints’  built  into  the  system  to   detect  non-­‐critical  change  dynamic  content Probability  and  predictability  of  critical  material  content   change  is  low  for  your  URL Your  website  speed  is  slow  and  Googlebot doesn’t  get  the   time  to  visit  your  URL Your  URL  has  been  ‘downgraded’  to  an  ‘inactive’  base   layer  segment Your  URL  has  returned  an  ‘unreachable’  server  response   code  recently 13 NEGATIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY
  14. IT’S NOT JUST ABOUT ‘FRESHNESS’ 14 It’s  about  the  

    probability  &   predictability  of  future   ‘freshness’ BASED ON DATA FROM THE HISTORY LOGS - HOW CAN WE INFLUENCE THEM TO ESCAPE THE BASE LAYER?
  15. Going  ‘where  the  action  is’  in  sites The  ‘need  for

     speed’ Logical  structure Correct  ‘response’  codes XML  sitemaps ‘Successful  crawl  visits ‘Seeing  everything’  on  a  page Taking  ‘hints’ Clear  unique  single  ‘URL   fingerprints’  (no  duplicates) Predicting  likelihood  of  ‘future   change’ Slow  sites Too  many  redirects Being  bored  (Meh)  (‘Hints’  are  built  in  by  the   search  engine  systems  – Takes  ‘hints’) Being  lied  to  (e.g.  On  XML  sitemap  priorities) Crawl  traps  and  dead  ends Going  round  in  circles  (Infinite  loops) Spam  URLs Crawl  wasting  minor  content  change  URLs ‘Hidden’  and  blocked  content Uncrawlable URLs Not  just  any  change Critical  material  change Predicting  future  change Dropping  ‘hints’  to  Googlebot Sending  Googlebot Where  ‘the  action  is’ CRAWL OPTIMISATION – STAGE 1 - UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES 15 LIKES DISLIKES CHANGE  IS  KEY
  16. FIND GOOGLEBOT 16 AUTOMATE  SERVER  LOG   RETRIEVAL  VIA  CRON

     JOB grep Googlebot access_log >googlebot_access.txt
  17. LOOK THROUGH ‘SPIDER EYES’ VIA LOG ANALYSIS – ANALYSE GOOGLEBOT

    17 PREPARE TO BE HORRIFIED Incorrect  URL  header  response  codes  (e.g.  302s) 301  redirect  chains Old  files  or  XML  sitemaps  left  on  server  from  years  ago Infinite/  endless  loops  (circular  dependency) On  parameter  driven  sites  URLs  crawled  which  produce  same  output URLs  generated  by  spammers Dead  image  files  being  visited Old  css files  still  being  crawled Identify  your  ‘real  time’,  ‘daily’  and  ‘base  layer’  URLs ARE  THEY  THE  ONES  YOU  WANT  THERE?
  18. 18 FIX GOOGLEBOT’S JOURNEY SPEED UP YOUR SITE TO ‘FEED’

    GOOGLEGOT MORE TECHNICAL  ‘FIXES’       Speed  up  your  site Implement  compression,  minification,  caching ‘ Fix  incorrect  header  response  codes Fix  nonsensical  ‘infinite  loops’  generated  by   database  driven  parameters  or  ‘looping’  relative   URLs Use  absolute  versus  relative  internal  links Ensure  no  parts  of  content  is  blocked  from   crawlers  (e.g.  in  carousels,  concertinas  and   tabbed  content Ensure  no  css or  javascript files  are  blocked  from   crawlers Unpick  301  redirect  chains
  19. Minimise  301  redirects Minimise  canonicalisation Use  ‘if  modified’  headers  on

     low  importance   ‘hygiene’  pages Use  ‘expires  after’  headers  on  content  with  short   shelf  live  (e.g.  auctions,  job  sites,  event  sites) Noindex low  search  volume  or  near  duplicate  URLs   (use  noindex directive  on  robots.txt) Use  410  ‘gone’  headers  on  dead  URLs  liberally Revisit  .htaccess file  and  review  legacy  pattern   matched  301  redirects Combine  CSS  and  javascript files FIX GOOGLEBOT’S JOURNEY 19 SAVE  BUDGET £
  20. Revisit  ‘Votes  for  self’  via  internal  links  in  GSC Clear

     ‘unique’  URL  fingerprints Use  XML  sitemaps  for  your  important  URLs  (don’t  put   everything  on  it) Use  ‘mega  menus’  (very  selectively)  to  key  pages Use  ‘breadcrumbs’  (for  hierarchical  structure) Build  ‘bridges’  and  ‘shortcuts’  via  html  sitemaps  and   supplementary  content  for  ‘cross  modular’  ‘related’   internal  linking  to  key  pages Consolidate  (merge)  important  but  similar  content  (e.g.   merge  FAQs) Consider  flattening  your  site  structure  so  ‘importance’   flows  further Reduce  internal  linking  to  low  priority  URLs BE  CLEAR  TO  GOOGLEBOT  WHICH  ARE   YOUR  MOST  IMPORTANT  PAGES Not  just  any  change  – Critical  material  change Keep  the  ‘action’  in  the  key  areas -­‐ NOT  JUST  THE  BLOG Use  ‘relevant  ‘supplementary  content  to  keep  key  pages  ‘fresh’ Remember  the  negative  impact  of    ‘crawl  hints’ Regularly  update  key  content Consider  ‘updating’  rather  than  replacing  seasonal  content   URLs Build  ‘dynamism’  into  your  web  development  (sites  that  ‘move’   win) GOOGLEBOT  GOES  WHERE  THE  ACTION  IS  AND   IS  LIKELY  TO  BE  IN  THE  FUTURE TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS) 20 EMPHASISE  PAGE  IMPORTANCE       TRAIN  ON  CHANGE
  21. YSlow Pingdom Google  Page  Speed  Tests Minificiation – JS  Compress

     and  CSS   Minifier Image  Compression   – Compressjpeg.com,   tinypng.com 21 TOOLS YOU CAN USE GSC  Crawl  Stats Deepcrawl Screaming  Frog Server  Logs SEMRush (auditing  tools) Webconfs (header  responses   /  similarity   checker) Powermapper (birds  eye  view  of  site) GSC  Internal  links  Report  (URL  importance) Link  Research  Tools  (Strongest  sub  pages   reports) GSC  Internal  links  (add  site  categories  and   sections  as  additional  profiles) Powermapper GSC  Index  levels  (over  indexation  checks) GSC  Crawl  stats Last  Accessed  Tools  (versus  competitors) Server  logs SPEED SPIDER  EYES URL  IMPORTANCE SAVINGS  &  CHANGE Webmaster Hangout Office Hours
  22. IS THIS YOUR BLOG?? HOPE NOT 22 WARNING SIGNS –

    TOO MANY VOTES BY SELF FOR WRONG PAGES Most Important Page 1 Most  Important  Page  2 Most  Important  Page  3
  23. Tags:  I,  must,  tag,    this,  blog,  post,  with,  

    every,  possible,   word,  that,  pops,   into,  my,   head,  when,  I,  look,  at,  it,  and,  dilute,  all,   relevance,  from,  it,  to,  a,  pile,  of,  mush,   cow,  shoes,  sheep,  the,  and,  me,  of,  it Image  Credit:  Buzzfeed Creating  ‘thin’  content  and   Even  more  URLs  to  crawl 24 WARNING SIGNS – TAG MAN
  24. ”Googlebot’s On  A  Strict  Diet” “Make  sure  the  right  URLs

     get  on  the  menu” Dawn  Anderson  @  dawnieando REMEMBER