SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

SEO ‘Crawl Tank’ -‐ ‘Death and Resurrection’ WHY YOU SHOULD
CARE ABOUT TAKING CARE OF CRAWLS (INTELLIGENT USE OF CRAWL ALLOCATION (BUDGET)) THE QUEST FOR ‘CRAWL RANK’ Dawn Anderson @ dawnieando

Indexed Web contains at least 4.73 billion pages (13/11/2015)
1 THE WEB IS ‘BIG’ Total number of websites 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 1,000,000,000 750,000,000 500,000,000 250,000,000 SINCE 2013 THE WEB IS THOUGHT TO HAVE INCREASED IN SIZE BY 1/3

2 THE ABILITY TO ‘SELF PUBLISH’ EASILY HAS CLEARLY INFLUENCED
THIS – WE ALL ‘LOVE CONTENT’ IMPORTANT TO NOTE THAT 75% OF WEBSITES ONLINE ARE DORMANT (E.G. PARKED DOMAINS) IMAGINE HOW MANY UNIQUE URLs COMBINED THIS AMOUNTS TO? – A LOT http://www.internetlivestats.com/total-‐number-‐of-‐websites/

Capacity limits on Google’s crawling system By prioritising
URLs for crawling By assigning crawl period intervals to URLs How have search engines responded? By creating work ‘schedules’ for Googlebots 3 TOO MUCH CONTENT

4 HERE’S WHY -> EVERYTHING HAS A FINITE CAPACITY (EVEN
CRAWLING) “While web pages can be manually selected for crawling, this becomes impracticable as the number of web pages grows. Moreover, to keep within the capacity limits of the crawler, automated selection mechanisms are needed to determine not only which web pages to crawl, but which web pages to avoid crawling. For instance, as of the end of 2003, the WWW is believed to include well in excess of 10 billion distinct documents or web pages, while a search engine may have a crawling capacity that is less than half as many documents.” -‐ Scheduler for search engine crawler Google Patent US 8042112 B1, (Zhu et al)

‘Managing items in crawl schedule’ - US 8666964 B1 Include
5 SOME GOOGLE CRAWL SCHEDULER PATENTS ‘Scheduling a recrawl’ - US 8386459 B1 ‘Web crawler scheduler that utilizes sitemaps from websites’ - US 8037054 B2 ‘Document reuse in a search engine crawler’ - US 8707312 B1 ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ - US 8407204 B2 ‘Scheduler for search engine crawler’ - US 8042112 B1 ‘Distributed crawling of hyperlinked documents’ - US 7305610 B1 IT SEEMS PRIORITIZATION AND GOOGLEBOT CRAWL EFFICIENCY ARE IMPORTANT TO SEARCH ENGINES

Crawled multiple times daily Crawled daily Or bi-‐daily
Crawled least on a ‘round robin’ basis – only ‘active’ segment is crawled Split into segments on random rotation 6 “MANAGING ITEMS IN A CRAWL SCHEDULE” (GOOGLE PATENT US 8666964 B1) Real Time Crawl Daily Crawl Base Layer Crawl 3 layers / tiers URLs are moved in and out of layers based on past visits data (retrieved from logs) PAGE ‘IMPORTANCE’ AND URL SCHEDULING

10 types of Googlebot THE KEY SEARCH ENGINE (THE APPLIANCE)
CHARACTERS 7 SUPPORTING ROLES (LOG MANAGERS & PAGE RANKERS Indexer / Ranking Engine The URL Scheduler History Logs Link Logs / Link Maps Anchor Logs / Anchor Maps Status Logs Page Rankers

8 THE ‘LOG’ MANAGERS (‘The Clerks’) History Logs Link Logs
JOBS INCLUDE JOBS INCLUDE Other Logs JOBS INCLUDE Consider these as ‘record-keepers’ (record info on the crawled URLS Retrieves previous copies of documents for comparison with newly retrieved copies for purposes of ’change frequency’ and ‘change weight’ calculation (last modified & update rate) Include: “identifies all the links (e.g., URLs, also called outbound links) that are found in the document associated with the record and the text that surrounds the link” (Brawer et al, Google Patent) INFO USED TO MAKE LINK MAPS • Anchor Logs & Maps • Status Logs A LOT MORE INFO ON LOGS AT: Scheduler for Search Engine Crawler US 20100241621 A1

9 SUPERVISOR - TEAM LEADER – ‘THE URL SCHEDULER’ Think
of it as Google’s line manager or ‘air traffic controller’ for Googlebots in the web crawling system JOBS Schedules Googlebot visits to URLs Decides which URLs to ‘feed’ to Googlebot Uses data from the history logs about past visits Assigns visit regularity of Googlebot to URLs Drops ‘hints’ to Googlebot to guide on types of content NOT to crawl and excludes some URLs from schedules Analyses past ‘change’ periods and predicts future ‘change’ (BASED ON PAST VISIT DATA) periods for URLs for the purposes of scheduling Googlebot visits Checks ‘page importance’ in scheduling visits (PRIORITIES) Assigns URLs to ‘layers / tiers’ for crawling schedules (REAL TIME, DAILY, BASE LAYER SEGMENT) The URL Scheduler controls the meal planner Scheduler checks URLs for ‘importance’, ‘boost factor’ candidacy, ‘probability of modification’ ‘Budgets’ are allocated Carefully controls the list of URLs Googlebot visits

THE 10 GOOGLEBOTS Image Video News Adsense Adsbot PAID SEARCH
TYPES 10 MEDIA TYPES Smartphone Apps Featurephone Mobile Adsense MOBILE TYPES BOT TYPES HAVE VARYING DEGREES OF ‘BUSY-NESS’ GOOGLEBOT WEB SEARCH Crawls images only Quality Checks Babybot (’the Noob’)

GOOGLEBOT JOBS 11 JOBS • ‘Ranks nothing at all’ •
Takes a list of URLs to crawl from URL Scheduler • Job varies based on ‘bot’ type (e.g. Image bot seems a bit of a ‘part timer’ (images change less frequently)) • Runs errands & makes deliveries for the URL server, indexer / ranking engine and logs • Makes notes of outbound linked pages and additional links for future crawling (in order for them to be assigned to future crawling schedules) • Takes notes of ‘hints’ from URL scheduler when crawling • Tells tales of URL accessibility status, server response codes, notes relationships between links and collects content checksums (binary data equivalent of web content) for comparison with past visits by history and link logs

12 ‘INDEXER’ Looks at all of the evidence from
the various logs (and the page rankers) of the search engine to index the URLs • Uses the combined data collected in order to index the results for a given query • TAKES DATA FROM THE LOGS TO GENERATE INDEXES “The indexer(s) 724 use the anchor maps 718 and other logs 716 to generate index(es) 726. The index(es) are used by the search engine to identify documents matching queries entered by users of the search engine.” (Web crawler scheduler that utilizes sitemaps from websites US 8037054 B2, Google Patent, Brawer et al, pub 2011)

I ASKED JOHN MUELLER AT WEBMASTER HANGOUT ABOUT URL QUEUES
14 GOOGLE WEBMASTER HANGOUT QUESTION ON ’URL QUEUEING’ BUT WHAT OTHER EVIDENCE DO WE HAVE TO SUPPORT OUT THEORIES? “URLS ARE NOT ALL CRAWLED IN ORDER, BUT THAT SOME RECEIVE MULTIPLE DAILY CRAWLS, SOME DAILY, SOME WEEKLY AND SOME VERY INFREQUENTLY” https://www.seroundtable.com/google-‐explains-‐why-‐ the-‐search-‐console-‐has-‐reporting-‐delays-‐21688.html LOW IMPORTANCE URLs APPEAR TO BE ‘QUEUED FOR LATER’ AND VISITED INFREQUENTLY WHEN THERE IS SPARE CAPACITY (LOWER PRIORITY) (SCHEDULES)

WHICH APPEARED TO SUPPORT… 15 “Priority scores are computed
for each remaining document identifier based on predetermined criteria (e.g., a page importance score of the document).” (Zhu et al, 2011) PATENT -‐ Scheduler for search engine crawler US 8042112 B1

16 CRAWL BUDGET 1. CRAWL BUDGET – “AN ALLOCATION OF
CRAWL VISITS TO A HOST” 3. PAGES WITH A LOT OF LINKS GET CRAWLED MORE 4. THE VAST MAJORITY OF URLS ON THE WEB DON’T GET A LOT OF BUDGET ALLOCATED TO THEM (LOW TO 0 PAGERANK URLS). 2. ROUGHLY PROPORTIONATE TO PAGERANK AND HOST SPEED / CAPACITY Mostly taken from Eric Enge’s (interview with Matt Cutts (@mattcutts) interview from 2010 https://www.stonetemple.com/matt-‐cutts-‐ interviewed-‐by-‐eric-‐enge-‐2/

I ASKED SOME STUFF ABOUT CRAWL BUDGET ALLOCATION 17 DISTRIBUTED
CRAWLING OF HYPERLINKED DOCUMENTS -‐ Patent Abstract – “Hyperlinked documents to be crawled are grouped by host and the host to be crawled next is selected according to a stall time of the host. The stall time can indicate the earliest time that the host should be crawled and the stall times can be a predetermined amount of time, vary by host and be adjusted according to actual retrieval times from the host” (Dean et al (Google, 2014)) IT SEEMS – BUDGET IS ASSIGNED TO THE HOST (I.P) AND THEN SHARED BETWEEN THE SITES THERE

I ASKED SOME STUFF ABOUT LINKS AND CRAWL BUDGET (in
light of 2012 ‘DISAVOW TOOL’) 18 TIP (IMHO - DAWN) – YOU MAY NEED TO RESTRUCTURE / FLATTEN SO ‘BUDGET’ CAN REACH IMPORTANT URLS “Thanks John” -‐ Waving J

19 IT SEEMS THERE MORE FACTORS AFFECTING ‘CRAWL BUDGET??’ Transcript:
https://searchenginewatch.com/201 6/04/06/webpromos-‐qa-‐with-‐ googles-‐andrey-‐lipattsev-‐transcript/ WEB PROMOS Q & A WITH GOOGLES ANDREY LIPATTSEV Andrev chatting with Ammon J seemed to imply that a lot more things affect crawl frequency now than just PageRank

20 ARE THERE OTHER FACTORS AFFECTING BUDGET AND /
OR ‘CRAWL RANK’ AS WELL AS PAGERANK AND SPEED? I ASKED @johnmu IF I COULD ASK WHETHER THE FACTORS AFFECTING CRAWL BUDGET HAD CHANGED? JOHN SAID – “Sure…You can always ask” J J – “But, he didn’t tell me what they were (if any)” SO I ASKED IF I COULD ASK IF FACTORS AFFECTING CRAWL BUDGET / CRAWL FREQUENCY HAD CHANGED – I.E. ADDITIONAL FACTORS?

22 GOOGLE PATENT – ‘NOT ALL ‘CHANGE’ IS CONSIDERED EQUAL’
(CRITICAL & NON-CRITICAL) “Changes can be described as critical or non-critical and that determination may depend on the portion of the document changed, or the context of the changes, rather than the amount of text or content changed. Sometimes a change to a document may be insubstantial, e.g., the change of advertisements associated with a document. In this case, it is more appropriate to ignore those accessory materials in a document prior to making content comparisons. In other cases, e.g., as part of a product search, not every piece of information in a document is weighted equally by a potential user. For instance, the user may care more about the unit price of the product and the availability of the product. In this case, it is more appropriate to focus on the changes associated with information that is deemed critical to a potential user rather than something that is less significant, e.g., a change in a product's colour” (Minimizing Visibility of Stale Content in Web Searching Including Revising Web Crawl Intervals of Documents -‐ Anton Carver, Google Patent -‐ US 20130226897 A1, pub 2013) Probability & predictability of future ‘freshness’ (newness or critical material change) (‘CHANGE RATE’ APPEARS TO BE ‘LEARNED’) ’CHANGE RATE & CHANGE WEIGHT THRESHOLDS’

CRITICAL MATERIAL CONTENT CHANGE (IMPORTANT CHANGE) & FEATURE WEIGHTS 21
C = ∑ i = 0 n -‐ 1 weight i * feature NOT JUST ‘RANDOM’ CHANGE like Shuffle($variable) or RAND($variable) NOT ALL ‘FEATURES’ ARE CREATED EQUAL ACCORDING TO THIS LINE IN PATENTS –” weight i * feature” EXAMPLE FEATURES – E.G. A CHANGE IN PRICE (FEATURE) MAY BE WEIGHTED HIGHER THAN A CHANGE IN COLOUR (FEATURE) – FEATURE WEIGHT PRICE > FEATURE WEIGHT COLOUR ”DEPENDS ON HOW OFTEN THE PAGE CHANGES” IS MENTIONED A LOT IN WEBMASTER HANGOUTS Minimizing Visibility of Stale Content in Web Searching Including Revising Web Crawl Intervals of Documents -‐ Anton Carver, Google Patent -‐ US 20130226897 A1, pub 2013

“BE CONSISTENT” - (@johnmu, Nov 2015) 23 SMX MILAN (November
2015), reported here by SERoundtable on quote from Google’s John Mueller @johnmu https://www.seroundtable.com/google-‐number-‐one-‐seo-‐advice-‐ be-‐consistent-‐21196.html DA -‐ I HAVE A FEELING CONSISTENCY IS IMPORTANT FOR ‘HISTORY LOGS’ TO ‘LEARN’ CHANGE RATES / THRESHOLDS

URL EXCLUSIONS FOR ‘TRIPPING ‘MINIMUM-CRAWL- THRESHOLD’ REVISIT ‘HINTS’ AND ‘SPAM’
URLs 24 ‘RANDOM’ CHANGE created programmatically like Shuffle($variable) or RAND($variable) may even be seen as ‘hints’ TO GOOGLEBOT TO ‘NOT’ CRAWL HINTS = ‘MEH CHANGES’ (E.G. PATTERNS OF ’SAME OLD, SAME OLD STUFF’ DUPLICATES, PROGRAMMATICALLY GENERATED CONTENT) "Hints may also be employed on pages that are automatically generated and/or contain dynamically generated elements that result in the page having a different checksum every time it is crawled” (Managing Items In A Crawl Schedule, Google Patent - US 8666964 B1)

26 GOOGLE THINKS CRAWL BUDGET IS IMPORTANT FOR SEO CIRCA
JULY 2015 BUT… NO ONE HAS EVER OFFICIALLY SAID THAT THERE’S ANY KIND OF RANKING BENEFIT FROM POSITIVE CRAWL ACTIVITY

ENTER ‘CRAWL RANK’ - A BENEFIT OF CRAWL OPTIMISATION?? 27
“The pages that aren’t crawled as often are pages with little to no PageRank. CrawlRankis the difference in this very large pool of pages. You win if you get your low PageRank pages crawled more frequently than the competition.” “I’m still not entirely convinced this is what is happening, but I’m seeing success using this philosophy. “-‐ A J Kohn @ajkohn OTHERS SEEM TO BE TRACKING IT TOO – E.G. SEO CLARITY DOES THE MYTHOLOGICAL ‘CRAWL RANK’ BENEFIT EVEN EXIST?

DOES ‘CRAWL RANK’ STILL APPLY? 28 I ASKED A J
KOHN IF HE STILL THOUGHT IT APPLIED NOW? “Thanks A.J” -‐ Waving J ”I still see evidence that getting pages crawled frequently (within 7-‐10 days) seems to have an impact on their ability to rank well” (AJ Kohn, 2016)

IS LONG-TAIL ‘LEAP-FROGGING’ (AND SOME CLUSTERING) WHAT ‘CRAWL RANK’ LOOKS
LIKE? 29 SITES JUMPING OVER EACH OTHER ON ’LONG TAILED QUERIES’ IN AN ENDLESS LAST LAP RACE?

HOW IT APPEARS TO WORK – ‘YOU DON’T ALWAYS HAVE
TO FIGHT THE ‘BOSS’ URLS’ 30 Why fight with the Hulk when you can be Yoda? Image Credit: Flickr

EVEN STRONGER DOMAINS HAVE WEAKER URLS 31 THE SITES MAY
ALL BE STRONGER THAN YOU BUT THERE ARE A LOT OF PAGES ON BIG SITES WITH NO STRENGTH YOU WON’T BEAT THE STRONG URLs WITH CRAWL OPTIMISATION ALONE You are unlikely to beat these URLs with crawl optimisation techniques alone. These URLs are not the intended target for these tactics – TOO STRONG SAVE SOME BATTLES FOR LATER Strong URLs

FIGHT AT A URL V URL OR TEMPLATE V TEMPLATE
LEVEL WITH LOW TO 0 PAGE RANK URLS 32 PICK OFF THE WEAKER URLS WHEN BATTLING WITH A BIG SITE – LOW TO NO PAGE RANK URLS • TARGETS THE LOW STRENGTH PAGES FURTHER DOWN IN THE SITES OF COMPETITORS (SUBCATEGORY PAGES E.G. IN ECOMMERCE SITES • THERE ARE A LOT OF PAGES (MILLIONS WITH LITTLE TO NO PAGE RANK) • YOU’RE AIMING TO BEAT THOSE VIRTUALLY NO STRENGTH IN 1,000s OF URLS POWERFUL WELL KNOWN BRANDS BUT NO STRENGTH LOWER DOWN THE ARCHITECTURE MANY LOW VOL / DEEP URLs ARE COMPLETE WEEDS ON BEHEMOTH SITES Weak URLs

25 A BIG FACTOR? - ‘EMPHASIS OF ‘ URL IMPORTANCE’’
(E.G. ON PARAMETERS) FULL TRANSCRIPT -‐ https://www.stonetemple.com/matt-‐cutts-‐interviewed-‐by-‐eric-‐enge-‐2/ THIS WAS IN THE ORIGINAL INTERVIEW WITH MATT CUTTS ALSO LOTS OF THE PATENTS MENTION “PAGE IMPORTANCE (WHICH MAY INCLUDE PAGERANK)”

WHICH SEEMS TO SUPPORT THIS PAPER BY PAGE ET AL
ON IMPORTANCE 13 “Thanks Bill” -‐ Waving J THIS REFERENCES THE PROBLEM OF THE SIZE OF THE WEB AND PRIORITIZES IMPORTANT PAGES Efficient Crawling Through URL Ordering Page et al

’POINT TO THE NEEDLE IN THE HAY’ – EMPHASISE IMPORTANCE
33 • Googlebot is also ‘hunting’… Hunting for relevant ‘needles’ in 1,000,000,000s of straws of ‘hay’ on the web • It’s about making your ‘one needle’ stand out in importance in not just your own site’s haystack, but tens of thousands of competing similar straws of hay in other site’s haystacks… (DON’T JUST MAKE YOUR HAYSTACK BIGGER) “Hey, you Googlebot… This is the needle” via architectural internal linking without blur of duplication or too many redirects or canonicalization

13 WHICH OF YOUR URLs ARE IMPORTANT? “If you don’t
consistently indicate via clean internal individual URL importance emphasis, the importance of your URLs, how will Googlebot know which are the most important?”

35 INTERNAL LINKS COUNT (A LOT) (RELEVATIVE IMPORTANCE VOTES ON
URL IMPORTANCE FROM YOUR OWN SITE) THESE ARE YOUR ‘VOTES’ TO GOOGLEBOT ON THE IMPORTANCE OF EACH URL EMPLOY ‘CONSISTENT’ INTERNAL LINK STRATEGIES THINK OF THESE AS ‘WALL-‐TIES’ HOLDING YOUR BUILDING (SITE ARCHITECTURE) TOGETHER STOP VOTING FOR THE WRONG URLS FROM WITHIN YOUR OWN SITE. WRONG TARGETS RANKING?… CHECK INTERNAL LINKS From Google Support Pages Consistent internal & external emphasis of a URLs ’IMPORTANCE’

38 NEGATIVE CONSEQUENCES FROM POOR CRAWL VISITS (E.G.
SPIDER TRAPS (INFINITE LOOPS), INDIVIDUAL URLS VISITED LESS AND LESS FREQUENTLY BECAUSE THERE’S TOO MANY) BUT IS THERE PERHAPS AN OPPOSITE OF ‘CRAWL RANK’? - ’CRAWL TANK’?? IS THERE ADVERSE EFFECT WHEN CRAWLING GOES BAD?

WELL - I’VE SEEN ‘CRAWL TANK’ – IT AIN’T PRETTY
39 SITE SEO DEATH BY TOO MANY URLS AND INSUFFICIENT CRAWL BUDGET TO SUPPORT (EITHER DUMPING A NEW THIN PARAMETER INTO A SITE OR INFINITE LOOP (CODING ERROR) (SPIDER TRAP)) ”BEEN THERE, DONE THAT”

IT KIND OF LOOKS A BIT LIKE THIS 40 ”BEEN
THERE, DONE THAT” DEFINITELY

41 ‘EXPONENTIAL URL UNIMPORTANCE’? Your URLs exponentially, CONSISTENTLY
confirmed unimportant to queries with each iterative crawl visit to other similar or duplicate content checksum URLs? MULTPLE RANDOM URLs competing for same query confirm irrelevance of all competing in-‐site URLs with no dominant relevant IMPORTANT URL?

STILL…SILVER LININGS 42 “EVERY SEO NEEDS A ’FLATLINER’ SITE
TO RESURRECT AND MAKE BETTER… “ RIGHT?

Going ‘where the action is’ in sites The ‘need for
speed’ Logical structure Correct ‘response’ codes XML sitemaps ‘Successful crawl visits ‘Seeing everything’ on a page Taking ‘hints’ Clear unique single ‘URL fingerprints’ (no duplicates) Predicting likelihood of ‘future change’ Slow sites Too many redirects Being bored (Meh) (‘Hints’ are built in by the search engine systems – Takes ‘hints’) Being lied to (e.g. On XML sitemap priorities) Crawl traps and dead ends Going round in circles (Infinite loops) Spam URLs Crawl wasting minor content change URLs ‘Hidden’ and blocked content Uncrawlable URLs Duplicate URLs Not just any change Critical material change Predicting future change Dropping ‘hints’ to Googlebot Sending Googlebot Where ‘the action is’ 43 LIKES DISLIKES CHANGE IS KEY BASED ON DATA FROM THE HISTORY LOGS - CAN WE INFLUENCE VIA CRAWL OPTIMISATION TO ESCAPE THE ‘BASE LAYER HOME’ OF THE ’UNIMPORTANT’ URLS?

44 HERE’S ONE I MADE EARLIER…SOME CAVEATS THIS IS A
PERSONAL PROJECT – MY 20 IN 70: 20:10 MIX IT’S NOT MOBILE FRIENDLY OR HTTPS (HANGS HEAD IN SHAME), AND YES, IT NEEDS A MAKEOVER… BUT… TIME… , RESOURCES, BUDGET…BLAH BLAH THERE IS NO ‘BIG BRAND’ MARKETING, VC BACKING, TV OR RADIO ADS (LIKE COMPETITORS) – JUST ME -‐ ‘CHIPPING AWAY’ 90%+ OF TRAFFIC IS NON-‐BRANDED GENERIC ORGANIC

URL CRAWL FREQUENCY ’CLOCKING’ 46 Spreadsheet provided by
@johnmu during Webmaster Hangout https://goo.gl/1p ToL8 ARE THE URLS THAT YOU WANT BEING CRAWLED ‘REAL TIME’, DAILY OR INFREQUENTLY? (REGULAR LOG ANALYSIS AND INTERVENTION TO EMPHASISE IMPORTANCE) MY THOUGHTS (DA) -‐ You need to find out which ones are getting crawled in the ‘real time’ schedule, the ‘daily crawl’ schedule and via random selection in the ‘dross’ (or UNLIKELY TO CHANGE A LOT / UNIMPORTANT) ‘base layer’ section. If it’s not the URLs that you want to be there, then formulate a plan to improve the ‘importance’ of URLS. (NOTE: JOHN DID NOT SAY THIS)

45 LOSE THE ‘DEAD WOOD’ SO GOOGLEBOT DETECTS ‘IMPORTANCE’ FIX
IT FOR A BETTER CRAWL EMBRACE THE ‘410 GONE’ FLATTENING ARCHITECTURES, CONSISTENTLY AVOIDING CANNIBALISATION, INTERNAL LINK STRATEGIES, LINKING RELEVANT CONTENT TO RELEVANT CONTENT, UTILISING XML & FRONT FACING SITEMAPS AND STRONG HUB PAGES TO ‘HERD’ GOOGLEBOT AROUND THE SITE

47 40,000 TOWNS, CITIES & VILLAGES 40,000+ towns, cities and
villages across the UK multiplied by X site categories (THAT’S A LOT OF LONG TAIL QUERY VOLUME)

48 FWIW – LONG TAIL CRAWL TECHNIQUES SEEM TO APPLY
TO OTHER SEARCH ENGINES TOO By shortening crawl paths and crawl frequency intervals and emphasing important to subcategory URLs on frequently changed URLs (fresh) it appears you may gain a competitive advantage on long tail queries

IT’S ALIVE… NEEDS WORK… BUT ALIVE 49 CAVEAT: IT’S TOO
COMPLEX TO ANSWER WITH A SIMPLE FEW EXAMPLES OF COURSE (TOO MANY FACTORS) – BUT… FOOD FOR THOUGHT ‘CRITICAL MATERIAL CHANGE FREQUENCY’ (FRESHNESS) AND DETECTED URL IMPORTANCE EMPHASIS VIA EXTERNAL OR INTERNAL SIGNALS (INC PAGERANK) SEEM KEY IS IT ‘CRAWL RANK’ OR ‘EMPHASING URL IMPORTANCE’ BETTER THAN COMPETITORS EMPHASE IMPORTANCE OF LOW TO NO PAGERANK PAGES WHERE FEW OTHER FACTORS SEPARATE?

50 CRAWL BUDGET & ‘CRAWL RANK’ – OTHER FACTORS?? 1.
IT APPEARS TO BE APPORTIONED BY THE URL SCHEDULER (BUDGET) 2. PAGES WITH A LOT OF (HEALTHY??) LINKS GET CRAWLED MORE (EXTERNAL AND INTERNAL?) (BUDGET AND RANK?) 3. THERE ARE URL EXCLUSIONS – ( ’HINT TRIPPERS’, OBJECTIONABLE CONTENT AND ‘SPAM URLS’?? ) (BUDGET) 4 – ‘CRITICAL MATERIAL CHANGE’ (FRESHNESS) AND THE PROBABILITY AND PREDICTABILITY OF CHANGE CORRELATE (BUDGET) 5 –’CONSISTENT’ EMPHASIS OF URL IMPORTANCE(BUT I THINK THAT THIS WAS ALWAYS THERE) MAY BE ’CRAWL RANK’(BUDGET AND RANK??) ’CRAWL RANK’ -‐ IS IT CORRELATION OR CAUSATION? (DO IMPORTANT PAGES GET CRAWLED MORE, OR IS IT BECAUSE THEY ARE CRAWLED MORE THEY ARE IMPORTANT?)

CAN WEB PAGES CRAWLED INFREQUENTLY STILL RANK? 36 YES THEY
CAN STILL BE ’IMPORTANT’ IT’S THE ONES YOU’RE INDICATING ARE UNIMPORTANT THAT YOU WANT TO KEEP AN EYE ON - #JUSTSAYING ;;)

“BE SMART ABOUT YOUR TAGS AND SITE ARCHITECTURE, STAY FRESH
AND RELEVANT” (@maileohye, 2016) 37 SLIDE FROM APRIL 2016’S SEJSUMMIT ON SEO INSTRUCTIONS 2016 FROM GOOGLE’S @maileohye

52 EITHER WAY - ARE ALL THE CHECKS AND BALANCES
INDICATING YOU ARE STILL ON TRACK? BECAUSE -‐ BRINGING A ROCKET BACK ON COURSE IS ‘CHALLENGING’ REGULAR TESTS AND EARLY DIAGNOSIS ARE CRUCIAL – STOP, CHECK AND KEEP CHECKING ‘TANK’ OR ‘RANK’? – YOU DECIDE

TWITTER -‐ @dawnieando GOOGLE+ -‐ +DawnAnderson888 LINKEDIN -‐ msdawnanderson THANKS
FOR LISTENING FOLKS J Dawn Anderson @ dawnieando ENJOY BRIGHTON SEO

REFERENCES http://www.internetlivestats.com/total-‐number-‐of-‐websites/ Scheduler for search engine crawler Google Patent US
8042112 B1, (Zhu et al) -‐ https://www.google.com/patents/US8707313 Managing items in crawl schedule – Google Patent (Alpert) http://www.google.ch/patents/US8666964 Document reuse in a search engine crawler -‐ Google Patent (Zhu et al) https://www.google.com/patents/US8707312 Web crawler scheduler that utilizes sitemaps (Brawer et al) -‐ http://www.google.com/patents/US8037054 Distributed crawling of hyperlinked documents (Dean et al) -‐ http://www.google.co.uk/patents/US7305610 Minimizing visibility of stale content (Carver) -‐ http://www.google.ch/patents/US20130226897

REFERENCES Efficient Crawling Through URL Ordering (Page et al) -‐
http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdf Crawl Optimisation (Blind Five Year Old – A J Kohn -‐ @ajkohn) http://www.blindfiveyearold.com/crawl-‐ optimization Scheduling a recrawl (Auerbach) -‐ http://www.google.co.uk/patents/US8386459 Scheduler for search engine crawler (Zhu et al) -‐ http://www.google.co.uk/patents/US8042112 Efficient crawling through URL ordering (Page et al) -‐ http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdf Google Explains Why The Search Console Reporting Is Not Real Time (SERoundtable) https://www.seroundtable.com/google-‐explains-‐why-‐the-‐search-‐console-‐has-‐reporting-‐delays-‐21688.html Crawl Data Aggregation Propagation (Mueller) -‐ https://goo.gl/1pToL8 Matt Cutts Interviewed By Eric Enge -‐ https://www.stonetemple.com/matt-‐cutts-‐interviewed-‐by-‐eric-‐enge-‐ 2/ Web Promo Q and A with Google’s Andrev Lippatsev -‐ https://searchenginewatch.com/2016/04/06/webpromos-‐qa-‐with-‐googles-‐andrey-‐lipattsev-‐transcript/ Google Number 1 SEO Advice – Be Consistent -‐ https://www.seroundtable.com/google-‐number-‐one-‐seo-‐ advice-‐be-‐consistent-‐21196.html

SEO Crawl Rank And Crawl Tank - Brighton SEO Ap...

SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

More Decks by Dawn Anderson

Other Decks in Marketing & SEO

Featured

Transcript