Crawling, Indexation and Logfiles

Crawling, Indexation and Logfiles Andor Palau Palau Consulting Speakerdeck.com/andorpalau @andorpalau
@andorpalau

@andorpalau #brightonseo 2 https://blog.google/outreach-initiatives/sustainability/our-commitment-to-climate-conscious-data-center- cooling/#:~:text=In%202021%2C%20the%20average%20Google,manufacture%20160%20pairs%20of%20jeans. One data center requires 620
million litres of water

@andorpalau #brightonseo 3 https://www.cnbc.com/2021/03/18/google-to-spend-7-billion-in-data-centers-and-office-space-in-2021.html & https://www.google.com/intl/de/about/datacenters/efficiency/ Google spends billions on
its infrastructure

@andorpalau #brightonseo 4 https://www.seroundtable.com/google-crawling-more-efficient-environmental-friendly-32792.html Crawling shall become more efficient

„And then, if you think about it, one thing that
we do and we might not need to do that much is refresh crawls.“ 20. Janauar 2022 5 https://www.seroundtable.com/google-crawling-more-efficient-environmental-friendly-32792.html

„And often, we can't estimate this well, and we definitely
have room for improvement there on refresh crawls, because sometimes, it just seems wasteful that we are hitting the same URL over and over again.“ 20. Janauar 2022 6 https://www.seroundtable.com/google-crawling-more-efficient-environmental-friendly-32792.html

@andorpalau #brightonseo 7 https://www.linkedin.com/posts/garyillyes_my-mission-this-year-is-to-figure-out-how-activity-7180832169156562945-aleJ/ Googles crawls not less, but schedules
better

@andorpalau #brightonseo 8 Indexing may depend on space capacities

@andorpalau #brightonseo 9 https://www.contentkingapp.com/academy/crawl-budget/ Why crawl budget should be optimised
1. The primary goal: Getting changes to the content being processed as quickly as possible after they have been made. 2. Newly published content is crawled quickly and subsequently indexed. 3. Best possible management of resources such as CSS, JavaScript, or images is also an important goal of crawl budget optimization."

@andorpalau #brightonseo 10 https://www.seroundtable.com/google-allocates-crawl-budget-by-hostname-37224.htm , https://www.seroundtable.com/crawl-budget-all-googlebot-crawling-37214.htmll Crawl budget is allocated
by hostname

@andorpalau #brightonseo 11 .pdf, .ps, .csv, .kml, .kmz, .gpx, .hwp,
.htm, .html, .xls, .xlsx, .ppt, .pptx, .doc, .docx, .odp, .ods, .odt, .rtf, .svg, .tex, .txt, .text, .bas, .c, .cc, .cpp, .cxx, .h, .hpp, .cs, .java, .pl, .py, .wml, .wap, .xml, .bmp, .gif, .jpeg, .png, .webp, .svg, .3gp, .3g2, .asf, .avi, .divx, .m2v, .m3u, .m3u8, .m4v, .mkv, .mov, .mp4, .mpeg, .ogv, .qvt, .ram, .rm, .vob, .webm, .wmv, and .xap https://developers.google.com/search/docs/crawling-indexing/indexable-file-types?hl=en These are files that can be indexed by Google

@andorpalau #brightonseo 12 GSC Question of efficiency arises for many
websites

@andorpalau #brightonseo 13 https://twitter.com/johnmu/status/867364568921714689 Noindex / Canonical do not help
with crawling

"After crawling, Google can already decide that the URL does
not run through the processing if the HTML is bad, and the problem is not that Google renders JS slowly, but that the HTML is bad."

@andorpalau #brightonseo 15 GSC Google gives indications of quality problems

@andorpalau #brightonseo 16 https://searchengineland.com/super-fresh-google-index-server-errors-rankings-impacts-230975 & https://www.pingdom.com/synthetic-monitoring/#1 Server errors cause extreme
& rapid damage!

@andorpalau #brightonseo 17 Always monitor your robots.txt

@andorpalau #brightonseo 18 https://developers.google.com/search/blog/2023/02/dont-404-my-yum 4XX errors cost no crawl budget
– one exception

@andorpalau #brightonseo 19 https://support.google.com/a/answer/10026322?hl=en & https://www.gstatic.com/ipranges/goog.json 2021: Google published its
IP addresses

@andorpalau #brightonseo 20 Blocking fake bots has become easier as
well

@andorpalau #brightonseo 21 https://www.seroundtable.com/reddit-it-not-blocking-google-search-37671.html & https://searchengineland.com/microsoft-confirms-reddit-blocked-bing-search-444385 Reddit didn’t block Google
from crawling, but Bing

Before we look at some data Some notes

@andorpalau #brightonseo 23 Summary: What is a log file? Servers
and computer applications of all kinds usually automatically generate a so-called log entry when they perform an action. This is written away in a file. If a bot crawls a URL, this creates an entry in the log file. We analyse these entries afterwards to understand how the bot moved around the domain. www.oncrawl.com:80 66.249.73.145 - - [07/Feb/2018:17:06:04 +0000] "GET /blog/ HTTP/1.1" 200 14486 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-" Important information from the log file: • Host • IP-Adresse • User Agent • Datum • URL • Statuscode

@andorpalau #brightonseo 24 https://medium.com/geekculture/load-balancing-da0bde7882f1 Watch out: Server logs vs. load
balancer logs

@andorpalau #brightonseo 25 https://medium.com/geekculture/load-balancing-da0bde7882f1 Data differences are possible with status
codes 302 301

@andorpalau #brightonseo 26 https://developers.google.com/search/blog/2022/01/url-inspection-api & https://searchengineland.com/seo-tools-google-search-console-url-inspection-api-379955 No logfiles? GSC Inspector
API is helpful

Just imagine, you realise that...

@andorpalau #brightonseo Oncrawl & GSC Web crawl deviates massively from
the GSC data

@andorpalau #brightonseo 29 Oncrawl / *last 45 days Only 35%
are linked & 14% of these generate clicks

@andorpalau #brightonseo 30 Oncrawl Adding log files results in over
700K orphan pages

@andorpalau #brightonseo 31 Oncrawl What next? Segment your reports

@andorpalau #brightonseo 32 Oncrawl Orphan pages & sessions: Watch out
your structure!

@andorpalau #brightonseo 33 Oncrawl Check if URLs with bot hits
and traffic makes sense

@andorpalau #brightonseo 34 Oncrawl Look for processed URLs without GSC
signals

@andorpalau #brightonseo 35 https://www.searchenginejournal.com/how-http-status-codes-impact-seo/411762/#close 70% 204 status codes? Know what
that means?

@andorpalau #brightonseo 36 https://www.linkedin.com/posts/garyillyes_http-304-not-modified-is-super-useful-to-activity-7101917948038000640-G0gW/ With 304 status code it even
can be worst

@andorpalau #brightonseo 37 Oncrawl Number of unique pages as baseline

@andorpalau #brightonseo 38 Oncrawl Evaluate the percentage of newly crawled
URLs

@andorpalau #brightonseo 39 Oncrawl Are directories crawled more after changes?

@andorpalau #brightonseo 40 Oncrawl Are your pagination pages actually being
crawled?

@andorpalau #brightonseo 41 Oncrawl What’s with other URL types?

@andorpalau #brightonseo 42 Oncrawl How does crawling correlate with the
word count?

@andorpalau #brightonseo 43 Oncrawl Are "light" URLs (<100KB) crawled less?

@andorpalau #brightonseo 44 Oncrawl Are deep lying URLs being crawled?

@andorpalau #brightonseo 45 Oncrawl Does age affect crawling significantly?

@andorpalau #brightonseo 46 Oncrawl Expiry date: non-crawl rate? Processed at
all?

@andorpalau #brightonseo 47 Oncrawl When does your content starts generating
traffic?

@andorpalau #brightonseo 48 • Crawling & logfile analyses are still
extremely exciting and helpful in 2024 for bigger site. • Combine as much data as possible to be able to compare them with each other. • Segment your data: Directories, URL types, authors, temporal aspects, etc. This allows you to gain much more insight from your data. • Look especially for inefficiencies: Don't just let static recommendations pile up on you. Think about what would and wouldn't make sense in your context and check this against the data. Some Key Take-Aways

Crawling, Indexation and Logfiles Andor Palau Palau Consulting Speakerdeck.com/andorpalau @andorpalau
@andorpalau

Crawling, Indexation and Logfiles

Crawling, Indexation and Logfiles

Other Decks in Marketing & SEO

Featured

Transcript