Crawling, Indexation and Logfiles

Slide 1

Slide 1 text

Crawling, Indexation and Logfiles Andor Palau Palau Consulting Speakerdeck.com/andorpalau @andorpalau @andorpalau

Slide 2

Slide 2 text

@andorpalau #brightonseo 2 https://blog.google/outreach-initiatives/sustainability/our-commitment-to-climate-conscious-data-center- cooling/#:~:text=In%202021%2C%20the%20average%20Google,manufacture%20160%20pairs%20of%20jeans. One data center requires 620 million litres of water

Slide 3

Slide 3 text

@andorpalau #brightonseo 3 https://www.cnbc.com/2021/03/18/google-to-spend-7-billion-in-data-centers-and-office-space-in-2021.html & https://www.google.com/intl/de/about/datacenters/efficiency/ Google spends billions on its infrastructure

Slide 4

Slide 4 text

@andorpalau #brightonseo 4 https://www.seroundtable.com/google-crawling-more-efficient-environmental-friendly-32792.html Crawling shall become more efficient

Slide 5

Slide 5 text

„And then, if you think about it, one thing that we do and we might not need to do that much is refresh crawls.“ 20. Janauar 2022 5 https://www.seroundtable.com/google-crawling-more-efficient-environmental-friendly-32792.html

Slide 6

Slide 6 text

„And often, we can't estimate this well, and we definitely have room for improvement there on refresh crawls, because sometimes, it just seems wasteful that we are hitting the same URL over and over again.“ 20. Janauar 2022 6 https://www.seroundtable.com/google-crawling-more-efficient-environmental-friendly-32792.html

Slide 7

Slide 7 text

@andorpalau #brightonseo 7 https://www.linkedin.com/posts/garyillyes_my-mission-this-year-is-to-figure-out-how-activity-7180832169156562945-aleJ/ Googles crawls not less, but schedules better

Slide 8

Slide 8 text

@andorpalau #brightonseo 8 Indexing may depend on space capacities

Slide 9

Slide 9 text

@andorpalau #brightonseo 9 https://www.contentkingapp.com/academy/crawl-budget/ Why crawl budget should be optimised 1. The primary goal: Getting changes to the content being processed as quickly as possible after they have been made. 2. Newly published content is crawled quickly and subsequently indexed. 3. Best possible management of resources such as CSS, JavaScript, or images is also an important goal of crawl budget optimization."

Slide 10

Slide 10 text

@andorpalau #brightonseo 10 https://www.seroundtable.com/google-allocates-crawl-budget-by-hostname-37224.htm , https://www.seroundtable.com/crawl-budget-all-googlebot-crawling-37214.htmll Crawl budget is allocated by hostname

Slide 11

Slide 11 text

@andorpalau #brightonseo 11 .pdf, .ps, .csv, .kml, .kmz, .gpx, .hwp, .htm, .html, .xls, .xlsx, .ppt, .pptx, .doc, .docx, .odp, .ods, .odt, .rtf, .svg, .tex, .txt, .text, .bas, .c, .cc, .cpp, .cxx, .h, .hpp, .cs, .java, .pl, .py, .wml, .wap, .xml, .bmp, .gif, .jpeg, .png, .webp, .svg, .3gp, .3g2, .asf, .avi, .divx, .m2v, .m3u, .m3u8, .m4v, .mkv, .mov, .mp4, .mpeg, .ogv, .qvt, .ram, .rm, .vob, .webm, .wmv, and .xap https://developers.google.com/search/docs/crawling-indexing/indexable-file-types?hl=en These are files that can be indexed by Google

Slide 12

Slide 12 text

@andorpalau #brightonseo 12 GSC Question of efficiency arises for many websites

Slide 13

Slide 13 text

@andorpalau #brightonseo 13 https://twitter.com/johnmu/status/867364568921714689 Noindex / Canonical do not help with crawling

Slide 14

Slide 14 text

"After crawling, Google can already decide that the URL does not run through the processing if the HTML is bad, and the problem is not that Google renders JS slowly, but that the HTML is bad."

Slide 15

Slide 15 text

@andorpalau #brightonseo 15 GSC Google gives indications of quality problems

Slide 16

Slide 16 text

@andorpalau #brightonseo 16 https://searchengineland.com/super-fresh-google-index-server-errors-rankings-impacts-230975 & https://www.pingdom.com/synthetic-monitoring/#1 Server errors cause extreme & rapid damage!

Slide 17

Slide 17 text

@andorpalau #brightonseo 17 Always monitor your robots.txt

Slide 18

Slide 18 text

@andorpalau #brightonseo 18 https://developers.google.com/search/blog/2023/02/dont-404-my-yum 4XX errors cost no crawl budget – one exception

Slide 19

Slide 19 text

@andorpalau #brightonseo 19 https://support.google.com/a/answer/10026322?hl=en & https://www.gstatic.com/ipranges/goog.json 2021: Google published its IP addresses

Slide 20

Slide 20 text

@andorpalau #brightonseo 20 Blocking fake bots has become easier as well

Slide 21

Slide 21 text

@andorpalau #brightonseo 21 https://www.seroundtable.com/reddit-it-not-blocking-google-search-37671.html & https://searchengineland.com/microsoft-confirms-reddit-blocked-bing-search-444385 Reddit didn’t block Google from crawling, but Bing

Slide 22

Slide 22 text

Before we look at some data Some notes

Slide 23

Slide 23 text

@andorpalau #brightonseo 23 Summary: What is a log file? Servers and computer applications of all kinds usually automatically generate a so-called log entry when they perform an action. This is written away in a file. If a bot crawls a URL, this creates an entry in the log file. We analyse these entries afterwards to understand how the bot moved around the domain. www.oncrawl.com:80 66.249.73.145 - - [07/Feb/2018:17:06:04 +0000] "GET /blog/ HTTP/1.1" 200 14486 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-" Important information from the log file: • Host • IP-Adresse • User Agent • Datum • URL • Statuscode

Slide 24

Slide 24 text

@andorpalau #brightonseo 24 https://medium.com/geekculture/load-balancing-da0bde7882f1 Watch out: Server logs vs. load balancer logs

Slide 25

Slide 25 text

@andorpalau #brightonseo 25 https://medium.com/geekculture/load-balancing-da0bde7882f1 Data differences are possible with status codes 302 301

Slide 26

Slide 26 text

@andorpalau #brightonseo 26 https://developers.google.com/search/blog/2022/01/url-inspection-api & https://searchengineland.com/seo-tools-google-search-console-url-inspection-api-379955 No logfiles? GSC Inspector API is helpful

Slide 27

Slide 27 text

Just imagine, you realise that...

Slide 28

Slide 28 text

@andorpalau #brightonseo Oncrawl & GSC Web crawl deviates massively from the GSC data

Slide 29

Slide 29 text

@andorpalau #brightonseo 29 Oncrawl / *last 45 days Only 35% are linked & 14% of these generate clicks

Slide 30

Slide 30 text

@andorpalau #brightonseo 30 Oncrawl Adding log files results in over 700K orphan pages

Slide 31

Slide 31 text

@andorpalau #brightonseo 31 Oncrawl What next? Segment your reports

Slide 32

Slide 32 text

@andorpalau #brightonseo 32 Oncrawl Orphan pages & sessions: Watch out your structure!

Slide 33

Slide 33 text

@andorpalau #brightonseo 33 Oncrawl Check if URLs with bot hits and traffic makes sense

Slide 34

Slide 34 text

@andorpalau #brightonseo 34 Oncrawl Look for processed URLs without GSC signals

Slide 35

Slide 35 text

@andorpalau #brightonseo 35 https://www.searchenginejournal.com/how-http-status-codes-impact-seo/411762/#close 70% 204 status codes? Know what that means?

Slide 36

Slide 36 text

@andorpalau #brightonseo 36 https://www.linkedin.com/posts/garyillyes_http-304-not-modified-is-super-useful-to-activity-7101917948038000640-G0gW/ With 304 status code it even can be worst

Slide 37

Slide 37 text

@andorpalau #brightonseo 37 Oncrawl Number of unique pages as baseline

Slide 38

Slide 38 text

@andorpalau #brightonseo 38 Oncrawl Evaluate the percentage of newly crawled URLs

Slide 39

Slide 39 text

@andorpalau #brightonseo 39 Oncrawl Are directories crawled more after changes?

Slide 40

Slide 40 text

@andorpalau #brightonseo 40 Oncrawl Are your pagination pages actually being crawled?

Slide 41

Slide 41 text

@andorpalau #brightonseo 41 Oncrawl What’s with other URL types?

Slide 42

Slide 42 text

@andorpalau #brightonseo 42 Oncrawl How does crawling correlate with the word count?

Slide 43

Slide 43 text

@andorpalau #brightonseo 43 Oncrawl Are "light" URLs (<100KB) crawled less?

Slide 44

Slide 44 text

@andorpalau #brightonseo 44 Oncrawl Are deep lying URLs being crawled?

Slide 45

Slide 45 text

@andorpalau #brightonseo 45 Oncrawl Does age affect crawling significantly?

Slide 46

Slide 46 text

@andorpalau #brightonseo 46 Oncrawl Expiry date: non-crawl rate? Processed at all?

Slide 47

Slide 47 text

@andorpalau #brightonseo 47 Oncrawl When does your content starts generating traffic?

Slide 48

Slide 48 text

@andorpalau #brightonseo 48 • Crawling & logfile analyses are still extremely exciting and helpful in 2024 for bigger site. • Combine as much data as possible to be able to compare them with each other. • Segment your data: Directories, URL types, authors, temporal aspects, etc. This allows you to gain much more insight from your data. • Look especially for inefficiencies: Don't just let static recommendations pile up on you. Think about what would and wouldn't make sense in your context and check this against the data. Some Key Take-Aways

Slide 49

Slide 49 text

Crawling, Indexation and Logfiles Andor Palau Palau Consulting Speakerdeck.com/andorpalau @andorpalau @andorpalau