Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hidden Traps with Robots.txt @BrightonSEO 2024

370

Hidden Traps with Robots.txt @BrightonSEO 2024

You risk losing traffic, revenue, and potentially even information security issues with your robots.txt. You will learn why, and what to do to safeguard yourself from trouble.

Gianna Brachetti-Truskawa

October 04, 2024
Tweet

Transcript

  1. Hidden Traps with Robots.txt and GSC Gianna Brachetti-Truskawa DEEPL SE

    Speakerdeck.com/giannabrachetti @gianna-brachetti-truskawa @tentaclequing
  2. 4 What‘s the fastest way to get your domain dexindexed?

    BrightonSEO | @gianna-brachetti-truskawa
  3. 6 Today, I want you to look beyond syntax …

    6 BrightonSEO | @gianna-brachetti-truskawa
  4. 9 What you think it is: • A security guard

    • A law bots will have to obey • A tool to prevent testing environments from being accessible via the internet 9 BrightonSEO | @gianna-brachetti-truskawa
  5. 10 What it really is: • An optional set of

    directives that crawlers can use to save on resources for crawl efficiency • A security risk 10 BrightonSEO | @gianna-brachetti-truskawa
  6. 11

  7. 13 1. What if you don‘t have a robots.txt? 2.

    What if you have one but it becomes unavailable? 3. Why can‘t you use it as a security guard? 4. What can you do to save yourself trouble? 13 BrightonSEO | @gianna-brachetti-truskawa What we‘re going to do today:
  8. 14 Without a robots.txt, crawlers assume they can access all

    14 BrightonSEO | @gianna-brachetti-truskawa
  9. 15 Fine for small sites – for large sites, it

    helps to manage crawl budget more efficiently 15 BrightonSEO | @gianna-brachetti-truskawa
  10. 16 But how would I reference my XML sitemap? 16

    BrightonSEO | @gianna-brachetti-truskawa
  11. Vulnerability reduction 18 Why would you hide your XML sitemap?

    Crawl control 18 BrightonSEO | @gianna-brachetti-truskawa
  12. 19 Why would you hide your XML sitemap? Control who

    accesses your XML sitemap Use case: • Competitive intelligence in eCommerce 19 BrightonSEO | @gianna-brachetti-truskawa
  13. 20 Why would you hide your XML sitemap? Make it

    harder to use the sitemap as a vector for attacks Use cases: • Scraping • (soft) DDoS attacks 20 BrightonSEO | @gianna-brachetti-truskawa
  14. Search Central Documentation: How Google interpretes the robots.txt specification² 24

    Documentation Specified Robots.txt Web Standards and Google Documentation RFC 9309: Robots Exclusion Protocol (REP)1 Search Console Help: robots.txt report³ 1 https://www.rfc-editor.org/rfc/rfc9309.html 2 https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt 3 https://support.google.com/webmasters/answer/6062598?hl=en#crawl_errors Last access 30.09.2024
  15. 26 Google may cache your robots.txt for up to 24h

    26 BrightonSEO | @gianna-brachetti-truskawa
  16. 28 If you make changes, it can still take up

    to 24h until they‘re fetched 28 BrightonSEO | @gianna-brachetti-truskawa
  17. 29 Robots.txt redirect chains can lead to it being ignored

    29 BrightonSEO | @gianna-brachetti-truskawa
  18. • For all 4xx but 429 • Treat as allow

    all 429 (Too many requests) will be treated as 5xx 30 Client Errors 4xx What happens if Google cannot fetch your robots.txt anymore? • Status codes 400 – 499 • MAY be treated as allow all • First 12h: Stops crawling domain • <= 30 days: use last cached version • > 30 days: Check if the site is available in general, treat as allow all Confidential REP Search Console Help Search Central Documentation 30 BrightonSEO | @gianna-brachetti-truskawa
  19. If you have time- sensitive info, it might not be

    fetched on time 31 31 BrightonSEO | @gianna-brachetti-truskawa
  20. 32 A 429 can lead to your domain being deindexed!

    32 BrightonSEO | @gianna-brachetti-truskawa
  21. 33 Server Errors 5xx What happens if Google cannot fetch

    your robots.txt anymore? • Treat as complete disallow • > 30 days: preferably use last cached version unless unavailable – else treat as 4xx (= allow all) • Treat 4xx and 5xx all the same: allow all REP Search Console Help Search Central Documentation 33 BrightonSEO | @gianna-brachetti-truskawa
  22. 34 Server Errors 5xx Contradictions between known sources REP Search

    Console Help Search Central Documentation Gary Illyes 34 BrightonSEO | @gianna-brachetti-truskawa DISALLOW ALL ALLOW ALL DEINDEX ALL
  23. 35 DNS errors, connection timeouts will be treated the same!

    35 BrightonSEO | @gianna-brachetti-truskawa
  24. 38 Error codes can lead to your robots.txt become a

    liability 38 BrightonSEO | @gianna-brachetti-truskawa
  25. 40 Robots.txt might tell us where the most interesting files

    might be 40 BrightonSEO | @gianna-brachetti-truskawa
  26. You’re relying on your robot.txt to keep things out of

    the internet 41 How your robots.txt can become a liability (even if it‘s a 200!) You’re exposing vulnerabilities of your website or servers You’re not monitoring the content or uptime of your robots.txt 41 BrightonSEO | @gianna-brachetti-truskawa
  27. 42 Exposing sensitive information can become an expensive GDPR issue

    42 BrightonSEO | @gianna-brachetti-truskawa
  28. 43 Robots.txt emerged as a practical solution to real-world problems

    in the early web 43 BrightonSEO | @gianna-brachetti-truskawa
  29. 46 46 BrightonSEO | @gianna-brachetti-truskawa Example Click tracking via parameters

    in URLs in prominent places https://www.deepl.com/pro?cta=header-prices
  30. 48 If in doubt, Google may choose indexation over restriction

    48 BrightonSEO | @gianna-brachetti-truskawa
  31. 51 Want to remove leaked data from the web? 51

    BrightonSEO | @gianna-brachetti-truskawa
  32. 52 Remove leaked data from the web: 410 (gone) X-Robots-Tag:

    none 52 BrightonSEO | @gianna-brachetti-truskawa
  33. 53 Want to protect data from leaking into the web?

    53 BrightonSEO | @gianna-brachetti-truskawa
  34. 54 Protect data from the web: HTTP Authentication X-Robots-Tag: none

    Avoid internal links 54 BrightonSEO | @gianna-brachetti-truskawa
  35. 55 Want to make it hard to spy on you

    or train genAI with your content? 55 BrightonSEO | @gianna-brachetti-truskawa
  36. 59 GSC shows you versions and status of your robots.txt

    59 BrightonSEO | @gianna-brachetti-truskawa
  37. 62 Set filters for WNC- and the error classes most

    relevant for you 62 BrightonSEO | @gianna-brachetti-truskawa
  38. 63 If you do want to test syntax, use a

    parser, eg. by Will Critchlow: www.realrobotstxt.com 63 BrightonSEO | @gianna-brachetti-truskawa
  39. 64 Go multi-level to successfully control which content can be

    accessed by AI crawlers 64 BrightonSEO | @gianna-brachetti-truskawa
  40. 65 65 • Multilingual Tech SEO Strategist of 15 years

    • Turned PM Driving Growth in Global Markets • Ask me about complex tech issues, B2B SaaS SEO, and… • … my paper for the IAB Workshop on AI Control. Current role: PM Search at DeepL 65 BrightonSEO | @gianna-brachetti-truskawa
  41. 68 How your robots.txt can become a liability You’re exposing

    vulnerabilities of your website or servers • Outdated folders with sensitive data • Server vulnerabilities (Apache server status) • Admin login paths in your CMS • Internal services 68 BrightonSEO | @gianna-brachetti-truskawa
  42. 69 How your robots.txt can become a liability You’re relying

    on your robot.txt to keep things out of the internet • Mitigate scraping attacks • Restrict access to personal data • Avoid duplicate content issues (why not target the root cause?) • Get things out of the index you accidentally leaked 69 BrightonSEO | @gianna-brachetti-truskawa
  43. 70 How your robots.txt can become a liability You’re not

    monitoring the content or uptime of your robots.txt • Inhouse teams make changes without aligning with you • CMS plugins can change it without that being their purpose / disclosure in their release notes 70 BrightonSEO | @gianna-brachetti-truskawa