You risk losing traffic, revenue, and potentially even information security issues with your robots.txt. You will learn why, and what to do to safeguard yourself from trouble.
• A law bots will have to obey • A tool to prevent testing environments from being accessible via the internet 9 BrightonSEO | @gianna-brachetti-truskawa
What if you have one but it becomes unavailable? 3. Why can‘t you use it as a security guard? 4. What can you do to save yourself trouble? 13 BrightonSEO | @gianna-brachetti-truskawa What we‘re going to do today:
all 429 (Too many requests) will be treated as 5xx 30 Client Errors 4xx What happens if Google cannot fetch your robots.txt anymore? • Status codes 400 – 499 • MAY be treated as allow all • First 12h: Stops crawling domain • <= 30 days: use last cached version • > 30 days: Check if the site is available in general, treat as allow all Confidential REP Search Console Help Search Central Documentation 30 BrightonSEO | @gianna-brachetti-truskawa
your robots.txt anymore? • Treat as complete disallow • > 30 days: preferably use last cached version unless unavailable – else treat as 4xx (= allow all) • Treat 4xx and 5xx all the same: allow all REP Search Console Help Search Central Documentation 33 BrightonSEO | @gianna-brachetti-truskawa
the internet 41 How your robots.txt can become a liability (even if it‘s a 200!) You’re exposing vulnerabilities of your website or servers You’re not monitoring the content or uptime of your robots.txt 41 BrightonSEO | @gianna-brachetti-truskawa
• Turned PM Driving Growth in Global Markets • Ask me about complex tech issues, B2B SaaS SEO, and… • … my paper for the IAB Workshop on AI Control. Current role: PM Search at DeepL 65 BrightonSEO | @gianna-brachetti-truskawa
vulnerabilities of your website or servers • Outdated folders with sensitive data • Server vulnerabilities (Apache server status) • Admin login paths in your CMS • Internal services 68 BrightonSEO | @gianna-brachetti-truskawa
on your robot.txt to keep things out of the internet • Mitigate scraping attacks • Restrict access to personal data • Avoid duplicate content issues (why not target the root cause?) • Get things out of the index you accidentally leaked 69 BrightonSEO | @gianna-brachetti-truskawa
monitoring the content or uptime of your robots.txt • Inhouse teams make changes without aligning with you • CMS plugins can change it without that being their purpose / disclosure in their release notes 70 BrightonSEO | @gianna-brachetti-truskawa