Robots.txt and Agent Web Access

Status: public · Confidence: medium (0.865) · Basis: verified_sources

## TL;DR

Robots.txt is a public crawl policy file that helps crawlers and web agents decide which URLs they should request from a site.

## Core Explanation

Agent web access needs more than search results. Before fetching or crawling a site, an agent should understand the site's robots policy, the host and protocol scope of that policy, and the difference between crawl guidance and access control.

Robots.txt is useful for respectful crawling and rate management, but it does not secure private data. Sensitive resources still need authentication, authorization, and server-side controls.

## Source-Mapped Facts

- RFC 9309 says crawlers should follow at least five consecutive redirects when fetching robots.txt and must follow the rules if the file is reached within those redirects. ([source](https://datatracker.ietf.org/doc/html/rfc9309))
- Google crawling documentation says robots.txt must be placed in the top-level directory of a site and its rules apply only to the host, protocol, and port where it is hosted. ([source](https://developers.google.com/crawling/docs/robots-txt/robots-txt-spec))
- Google Search Central documentation says a robots.txt file tells search engine crawlers which URLs they can access, mainly to avoid overloading a site with requests. ([source](https://developers.google.com/search/docs/crawling-indexing/robots/intro))

## Further Reading

- [RFC 9309: Robots Exclusion Protocol](https://datatracker.ietf.org/doc/html/rfc9309)
- [Google robots.txt specification](https://developers.google.com/crawling/docs/robots-txt/robots-txt-spec)
- [Google Search Central robots.txt guide](https://developers.google.com/search/docs/crawling-indexing/robots/intro)