diff --git a/norobots-rfc.html b/norobots-rfc.html deleted file mode 100644 index 7c934958..00000000 --- a/norobots-rfc.html +++ /dev/null @@ -1,651 +0,0 @@ - -
- -
-
-
-
-
-
-Network Working Group M. Koster
-INTERNET DRAFT WebCrawler
-Category: Informational November 1996
-Dec 4, 1996 Expires June 4, 1997
-<draft-koster-robots-00.txt>
-
- A Method for Web Robots Control
-
-
-Status of this Memo
-
- This document is an Internet-Draft. Internet-Drafts are
- working documents of the Internet Engineering Task Force
- (IETF), its areas, and its working groups. Note that other
- groups may also distribute working documents as Internet-
- Drafts.
-
- Internet-Drafts are draft documents valid for a maximum of six
- months and may be updated, replaced, or obsoleted by other
- documents at any time. It is inappropriate to use Internet-
- Drafts as reference material or to cite them other than as
- ``work in progress.''
-
- To learn the current status of any Internet-Draft, please
- check the ``1id-abstracts.txt'' listing contained in the
- Internet- Drafts Shadow Directories on ftp.is.co.za (Africa),
- nic.nordu.net (Europe), munnari.oz.au (Pacific Rim),
- ds.internic.net (US East Coast), or ftp.isi.edu (US West
- Coast).
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Koster draft-koster-robots-00.txt [Page 1]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
-
-Table of Contents
-
- 1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 2
- 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2
- 3. Specification . . . . . . . . . . . . . . . . . . . . . . . 3
- 3.1 Access method . . . . . . . . . . . . . . . . . . . . . . . 3
- 3.2 File Format Description . . . . . . . . . . . . . . . . . . 4
- 3.2.1 The User-agent line . . . . . . . . . . . . . . . . . . . . 5
- 3.2.2 The Allow and Disallow lines . . . . . . . . . . . . . . . 5
- 3.3 Formal Syntax . . . . . . . . . . . . . . . . . . . . . . . 6
- 3.4 Expiration . . . . . . . . . . . . . . . . . . . . . . . . 8
- 4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 8
- 5. Implementor's Notes . . . . . . . . . . . . . . . . . . . . 9
- 5.1 Backwards Compatibility . . . . . . . . . . . . . . . . . . 9
- 5.2 Interoperability . . .. . . . . . . . . . . . . . . . . . . 10
- 6. Security Considerations . . . . . . . . . . . . . . . . . . 10
- 7. References . . . . . . . . . . . . . . . . . . . . . . . . 10
- 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 11
- 9. Author's Address . . . . . . . . . . . . . . . . . . . . . 11
-
-
-1. Abstract
-
- This memo defines a method for administrators of sites on the World-
- Wide Web to give instructions to visiting Web robots, most
- importantly what areas of the site are to be avoided.
-
- This document provides a more rigid specification of the Standard
- for Robots Exclusion [1], which is currently in wide-spread use by
- the Web community since 1994.
-
-
-2. Introduction
-
- Web Robots (also called "Wanderers" or "Spiders") are Web client
- programs that automatically traverse the Web's hypertext structure
- by retrieving a document, and recursively retrieving all documents
- that are referenced.
-
- Note that "recursively" here doesn't limit the definition to any
- specific traversal algorithm; even if a robot applies some heuristic
- to the selection and order of documents to visit and spaces out
- requests over a long space of time, it qualifies to be called a
- robot.
-
- Robots are often used for maintenance and indexing purposes, by
- people other than the administrators of the site being visited. In
- some cases such visits may have undesirable effects which the
-
-
-
-Koster draft-koster-robots-00.txt [Page 2]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
-
- administrators would like to prevent, such as indexing of an
- unannounced site, traversal of parts of the site which require vast
- resources of the server, recursive traversal of an infinite URL
- space, etc.
-
- The technique specified in this memo allows Web site administrators
- to indicate to visiting robots which parts of the site should be
- avoided. It is solely up to the visiting robot to consult this
- information and act accordingly. Blocking parts of the Web site
- regardless of a robot's compliance with this method are outside
- the scope of this memo.
-
-
-3. The Specification
-
- This memo specifies a format for encoding instructions to visiting
- robots, and specifies an access method to retrieve these
- instructions. Robots must retrieve these instructions before visiting
- other URLs on the site, and use the instructions to determine if
- other URLs on the site can be accessed.
-
-3.1 Access method
-
- The instructions must be accessible via HTTP [2] from the site that
- the instructions are to be applied to, as a resource of Internet
- Media Type [3] "text/plain" under a standard relative path on the
- server: "/robots.txt".
-
- For convenience we will refer to this resource as the "/robots.txt
- file", though the resource need in fact not originate from a file-
- system.
-
- Some examples of URLs [4] for sites and URLs for corresponding
- "/robots.txt" sites:
-
- http://www.foo.com/welcome.html http://www.foo.com/robots.txt
-
- http://www.bar.com:8001/ http://www.bar.com:8001/robots.txt
-
- If the server response indicates Success (HTTP 2xx Status Code,)
- the robot must read the content, parse it, and follow any
- instructions applicable to that robot.
-
- If the server response indicates the resource does not exist (HTTP
- Status Code 404), the robot can assume no instructions are
- available, and that access to the site is not restricted by
- /robots.txt.
-
-
-
-
-Koster draft-koster-robots-00.txt [Page 3]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
-
- Specific behaviors for other server responses are not required by
- this specification, though the following behaviours are recommended:
-
- - On server response indicating access restrictions (HTTP Status
- Code 401 or 403) a robot should regard access to the site
- completely restricted.
-
- - On the request attempt resulted in temporary failure a robot
- should defer visits to the site until such time as the resource
- can be retrieved.
-
- - On server response indicating Redirection (HTTP Status Code 3XX)
- a robot should follow the redirects until a resource can be
- found.
-
-
-3.2 File Format Description
-
- The instructions are encoded as a formatted plain text object,
- described here. A complete BNF-like description of the syntax of this
- format is given in section 3.3.
-
- The format logically consists of a non-empty set or records,
- separated by blank lines. The records consist of a set of lines of
- the form:
-
- <Field> ":" <value>
-
- In this memo we refer to lines with a Field "foo" as "foo lines".
-
- The record starts with one or more User-agent lines, specifying
- which robots the record applies to, followed by "Disallow" and
- "Allow" instructions to that robot. For example:
-
- User-agent: webcrawler
- User-agent: infoseek
- Allow: /tmp/ok.html
- Disallow: /tmp
- Disallow: /user/foo
-
- These lines are discussed separately below.
-
- Lines with Fields not explicitly specified by this specification
- may occur in the /robots.txt, allowing for future extension of the
- format. Consult the BNF for restrictions on the syntax of such
- extensions. Note specifically that for backwards compatibility
- with robots implementing earlier versions of this specification,
- breaking of lines is not allowed.
-
-
-
-Koster draft-koster-robots-00.txt [Page 4]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
-
- Comments are allowed anywhere in the file, and consist of optional
- whitespace, followed by a comment character '#' followed by the
- comment, terminated by the end-of-line.
-
-3.2.1 The User-agent line
-
- Name tokens are used to allow robots to identify themselves via a
- simple product token. Name tokens should be short and to the
- point. The name token a robot chooses for itself should be sent
- as part of the HTTP User-agent header, and must be well documented.
-
- These name tokens are used in User-agent lines in /robots.txt to
- identify to which specific robots the record applies. The robot
- must obey the first record in /robots.txt that contains a User-
- Agent line whose value contains the name token of the robot as a
- substring. The name comparisons are case-insensitive. If no such
- record exists, it should obey the first record with a User-agent
- line with a "*" value, if present. If no record satisfied either
- condition, or no records are present at all, access is unlimited.
-
- The name comparisons are case-insensitive.
-
- For example, a fictional company FigTree Search Services who names
- their robot "Fig Tree", send HTTP requests like:
-
- GET / HTTP/1.0
- User-agent: FigTree/0.1 Robot libwww-perl/5.04
-
- might scan the "/robots.txt" file for records with:
-
- User-agent: figtree
-
-3.2.2 The Allow and Disallow lines
-
- These lines indicate whether accessing a URL that matches the
- corresponding path is allowed or disallowed. Note that these
- instructions apply to any HTTP method on a URL.
-
- To evaluate if access to a URL is allowed, a robot must attempt to
- match the paths in Allow and Disallow lines against the URL, in the
- order they occur in the record. The first match found is used. If no
- match is found, the default assumption is that the URL is allowed.
-
- The /robots.txt URL is always allowed, and must not appear in the
- Allow/Disallow rules.
-
- The matching process compares every octet in the path portion of
- the URL and the path from the record. If a %xx encoded octet is
-
-
-
-Koster draft-koster-robots-00.txt [Page 5]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
-
- encountered it is unencoded prior to comparison, unless it is the
- "/" character, which has special meaning in a path. The match
- evaluates positively if and only if the end of the path from the
- record is reached before a difference in octets is encountered.
-
- This table illustrates some examples:
-
- Record Path URL path Matches
- /tmp /tmp yes
- /tmp /tmp.html yes
- /tmp /tmp/a.html yes
- /tmp/ /tmp no
- /tmp/ /tmp/ yes
- /tmp/ /tmp/a.html yes
-
- /a%3cd.html /a%3cd.html yes
- /a%3Cd.html /a%3cd.html yes
- /a%3cd.html /a%3Cd.html yes
- /a%3Cd.html /a%3Cd.html yes
-
- /a%2fb.html /a%2fb.html yes
- /a%2fb.html /a/b.html no
- /a/b.html /a%2fb.html no
- /a/b.html /a/b.html yes
-
- /%7ejoe/index.html /~joe/index.html yes
- /~joe/index.html /%7Ejoe/index.html yes
-
-3.3 Formal Syntax
-
- This is a BNF-like description, using the conventions of RFC 822 [5],
- except that "|" is used to designate alternatives. Briefly, literals
- are quoted with "", parentheses "(" and ")" are used to group
- elements, optional elements are enclosed in [brackets], and elements
- may be preceded with <n>* to designate n or more repetitions of the
- following element; n defaults to 0.
-
- robotstxt = *blankcomment
- | *blankcomment record *( 1*commentblank 1*record )
- *blankcomment
- blankcomment = 1*(blank | commentline)
- commentblank = *commentline blank *(blankcomment)
- blank = *space CRLF
- CRLF = CR LF
- record = *commentline agentline *(commentline | agentline)
- 1*ruleline *(commentline | ruleline)
-
-
-
-
-
-Koster draft-koster-robots-00.txt [Page 6]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
-
- agentline = "User-agent:" *space agent [comment] CRLF
- ruleline = (disallowline | allowline | extension)
- disallowline = "Disallow" ":" *space path [comment] CRLF
- allowline = "Allow" ":" *space rpath [comment] CRLF
- extension = token : *space value [comment] CRLF
- value = <any CHAR except CR or LF or "#">
-
- commentline = comment CRLF
- comment = *blank "#" anychar
- space = 1*(SP | HT)
- rpath = "/" path
- agent = token
- anychar = <any CHAR except CR or LF>
- CHAR = <any US-ASCII character (octets 0 - 127)>
- CTL = <any US-ASCII control character
- (octets 0 - 31) and DEL (127)>
- CR = <US-ASCII CR, carriage return (13)>
- LF = <US-ASCII LF, linefeed (10)>
- SP = <US-ASCII SP, space (32)>
- HT = <US-ASCII HT, horizontal-tab (9)>
-
- The syntax for "token" is taken from RFC 1945 [2], reproduced here for
- convenience:
-
- token = 1*<any CHAR except CTLs or tspecials>
-
- tspecials = "(" | ")" | "<" | ">" | "@"
- | "," | ";" | ":" | "\" | <">
- | "/" | "[" | "]" | "?" | "="
- | "{" | "}" | SP | HT
-
- The syntax for "path" is defined in RFC 1808 [6], reproduced here for
- convenience:
-
- path = fsegment *( "/" segment )
- fsegment = 1*pchar
- segment = *pchar
-
- pchar = uchar | ":" | "@" | "&" | "="
- uchar = unreserved | escape
- unreserved = alpha | digit | safe | extra
-
- escape = "%" hex hex
- hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
- "a" | "b" | "c" | "d" | "e" | "f"
-
- alpha = lowalpha | hialpha
-
-
-
-
-Koster draft-koster-robots-00.txt [Page 7]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
- lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
- "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
- "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
- hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
- "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
- "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
-
- digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
- "8" | "9"
-
- safe = "$" | "-" | "_" | "." | "+"
- extra = "!" | "*" | "'" | "(" | ")" | ","
-
-
-3.4 Expiration
-
- Robots should cache /robots.txt files, but if they do they must
- periodically verify the cached copy is fresh before using its
- contents.
-
- Standard HTTP cache-control mechanisms can be used by both origin
- server and robots to influence the caching of the /robots.txt file.
- Specifically robots should take note of Expires header set by the
- origin server.
-
- If no cache-control directives are present robots should default to
- an expiry of 7 days.
-
-
-4. Examples
-
- This section contains an example of how a /robots.txt may be used.
-
- A fictional site may have the following URLs:
-
- http://www.fict.org/
- http://www.fict.org/index.html
- http://www.fict.org/robots.txt
- http://www.fict.org/server.html
- http://www.fict.org/services/fast.html
- http://www.fict.org/services/slow.html
- http://www.fict.org/orgo.gif
- http://www.fict.org/org/about.html
- http://www.fict.org/org/plans.html
- http://www.fict.org/%7Ejim/jim.html
- http://www.fict.org/%7Emak/mak.html
-
- The site may in the /robots.txt have specific rules for robots that
- send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and
-
-
-Koster draft-koster-robots-00.txt [Page 8]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
- "Excite/1.0", and a set of default rules:
-
- # /robots.txt for http://www.fict.org/
- # comments to webmaster@fict.org
-
- User-agent: unhipbot
- Disallow: /
-
- User-agent: webcrawler
- User-agent: excite
- Disallow:
-
- User-agent: *
- Disallow: /org/plans.html
- Allow: /org/
- Allow: /serv
- Allow: /~mak
- Disallow: /
-
- The following matrix shows which robots are allowed to access URLs:
-
- unhipbot webcrawler other
- & excite
- http://www.fict.org/ No Yes No
- http://www.fict.org/index.html No Yes No
- http://www.fict.org/robots.txt Yes Yes Yes
- http://www.fict.org/server.html No Yes Yes
- http://www.fict.org/services/fast.html No Yes Yes
- http://www.fict.org/services/slow.html No Yes Yes
- http://www.fict.org/orgo.gif No Yes No
- http://www.fict.org/org/about.html No Yes Yes
- http://www.fict.org/org/plans.html No Yes No
- http://www.fict.org/%7Ejim/jim.html No Yes No
- http://www.fict.org/%7Emak/mak.html No Yes Yes
-
-
-5. Notes for Implementors
-
-5.1 Backwards Compatibility
-
- Previous of this specification didn't provide the Allow line. The
- introduction of the Allow line causes robots to behave slightly
- differently under either specification:
-
- If a /robots.txt contains an Allow which overrides a later occurring
- Disallow, a robot ignoring Allow lines will not retrieve those
- parts. This is considered acceptable because there is no requirement
- for a robot to access URLs it is allowed to retrieve, and it is safe,
- in that no URLs a Web site administrator wants to Disallow are be
- allowed. It is expected this may in fact encourage robots to upgrade
- compliance to the specification in this memo.
-
-
-Koster draft-koster-robots-00.txt [Page 9]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
-5.2 Interoperability
-
- Implementors should pay particular attention to the robustness in
- parsing of the /robots.txt file. Web site administrators who are not
- aware of the /robots.txt mechanisms often notice repeated failing
- request for it in their log files, and react by putting up pages
- asking "What are you looking for?".
-
- As the majority of /robots.txt files are created with platform-
- specific text editors, robots should be liberal in accepting files
- with different end-of-line conventions, specifically CR and LF in
- addition to CRLF.
-
-
-6. Security Considerations
-
- There are a few risks in the method described here, which may affect
- either origin server or robot.
-
- Web site administrators must realise this method is voluntary, and
- is not sufficient to guarantee some robots will not visit restricted
- parts of the URL space. Failure to use proper authentication or other
- restriction may result in exposure of restricted information. It even
- possible that the occurence of paths in the /robots.txt file may
- expose the existence of resources not otherwise linked to on the
- site, which may aid people guessing for URLs.
-
- Robots need to be aware that the amount of resources spent on dealing
- with the /robots.txt is a function of the file contents, which is not
- under the control of the robot. For example, the contents may be
- larger in size than the robot can deal with. To prevent denial-of-
- service attacks, robots are therefore encouraged to place limits on
- the resources spent on processing of /robots.txt.
-
- The /robots.txt directives are retrieved and applied in separate,
- possible unauthenticated HTTP transactions, and it is possible that
- one server can impersonate another or otherwise intercept a
- /robots.txt, and provide a robot with false information. This
- specification does not preclude authentication and encryption
- from being employed to increase security.
-
-7. Acknowledgements
-
- The author would like the subscribers to the robots mailing list for
- their contributions to this specification.
-
-
-
-
-
-
-
-Koster draft-koster-robots-00.txt [Page 10]
-
-INTERNET DRAFT A Method for Robots Control December 4, 1996
-
-8. References
-
- [1] Koster, M., "A Standard for Robot Exclusion",
- http://info.webcrawler.com/mak/projects/robots/norobots.html,
- June 1994.
-
- [2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext
- Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996.
-
- [3] Postel, J., "Media Type Registration Procedure." RFC 1590,
- USC/ISI, March 1994.
-
- [4] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
- Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
- University of Minnesota, December 1994.
-
- [5] Crocker, D., "Standard for the Format of ARPA Internet Text
- Messages", STD 11, RFC 822, UDEL, August 1982.
-
- [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
- UC Irvine, June 1995.
-
-9. Author's Address
-
- Martijn Koster
- WebCrawler
- America Online
- 690 Fifth Street
- San Francisco
- CA 94107
-
- Phone: 415-3565431
- EMail: m.koster@webcrawler.com
-
- Expires June 4, 1997
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Koster draft-koster-robots-00.txt [Page 11]
-
-