diff --git a/norobots-rfc.html b/norobots-rfc.html deleted file mode 100644 index 7c934958..00000000 --- a/norobots-rfc.html +++ /dev/null @@ -1,651 +0,0 @@ - - - - -A Standard for Robot Exclusion - - - - -
- - -The Web Robots Pages - - - -
-
-
-
-
-
-
-
-Network Working Group                                          M. Koster
-INTERNET DRAFT                                                WebCrawler
-Category: Informational                                    November 1996
-Dec 4, 1996                                         Expires June 4, 1997
-<draft-koster-robots-00.txt>
-
-                      A Method for Web Robots Control
-
-
-Status of this Memo
-
-     This document is an Internet-Draft.  Internet-Drafts are
-     working documents of the Internet Engineering Task Force
-     (IETF), its areas, and its working groups.  Note that other
-     groups may also distribute working documents as Internet-
-     Drafts.
-
-     Internet-Drafts are draft documents valid for a maximum of six
-     months and may be updated, replaced, or obsoleted by other
-     documents at any time.  It is inappropriate to use Internet-
-     Drafts as reference material or to cite them other than as
-     ``work in progress.''
-
-     To learn the current status of any Internet-Draft, please
-     check the ``1id-abstracts.txt'' listing contained in the
-     Internet- Drafts Shadow Directories on ftp.is.co.za (Africa),
-     nic.nordu.net (Europe), munnari.oz.au (Pacific Rim),
-     ds.internic.net (US East Coast), or ftp.isi.edu (US West
-     Coast).
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Koster                draft-koster-robots-00.txt                [Page 1]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-
-Table of Contents
-
-   1.    Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . 2
-   2.    Introduction  . . . . . . . . . . . . . . . . . . . . . . . 2
-   3.    Specification . . . . . . . . . . . . . . . . . . . . . . . 3
-   3.1   Access method . . . . . . . . . . . . . . . . . . . . . . . 3
-   3.2   File Format Description . . . . . . . . . . . . . . . . . . 4
-   3.2.1 The User-agent line . . . . . . . . . . . . . . . . . . . . 5
-   3.2.2 The Allow and Disallow lines . . . . . . . . . . .  . . . . 5
-   3.3   Formal Syntax . . . . . . . . . . . . . . . . . . . . . . . 6
-   3.4   Expiration . . . . . . . . . . . . .  . . . . . . . . . . . 8
-   4.    Examples . . . . . . . . . . . . . .  . . . . . . . . . . . 8
-   5.    Implementor's Notes . . . . . . . . . . . . . . . . . . . . 9
-   5.1   Backwards Compatibility . . . . . . . . . . . . . . . . . . 9
-   5.2   Interoperability . . .. . . . . . . . . . . . . . . . . . . 10
-   6.    Security Considerations . . . . . . . . . . . . . . . . . . 10
-   7.    References  . . . . . . . . . . . . . . . . . . . . . . . . 10
-   8.    Acknowledgements  . . . . . . . . . . . . . . . . . . . . . 11
-   9.    Author's Address  . . . . . . . . . . . . . . . . . . . . . 11
-
-
-1.  Abstract
-
-   This memo defines a method for administrators of sites on the World-
-   Wide Web to give instructions to visiting Web robots, most
-   importantly what areas of the site are to be avoided.
-
-   This document provides a more rigid specification of the Standard 
-   for Robots Exclusion [1], which is currently in wide-spread use by
-   the Web community since 1994.
-
-
-2.  Introduction
-
-   Web Robots (also called "Wanderers" or "Spiders") are Web client
-   programs that automatically traverse the Web's hypertext structure
-   by retrieving a document, and recursively retrieving all documents
-   that are referenced.
-
-   Note that "recursively" here doesn't limit the definition to any
-   specific traversal algorithm; even if a robot applies some heuristic
-   to the selection and order of documents to visit and spaces out
-   requests over a long space of time, it qualifies to be called a
-   robot.
-
-   Robots are often used for maintenance and indexing purposes, by
-   people other than the administrators of the site being visited. In
-   some cases such visits may have undesirable effects which the
-
-
-
-Koster                draft-koster-robots-00.txt                [Page 2]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-
-   administrators would like to prevent, such as indexing of an
-   unannounced site, traversal of parts of the site which require vast
-   resources of the server, recursive traversal of an infinite URL
-   space, etc.
-
-   The technique specified in this memo allows Web site administrators
-   to indicate to visiting robots which parts of the site should be
-   avoided. It is solely up to the visiting robot to consult this
-   information and act accordingly. Blocking parts of the Web site
-   regardless of a robot's compliance with this method are outside
-   the scope of this memo.
-   
-   
-3. The Specification
-
-   This memo specifies a format for encoding instructions to visiting
-   robots, and specifies an access method to retrieve these
-   instructions. Robots must retrieve these instructions before visiting
-   other URLs on the site, and use the instructions to determine if
-   other URLs on the site can be accessed.
-
-3.1 Access method
-
-   The instructions must be accessible via HTTP [2] from the site that
-   the instructions are to be applied to, as a resource of Internet
-   Media Type [3] "text/plain" under a standard relative path on the
-   server: "/robots.txt".
-
-   For convenience we will refer to this resource as the "/robots.txt
-   file", though the resource need in fact not originate from a file-
-   system.
-
-   Some examples of URLs [4] for sites and URLs for corresponding
-   "/robots.txt" sites:
-
-     http://www.foo.com/welcome.html http://www.foo.com/robots.txt
-
-     http://www.bar.com:8001/        http://www.bar.com:8001/robots.txt
-
-   If the server response indicates Success (HTTP 2xx Status Code,)
-   the robot must read the content, parse it, and follow any
-   instructions applicable to that robot.
-
-   If the server response indicates the resource does not exist (HTTP
-   Status Code 404), the robot can assume no instructions are
-   available, and that access to the site is not restricted by
-   /robots.txt.
-
-
-
-
-Koster                draft-koster-robots-00.txt                [Page 3]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-
-   Specific behaviors for other server responses are not required by
-   this specification, though the following behaviours are recommended:
-
-     - On server response indicating access restrictions (HTTP Status
-       Code 401 or 403) a robot should regard access to the site
-       completely restricted.
-
-     - On the request attempt resulted in temporary failure a robot
-       should defer visits to the site until such time as the resource
-       can be retrieved.
-  
-     - On server response indicating Redirection (HTTP Status Code 3XX)
-       a robot should follow the redirects until a resource can be
-       found.
-
-
-3.2 File Format Description
-
-   The instructions are encoded as a formatted plain text object,
-   described here. A complete BNF-like description of the syntax of this
-   format is given in section 3.3.
-  
-   The format logically consists of a non-empty set or records,
-   separated by blank lines. The records consist of a set of lines of
-   the form:
-  
-     <Field> ":" <value>
-  
-   In this memo we refer to lines with a Field "foo" as "foo lines".
-
-   The record starts with one or more User-agent lines, specifying
-   which robots the record applies to, followed by "Disallow" and
-   "Allow" instructions to that robot. For example:
-  
-     User-agent: webcrawler
-     User-agent: infoseek
-     Allow:    /tmp/ok.html
-     Disallow: /tmp
-     Disallow: /user/foo
-    
-   These lines are discussed separately below.
-   
-   Lines with Fields not explicitly specified by this specification
-   may occur in the /robots.txt, allowing for future extension of the
-   format. Consult the BNF for restrictions on the syntax of such
-   extensions. Note specifically that for backwards compatibility 
-   with robots implementing earlier versions of this specification,
-   breaking of lines is not allowed.
-
-
-   
-Koster                draft-koster-robots-00.txt                [Page 4]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-
-   Comments are allowed anywhere in the file, and consist of optional
-   whitespace, followed by a comment character '#' followed by the
-   comment, terminated by the end-of-line.
-  
-3.2.1 The User-agent line
-
-   Name tokens are used to allow robots to identify themselves via a
-   simple product token. Name tokens should be short and to the
-   point. The name token a robot chooses for itself should be sent
-   as part of the HTTP User-agent header, and must be well documented.
-
-   These name tokens are used in User-agent lines in /robots.txt to
-   identify to which specific robots the record applies. The robot
-   must obey the first record in /robots.txt that contains a User-
-   Agent line whose value contains the name token of the robot as a 
-   substring. The name comparisons are case-insensitive. If no such
-   record exists, it should obey the first record with a User-agent
-   line with a "*" value, if present. If no record satisfied either
-   condition, or no records are present at all, access is unlimited.
-
-   The name comparisons are case-insensitive.
-  
-   For example, a fictional company FigTree Search Services who names
-   their robot "Fig Tree", send HTTP requests like:
-  
-     GET / HTTP/1.0
-     User-agent: FigTree/0.1 Robot libwww-perl/5.04
-    
-   might scan the "/robots.txt" file for records with:
-  
-     User-agent: figtree
-  
-3.2.2 The Allow and Disallow lines
-
-   These lines indicate whether accessing a URL that matches the
-   corresponding path is allowed or disallowed. Note that these
-   instructions apply to any HTTP method on a URL.
-  
-   To evaluate if access to a URL is allowed, a robot must attempt to
-   match the paths in Allow and Disallow lines against the URL, in the
-   order they occur in the record. The first match found is used. If no
-   match is found, the default assumption is that the URL is allowed.
-
-   The /robots.txt URL is always allowed, and must not appear in the
-   Allow/Disallow rules.
-
-   The matching process compares every octet in the path portion of
-   the URL and the path from the record. If a %xx encoded octet is
-
-
-
-Koster                draft-koster-robots-00.txt                [Page 5]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-
-   encountered it is unencoded prior to comparison, unless it is the
-   "/" character, which has special meaning in a path. The match
-   evaluates positively if and only if the end of the path from the
-   record is reached before a difference in octets is encountered.
-
-   This table illustrates some examples:
-  
-     Record Path        URL path         Matches
-     /tmp               /tmp               yes
-     /tmp               /tmp.html          yes
-     /tmp               /tmp/a.html        yes
-     /tmp/              /tmp               no
-     /tmp/              /tmp/              yes
-     /tmp/              /tmp/a.html        yes
-     
-     /a%3cd.html        /a%3cd.html        yes
-     /a%3Cd.html        /a%3cd.html        yes
-     /a%3cd.html        /a%3Cd.html        yes
-     /a%3Cd.html        /a%3Cd.html        yes
-     
-     /a%2fb.html        /a%2fb.html        yes
-     /a%2fb.html        /a/b.html          no
-     /a/b.html          /a%2fb.html        no
-     /a/b.html          /a/b.html          yes
-     
-     /%7ejoe/index.html /~joe/index.html   yes
-     /~joe/index.html   /%7Ejoe/index.html yes
-    
-3.3 Formal Syntax
-
-  This is a BNF-like description, using the conventions of RFC 822 [5],
-  except that "|" is used to designate alternatives.  Briefly, literals
-  are quoted with "", parentheses "(" and ")" are used to group
-  elements, optional elements are enclosed in [brackets], and elements
-  may be preceded with <n>* to designate n or more repetitions of the
-  following element; n defaults to 0.
-
-    robotstxt    = *blankcomment
-                 | *blankcomment record *( 1*commentblank 1*record )
-                   *blankcomment
-    blankcomment = 1*(blank | commentline)
-    commentblank = *commentline blank *(blankcomment)
-    blank        = *space CRLF
-    CRLF         = CR LF
-    record       = *commentline agentline *(commentline | agentline)
-                   1*ruleline *(commentline | ruleline)
-
-
-
-
-
-Koster                draft-koster-robots-00.txt                [Page 6]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-
-    agentline    = "User-agent:" *space agent  [comment] CRLF
-    ruleline     = (disallowline | allowline | extension)
-    disallowline = "Disallow" ":" *space path [comment] CRLF
-    allowline    = "Allow" ":" *space rpath [comment] CRLF
-    extension    = token : *space value [comment] CRLF
-    value        = <any CHAR except CR or LF or "#">
-
-    commentline  = comment CRLF
-    comment      = *blank "#" anychar
-    space        = 1*(SP | HT)
-    rpath        = "/" path
-    agent        = token
-    anychar      = <any CHAR except CR or LF>
-    CHAR         = <any US-ASCII character (octets 0 - 127)>
-    CTL          = <any US-ASCII control character
-                        (octets 0 - 31) and DEL (127)>
-    CR           = <US-ASCII CR, carriage return (13)>
-    LF           = <US-ASCII LF, linefeed (10)>
-    SP           = <US-ASCII SP, space (32)>
-    HT           = <US-ASCII HT, horizontal-tab (9)>
-
-   The syntax for "token" is taken from RFC 1945 [2], reproduced here for
-   convenience:
-   
-    token        = 1*<any CHAR except CTLs or tspecials>
-
-    tspecials    = "(" | ")" | "<" | ">" | "@"
-                 | "," | ";" | ":" | "\" | <">
-                 | "/" | "[" | "]" | "?" | "="
-                 | "{" | "}" | SP | HT
-
-  The syntax for "path" is defined in RFC 1808 [6], reproduced here for
-  convenience:
-
-    path        = fsegment *( "/" segment )
-    fsegment    = 1*pchar
-    segment     =  *pchar
-
-    pchar       = uchar | ":" | "@" | "&" | "="
-    uchar       = unreserved | escape
-    unreserved  = alpha | digit | safe | extra
-
-    escape      = "%" hex hex
-    hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
-                         "a" | "b" | "c" | "d" | "e" | "f"
-
-    alpha       = lowalpha | hialpha
-
-
-
-
-Koster                draft-koster-robots-00.txt                [Page 7]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-    lowalpha    = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
-                  "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
-                  "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
-    hialpha     = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
-                  "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
-                  "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
-
-    digit       = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
-                  "8" | "9"
-
-    safe        = "$" | "-" | "_" | "." | "+"
-    extra       = "!" | "*" | "'" | "(" | ")" | ","
-
-                   
-3.4 Expiration
-
-   Robots should cache /robots.txt files, but if they do they must
-   periodically verify the cached copy is fresh before using its
-   contents.
-
-   Standard HTTP cache-control mechanisms can be used by both origin
-   server and robots to influence the caching of the /robots.txt file.
-   Specifically robots should take note of Expires header set by the
-   origin server.
-
-   If no cache-control directives are present robots should default to
-   an expiry of 7 days.
-
-
-4. Examples
-
-   This section contains an example of how a /robots.txt may be used.
-
-   A fictional site may have the following URLs:
-
-     http://www.fict.org/
-     http://www.fict.org/index.html
-     http://www.fict.org/robots.txt
-     http://www.fict.org/server.html
-     http://www.fict.org/services/fast.html
-     http://www.fict.org/services/slow.html
-     http://www.fict.org/orgo.gif
-     http://www.fict.org/org/about.html
-     http://www.fict.org/org/plans.html
-     http://www.fict.org/%7Ejim/jim.html
-     http://www.fict.org/%7Emak/mak.html
-
-   The site may in the /robots.txt have specific rules for robots that
-   send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and
-
-
-Koster                draft-koster-robots-00.txt                [Page 8]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-   "Excite/1.0", and a set of default rules:
-
-      # /robots.txt for http://www.fict.org/
-      # comments to webmaster@fict.org
-
-      User-agent: unhipbot
-      Disallow: /
-
-      User-agent: webcrawler
-      User-agent: excite
-      Disallow: 
-
-      User-agent: *
-      Disallow: /org/plans.html
-      Allow: /org/
-      Allow: /serv
-      Allow: /~mak
-     Disallow: /
-
-   The following matrix shows which robots are allowed to access URLs:
-
-                                               unhipbot webcrawler other
-                                                        & excite
-     http://www.fict.org/                         No       Yes       No
-     http://www.fict.org/index.html               No       Yes       No
-     http://www.fict.org/robots.txt               Yes      Yes       Yes
-     http://www.fict.org/server.html              No       Yes       Yes
-     http://www.fict.org/services/fast.html       No       Yes       Yes
-     http://www.fict.org/services/slow.html       No       Yes       Yes
-     http://www.fict.org/orgo.gif                 No       Yes       No
-     http://www.fict.org/org/about.html           No       Yes       Yes
-     http://www.fict.org/org/plans.html           No       Yes       No
-     http://www.fict.org/%7Ejim/jim.html          No       Yes       No
-     http://www.fict.org/%7Emak/mak.html          No       Yes       Yes
-
-
-5. Notes for Implementors
-
-5.1   Backwards Compatibility
-
-   Previous of this specification didn't provide the Allow line. The
-   introduction of the Allow line causes robots to behave slightly 
-   differently under either specification:
-   
-   If a /robots.txt contains an Allow which overrides a later occurring
-   Disallow, a robot ignoring Allow lines will not retrieve those
-   parts. This is considered acceptable because there is no requirement
-   for a robot to access URLs it is allowed to retrieve, and it is safe,
-   in that no URLs a Web site administrator wants to Disallow are be
-   allowed. It is expected this may in fact encourage robots to upgrade
-   compliance to the specification in this memo.
-
-
-Koster                draft-koster-robots-00.txt                [Page 9]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-5.2   Interoperability
-
-   Implementors should pay particular attention to the robustness in
-   parsing of the /robots.txt file. Web site administrators who are not
-   aware of the /robots.txt mechanisms often notice repeated failing
-   request for it in their log files, and react by putting up pages
-   asking "What are you looking for?".
-
-   As the majority of /robots.txt files are created with platform-
-   specific text editors, robots should be liberal in accepting files
-   with different end-of-line conventions, specifically CR and LF in
-   addition to CRLF.
-
-
-6. Security Considerations
-
-   There are a few risks in the method described here, which may affect
-   either origin server or robot.
-
-   Web site administrators must realise this method is voluntary, and
-   is not sufficient to guarantee some robots will not visit restricted
-   parts of the URL space. Failure to use proper authentication or other
-   restriction may result in exposure of restricted information. It even
-   possible that the occurence of paths in the /robots.txt file may
-   expose the existence of resources not otherwise linked to on the
-   site, which may aid people guessing for URLs.
-
-   Robots need to be aware that the amount of resources spent on dealing
-   with the /robots.txt is a function of the file contents, which is not
-   under the control of the robot. For example, the contents may be
-   larger in size than the robot can deal with. To prevent denial-of-
-   service attacks, robots are therefore encouraged to place limits on
-   the resources spent on processing of /robots.txt.
-
-   The /robots.txt directives are retrieved and applied in separate,
-   possible unauthenticated HTTP transactions, and it is possible that
-   one server can impersonate another or otherwise intercept a
-   /robots.txt, and provide a robot with false information. This
-   specification does not preclude authentication and encryption
-   from being employed to increase security.
-
-7. Acknowledgements
-
-   The author would like the subscribers to the robots mailing list for
-   their contributions to this specification.
-
-
-
-
-
-
-
-Koster                draft-koster-robots-00.txt               [Page 10]
-
-INTERNET DRAFT        A Method for Robots Control       December 4, 1996
-
-8. References
-
-   [1] Koster, M., "A Standard for Robot Exclusion", 
-       http://info.webcrawler.com/mak/projects/robots/norobots.html,
-       June 1994.
-   
-   [2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext
-       Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996.
-       
-   [3] Postel, J., "Media Type Registration Procedure." RFC 1590,
-        USC/ISI, March 1994.
-   
-   [4]  Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
-        Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
-        University of Minnesota, December 1994.
-
-   [5] Crocker, D., "Standard for the Format of ARPA Internet Text
-       Messages", STD 11, RFC 822, UDEL, August 1982.
-
-   [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
-       UC Irvine, June 1995.
-
-9. Author's Address
-
-   Martijn Koster
-   WebCrawler
-   America Online
-   690 Fifth Street
-   San Francisco
-   CA 94107
-   
-   Phone: 415-3565431
-   EMail: m.koster@webcrawler.com
-
-                                                    Expires June 4, 1997
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Koster                draft-koster-robots-00.txt               [Page 11]
-
-
-
-
- -The -Web Robots Pages - -
-
- - -