Ecosyste.ms: Packages

An open API service providing package, version and dependency metadata of many open source software ecosystems and registries.

Top 6.1% on proxy.golang.org
Top 2.8% dependent packages on proxy.golang.org

proxy.golang.org : github.com/benjaminestes/robots/v2

Package robots implements robots.txt parsing and matching based on Google's specification. For a robots.txt primer, please read the full specification at: https://developers.google.com/search/reference/robots_txt. Clients of this package have one obligation: when testing whether a URL can be crawled, use the correct robots.txt file. The specification uses scheme, port, and punycode variations to define which URLs are in scope. To get the right robots.txt file, use Locate. Locate takes as its only argument the URL you want to access. It returns the URL of the robots.txt file that governs access. Locate will always return a single unique robots.txt URL for all input URLs sharing a scope. In practice, a client pattern for testing whether a URL is accessible would be: a) Locate the robots.txt file for the URL; b) check whether you have fetched data for that robots.txt file; c) if yes, use the data to Test the URL against your user agent; d) if no, fetch the robots.txt data and try again. For details, see "File location & range of validity" in the specification: https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. A generous parser is specified. A valid line is accepted, and an invalid line is silently discarded. This is true even if the content parsed is in an unexpected format, like HTML. For details, see "File format" in the specification: https://developers.google.com/search/reference/robots_txt#file-format The specification states that a crawler will assume all URLs are accessible, even if there is no robots.txt file, or the body of the robots.txt file is empty. So a robots.txt file with a 404 status code will result in all URLs being crawlable. The exception to this is a 5xx status code. This is treated as a temporary "full disallow" of crawling. For details, see "Handling HTTP result codes" in the specification: https://developers.google.com/search/reference/robots_txt#handling-http-result-codes

Registry - Source - Documentation - JSON
purl: pkg:golang/github.com/benjaminestes/robots/v2
Keywords: go, robots-txt
License: MIT
Latest release: over 4 years ago
First release: over 1 year ago
Namespace: github.com/benjaminestes/robots
Dependent packages: 4
Stars: 3 on GitHub
Forks: 2 on GitHub
See more repository details: repos.ecosyste.ms
Last synced: 13 days ago

Readme