msg331595 - (view) |
Author: larsfuse (larsfuse) |
Date: 2018-12-11 09:30 |
The standard (http://www.robotstxt.org/robotstxt.html) says: > To allow all robots complete access: > User-agent: * > Disallow: > (or just create an empty "/robots.txt" file, or don't use one at all) Here I give python an empty file: $ curl http://10.223.68.186/robots.txt $ Code: rp = robotparser.RobotFileParser() print (robotsurl) rp.set_url(robotsurl) rp.read() print( "fetch /", rp.can_fetch(useragent = "*", url = "/")) print( "fetch /admin", rp.can_fetch(useragent = "*", url = "/admin")) Result: $ ./test.py http://10.223.68.186/robots.txt ('fetch /', False) ('fetch /admin', False) And the result is, robotparser thinks the site is blocked. |
|
|
msg331870 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2018-12-14 21:08 |
https://docs.python.org/2.7/library/robotparser.html#module-robotparser and https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser refers users, for file structure, to http://www.robotstxt.org/orig.html. This says nothing about the effect of an empty file, so I don't see this as a bug. Even if it was, I would be dubious about reversing the effect without a deprecation notice first, and definitely not in 2.7. I would propose instead that the doc be changed to refer to the new file, with more and better examples, but add a note that robotparser interprets empty files as 'block all' rather than 'allow all'. Try bringing this up on python-ideas. |
|
|
msg331963 - (view) |
Author: larsfuse (larsfuse) |
Date: 2018-12-17 10:02 |
> (...) refers users, for file structure, to http://www.robotstxt.org/orig.html. This says nothing about the effect of an empty file, so I don't see this as a bug. That is incorrect. From that url you can find: > The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome. So this is definitely a bug. |
|
|
msg359180 - (view) |
Author: Andre Burgaud (gallicrooster) * |
Date: 2020-01-02 03:41 |
Hi, Is this ticket still relevant for Python 3.8? While running some tests with an empty robotstxt file I realized that it was returning "ALLOWED" for any path (as per the current draft of the Robots Exclusion Protocol: https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.1 ") Code: from urllib import robotparser robots_url = "file:///tmp/empty.txt" rp = robotparser.RobotFileParser() print(robots_url) rp.set_url(robots_url) rp.read() print( "fetch /", rp.can_fetch(useragent = "*", url = "/")) print( "fetch /admin", rp.can_fetch(useragent = "*", url = "/admin")) Output: $ cat /tmp/empty.txt $ python -V Python 3.8.1 $ python test_robot3.py file:///tmp/empty.txt fetch / True fetch /admin True |
|
|
msg359185 - (view) |
Author: Karthikeyan Singaravelan (xtreak) *  |
Date: 2020-01-02 08:36 |
There is a behavior change. parse() sets the modified time and unless the modified time is set the can_fetch method returns false. In Python 2 the parse method was called only when the file is non-empty [0] but in Python 3 it's always called though the file is empty [1] . The change was done with 1afc1696167547a5fa101c53e5a3ab4717f8852c to always read parse and then in 122541beceeccce4ef8a9bf739c727ccdcbf2f28 modified function was always called during parse thus setting the modified_time to return True from can_fetch in the end. I think the behavior of robotparser for empty file was undefined allowing these changes and it will be good to have a test for this behavior. [0] https://github.com/python/cpython/blob/f82e59ac4020a64c262a925230a8eb190b652e87/Lib/robotparser.py#L66-L67 [1] https://github.com/python/cpython/blob/149175c6dfc8455023e4335575f3fe3d606729f9/Lib/urllib/robotparser.py#L69-L70 |
|
|
msg359202 - (view) |
Author: Andre Burgaud (gallicrooster) * |
Date: 2020-01-02 15:45 |
Thanks @xtreak for providing some clarification on this behavior! I can write some tests to cover this behavior, assuming that we agree that an empty file means "unlimited access". This was worded as such in the old internet draft from 1996 (section 3.2.1 in https://www.robotstxt.org/norobots-rfc.txt). The current draft is more ambiguous with "If no group satisfies either condition, or no groups are present at all, no rules apply." https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.1 https://www.robotstxt.org/robotstxt.html clearly states that an empty file gives full access, but I'm getting lost in figuring out which is the official spec at the moment :-) |
|
|