msg184017 - (view) |
Author: Ben Mezger (benmezger) |
Date: 2013-03-12 10:58 |
I am trying to parse Google's robots.txt (http://google.com/robots.txt) and it fails when checking whether I can crawl the url /catalogs/p? (which it's allowed) but it's returning false, according to my question on stackoverflow -> http://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly Someone has answered it has to do with the line "rllib.quote(urlparse.urlparse(urllib.unquote(url))[2])" in robotparser's module, since it removes the "?" from the end of the url. Here is the answer I received -> http://stackoverflow.com/a/15350039/1649067 |
|
|
msg184525 - (view) |
Author: Mher Movsisyan (mher) |
Date: 2013-03-18 21:19 |
Attaching patch. |
|
|
msg184609 - (view) |
Author: Mher Movsisyan (mher) |
Date: 2013-03-19 07:11 |
The second patch only normalizes the url. From http://www.robotstxt.org/norobots-rfc.txt it is not clear how to handle multiple rules with the same prefix. |
|
|
msg184614 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-03-19 07:35 |
I left a couple of comments on rietveld. |
|
|
msg185311 - (view) |
Author: andrew cooke (acooke) |
Date: 2013-03-27 00:10 |
what is rietveld? and why is this marked as "easy"? it seems like it involves issues that aren't described well in the spec - it requires some kind of canonical way to describe urls with (and without) parameters to solve completely. |
|
|
msg185312 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-03-27 00:14 |
Rietveld is the review tool. You can access it by clicking on the "review" link at the right of the patch. You should have received an email as well when I made the review. |
|
|
msg185313 - (view) |
Author: andrew cooke (acooke) |
Date: 2013-03-27 00:19 |
thanks (only subscribed to this now, so no previous email). my guess is that google are assuming a dumb regexp so http://example.com/foo? in a rule does not match http://example.com/foo and also i realised that http://google.com/robots.txt doesn't contain any url with multiple parameters. so perhaps i was wrong about needing a canonical representation (ie parameter ordering). |
|
|
msg185314 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2013-03-27 02:36 |
Well, the code is easy. Figuring out what the code is supposed to do turns out to be hard, but we didn't know that when we marked it as easy :) I want to do more research before OKing a fix for this. (There is clearly a bug, I'm just not certain what the correct fix is.) |
|
|
msg187523 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2013-04-21 20:30 |
Lucaz pointed out on IRC that the problem is that the current robotparser is implementing an outdated robots.txt standard. He may work on fixing that. |
|
|
msg187552 - (view) |
Author: Mher Movsisyan (mher) |
Date: 2013-04-22 10:54 |
Can you share the link of the new robots.txt standard? I may help to implement it. |
|
|
msg187557 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2013-04-22 12:39 |
I haven't a clue, that was part of the research I was going to do but haven't done yet (and probably won't for now...I'll wait to see if you or Lukaz pick it up first :). I see he didn't nosy himself on the issue yet, though, so I've done that. Maybe he'll respond. |
|
|
msg187560 - (view) |
Author: Łukasz Langa (lukasz.langa) *  |
Date: 2013-04-22 13:16 |
robotparser implements http://www.robotstxt.org/orig.html, there's even a link to this document at http://docs.python.org/3/library/urllib.robotparser.html. As mher points out, there's a newer version of that spec formed as RFC: http://www.robotstxt.org/norobots-rfc.txt. It introduces Allow, specifies how percentage encoding should be treated and how to handle expiration. Moreover, there is a de facto standard agreed by Google, Yahoo and Microsoft in 2008, documented by their respective blog posts: http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html http://www.ysearchblog.com/2008/06/03/one-standard-fits-all-robots-exclusion-protocol-for-yahoo-google-and-microsoft/ http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx For reference, there are two third-party robots.txt parsers out there implementing these extensions: - https://pypi.python.org/pypi/reppy - https://pypi.python.org/pypi/robotexclusionrulesparser We need to decide how to incorporate those new features while maintaining backwards compatibility concerns. |
|
|
msg187561 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2013-04-22 13:29 |
My suggestion for this issue is going ahead with patch2 of Mher. It does a simple normalization and does the right thing. The case in the question is an empty query string and behavior or Allow and Disallow for that and patch addresses that. (I don't know why this *bug* was not detected earlier) Robotparser implements the updated one ( <www.robotstxt.org/norobots-rfc.txt>) - You can check for Allow string verification in both code and tests. That said, if updating robotparser further to more compliant with many cases which the 3rd party modules adhere, +1 to that. I suggest that be taken as a different issue and not be confused with this bug. |
|
|
msg190304 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2013-05-29 12:59 |
New changeset 30128355f53b by Senthil Kumaran in branch '3.3': #17403: urllib.parse.robotparser normalizes the urls before adding to ruleline. http://hg.python.org/cpython/rev/30128355f53b New changeset e954d7a3bb8a by Senthil Kumaran in branch 'default': merge from 3.3 http://hg.python.org/cpython/rev/e954d7a3bb8a New changeset bcbad715c2ce by Senthil Kumaran in branch '2.7': #17403: urllib.parse.robotparser normalizes the urls before adding to ruleline. http://hg.python.org/cpython/rev/bcbad715c2ce |
|
|
msg190306 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2013-05-29 13:01 |
This is fixed in default, 3.3 and 2.7. I will merge this change to 3.2 code line before closing this. I shall raise a new request for updating robotparser with other goodies. |
|
|
msg237995 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-03-13 00:23 |
Perhaps it’s too late to modify the 3.2 branch now? IMO the change made for this bug abuses the behaviour of urlunparse() removing empty query strings; see Issue 22852 where I proposed to stop it doing that. |
|
|
msg237998 - (view) |
Author: Berker Peksag (berker.peksag) *  |
Date: 2015-03-13 00:39 |
Yes, this doesn't look like a security issue to me. Too late for 3.2. Closing this as "fixed". |
|
|