msg171711 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2012-10-01 12:58 |
Robotparser doesn't support two quite important optional parameters from the robots.txt file. I have implemented those in the following way: (Robotparser should be initialized in the usual way: rp = robotparser.RobotFileParser() rp.set_url(..) rp.read ) crawl_delay(useragent) - Returns time in seconds that you need to wait for crawling if none is specified, or doesn't apply to this user agent, returns -1 request_rate(useragent) - Returns a list in the form [request,seconds]. if none is specified, or doesn't apply to this user agent, returns -1 |
|
|
msg171712 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2012-10-01 13:16 |
Thanks for the patch. New features must be implemented in Python 3.4. Python 2.7 is in feature freeze mode and therefore doesn't get new features. |
|
|
msg171715 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2012-10-01 13:37 |
Okay, sorry didn't know that (: Here's the same patch (Same functionality) for python3 Feedback is welcome, as always (: |
|
|
msg171719 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2012-10-01 13:52 |
We have a team that mentors new contributors. If you are interested to get your patch into Python 3.4, please read http://pythonmentors.com/ . The people are really friendly and will help you with every step of the process. |
|
|
msg172327 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2012-10-07 18:20 |
Okay, here's a proper patch with documentation entry and test cases. Please review and comment |
|
|
msg172338 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2012-10-07 19:56 |
Reformatted patch |
|
|
msg205567 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2013-12-08 14:41 |
Hey, it has been more than an year since the last activity. Is there anything else I should do in order for someone of the python devs team to review my changes and perhaps give some feedback? Nick |
|
|
msg205641 - (view) |
Author: Berker Peksag (berker.peksag) *  |
Date: 2013-12-09 02:31 |
I left a few comments on Rietveld. |
|
|
msg205755 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2013-12-10 00:22 |
Thank you for the review! I have addressed your comments and release a v2 of the patch: Highlights: No longer crashes when provided with malformed crawl-delay/robots.txt parameter. Returns None when parameter is missing or syntax is invalid. Simplified several functions. Extended tests. http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser.rst File Doc/library/urllib.robotparser.rst (right): http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser.... Doc/library/urllib.robotparser.rst:56: .. method:: crawl_delay(useragent) On 2013/12/09 03:30:54, berkerpeksag wrote: > Is crawl_delay used for search engines? Google recommends you to set crawl speed > via Google Webmaster Tools instead. > > See https://support.google.com/webmasters/answer/48620?hl=en. Crawl delay and request rate parameters are targeted to custom crawlers that many people/companies write for specific tasks. The Google webmaster tools is targeted only to google's crawler and typically web admins have different rates for google/yahoo/bing and all other user agents. http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py File Lib/urllib/robotparser.py (right): http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py#newco... Lib/urllib/robotparser.py:168: for entry in self.entries: On 2013/12/09 03:30:54, berkerpeksag wrote: > Is there a better way to calculate this? (perhaps O(1)?) I have followed the model of what was written beforehand. A 0(1) implementation (probably based on dictionaries) would require a complete rewrite of this library, as all previously implemented functions employ the: for entry in self.entries: if entry.applies_to(useragent): logic. I don't think this matters particularly here, as those two functions in particular need only be called once per domain and robots.txt seldom contains more than 3 entries. This is why I have just followed the design laid out by the original developer. Thanks Nick |
|
|
msg205761 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2013-12-10 00:41 |
Oh... Sorry for the spam, could you please verify my documentation link syntax. I'm not entirely sure I got it right. |
|
|
msg208721 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2014-01-21 23:30 |
Hey, Just a reminder friendly reminder that there hasn't been any activity for a month and I have released a v2, pending for review (: |
|
|
msg219212 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2014-05-27 09:29 |
Updated patch, all comments addressed, sorry for the 6 months delay. Please review |
|
|
msg223099 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2014-07-15 10:38 |
Hey, Just a friendly reminder that there has been no activity for a month and a half and v3 is pending for review (: |
|
|
msg225916 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2014-08-26 13:15 |
Hey, Just a friendly reminder that the patch is pending for review and there has been no activity for 3 months (: |
|
|
msg252483 - (view) |
Author: Nikolay Bogoychev (XapaJIaMnu) |
Date: 2015-10-07 20:01 |
Hey, Friendly reminder that there has been no activity on this issue for more than an year. Cheers, Nick |
|
|
msg252521 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2015-10-08 09:27 |
New changeset dbed7cacfb7e by Berker Peksag in branch 'default': Issue #16099: RobotFileParser now supports Crawl-delay and Request-rate https://hg.python.org/cpython/rev/dbed7cacfb7e |
|
|
msg252525 - (view) |
Author: Berker Peksag (berker.peksag) *  |
Date: 2015-10-08 09:34 |
I've finally committed your patch to default. Thank you for not giving up, Nikolay :) Note that currently the link in the example section doesn't work. I will open a new issue for that. |
|
|