Hello, I have stumbled upon a couple of inconsistencies in urllib.robotparser's __str__ methods. These appear to be unintentional omissions; basically the code was modified but the string methods were never updated. 1. The RobotFileParser.__str__ method doesn't include the default (*) User-agent entry. >>> from urllib.robotparser import RobotFileParser >>> parser = RobotFileParser() >>> text = """ ... User-agent: * ... Allow: /some/path ... Disallow: /another/path ... ... User-agent: Googlebot ... Allow: /folder1/myfile.html ... """ >>> parser.parse(text.splitlines()) >>> print(parser) User-agent: Googlebot Allow: /folder1/myfile.html >>> This is *especially* awkward when parsing a valid robots.txt that only contains a wildcard User-agent. >>> from urllib.robotparser import RobotFileParser >>> parser = RobotFileParser() >>> text = """ ... User-agent: * ... Allow: /some/path ... Disallow: /another/path ... """ >>> parser.parse(text.splitlines()) >>> print(parser) >>> 2. Support was recently added for `Crawl-delay` and `Request-Rate` lines, but __str__ does not include these. >>> from urllib.robotparser import RobotFileParser >>> parser = RobotFileParser() >>> text = """ ... User-agent: figtree ... Crawl-delay: 3 ... Request-rate: 9/30 ... Disallow: /tmp ... """ >>> parser.parse(text.splitlines()) >>> print(parser) User-agent: figtree Disallow: /tmp >>> 3. Two unnecessary trailing newlines are being appended to the string output (one for the last RuleLine and one for the last Entry) (see above examples) Taken on their own these are all minor issues, but they do make things quite confusing when using robotparser from the REPL!
The default entry was moved out of entries added in , but RobotFileParser.__str__ was not updated. Support for "Crawl-delay" and "Request-Rate" was added in , but Entry.__str__ was not updated. This looks like bugs to me, and I think the fix should be backported. But two unnecessary trailing newlines should be kept for compatibility in maintained versions. I think we can get rid of them in 3.8 (unless Senthil has other opinion).
> But two unnecessary trailing newlines should be kept for compatibility in maintained versions. Yup, that sounds good to me. It doesn't seem like any RFC requirements. It's just kept for the compatibility and we can do away with it in 3.8