(original) (raw)

# vim: ts=4 sw=4 et ai ff=unix ft=python paste Title/Summary: CodingProjectIdeas/StandardLibrary/CleanupUrlLibProject Organization: Python Software Foundation. Abstract: In the Python Standard Library, the functions required to handle url specific operations is spread across different modules in the Internet Protocols Category; primary modules being urllib, urllib2 and urlparse. These modules are old,have overlapping functionality, few required features missing in one or all of the modules. These three modules stand a chance of cleanupon various bugs;inconsistencies; feature requests and implementation changes required in the code. With the cleanup of the urllib modules, all the modules can be unified into a single module which will provide better interfaces to most commonly used methods in url handling. Many of the methods can be improved to an enormous exent using function decorators.The present modules are coded compliant with RFC1738, RFC1808 as reference and certain properties of RFC2396 are included. All the modules will be required to be compliant with the specification as described in RFC2396 The unified module should will incorprate methods and interfaces similar to as provided in the seperate modules like mechanize and urlgrabber. With the change in the codes to url modules, the following bug-fixes,enhance requests and feature requests are to be incorporated into this project. Bug Fixes: 1) Implement a timeout method for the connection. 2) [ 600362 ] relocate cgi.parse_qs() into urlparse 3) [ 735515 ] urllib,urllib2 to cache redirections. Fixes with few modules: Patches 1) [ 1648102 ] proxy_bypass in urllib handling of macro 2) [ 1664522 ] Fix for urllib.ftpwrapper.retrfile() 3) [ 1667860 ] urllib2 raises an UnboundLocalError if "auth-int" is the qop 4) [ 1673007 ] urllib2 requests history + HEAD support 5) [ 1675455 ] Use getaddrinfo() in urllib2.py for IPv6 support Changes. 1) [PROPOSAL] [MAJOR] Unification of urllib, urllib2 and urlparse module into a single module. 2) In all modules, Follow the new RFC 2396 in favour of RFC 1738 and RFC 1808. 3) Implement higher level interfaces to url module, which can accomplish most common tasks. The higher level interface definitions to follow the interfaces of the mechanize and urlgrabber module. 4) Enable consistentency in Proxy and Proxy Autentication Support. 5) Provide HTTPRefreshProcessor, HTTPEquivProcessor, HTTPRobotRulesProcessor,HTTPRedirectHandler,HTTPRefererProcessor, HTTPRequestUpgradeProcessor Objects to urllib module. 6) Provide Interface to urlgrabbing feature. 7) Address urlparser brokenness issue: Compliance with RFC2396. 8) Chache facility and query cache facility. Write Unit Tests for all Fixs and Changes Proposed. Detailed Description: This proposal aims to clean up the urllib module and proposes a unified module to handle the url specific functions. The approach to implement these changes to be fix the existing identified bugs, implement the changes requested and develop a unified solution modelling the existing available url handling modules. The bugs which will be fixed with the clean up tasks are: 1) Implement a timeout method for the connection. Implementing urllib2.timeout() for timeout at specified intervals and ability to pass a timeout to underlying socket. urllib and urllib2 uses socket module and does not yet have to feature to timeout when the request has not been served for specified interval of time. Providing a timeout value to the request methods will be an useful addition to urllib. 2) [ 600362 ] relocate cgi.parse_qs() into urlparse url parsing stuff seems distributed among the urlparse and cgi modules [ 600362 ] relocate cgi.parse_qs() into urlparse The location of the url-handling functions are distributed among several modules, and it would be good to consolidate them to make them easier to find. The urlparse.urlparse() function splits an url into its relative pieces. However, it does not parse out the query string into a dictionary --- that role is played by cgi.parse_qs(). And to convert a dictionary back to a query string, the programmer needs to know that that function is in urllib.urlencode. It would be nice to have cgi.parse_qs() and urllib.urlencode() in a unified place, within in the urlparse module if appropriate. This will help reduce the amount of hunting-and-pecking that beginners do when they're trying to deal with URLs. Reference: http://mail.python.org/pipermail/tutor/2002-August/016823.html 3) [ 735515 ] urllib,urllib2 to cache redirections. urllib / urllib2 should cache the results of 301 (permanent) redirections. This shouldn't break anything, since it's just an internal optimisation from one point of view -- but it's also what the RFC (2616, section 10.3.2, first para) says SHOULD happen. Fixes with the patches submitted at sourceforge. While writing the new modules, only the functionality provided by these patches will be kept intact; but the module as such needs to be re-written supporting the new features of python to enable faster processing. Contacting the authors, informing the changes will be necessary when implementing the changes. 1) [ 1648102 ] proxy_bypass in urllib handling of macro Andling of the macro in urllib.proxy_bypass is broken. According to the Microsoft documentation for this macro, what should be checked is simply that the host name specified does not contain a period. Since urllib gets its proxy information directly from the Windows registry it would make sense to use the same definitions that Microsoft does. Attached is a patch that does this. Here is a link to the documentation that specifies this: http://msdn2.microsoft.com/en-gb/library/aa384098.aspx 2) [ 1664522 ] Fix for urllib.ftpwrapper.retrfile() When trying to retrieve a none existing file using the urllib.ftpwrapper.retrfile() method the behaviour is that instead of an error message a valid hook is returned and you will recieve a 0 byte file. The current behaviour tries to emulate what one typically sees with http servers and DirIndexes, which means: - Try to RETR the file. - If that fails, assume it is a directory and LIST it. 3) [ 1667860 ] urllib2 raises an UnboundLocalError if "auth-int" is the qop urllib2 raises an UnboundLocalError if "auth-int" is the qop If a proxy server is connected to that specifies the "auth-int" quality of protection (qop) code--or any qop code aside from "auth", actually--urllib2 raises an UnboundLocalError exception. 4) [ 1673007 ] urllib2 requests history + HEAD support Add history off all sent and received headers/requests to addinfourl object. Save redirections history too. 5) [ 1675455 ] Use getaddrinfo() in urllib2.py for IPv6 support The number of base Python modules use gethostbyname() when they should be using getaddrinfo(). The big limitation hit when using gethostbyname() is the lack of IPv6 support. This patch for urllib2.py. It replaces all uses of gethostbyname() with getaddrinfo() instead. getaddrinfo() returns a 5-tuple, so additional code needs to wrap a getaddrinfo() call when replacing gethostbyname() calls. major proposal being the unification of the module, this has to research into the backward compatiblity and communication with the older modules aspects. The design of the unified module can follow the same as the mechanize module; which prevides bug fixes, simpler interface and co-existing features of urllib2 and also provides communication mechanism for urllib2 objects to mechanize objects. In all modules, follow the new RFC 2396 in favor of RFC 1738, RFC 1808. The standards for URI described in RFC 2396 is different from older RFCs and urllib, urllib2 modules implement the URL specifications based on the older URL specification. This will need changes in urlparse and other parse modules to handle URLS as specified in the RFC2396. As in mechanize and urlgrabber,Implement higher level interfaces to url module, which can accomplish most common tasks. urllib is a relatively raw inteface to the underlying protocols, urlgrabber much better interface to support urlgrabbing.t is extremely simple to drop into an existing program and provides a clean interface to protocol-independant file-access. Solve the urlparser brokenness as discussed in: http://mail.python.org/pipermail/python-dev/2005-November/058301.html urlparse, which currently returns 4 tuple as per RFC 1738, need to be updated to support RFC 2396, which which case: """Parse a URL into 5 components::///?# Return a 5-tuple: (scheme, netloc, path, query, fragment). Provide better consistency to PROXY SUPPORT Features of urllib modules. Enable caching feature in urllib (as present in the mechanize module) and Include a function to query if a particular url is in cache. The following cases, can be adoped from the mechanize module. 1. Handler classes HTTPRefreshProcessor, HTTPEquivProcessor, HTTPRobotRulesProcessor 2. HTTPRedirectHandler.HTTPRequestUpgradeProcessor and ResponseUpgradeProcessord classes. 3. Request and response objects from code based on urllib2 work with mechanize,urlgrabber and vice-versa. With the above mentioned changes, the plan for completion of CleanUrllib can be undertaken in the specified timeframe.