[Python-Dev] I'm not getting email from SF when assigned abug/patch (original) (raw)

Fredrik Lundh fredrik at pythonware.com
Mon Apr 3 00:28:29 CEST 2006


> Fredrik, if you would like to help move this all forward, great; I > would appreciate the help. You can write a page scraper to get the > data out of SF

challenge accepted ;-) http://effbot.python-hosting.com/browser/stuff/sandbox/sourceforge/ contains three basic tools; getindex to grab index information from a python tracker, getpages to get "raw" xhtml versions of the item pages, and getfiles to get attached files. I'm currently downloading a tracker snapshot that could be useful for testing; it'll take a few more hours before all data are downloaded (provided that SF doesn't ban me, and I don't stumble upon more cases where a certain rhettinger has pasted binary gunk into an iso-8859-1 form ;-).

alright, it took my poor computer nearly eight hours to grab all the data, and some tracker items needed special treatment to work around some interesting SF bugs, but I've finally managed to download all items available via the SF tracker index, and all data files available via the item pages:

tracker-105470 (bugs)
    6682 items
    6682 pages (100%)
    1912 files
tracker-305470 (patches)
    3610 items
    3610 pages (100%)
    4663 files
tracker-355470 (feature requests)
    430 items
    430 pages (100%)
    80 files

the complete data set is about 300 megabytes uncompressed, and ~85 megabytes zipped.

the scripts are designed to make it easy to update the dataset; adding new items and files only takes a couple of minutes; refreshing the item information may take a few hours.

::: I've also added a basic "extract" module which parses the XHTML pages and the data files. this module can be used by import scripts, or be used to convert the dataset into other formats (e.g. a single XML file) for further processing.

the source code is available via the above link; I'll post the ZIP file some- where tomorrow (drop me a line if you want the URL).



More information about the Python-Dev mailing list