Issue 881522: Shelve slow after 7/8000 key (original) (raw)

Issue881522

Created on 2004-01-21 17:09 by marcoberi, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test1.py marcoberi,2004-01-21 17:09 Little test program (9 lines) to show the problem
test1skip.py skip.montanaro,2004-01-22 00:28
test3skip.py skip.montanaro,2004-01-22 18:02
Messages (24)
msg19737 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-21 17:09
After about 8.000 insertion shelve became really, really slow. This happens only with 2.3.3 #51 on Windows, not with 2.2 and with 2.3 on Linux. I try with writeback True or False: same problem. Help! :-))
msg19738 - (view) Author: Thomas Heller (theller) * (Python committer) Date: 2004-01-21 18:24
Logged In: YES user_id=11105 Hm, are windows bugs automatically assigned to me ;-)??
msg19739 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-21 23:57
Logged In: YES user_id=588604 Skip Montanaro discovered that whichdb repors bsddb185 with python 2.2 and dbhash with 2.3.3. So why is it so slow after few thousand keys?
msg19740 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 00:28
Logged In: YES user_id=44345 Can't reproduce on Mac OS X. I tried with 2.2, 2.3 and CVS using attached test1skip.py (no writeback - 2.2 doesn't support it, no import pickle - not used, no key prints - just muddies the water, print whichdb's result). The times are close enough to not worry me: montanaro:tmp% time python2.3 test1.py dbhash real 0m1.927s user 0m1.720s sys 0m0.080s montanaro:tmp% time python2.2 test1.py dbhash real 0m1.250s user 0m0.850s sys 0m0.360s montanaro:tmp% time python test1.py dbhash real 0m2.179s user 0m1.950s sys 0m0.120s Please try this modified version just to make sure we are both looking at the same thing.
msg19741 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-22 07:30
Logged In: YES user_id=588604 I tried your version: 31.36 seconds vs 0.65. Just to be sure I tried on three different computers with Windows 2000: same gap. [c:\tmp]timer & \Python23\python test1skip.py & timer Timer 1 on: 8.21.58 dbhash Timer 1 off: 8.22.29 Elapsed: 0.00.31,36 [c:\tmp]timer & \Python22\python test1skip.py & timer Timer 1 on: 8.22.40 dbhash Timer 1 off: 8.22.41 Elapsed: 0.00.00,65
msg19742 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-01-22 17:29
Logged In: YES user_id=31435 FYI, on a Win98SE box, test1skip.py took about 30 seconds under 2.3.3, and about 1 second under both 2.2.3 and 2.1.3. Under 2.3.3, no significant time is taken by a.close(), so it's all in the loop. It prints "dbhash" under all versions.
msg19743 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 18:01
Logged In: YES user_id=44345 Try test3skip.py. You run it like this: python test3skip.py hashopen python test3skip.py btopen I ran it on win2k under cygwin so I could use the time command (but ran the Windows version of Python). Using btopen was much faster. I got rid of shelve to eliminate it and pickle as possible sources of problems. $ time /cygdrive/c/Python23/python test3skip.py hashopen real 0m6.801s user 0m0.015s sys 0m0.000s Administrator@CYCLOPS ~/tmp $ time /cygdrive/c/Python23/python test3skip.py btopen real 0m0.345s user 0m0.015s sys 0m0.015s I don't know if the relationship between real, user and sys time means anything on cygwin, but the reported real times are very repeatable and match my subjective feel of the elapsed time. This suggests there's something fishy with either the underlying library or with __setitem__ when using hash files. I'm assigning to Greg so he can take a peek. As the bsddb/ pybsddb guy he might have some better insight (certainly better than me).
msg19744 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 18:02
Logged In: YES user_id=44345 Try test3skip.py. You run it like this: python test3skip.py hashopen python test3skip.py btopen I ran it on win2k under cygwin so I could use the time command (but ran the Windows version of Python). Using btopen was much faster. I got rid of shelve to eliminate it and pickle as possible sources of problems. $ time /cygdrive/c/Python23/python test3skip.py hashopen real 0m6.801s user 0m0.015s sys 0m0.000s Administrator@CYCLOPS ~/tmp $ time /cygdrive/c/Python23/python test3skip.py btopen real 0m0.345s user 0m0.015s sys 0m0.015s I don't know if the relationship between real, user and sys time means anything on cygwin, but the reported real times are very repeatable and match my subjective feel of the elapsed time. This suggests there's something fishy with either the underlying library or with __setitem__ when using hash files. I'm assigning to Greg so he can take a peek. As the bsddb/ pybsddb guy he might have some better insight (certainly better than me).
msg19745 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-22 18:16
Logged In: YES user_id=588604 I get your same results under normal cmd: 7.07 seconds vs 0.46. [c:\tmp]timer & \python23\python test3skip.py hashopen & timer Timer 1 on: 19.13.22 Timer 1 off: 19.13.29 Elapsed: 0.00.07,07 [c:\tmp]timer & \python23\python test3skip.py btopen & timer Timer 1 on: 19.13.45 Timer 1 off: 19.13.45 Elapsed: 0.00.00,46
msg19746 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2004-01-22 18:32
Logged In: YES user_id=413 This problem is not specific to windows. hashopen in the test3skip.py test case is 10x slower than btopen on my linux-alpha system. I don't know why BerkeleyDB hash databases are so much slower than B-Tree ones. My best suggestion is: if it hurts, don't do that. Use a btree rather thah hash database. Running the python process under strace on linux reveals nothing obvious (no system calls are being made during the time hash open is consuming lots of cpu... You'll have to ask sleepycat themselves if you want a real answer as to why hash databases don't perform well.
msg19747 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-01-22 18:56
Logged In: YES user_id=31435 The original question is why a BDB hash is some 30x slower under 2.3 than under 2.2 or 2.1, and that does appear specific to Windows. Skip threw btrees into this too, but that complication doesn't appear relevant to the original report (despite marcoberi's hearsay 2004-01-21 18:57 comment -- others posted actual output, making clear that dbhash is used under all Python versions in test1skip). I'll note in passing that the test case inserts keys in already- mostly-sorted order, which is a friendly order for a btree- based mapping. To get back to the original report, ignore everything here concerning test3skip and btrees.
msg19748 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2004-01-22 19:12
Logged In: YES user_id=413 python 2.2 and earlier on windows linked against some form of bsddb 1.85. python 2.3 and later link against modern BerkeleyDB (not really related to bsddb 1.85 much at all other than by name and a legacy api). They are very different libraries with very different capabilities and performance. regardless, i don't have a windows development platform anymore. someone who does, please take this. i suspect this is not something we can fix. try asking sleepycat why modern DB_HASH databases might be slower than bsddb 1.85 hash databases on windows and see what they say.
msg19749 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 20:22
Logged In: YES user_id=44345 I guess I get similar results on Mac OS X after looking at it a bit. The differences are just not as dramatic (or disappointing) as they are on Windows. Here's the output of a little shell script which runs test3skip.py with various Python interpreters and Berkeley DB versions: Python version: (2, 4, 0, 'alpha', 0) Berkeley DB version: 4.2.4 hashopen: 0m1.621s btopen: 0m0.608s Python version: (2, 3, 3, 'final', 0) Berkeley DB version: 4.2.0 hashopen: 0m1.359s btopen: 0m0.450s Python version: (2, 2, 0, 'final', 0) Berkeley DB version: ??? hashopen: 0m0.514s btopen: 0m0.202s Only real (wall clock) times are displayed. Mario, Unfortunately, there doesn't seem to be much we can do at this end to remedy the situation with hash files. If you want to use shelve but switch to bsddb.btopen as the underlying db file open call, try posting to comp.lang.python. Anything you do will probably be a miserable hack, but we can probably figure something out.
msg19750 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 20:28
Logged In: YES user_id=44345 Whoops, sorry about polluting the waters with the btree stuff. Dang time lag. Looking at just the hashopen times between 2.2, 2.3 and 2.4 does show that it hash file times have gotten worse since Berkeley 1.85 days. Whether or not btree times muddy these particular waters, figuring out a way to switch to a different db type and still use the shelve module may be Marco's best bet for a short term performance improvement.
msg19751 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-01-22 20:36
Logged In: YES user_id=31435 Greg, I didn't expect you to fix it , I just didn't want the bug report closed based on misunderstanding what it was about. I've unassigned this item, and if nobody volunteers to dig into it within a few weeks, it should indeed be closed as "3rd Party" and "Wont Fix Skip, maybe we should try to force spambayes to use a btree mapping too -- then maybe we could get a whole new class of intractable corruption errors <wink -- but it might be a lot faster>.
msg19752 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 21:11
Logged In: YES user_id=44345 If we wanted speed and didn't care about corruption, my vote would be bsddb185. ;-)
msg19753 - (view) Author: James Kew (jkew) Date: 2004-01-22 23:53
Logged In: YES user_id=598066 FWIW, to throw another use case into the pot: I (used to) run Roundup (roundup.sf.net) trackers on anydbm/Win2K and experienced a significant drop in performance between 2.2.x (bsddb185) and 2.3.x (dbhash). I understand that this is a third-party issue, and that there were significant known problems with bsddb 1.85, but it did cause me a bit of a double-take after having heard so much about Python 2.3 being faster... I say "used to" because the slowdown prompted me to migrate to Roundup's sqlite backend, solving my problem.
msg19754 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-23 00:08
Logged In: YES user_id=588604 I get your same results under normal cmd: 7.07 seconds vs 0.46. [c:\tmp]timer & \python23\python test3skip.py hashopen & timer Timer 1 on: 19.13.22 Timer 1 off: 19.13.29 Elapsed: 0.00.07,07 [c:\tmp]timer & \python23\python test3skip.py btopen & timer Timer 1 on: 19.13.45 Timer 1 off: 19.13.45 Elapsed: 0.00.00,46
msg19755 - (view) Author: James Kew (jkew) Date: 2004-01-23 00:16
Logged In: YES user_id=598066 FWIW2, on skip's "miserable hack" comment below, vis-a-vis running shelve on btree: isn't this exactly the sort of thing shelve.Shelf is intended for? import bsddb import shelve db = bsddb.btopen("temp.db") sh = shelve.Shelf(db) # do stuff with sh sh.close() # automatically calls close() on the underlying db (Not sure why Shelf and friends are documented on shelve's "Restrictions" subsection...)
msg19756 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-23 00:44
Logged In: YES user_id=588604 jkew, also I god a bit of a headache. I was pretty sure to improve performances with Python 2.3.3, while they get incredibly worse. I know perhaps this is a third-party issue, but I use a python feature (shelve) and at least I think that it's better to remove it or signal this problem in the documentation. We are talking about few thousand key, not billions! BTW I didn't post twice the previuos message.
msg19757 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-23 10:01
Logged In: YES user_id=588604 I give a wrong info: I didn't try it on Linux so I'm not so sure it's a windows specific problem. Besides this, looking at 2004-01-22 18:32 greg comment, it's seems that also Linux - alpha version has this problem. Probably it's better to modify category to "Python library"?
msg19758 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-23 10:03
Logged In: YES user_id=588604 I mean: I didn't try with python 2.3 on linux (just with python 2.2)
msg19759 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-04-22 01:39
Logged In: YES user_id=31435 As threatened months ago, closed as 3rd Party, Won't Fix -- there's no sign that this will ever make progress.
msg19760 - (view) Author: Marco Beri (marcoberi) Date: 2005-02-17 13:42
Logged In: YES user_id=588604 FYI, with Python 2.4 speed is again ok. So problem are confined to 2.3 version (also 2.3.5 has the shelve slow problem).
History
Date User Action Args
2022-04-11 14:56:02 admin set github: 39844
2004-01-21 17:09:32 marcoberi create