Issue 849662: reading shelves is really slow (original) (raw)

Issue849662

Created on 2003-11-26 14:06 by ganssauge, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
all_idx2.shelve.bz2 ganssauge,2003-11-27 10:42 The shelve in question
69228.profile ganssauge,2003-11-27 10:43 The profiling data I made
slow_shelve.py ganssauge,2003-11-28 16:01
Messages (10)
msg19147 - (view) Author: Gottfried Ganßauge (ganssauge) Date: 2003-11-26 14:06
My application uses a shelve-file which is created by another process using the same python version. Before python2.3 using this shelve with the exact same application was almost twice as fast as a binary pickle containing the same data. Now with python2.3 the same application is suddenly about 150 times slower than using the binary pickle. The usage is as follows: idx_dict = shelve.open (idx_dict_name, "r") ... while not infile.eof: index = get_index_from_somewhere_else() if not idx_dict.has_key (index): do_something(index) else: do_something_else(index) idx.dict.close() Profiling revealed that most of the time is spent within userdict.
msg19148 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2003-11-27 09:17
Logged In: YES user_id=80475 I can reproduce a four-fold slowdown that persists even after the UserDict.DictMixin lines are commented out of shelve.py and bsddb.__init__.py. For me, the only thing that has changed is the underlying bsddb implementation. Let's see if you system is going somewhere else to get its shelving done. After the first line, add: idx_dict.has_key ([]) Then post the traceback here. Do that for both Py2.2 and for Py2.3. Thank you. Also, post what a typical record in the index and tell me how many entries are typically in idx_dict. That way, I can try to reproduce your timings with greater fidelity. Which os are you using and what the minor bugfix verion numbers of the Py2.2 and PY2.3 you are using.
msg19149 - (view) Author: Gottfried Ganßauge (ganssauge) Date: 2003-11-27 10:32
Logged In: YES user_id=792746 I uploaded my profiling data, maybe it will help you ... Here is the information you requested: ----------------><------------------------><------------ (gotti@gglinux 534) PYTHONPATH=../../../COMMON.DEVEL/Tools/python/lib.linux- i686-2.3 python Konvertierung/entsch_pass2.py HI69228 x HR all_idx2.shelve <hi69228.sgml Traceback (most recent call last): File "Konvertierung/entsch_pass2.py", line 1026, in ? init_idx_dict (idx_dict_name) File "../../COMMON/lib/EDB.py", line 54, in init_idx_dict idx_dict.has_key([]) File "/usr/lib/python2.3/shelve.py", line 104, in has_key return self.dict.has_key(key) File "/usr/lib/python2.3/bsddb/__init__.py", line 142, in has_key return self.db.has_key(key) TypeError: String or Integer object expected for key, list found (gotti@gglinux 535) PYTHONPATH=../../../COMMON.DEVEL/Tools/python/lib.linux- i686-2.2 python2.2 Konvertierung/entsch_pass2.py HI69228 x HR all_idx2.shelve <hi69228.sgml Traceback (most recent call last): File "Konvertierung/entsch_pass2.py", line 1026, in ? init_idx_dict (idx_dict_name) File "../../COMMON/lib/EDB.py", line 54, in init_idx_dict idx_dict.has_key([]) File "/usr/lib/python2.2/shelve.py", line 62, in has_key return self.dict.has_key(key) TypeError: key type must be string (gotti@gglinux 536) python -V Python 2.3.2 (gotti@gglinux 537) python2.2 -V Python 2.2.3 (gotti@gglinux 538) uname -a Linux gglinux 2.4.22 #1 SMP Mon Nov 3 11:40:28 CET 2003 i686 unknown unknown GNU/Linux (gotti@gglinux 538) cat /etc/debian_version testing/unstable (gotti@gglinux 539) python2.2 -c 'import shelve ; d = shelve.open("all_idx2.shelve", "r"); print len (d.keys()) ; print d.keys()[0], d [d.keys()[0]]' 34983 HI568817 None (gotti@gglinux 540) python2.3 -c 'import shelve ; d = shelve.open("all_idx2.shelve", "r"); print "# items in shelve:", len (d.keys()) ; print "Items look like: index", d.keys() [0], "value", d [d.keys()[0]]' # items in shelve: 34983 Items look like: index HI568817 value None
msg19150 - (view) Author: Gottfried Ganßauge (ganssauge) Date: 2003-11-27 10:42
Logged In: YES user_id=792746 What the heck ... here is the shelve in question
msg19151 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2003-11-27 17:55
Logged In: YES user_id=80475 The fragment in the original posting showed the only inner-loop shelve access was through has_key(). The tracebacks show that UserDict is nowhere in the traceback chain. I conclude that the fragment does not represent what is really going on in the problematic script. So, please attach the profiled script, Konvertierung/entsch_pass2.py The attached profile indicates that somewhere, there is a line like: for k,v in idx_dict.iteritems(). This is surprising because shelves did not support iteritems() in Py2.2. That would be mean that you've timed and compared two different pieces of code. Please show the shortest script with data that runs at radically different speeds on Py2.2 vs Py2.3.
msg19152 - (view) Author: Gottfried Ganßauge (ganssauge) Date: 2003-11-28 16:01
Logged In: YES user_id=792746 I think I found the answer: apart from has_key() I'm using "dict != None". If I leave that out in my test program both python variants run with the same speed. The dict != None condition seems to trigger len(dict.keys()) and that seems to be way slower than before. I definitely didn't time different scripts: the script is part of our CDROM production system and the only variables I had during my tests were python itself and the python path. Find my test script attached...
msg19153 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2003-11-28 21:57
Logged In: YES user_id=80475 Yes, that was the culprit. I'll look for a way to make __cmp__ a bit smarter. In the meantime, the proper way to check for None is always: if dict is None.
msg19154 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2003-12-07 11:55
Logged In: YES user_id=80475 I fixed-up your particular problem for Py2.3.3 and Py2.4. Leaving the report open because there are other calls which have performance issues.
msg55408 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2007-08-29 01:57
Raymond - can we close this ticket?
msg110108 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-12 16:30
Raymond - can we close this ticket?
History
Date User Action Args
2022-04-11 14:56:01 admin set github: 39611
2010-07-12 19:41:33 rhettinger set status: open -> closedresolution: out of date
2010-07-12 16:30:01 BreamoreBoy set nosy: + BreamoreBoymessages: +
2009-02-16 06:25:10 skip.montanaro set nosy: - skip.montanaro
2009-02-14 12:32:22 ajaksu2 set stage: test neededversions: + Python 2.7, - Python 2.3
2008-03-16 21:06:20 georg.brandl set type: performance
2007-08-29 01:57:38 skip.montanaro set nosy: + skip.montanaromessages: +
2003-11-26 14:06:12 ganssauge create