[Numpy-discussion] Distance Matrix speed (original) (raw)
Sebastian Beca sebastian.beca at gmail.com
Mon Jun 19 16:04:31 EDT 2006
- Previous message (by thread): [Numpy-discussion] Distance Matrix speed
- Next message (by thread): [Numpy-discussion] Distance Matrix speed
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I just ran Alan's script and I don't get consistent results for 100 repetitions. I boosted it to 1000, and ran it several times. The faster one varied alot, but both came into a ~ +-1.5% difference.
When it comes to scaling, for my problem(fuzzy clustering), N is the size of the dataset, which should span from thousands to millions. C is the amount of clusters, usually less than 10, and K the amount of features (the dimension I want to sum over) is also usually less than 100. So mainly I'm concerned with scaling across N. I tried C=3, K=4, N=1000, 2500, 5000, 7500, 10000. Also using 1000 runs, the results were: dist_beca: 1.1, 4.5, 16, 28, 37 dist_loehner1: 1.7, 6.5, 22, 35, 47
I also tried scaling across K, with C=3, N=2500, and K=5-50. I couldn't get any consistent results for small K, but both tend to perform as well (+-2%) for large K (K>15).
I'm not sure how these work in the backend so I can't argument as to why one should scale better than the other. Regards,
Sebastian.
On 6/19/06, Alan G Isaac <aisaac at american.edu> wrote:
On Sun, 18 Jun 2006, Tim Hochberg apparently wrote:
> Alan G Isaac wrote: >> On Sun, 18 Jun 2006, Sebastian Beca apparently wrote: >>> def dist(): >>> d = zeros([N, C], dtype=float) >>> if N < C: for i in range(N):_ _>>> xy = A[i] - B d[i,:] = sqrt(sum(xy**2, axis=1)) >>> return d >>> else: >>> for j in range(C): >>> xy = A - B[j] d[:,j] = sqrt(sum(xy**2, axis=1)) >>> return d >> But that is 50% slower than Johannes's version: >> def distloehner1(): >> d = A[:, newaxis, :] - B[newaxis, :, :] >> d = sqrt((d**2).sum(axis=2)) >> return d > Are you sure about that? I just ran it through timeit, using Sebastian's > array sizes and I get Sebastian's version being 150% faster. This > could well be cache size dependant, so may vary from box to box, but I'd > expect Sebastian's current version to scale better in general. No, I'm not sure. Script attached bottom. Most recent output follows: for reasons I have not determined, it doesn't match my previous runs ... Alan >>> execfile(r'c:\temp\temp.py') distbeca : 3.042277 distloehner1: 3.170026
################################# #THE SCRIPT import sys sys.path.append("c:\temp") import numpy from numpy import * import timeit K = 10 C = 2500 N = 3 # One could switch around C and N now. A = numpy.random.random( [N, K] ) B = numpy.random.random( [C, K] ) # beca def distbeca(): d = zeros([N, C], dtype=float) if N < C: for i in range(N): xy = A[i] - B d[i,:] = sqrt(sum(xy**2, axis=1)) return d else: for j in range(C): xy = A - B[j] d[:,j] = sqrt(sum(xy**2, axis=1)) return d #loehnert def distloehner1(): # drawback: memory usage temporarily doubled # solution see below d = A[:, newaxis, :] - B[newaxis, :, :] # written as 3 expressions for more clarity d = sqrt((d**2).sum(axis=2)) return d if name == "main": t1 = timeit.Timer('distbeca()', 'from temp import distbeca').timeit(100) t8 = timeit.Timer('distloehner1()', 'from temp import distloehner1').timeit(100) fmt="%-10s:\t"+"%10.6f" print fmt%('distbeca', t1) print fmt%('distloehner1', t8)
Numpy-discussion mailing list Numpy-discussion at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
- Previous message (by thread): [Numpy-discussion] Distance Matrix speed
- Next message (by thread): [Numpy-discussion] Distance Matrix speed
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]