python - How to efficiently compute the inner product of two dictionaries -


suppose represent feature vector using dictionary (why? because know features sparse, but, more on later).

how should implement inner product of 2 such dictionaries (denoted, a, b)

i tried naive approach:

for k in a:   if k in b:     sum += a[k] * b[k] 

but turns out slow.

some more details:

  • i'm using dictionary represent features because

    1. the feature keys strings
    2. there ~20k possible keys
    3. each vector sparse (say, 1000 non-zero elements).
  • i'm interested in computing pairwise inner product of n=2000 different dictionaries (that is, linear kernel).

hmm, seems approach best dense vectors:

>>> # eric's answer >>> timeit.timeit('sum([a[k]*b[k] k in set(a.keys()) & set(b.keys())])', setup='a=dict((i,i) in xrange(100));b=dict((i,i) in xrange(100))', number=10000) 0.4360210521285808  >>> # comment >>> timeit.timeit('for k,v in a.iteritems(): sum += v*b.get(k,0)', setup='a=dict((i,i) in xrange(100));b=dict((i,i) in xrange(100));sum=0', number=10000) 0.4082838999682963  # comment, more compact >>> timeit.timeit('sum(v*b.get(k,0) k,v in a.iteritems())', setup='a=dict((i,i) in xrange(100));b=dict((i,i) in xrange(100))', number=10000) 0.38053266868496394  >>> #your approach >>> timeit.timeit('for k in a: sum += a[k]*b[k] if k in b else 0.', setup='a=dict((i,i) in xrange(100));b=dict((i,i) in xrange(100));sum=0', number=10000) 0.35574231962510794  >>> # approach, more compact >>> timeit.timeit('sum(a[k]*b[k] k in if k in b)', setup='a=dict((i,i) in xrange(100));b=dict((i,i) in xrange(100))', number=10000) 0.3400850549682559 

for sparser ones, eric's answer performs better yours still fastest:

# mine >>> timeit.timeit('sum(v*b.get(k,0) k,v in a.iteritems())', setup='import random;a=dict((i,i) in xrange(100) if random.random() < 0.3);b=dict((i,i) in xrange(100) if random.random() < 0.2)', number=10000) 0.1390782696843189  # eric's >>> timeit.timeit('sum([a[k]*b[k] k in set(a.keys()) & set(b.keys())])', setup='import random;a=dict((i,i) in xrange(100) if random.random() < 0.3);b=dict((i,i) in xrange(100) if random.random() < 0.2)', number=10000) 0.11702822992151596  # yours >>> timeit.timeit('sum(a[k]*b[k] k in if k in b)', setup='import random;a=dict((i,i) in xrange(100) if random.random() < 0.3);b=dict((i,i) in xrange(100) if random.random() < 0.2)', number=10000) 0.07878250570843193 

edit

after messing around bit, seems sum([x x ...]) faster sum(x x in ...). rebenchmarking , janne's remark keys in eric's answer, yours still on top (with joowani's giving slight improvement):

>>> timeit.timeit('sum([v*b.get(k,0) k,v in a.items()])', setup='import random;a=dict((i,i) in xrange(100) if random.random() < 0.3);b=dict((i,i) in xrange(100) if random.random() < 0.2)', number=100000) 1.1604375791416714 >>> timeit.timeit('sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()])', setup='import random;a=dict((i,i) in xrange(100) if random.random() < 0.3);b=dict((i,i) in xrange(100) if random.random() < 0.2)', number=100000) 0.9234189571552633 >>> timeit.timeit('sum([a[k]*b[k] k in if k in b])', setup='import random;a=dict((i,i) in xrange(100) if random.random() < 0.3);b=dict((i,i) in xrange(100) if random.random() < 0.2)', number=100000) 0.5411289579401455 >>> timeit.timeit('sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a])', setup='import random;a=dict((i,i) in xrange(100) if random.random() < 0.3);b=dict((i,i) in xrange(100) if random.random() < 0.2)', number=100000) 0.5198972138696263 

scaling large sizes, see same pattern:

>>> #mine >>> timeit.timeit('sum([v*b.get(k,0) k,v in a.iteritems()])', setup='import random;a=dict((i,i) in xrange(10000) if random.random() < 0.1);b=dict((i,i) in xrange(10000) if random.random() < 0.2)', number=100000) 45.328807250833506  >>> #eric's >>> timeit.timeit('sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()])', setup='import random;a=dict((i,i) in xrange(10000) if random.random() < 0.1);b=dict((i,i) in xrange(10000) if random.random() < 0.2)', number=100000) 28.042937058640973  >>> #yours >>> timeit.timeit('sum([a[k]*b[k] k in if k in b])', setup='import random;a=dict((i,i) in xrange(10000) if random.random() < 0.1);b=dict((i,i) in xrange(10000) if random.random() < 0.2)', number=100000) 16.55080344861699  >>> #joowani's >>> timeit.timeit('sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a])', setup='import random;a=dict((i,i) in xrange(10000) if random.random() < 0.1);b=dict((i,i) in xrange(10000) if random.random() < 0.2)', number=100000) 15.485236119691308 

i think joowani's trick not improve here because vectors same size, depending on problem (if vectors ridiculously smaller others) might more significant...

edit again

oops, seems should have taken coffee before posting... eric pointed out (although missed it...), defining array in setup keeps same trials, not best way benchmark. proper random vectors being tested, results not different, sake of completeness:

>>> timeit.timeit('mine(dict((i,i) in xrange(100) if random.random() < 0.3),dict((i,i) in xrange(100) if random.random() < 0.2))', setup='import random;joowanis=lambda a,b:sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a]);mine=lambda a,b:sum([v*b.get(k,0) k,v in a.iteritems()]);erics=lambda a,b:sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()]);yours=lambda a,b:sum([a[k]*b[k] k in if k in b])', number=100000) 6.294158102577967 >>> timeit.timeit('erics(dict((i,i) in xrange(100) if random.random() < 0.3),dict((i,i) in xrange(100) if random.random() < 0.2))', setup='import random;joowanis=lambda a,b:sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a]);mine=lambda a,b:sum([v*b.get(k,0) k,v in a.iteritems()]);erics=lambda a,b:sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()]);yours=lambda a,b:sum([a[k]*b[k] k in if k in b])', number=100000) 6.068052507449011 >>> timeit.timeit('yours(dict((i,i) in xrange(100) if random.random() < 0.3),dict((i,i) in xrange(100) if random.random() < 0.2))', setup='import random;joowanis=lambda a,b:sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a]);mine=lambda a,b:sum([v*b.get(k,0) k,v in a.iteritems()]);erics=lambda a,b:sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()]);yours=lambda a,b:sum([a[k]*b[k] k in if k in b])', number=100000) 5.745110704570834 >>> timeit.timeit('joowanis(dict((i,i) in xrange(100) if random.random() < 0.3),dict((i,i) in xrange(100) if random.random() < 0.2))', setup='import random;joowanis=lambda a,b:sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a]);mine=lambda a,b:sum([v*b.get(k,0) k,v in a.iteritems()]);erics=lambda a,b:sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()]);yours=lambda a,b:sum([a[k]*b[k] k in if k in b])', number=100000) 5.737499445367575 

to scale:

>>> timeit.timeit('mine(dict((i,i) in xrange(10000) if random.random() < 0.1),dict((i,i) in xrange(10000) if random.random() < 0.2))', setup='import random;joowanis=lambda a,b:sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a]);mine=lambda a,b:sum([v*b.get(k,0) k,v in a.iteritems()]);erics=lambda a,b:sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()]);yours=lambda a,b:sum([a[k]*b[k] k in if k in b])', number=1000) 5.0510995368395015 >>> timeit.timeit('erics(dict((i,i) in xrange(10000) if random.random() < 0.1),dict((i,i) in xrange(10000) if random.random() < 0.2))', setup='import random;joowanis=lambda a,b:sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a]);mine=lambda a,b:sum([v*b.get(k,0) k,v in a.iteritems()]);erics=lambda a,b:sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()]);yours=lambda a,b:sum([a[k]*b[k] k in if k in b])', number=1000) 4.350612399185138 >>> timeit.timeit('yours(dict((i,i) in xrange(10000) if random.random() < 0.1),dict((i,i) in xrange(10000) if random.random() < 0.2))', setup='import random;joowanis=lambda a,b:sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a]);mine=lambda a,b:sum([v*b.get(k,0) k,v in a.iteritems()]);erics=lambda a,b:sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()]);yours=lambda a,b:sum([a[k]*b[k] k in if k in b])', number=1000) 4.15619379016789 >>> timeit.timeit('joowanis(dict((i,i) in xrange(10000) if random.random() < 0.1),dict((i,i) in xrange(10000) if random.random() < 0.2))', setup='import random;joowanis=lambda a,b:sum([a[k]*b[k] k in if k in b]) if len(a)<len(b) else sum([a[k]*b[k] k in b if k in a]);mine=lambda a,b:sum([v*b.get(k,0) k,v in a.iteritems()]);erics=lambda a,b:sum([a[k]*b[k] k in a.viewkeys() & b.viewkeys()]);yours=lambda a,b:sum([a[k]*b[k] k in if k in b])', number=1000) 4.185129374341159 

i think bottom line cannot expect significant speedup cleverly reordering expressions kind of thing... maybe try doing numerical part in c/cython or using scipy's sparse package ?


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -