regex - Python: Compiling regexes in parallel -
i have program need compile several thousand large regexes, of used many times. problem is, takes long (according cprofiler
, 113 secs) re.compile()
them. (btw, searching using of these regexes < 1.3 secs once compiled.)
if don't precompile, postpones problem when search, since re.search(expr, text)
implicitly compiles expr
. actually, it's worse, because re
going recompile entire list of regexes every time use them.
i tried using multiprocessing
, slows things down. here's small test demonstrate:
## rgxparallel.py ## import re import multiprocessing mp def serial_compile(strings): return [re.compile(s) s in strings] def parallel_compile(strings): print("using {} processors.".format(mp.cpu_count())) pool = mp.pool() result = pool.map(re.compile, strings) pool.close() return result l = map(str, xrange(100000))
and test script:
#!/bin/sh python -m timeit -n 1 -s "import rgxparallel r" "r.serial_compile(r.l)" python -m timeit -n 1 -s "import rgxparallel r" "r.parallel_compile(r.l)" # output: # 1 loops, best of 3: 6.49 sec per loop # using 4 processors. # using 4 processors. # using 4 processors. # 1 loops, best of 3: 9.81 sec per loop
i'm guessing parallel version is:
- in parallel, compiling , pickling regexes, ~2 secs
- in serial, un-pickling, , therefore recompiling them all, ~6.5 secs
together overhead starting , stopping processes, multiprocessing
on 4 processors more 25% slower serial.
i tried divvying list of regexes 4 sub-lists, , pool.map
-ing sublists, rather individual expressions. gave small performance boost, still couldn't better ~25% slower serial.
is there way compile faster serial?
edit: corrected running time of regex compilation.
i tried using threading
, due gil, 1 processor used. better multiprocessing
(130 secs vs. 136 secs), still slower serial (113 secs).
edit 2: realized regexes duplicated, added dict caching them. shaved off ~30 sec. i'm still interested in parallelizing, though. target machine has 8 processors, reduce compilation time ~15 secs.
as love python, think solution is, in perl (see speed comparison, example), or c, etc.
if want keep main program in python, use subprocess
call perl script (just make sure pass many values possible in few subprocess
calls possible avoid overhead.
Comments
Post a Comment