multithreading - Python for loop using Threading or multiprocessing -


all, rather new , looking assistance. need perform string search on data set compressed 20 gb of data. have 8 core ubuntu box 32 gb of ram can use crunch through not able implement nor determine best possible code such task. threading or multiprocessing best such task? please provide code samples. thank you. please see current code;

#!/usr/bin/python import sys logs = [] iplist = []  logs = open(sys.argv[1], 'r').readlines() iplist = open(sys.argv[2], 'r').readlines() print "+loaded {0} entries {1}".format(len(logs), sys.argv[1]) print "+loaded {0} entries {1}".format(len(iplist), sys.argv[2])  in logs:     b in iplist:         if a.lower().strip() in b.lower().strip()             print "match! --> {0}".format(a.lower().strip()) 

i'm not sure if multithreading can you, code has problem bad performance: reading logs in 1 go consumes incredible amounts of ram , thrashes cache. instead, open , read sequentially, after making sequential scan, don't you? then, don't repeat operations on same data. in particular, iplist doesn't change, every log entry, repeatedly calling b.lower().strip(). once, after reading file ip addresses.

in short, looks this:

with open(..) f:     iplist = [l.lower().strip() l in f]  open(..) f:     l in f:         l = l.lower().strip()         if l in iplist:             print('match!') 

you can improve performance more using set iplist, because looking things there faster when there many elements. said, i'm assuming second file huge, while iplist remain relatively small.

btw: improve performance multiple cpus using 1 read file , other scan matches, guess above give sufficient performance boost.


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

javascript - addthis share facebook and google+ url -

ios - Show keyboard with UITextField in the input accessory view -