c# - Multithreading a web scraper? -


i've been thinking making web scraper multithreaded, not normal threads (egthread scrape = new thread(function);) threadpool there can large number of threads.

my scraper works using for loop scrape pages.

for (int = (int)pagesmin.value; <= (int)pagesmax.value; i++) 

so how multithread function (that contains loop) threadpool? i've never used threadpools before , examples i've seen have been quite confusing or obscure me.


i've modified loop this:

int min = (int)pagesmin.value; int max = (int)pagesmax.value; paralleloptions poptions = new paralleloptions(); poptions.maxdegreeofparallelism = properties.settings.default.threads; parallel.for(min, max, poptions, =>{     //scraping }); 

would work or have got wrong?

the problem using pool threads spend of time waiting response web site. , problem using parallel.foreach limits parallelism.

i got best performance using asynchronous web requests. used semaphore limit number of concurrent requests, , callback function did scraping.

the main thread creates semaphore, this:

semaphore _requestssemaphore = new semaphore(20, 20); 

the 20 derived trial , error. turns out limiting factor dns resolution and, on average, takes 50 ms. @ least, did in environment. 20 concurrent requests absolute maximum. 15 more reasonable.

the main thread loops, this:

while (true) {     _requestssemaphore.waitone();     string urltocrawl = dequeueurl();  //     var request = (httpwebrequest)webrequest.create(urltocrawl);     // set request properties appropriate     // , asynchronous request     request.begingetresponse(responsecallback, request); } 

the responsecallback method, called on pool thread, processing, disposes of response, , releases semaphore request can made.

void responsecallback(iasyncresult ir) {     try     {         var request = (httpwebrequest)ir.asyncstate;         // you'll want exception handling here         using (var response = (httpwebresponse)request.endgetresponse(ir))         {             // process response here.         }     }         {         // release semaphore request can made         _requestsemaphore.release();     } } 

the limiting factor, said, dns resolution. turns out dns resolution done on calling thread (the main thread in case). see is asynchronous? more information.

this simple implement , works quite well. it's possible more 20 concurrent requests, doing takes quite bit of effort, in experience. had lot of dns caching , ... well, difficult.

you can simplify above using task , new async stuff in c# 5.0 (.net 4.5). i'm not familiar enough how, though.


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -