c# - Multithreading a web scraper? -
i've been thinking making web scraper multithreaded, not normal threads (egthread scrape = new thread(function);) threadpool there can large number of threads.
my scraper works using for
loop scrape pages.
for (int = (int)pagesmin.value; <= (int)pagesmax.value; i++)
so how multithread function (that contains loop) threadpool? i've never used threadpools before , examples i've seen have been quite confusing or obscure me.
i've modified loop this:
int min = (int)pagesmin.value; int max = (int)pagesmax.value; paralleloptions poptions = new paralleloptions(); poptions.maxdegreeofparallelism = properties.settings.default.threads; parallel.for(min, max, poptions, =>{ //scraping });
would work or have got wrong?
the problem using pool threads spend of time waiting response web site. , problem using parallel.foreach
limits parallelism.
i got best performance using asynchronous web requests. used semaphore
limit number of concurrent requests, , callback function did scraping.
the main thread creates semaphore
, this:
semaphore _requestssemaphore = new semaphore(20, 20);
the 20
derived trial , error. turns out limiting factor dns resolution and, on average, takes 50 ms. @ least, did in environment. 20 concurrent requests absolute maximum. 15 more reasonable.
the main thread loops, this:
while (true) { _requestssemaphore.waitone(); string urltocrawl = dequeueurl(); // var request = (httpwebrequest)webrequest.create(urltocrawl); // set request properties appropriate // , asynchronous request request.begingetresponse(responsecallback, request); }
the responsecallback
method, called on pool thread, processing, disposes of response, , releases semaphore request can made.
void responsecallback(iasyncresult ir) { try { var request = (httpwebrequest)ir.asyncstate; // you'll want exception handling here using (var response = (httpwebresponse)request.endgetresponse(ir)) { // process response here. } } { // release semaphore request can made _requestsemaphore.release(); } }
the limiting factor, said, dns resolution. turns out dns resolution done on calling thread (the main thread in case). see is asynchronous? more information.
this simple implement , works quite well. it's possible more 20 concurrent requests, doing takes quite bit of effort, in experience. had lot of dns caching , ... well, difficult.
you can simplify above using task
, new async stuff in c# 5.0 (.net 4.5). i'm not familiar enough how, though.
Comments
Post a Comment