Find double files using PHP with high performance -


i've got around 25000 files scattered around many folders vary between 5mb , 200mb on 2 external hard drives. need find out of these duplicate, leaving unique files on drives.

currently im doing md5_file() on each source file , compare these see if same file has been found before. issue is, md5_file() take more 10 seconds execute , i've seen taking minute files. if let script run in it's current form, mean process take more week finish.

note i'm saving each hash after 1 has been made, dont have re-hash each file on each run. thing these files yet hashed.

i'm wondering speed up. need finish in less 5 days, script takes more week no option. thinking multithreading (using pthread) solution, drives slow , cpu not issue, don't think help. else there do?

as guessed, it's hard tell if can see gains using threading ...

however, decided write nice pthreads example based on idea, think illustrates things should while threading ...

your mileage vary, here's example same:

<?php /* create mutex readable logging output */ define ("log", mutex::create());  /* log message stdout, use thread safe printf */ function out($message, $format = null) {     $format = func_get_args();      if ($format) {         $message = array_shift(             $format);          mutex::lock(log);         echo vsprintf(             $message, $format         );         mutex::unlock(log);     } }  /*  sums collection of sum => file shared among workers */ class sums extends stackable {     public function run(){} }  /* worker execute sum tasks */ class checkworker extends worker {     public function run() {} }  /*   simplest version of job calculates checksum of file */ class check extends stackable {      /* properties public */     public $file;     public $sum;      /* accept file , sums collection */     public function __construct($file, sums &$sums) {         $this->file = $file;         $this->sums = $sums;     }      public function run(){         out(             "checking: %s\n", $this->file);          /* calculate checksum */         $sum = md5_file($this->file);          /* check sum in list */         if (isset($this->sums[$sum])) {              /* deal duplicate */             out(                 "duplicate file found: %s, duplicate of %s\n", $this->file, $this->sums[$sum]);         } else {             /* set sum in shared list */             $this->sums[$sum] = $this->file;              /* output info ... */             out(                 "unique file found: %s, sum (%s)\n", $this->file, $sum);         }     } }  /* start timer */  $start = microtime(true);  /* checksum collection, shared across threads */ $sums = new sums();  /* create suitable amount of worker threads */ $workers = array(); $checks = array(); $worker = 0;  /* how many worker threads have depends on hardware */ while (count($workers) < 16) {     $workers[$worker] = new checkworker();     $workers[$worker]->start();     $worker++; }  /* scan path given on command line files */ foreach (scandir($argv[1]) $id => $path) {      /* @todo(u) write code recursively scan path */     $path = sprintf(         "%s/%s",         $argv[1], $path     );      /* create job calculate checksum of file */     if (!is_dir($path))  {         $checks[$id] = new check(             $path, $sums);          /* @todo(u) write code stack appropriate worker */         $workers[array_rand($workers)]->stack($checks[$id]);     } }  /* join threads */ foreach ($workers $worker) {     $worker->shutdown(); }  /* output info */ out("complete in %.3f seconds\n", microtime(true)-$start);  /* destroy logging mutex */ mutex::destroy(log); ?> 

play around it, see how different numbers of workers affects runtime, , implement own logic delete files , scan directories (this basic stuff should know already, left out make simple example) ...


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -