c - Maintain a sorted array that a separate, iterative function can keep accessing -


i'm writing code decision tree in c. right gives me correct result (0% training error, low test error), takes long time run.

the problem lies in how run qsort. basic algorithm this:

    every feature         sort feature column using qsort         remove duplicate feature values in column         every unique feature value           split           determine entropy given split     save best feature split + split value     every training_example       if training_example's value best feature < best split value, store in left[]       else store in right[]     recursively call function, using left[] training examples     recursively call function, using right[] training examples 

because last 2 lines iterative calls, , because tree can extend dozens , dozens of branches, number of calls qsort huge (especially dataset has > 1000 features).

my idea reduce runtime create 2d array (in separate function) each column sorted feature column. then, long maintain vector of row numbers of training examples in left[] , right[] each recursive call, can call separate function, grab rows want in pre-sorted feature vector, , save cost of having qsort each time.

i'm new c , i'm not sure how code this. in matlab can have global array function can change or access, looking in c.

global arrays in c totally possible. there 2 ways of doing that. in first case dimensions of array fixed application:

#define nrows   100 #define ncols   100 int array[nrows][ncols];  int main(void) {         int     i, j;          (i = 0; < nrows; i++)         (j = 0; j < ncols; j++)         {                 array[i][j] = i+j;         }         return 0; } 

in second example dimensions may depend on values input.

#include <stdlib.h> int **array;  int main(void) {         int     nrows = 100;         int     ncols = 100;         int     i, j;          array = malloc(nrows*sizeof(*array));         (i = 0; < nrows; i++)         {                 array[i] = malloc(ncols*sizeof(*(array[i])));                 (j = 0; j < ncols; j++)                 {                         array[i][j] = i+j;                 }         } } 

although access arrays in both examples looks deceivingly similar, implementation of arrays quite different. in first example array located in 1 piece of memory , strides access rows whole row. in second example each row access pointer row, 1 piece of memory. various rows can located in different areas of memory. in second example rows might have different length. in case need store length of each row somewhere too.

i don't understand trying achieve, because i'm not familiar terminology of decision tree, feature , standard approaches training sets. may want have @ other data structures maintain sorted data:

  1. http://en.wikipedia.org/wiki/red–black_tree maintains more or less balanced , sorted tree.
  2. avl tree bit slower more balanced , sorted tree.
  3. trie sorted tree on lists of elements.
  4. hash function map complex element integral value can used sort elements. finding exact elements, there no real order in elements itself.

p.s1: coming matlab may want consider different language c move to. c++ has standard libraries support above data structures. java, python come mind or haskell if daring. pointer handling in c can quite tedious , error prone.

p.s2: i'm unable include - in url on stackoverflow. red-black tree links bit off , can't clicked. if can edit post fix it, appreciate that.


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

javascript - addthis share facebook and google+ url -

ios - Show keyboard with UITextField in the input accessory view -