c - Maintain a sorted array that a separate, iterative function can keep accessing -
i'm writing code decision tree in c. right gives me correct result (0% training error, low test error), takes long time run.
the problem lies in how run qsort. basic algorithm this:
every feature sort feature column using qsort remove duplicate feature values in column every unique feature value split determine entropy given split save best feature split + split value every training_example if training_example's value best feature < best split value, store in left[] else store in right[] recursively call function, using left[] training examples recursively call function, using right[] training examples
because last 2 lines iterative calls, , because tree can extend dozens , dozens of branches, number of calls qsort huge (especially dataset has > 1000 features).
my idea reduce runtime create 2d array (in separate function) each column sorted feature column. then, long maintain vector of row numbers of training examples in left[] , right[] each recursive call, can call separate function, grab rows want in pre-sorted feature vector, , save cost of having qsort each time.
i'm new c , i'm not sure how code this. in matlab can have global array function can change or access, looking in c.
global arrays in c totally possible. there 2 ways of doing that. in first case dimensions of array fixed application:
#define nrows 100 #define ncols 100 int array[nrows][ncols]; int main(void) { int i, j; (i = 0; < nrows; i++) (j = 0; j < ncols; j++) { array[i][j] = i+j; } return 0; }
in second example dimensions may depend on values input.
#include <stdlib.h> int **array; int main(void) { int nrows = 100; int ncols = 100; int i, j; array = malloc(nrows*sizeof(*array)); (i = 0; < nrows; i++) { array[i] = malloc(ncols*sizeof(*(array[i]))); (j = 0; j < ncols; j++) { array[i][j] = i+j; } } }
although access arrays in both examples looks deceivingly similar, implementation of arrays quite different. in first example array located in 1 piece of memory , strides access rows whole row. in second example each row access pointer row, 1 piece of memory. various rows can located in different areas of memory. in second example rows might have different length. in case need store length of each row somewhere too.
i don't understand trying achieve, because i'm not familiar terminology of decision tree, feature , standard approaches training sets. may want have @ other data structures maintain sorted data:
- http://en.wikipedia.org/wiki/red–black_tree maintains more or less balanced , sorted tree.
- avl tree bit slower more balanced , sorted tree.
- trie sorted tree on lists of elements.
- hash function map complex element integral value can used sort elements. finding exact elements, there no real order in elements itself.
p.s1: coming matlab may want consider different language c move to. c++ has standard libraries support above data structures. java, python come mind or haskell if daring. pointer handling in c can quite tedious , error prone.
p.s2: i'm unable include -
in url on stackoverflow. red-black tree links bit off , can't clicked. if can edit post fix it, appreciate that.
Comments
Post a Comment