matlab - Remove highly correlated components -
i have got problem remove highly correlated components. can ask how this?
for example, have got 40 instances 20 features (random created). feature 2 , 18 highly correlated feature 4. , feature 6 highly correlated feature 10. how remove highly correlated (redundant) features such 2, 18 , 10? essentially, need index of remaining features 1, 3, 4, 5, 6, ..., 9, 11, ..., 17, 19, 20.
matlab codes:
x = randn(40,20); x(:,2) = 2.*x(:,4); x(:,18) = 3.*x(:,4); x(:,6) = 100.*x(:,10); x_corr = corr(x); size(x_corr) figure, imagesc(x_corr),colorbar correlation matrix x_corr looks like

edit:
i worked out way:
x_corr = x_corr - diag(diag(x_corr)); [x_corrx, x_corry] = find(x_corr>0.8); = 1:size(x_corrx,1) xx = find(x_corry == x_corrx(i)); x_corrx(xx,:) = 0; x_corry(xx,:) = 0; end x_corrx = unique(x_corrx); x_corrx = x_corrx(2:end); im = setxor(x_corrx, (1:20)'); am right? or have better idea please post. thanks.
edit2: method same using pca?
it seems quite clear idea of yours, remove highly correlated variables analysis not same pca. pca way rank reduction of seems complicated problem, 1 turns out have few independent things happening. pca uses eigenvalue (or svd) decomposition achieve goal.
anyway, might have problem. example, suppose highly correlated b, , b highly correlated c. however, need not true , c highly correlated. since correlation can viewed measure of angle between vectors in corresponding high dimensional vector space, can made happen.
as trivial example, i'll create 2 variables, , b, correlated @ "moderate" level.
n = 50; = rand(n,1); b = + randn(n,1)/2; corr([a,b]) ans = 1 0.55443 0.55443 1 so here 0.55 correlation. i'll create c virtually average of , b. highly correlated definition.
c = [a + b]/2 + randn(n,1)/100; corr([a,b,c]) ans = 1 0.55443 0.80119 0.55443 1 0.94168 0.80119 0.94168 1 clearly c bad guy here. if 1 @ pair [a,c] , remove analysis, same pair [b,c] , remove b, have made wrong choices. , trivially constructed example.
in fact, true eigenvalues of correlation matrix might of interest.
[v,d] = eig(corr([a,b,c])) v = -0.53056 -0.78854 -0.311 -0.57245 0.60391 -0.55462 -0.62515 0.11622 0.7718 d = 2.5422 0 0 0 0.45729 0 0 0 0.00046204 the fact d has 2 significant diagonal elements, , tiny 1 tells really, 2 variable problem. pca not tell vector remove though, , problem less clear more variables, many interactions between of them.
Comments
Post a Comment