5

2

I am searching for pointers to algorithms for feature detection.

EDIT: all the answers helped me a lot, I cannot decide which one I should accept. THX guys!

# What I did:

For discrete variables (i.e. $D_i, E$ are finite sets) $X_i : \Omega \to D_i$ and a given data table $$ \begin{pmatrix}{} X_1 & ... & X_n & X_{n+1} \\ x_1^{(1)} & ... & x_n^{(1)} & x_{n+1}^{(1)} \\ ... \\ x_1^{(m)} & ... & x_n^{(m)} & x_{n+1}^{(m)} \\ \end{pmatrix} $$ (the last variable will be the 'outcome', thats why I stress it with a special index) and $X, Y$ being some of the $X_1, ..., X_{n+1}$ (so if $X=X_a, Y=X_b$ then $D=D_a, E=D_b$) compute

$$H(X) = - \sum_{d \in D} P[X=d] * log(P[X=d])$$

$$H(Y|X) = - \sum_{d \in D} { P[X=d] * \sum_{e \in E} { P[Y=e|X=d] * log(P[Y=e|X=d]) } }$$

where we estimate $$P[X_a=d] = |\{j \in \{1, ..., m\} : x_a^{(j)} = d\}|$$ and analogously $$P[X_a=d \cap X_b=e] = |\{j \in \{1, ..., m\} : x_a^{(j)} = d ~\text{and}~ x_b^{(j)}=e\}|$$ and then $$I(Y;X) = \frac{H(Y) - H(Y|X)}{\text{log}(\text{min}(|D|, |E|))}$$ which is to be interpreted as the influence of $Y$ on $X$ (or vice versa, its symmetric).

EDIT: A little late now but still:

This is wrong:
~~Exercise for you: show that if $X=Y$ then $I(X,Y)=1$.~~

This is correct: Exercise for you: show that if $X=Y$ then $I(X,X)=H(X)/log(|D|)$ and if $X$ is additionally equally distributed then $I(X,X)=1$.

For selecting features start with the available set $\{X_1, ..., X_n\}$ and a set 'already selected'$ = ()$ [this is an ordered list!]. We select them step by step, always taking the one that maximizes $$\text{goodness}(X) = I(X, X_{n+1}) - \beta \sum_{X_i ~\text{already selected}} I(X, X_i)$$ for a value $\beta$ to be determined (authors suggest $\beta = 0.5$). I.e. goodness = influence on outcome - redundancy introduced by selecting this variable. After doing this procedure, take the first 'few' of them and throw away the ones with lower rank (whatever that means, I have to play with it a little bit). This is what is described in this paper.

For computing the $I$ for continuous variables one needs to bin them in some way. More concretely, the inventors of 'I' suggest to take the maximal value over binning $X$ into $n_x$ bins, $Y$ into $n_y$ bins and $n_x \cdot n_y <= m^{0.6}$, i.e. compute $$ \text{MIC}(X;Y) = \text{max}_{n_X \cdot n_Y \leq m^{0.6}} \left( \frac{I_{n_X, n_Y}(X;Y)}{log(\text{min}(n_X, n_Y)} \right)$$

where $I_{n_X, n_Y}(X;Y)$ means: compute the $I$ precisely as you did for discrete variables by treating $X$ as a discrete random variable after binning it into $n_X$ bins and analogously with $Y$.

===

## ORIGINAL QUESTION

More precisely: I have a classification problem for one boolean variable, let's call this variable `outcome`

.

I have lots of data and lots of features (~150 or so) but these features are not totally 'meaningless' as in image prediction (where every x and y coordinate is a feature) but they are of the form gender, age, etc.

What I did until now: from these 150 features, I guessed the ones that 'seem' to have some importance for the outcome. Still, I am unsure which features to select and also how to measure their importance before starting the actual learning algorithm (that involves yet more selection like PCA and stuff).

For example, for a feature `f`

taking only finitely many values `x_1, ..., x_n`

my very naive approach would be to compute some relation between

`P(outcome==TRUE | f==x_1)`

, ..., `P(outcome==TRUE | f==x_n)`

and `P(outcome==TRUE)`

(i.e. the feature is important when I can deduce more information about the coutcome from it than without any knowledge about the feature).

*Concrete question(s): Is that a good idea? Which relation to take? What to do with continuous variables?*

I'm sure that I'm not the first one ever wondering about this. I've read about (parts of) algorithms that do this selection in a sort-of automated way. Can somebody point me into the right direction (references, names of algorithms to look for, ...)?