3
$\begingroup$

Suppose that $\Omega$ is a finite probability space,with measure $P$ (we can take $P$ uniform). Let $C \in \{\pm 1 \}$ be a random variable on $\Omega$, the classifier. Let

$$H(C) = H(C, \Omega, P) = \sum_{i \in \pm 1} P(C = i) \log P(C = i)$$

be the entropy of $C$.

For a random variable $Y$ on $\Omega$ and $r \in \mathbb{R}$, define the level set $\{Y = r\} = V_r$, and define

$$H(C | Y) = \Sigma_{r \in \mathbb{R}} P(Y = r) H(C|_{V_r}, V_r, P_{V_r})$$

where $P_{V_r}(B) = P(B \cap V_r) / P(V_r)$ is a measure on $V_r$. $H(C|_{V_r}, V_r, P_{V_r})$ is the entropy of the random variable restricted to the the subset $V_r$, with the conditional probability measure.

As in a decision tree classifier, I want to determine the subset $A \subset \Omega$ that maximizes $H(C) - H(C|1_A)$. That is, I want to maximize the information gain. What is the "best" algorithm for doing so? What is the complexity of this question? Are there good probabilistic algorithms ?

Some computation allows one to reformulate this as finding the subset $A$ that maximizes

$$\sum_e \sum_{i = \pm 1} P(C^i \cap A^e) \log(P(C^i \cap A^e)) + \sum_e P(A^e) \log P(A^e),$$

provided we let $A^e$ mean $A$ or $A^c$, and $C^i = \{C = i\}$.

$\endgroup$
6
  • $\begingroup$ I'm not sure what exactly you're trying to define. The notion of information gain, in the context of decision trees, deals with choosing an attribute, not a subset of samples. As for your definition, some questions that pop up, what exactly is $C$, is it a function from the set to $\left\{-1,1\right\}$? Is it constant? If not, how is it generated? Also, you shouldn't write $\sum\limits_{r\in\mathbb{R}}\cdot$, some people might get the cringe. $\endgroup$ Commented Apr 11, 2017 at 10:47
  • $\begingroup$ @Ariel This is just a specific simple case of feature selection, where our features are the subsets. Yes, C is a function to $\{-1, 1\}$. I suppose it could be a constant. It is just a function -- a labeling. As for the summation notation, I think it's is fine because only finitely many terms are nonzero. $\endgroup$ Commented Apr 11, 2017 at 16:55
  • $\begingroup$ If $C$ is fixed, then you run into the same issue you had in your previous question. Remember that $H(X|Y)=H(X)-H(Y)$, where $H(X|Y)$ is the entropy of the random variable $X$ relative the conditional distribution. The information gain is $H(X)-H(X|Y)=H(Y)$ is maximized when $Y$ is uniform, which in that case means, that $\Pr\left[c(x)=1|x\in A\right]=\frac{1}{2}$. $\endgroup$ Commented Apr 11, 2017 at 17:40
  • 1
    $\begingroup$ @Ariel Your formula for $H(X |Y)$ is not correct. The correct formula is $H(X | Y) = H(X,Y) - H(Y)$. For example, if $Y$ and $X$ are independent so that $H(X,Y) = H(X) + H(Y)$, then $H(X | Y) = H(X)$, and if we assume that $H(X) = H(Y) \not = 0$ then you have a contradiction. $\endgroup$ Commented Apr 11, 2017 at 19:46
  • $\begingroup$ Yeah you're obviously correct, I still think that the information gain here wont tell you something interesting, ill try to think about it later. $\endgroup$ Commented Apr 11, 2017 at 20:01

1 Answer 1

2
$\begingroup$

As in the discussion, there are too many "features" for this to be in interesting question. In particular, letting $A = C^{-1}(1)$, the set of positive examples, then $H(C|A) = H(C|C) = 0$ so information gain is $H(C)$, which is maximal.

This is obvious in retrospect, in order to decrease the entropy maximally, make $C$ deterministic on $A$ and $A^c$.

For decision trees, the point is that we have a limited collection of features, specifically excluding C, to split our data set around.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.