What should we do when we have equal observations with different labels?


Suppose we have a labeled data set with columns $A$, $B$, and $C$ and a binary outcome variable $X$. Suppose we have rows as follows:

 col  A B C X
  1   1 2 3 1
  2   4 2 3 0
  3   6 5 1 1
  4   1 2 3 0

Should we throw away either row 1 or row 4 because they have different values of the outcome variable X? Or keep both of them?


Posted 2019-08-23T21:19:55.670

Reputation: 51

I think your question is quite naive. If you can share your motivation for the question, then the question would attract more apt answers. – naive – 2019-08-30T12:37:03.860



The problem you are portraying looks like a modified XOR problem. You can't throw away the lines with a label of 1 because a the model won't be able to learn this class.


Posted 2019-08-23T21:19:55.670

Reputation: 141


This is perfectly acceptable in a stochastic environment. Generally your loss is to minimize $-log\ p(Y|X)$ or equivalently $-\sum_i log\ p(y_i|x_i)$. This optimization is equivalent to $-\mathbb{E}\log\ p(y_i|x_i)$. In other words you are minimizing in this case:

$$ \begin{align*} L &= -log\ p(1|x_0) - log\ p(0|x_0) \\ &= -log [p(1|x_0) * p(0|x_0)] \\ &= -log [p(1|x_0) * (1 - p(1|x_0))] \\ \end{align*} $$
or since log is concave equivalently minimizing
$$ \hat L = -p(1|x_0) * (1 - p(1|x_0)) $$ After some basic calc 1, we see the optimal result we want the system to learn is that
$$ p(1|x_0) = .5$$

Note that if you had more evidence, the result would just be that you want it to learn that it is $1$ with probability $\mathbb{E}_i\ y_i | x$


Posted 2019-08-23T21:19:55.670

Reputation: 1 845

So throwaway or keep both columns? – PrimeNumber – 2019-08-26T03:03:10.813

keep both. Do not throw away data unless you have good reason to. In this case you want your model to output .5 (not 0 or 1) – mshlis – 2019-08-26T03:04:26.713

What happens if the outcome variable for both rows (or $n$ rows) are the same (i.e we have duplicate rows)? Should we throw one of them out? Or keep them both? Does it really matter? – PrimeNumber – 2019-08-26T03:09:05.523

@Prime In that case it depends, if this is due to a sampling scheme then do not, because usually thatll mean that daat point is twice as important on the other hand if someone accidentally copy and pasted a row, then yes delete it, because itll be giving additionall importance to a point that doesnt deserve it – mshlis – 2019-08-26T11:08:46.550


I might consider 2 models (throw away col 1 and throw away col 4), and one more that keeps both, and see which generalises better to test set.


Posted 2019-08-23T21:19:55.670

Reputation: 11