A response variable (label) $B$ can either be $0$ or $1$.
In the training set, $B_i = 1$ is an extremely rare event at only $0.26\%$ occurrences. Which makes the prediction of this label on a test data-set a difficult problem.
SMOTEto sample from the training-set of some $1.55 \times 10^5$ rows to obtain a completely balanced set of $620$ rows.
balanced.df <- SMOTE(B ~ ., df, perc.over = 100, perc.under = 200)
randomForestwith $1000$ trees was used to fit the model, as shown:
randomForest.Fit <- randomForest(B ~ ., data = balanced.df, ntree = 1000)
For making a validation set, $2000$ rows were sampled at random without repetition from the data-set.
The actual frequencies of $B_i$ in the validation set are:
0 1 1998 2
And those in the predicted set are:
0 1 1836 164
The results seem promising, but perhaps a little too much. Also, it is essential that the percentage of False Positives are reduced.
My questions are:
How severely do you think the skew in data is affecting the validation results?
Is there a point in validating again by creating an arbitrary data-set with bias, for example by selecting more $B_i = 1$ in the validation set?
What other metrics/validation techniques which reflect the "accuracy*" of prediction?
*The term accuracy is used in a generic sense.