I've been doing a few DataScience competitions now, and i'm noticing something quite odd and frustrating to me. Why is it frustrating? Because , in theory, when you read about datascience it's all about features, and the careful selection, extraction and engineering of those to extract the maximum information out of raw variables, and so far, throwing every variable as it is in the mix seems to work fine with the right encodings. Even removing a variable that has 80% nulls ( which in theory should be an overfitting contributor ) decreases the performance of the regression model slightly.
For a pratical case : I have long/lat for a pickup point and destination point. I did the logical task of calculating the distance ( all kinds of them ) from these points. And dropped the long/lat. Model performs way better when you include both ( coordinates & distance ) in the features list. Any explanations? And a general thought on my dilemma here with the real utility of feature selection/engineering/extraction
EDIT : could it be that the information we can get out of the coordinates is bigger than the distance? Is it just possible to extract features that are more beneficial to my model that plain long/lat?