## How to include both origin and destination in your features?

1

I'm trying to predict the price of transportation for trucking freight. Two important features that I think would be of great impact are Origin and Destination. What's the best way to include that in your features?

They are categorical variables and if I encode it, the dataset will be extremely sparse. I thought about converting both features into latlong coordinates too and treat them as numerical variables.

Anyone dealt with this situation before?

1

Since your goal is to predict the price, I think it would be more useful to include features such as:

• distance between origin and destination
• whether origin and destination are in the same state/country/continent/area...

However the actual origin and destination might still be useful, at least the frequent ones, so it's worth experimenting. Those which appear only once or twice in your training data are unlikely to help the model, on the contrary they might cause overfitting. So you could filter out the origin/destination cities which appear less then $$N$$ times and keep the other ones. It's very likely that this will reduce the number of possible values a lot.

The market price fluctuates and the origin-dest combo matters a lot in this field. For example, there can be more capacity in one region vs. another. If there's too much capacity, the price dips (vice-versa). – TheSugoiBoi – 2019-08-03T16:35:52.917

But how should I include them in my training set? Encode them and make it super sparse? Or convert to latlong? – TheSugoiBoi – 2019-08-03T16:37:00.833

Technically this is not a problem of sparse data afaik, because all the binary features would represent a single element so it makes sense that only one value is non-zero. This is done all the time when representing words. You could certainly try latlong coordinates as well, but my vague intuition is that it won't work as well. – Erwan – 2019-08-03T16:46:46.687

Ah okay, thank you! – TheSugoiBoi – 2019-08-03T16:55:05.133