Destination prediction with Naive Bayes and sparse output matrix

1

Given a dataset of historical cab rides, I'm trying to predict the final zip code destination of a ride based on the following features:

• origin zip code (e.g. 10006 Wall Street, Manhattan)
• pickup hour of the day
• pickup day of the week
• pickup month of the year
• car class (e.g. standard, premium, luxury, van...)
• current zip code at the time of the observation (e.g. 10013 Tribeca, Manhattan)

As an example, a ride started in Wall Street in January on a Saturday at 9am with a Premium car, currently underway and located in Tribeca is likely to end up dropping its passenger in Chelsea.

The classification experiments I've done so far using naives bayes show some interesting results but I can see a couple of issues I'm not sure how to tackle.

First issue: sparsity of the output matrix

The first one is that the output matrix is very sparse, with hundreds/thousands of possible zip codes. I'm not sure how to model that appropriately in order to optimise results and also training time. Modelling this as a multiclass problem sounds a bit naive. Would a regression with lat/lon potentially work better? Is it realistic to think of a combination of both approaches, and in which case would we need two independent models?

Second issue: embedding geographical knowledge

The other issue is that one zip code can have the single highest probability while a group of 5 others, closely packed together, can have lower individual probabilities but indicate a better chance overall. Example, Morningside (upper Manhattan) can have 15% but six other individual adjacent zip codes in Brooklyn could have, together, an overall probability of 60%, meaning that the most likely scenario is that the ride is going to Brooklyn (although we're not sure where exactly) instead of Morningside. For this one, I'm balanced between hardcoding an aggregation by borough from the individual predictions, or alternatively, to have two models (one for the zip codes and one for the boroughs) and merging their predictions afterwards. There might be other approaches like encoding the mutual distances between boroughs, but I'm not sure how to exploit this information in the model.

Any ideas or guidance for either of the above issues would be very helpful.