I am building a model that is trying to predict the 5 year sales production of a salesperson in every zip code in the United States. I have only 7 years of data. This task is one piece of a larger model whose purpose is to determine the most optimal location (as determined by the zip code) to place a new salesperson. My firm has approximately 20,000 sales people located in 10,000 zip codes. These 20,000 salespeople have customers in 35,000 different zip codes.
I have historic sales information and zip code demographics. My problem is I am trying to determine the “home zip code” to place a salesperson. Only 10,000 of the 43,000 zip codes are the “home zip code” to a salesperson. My company faces different constraints in every state, so our representation is not uniform, but we are in most states. I can build a machine learning model to predict the production of the salesperson in the 10,000 “home zips”, but how should I approach the other 33,000 zip codes. I’ve thought about building a similarity index of the zip codes based on demographics, sales history and regulatory constraints and building a unique model for each these groupings, but this seems like I’m adding a lot of uncertainty with the clustering/similarity grouping not to mention smaller data sets for learning.
My data is based on an annual contracts, i.e., the customer signs a one-year agreement on an annual basis.
Any ideas how to best approach this problem would be appreciated