Modelling regressor of historic data on basic features test set


I am using a historical dataset of sales of items in shops and I need to predict the sales of the next month of the period.

I performed feature engineering, and now I have 10 features in the train set - median, avg of sales (prices and quantities). I want to perform Linear Regression, The test set contains only 2 features - Shop_id and Item_id. How can I perform prediction from fitted regressor with 10 coefficients, and the test has only the 2?

I am not talking about doing PCA - the historic data has few historic sales features and the test only has the basics

Dataset: Competitive data science predict future sales

Yuval Asher

Posted 2019-04-27T10:01:57.560

Reputation: 11



Well, what you have here is a time series problem.

In a time series problem, you don't have the data parallel to the variable you are trying to predict. That means your data is over at the time $T$ and you have to predict based on the data from times $1$ to $T-1$.

Also, Shop_id and Item_id are not regressor which have a coefficient, because they are IDs, and using IDs to predict causes you model to learn-by-memory your data for the ID's, it gives a overfitted model.

Imagine the real-life situation: The store asks you to predict the sales of the next month, of course, you don't have data like the average price of the next month either, so you have to predict with historical data of both (all) variables.

The solution you are looking for is an ARIMAX model: The ARIMAX model allows you to use variables like Sales, Avg price, and others with information from the past.

$$Sales_t = \alpha_1 Sales_{t-1} + \alpha_2 Sales_{t-2}+\dots+\beta_1AvgPrice_{t-1}+\dots+\beta_mAvgPrice_{t-n}$$

The model is an example of how your predictions could be made, the use of the IDs can be "a model for every store". And every $Sales$ variable in the model could be a set of products.

Juan Esteban de la Calle

Posted 2019-04-27T10:01:57.560

Reputation: 2 102