Machine Learning models in production environment



Lets say a Model was trained on date $dt1$ using the available labeled data, split into training and test i.e $train_{dt1}$, $test_{dt1}$. This model is then deployed in production and makes predictions on new incoming data. Some $X$ days pass, and there is bunch of labelled data that is collected in between $dt1$ and $dt1 + X$ days, lets call it $Data_x$. In my current approach, I take random samples out of $DATA_x$ (take for e.g 80/20 split) ,

So, $80\%$ of $DATA_x$ = $train_x$ (new data used to fine-tune the existing model trained on $dt1$) $20\%$ of $DATA_x$ = $test_x$ (new data added to the $test_{dt1}$)

This process of fine-tuning repeated as time passes.

By doing this I get an ever expanding test set, as well as I prevent retraining the whole model (essentially I can throw away the old data as the model has learnt from it). The new model generated is just a fine-tuned version of the old one.

I have some questions, regarding this approach:

  1. Are there any obvious drawbacks in doing this?
  2. Would the model ever need to be completely retrained (forgetting everything that was learnt before, and training the model with new train/test splits ) after some time or can the approach I described above continue indefinitely ?
  3. What should be the condition for swapping the existing deployed model with the newly fine-tuned model ?


Posted 2016-08-11T17:48:38.837

Reputation: 263

Excuse a neophyte, please. You must have a very special dataset for it to come labeled, yes? Supervised labeling is by nature costly and slow. – xtian – 2016-08-12T10:13:09.987

@xtian Cost of supervised labelling and time it takes is significantly dependent on the problem. Lets say you had ML model that predicted when someone walks in a dealership will he buy the car or not (given person attributes) ? Your labeled data collection is relatively fast in this case. In a day, you might get 100+ labeled samples. – trailblazer – 2016-08-12T16:41:14.653



I think this is a good approach in general. However:

  • Fine-tuning your model (online learning) depends a lot on the algorithm and model how well this works. Depending on your algorithm it might be wise to retrain the whole thing

  • Your sample space might change overtime. If you have enough data maybe retraining every few days/weeks/months over only the last year worth of data might be better. If your old samples don't represent the current situation as well having them included might hurt your performance more than the extra samples help

  • The biggest condition is if it's tested and how much downtime it involves, but in general swapping more times is better, and this can be automated

Jan van der Vegt

Posted 2016-08-11T17:48:38.837

Reputation: 8 538

Thanks for the reply ! I am currently using ensembles methods such as Random Forest, and Gradient Boosted Trees. The reason I did not mention the them, as I wanted to know how good the approach is agnostic to the type of algorithm. – trailblazer – 2016-08-11T18:10:00.463

About the sample space, don't you think that can be handled with giving weights to the observations ? building some sort of time notion. – trailblazer – 2016-08-11T18:10:59.317

@trailblazer adding trees to your forest is a decent approach I think, never tried it but there should be literature about it. Look for online learning. Algorithm agnostic will not be possible because some algorithms can only learn over the whole set. – Jan van der Vegt – 2016-08-19T07:50:05.913

@trailblazer with regards to the sample space question, that could work for some algorithms but not for others, this again depends on the online learning possibility but you would also need to keep increasing weights or retrain on everything, you cannot retroactively decrease the weight on older samples without retraining – Jan van der Vegt – 2016-08-19T07:51:37.663


It mainly depends on the kind of learning your ml algorithm does. For Offline learning: re training the whole thing is wise as some algorithm require your full data to generate better assumption. Online learning : Your model can be fine tuned to the recent or latest data with update in model as the data arrives.

yash kumar

Posted 2016-08-11T17:48:38.837

Reputation: 1