What happens to the training data after your machine learning model has been trained?

5

I am completely new to all this, for the life of me I can't find the answer to this question anywhere on Google.

What happens after you have used machine learning to train your model? What happens to the training data?

Let's pretend it predicted correct 99.99999% of the time and you were happy with it and wanted to share it with the world. If you put in 10GB of training data is the file you share with the world 10GB? If it was all trained on AWS can people only use your service if they connect to AWS through an API?

What happens to all the old training data? Does the model still need all of it to make new predictions?

icYou520

Posted 2018-08-28T04:55:58.417

Reputation: 161

2you dataset is not trained, your model is (with your dataset) – Jérémy Blain – 2018-08-28T07:11:28.277

Answers

4

In many cases, a production-ready model has everything it needs to make predictions without retaining training data. For example: a linear model might only need the coefficients, a decision tree just needs rules/splits, and a neural network needs architecture and weights. The training data isn't required as all the information needed to make a prediction is incorporated into the model.

However, some algorithms retain some or all of the training data. A support vector machine stores the points ('support vectors') closest to the separating hyperplane, so that portion of the training data will be stored with the model. Further, k-nearest neighbours must evaluate all points in the dataset every time a prediction is made, and as a result the model incorporates the entire training set.

Having said that, where possible the training data would be retained. If additional data is received, a new model can be trained on the enlarged dataset. If it is decided a different approach is required, or if there are concerns about concept drift, then it's good to have the original data still on hand. In many cases, the training data might comprise personal data or make a company's competitive advantage, so the model and the data should stay separate.

If you'd like to see how this can work, this Keras blog post has some information (note: no training data required to make predictions once a model is re-instantiated).

redhqs

Posted 2018-08-28T04:55:58.417

Reputation: 291

ahhh Thank you so much, this makes total sense now. Quick follow up question, is this newly trained model light weight enough for an older computer to run or do you still need the power of modern day GPU etc..? also once you deploy your perfected model in to use, is it done learning? or does your model continue to fine tune itself on a slower scale? – icYou520 – 2018-08-28T15:07:06.387

2@redhqs This is a good start at an answer, but it's not quite complete: some models embed all or some of the training data inside themselves, and must retain it to make future predictions. K-Nearest Neighbours is one example where data is explicitly retained, but SVM models are often large precisely because they implicitly retain selected points from the training data, and must retain many points for complex problems. – John Doucette – 2018-08-29T01:45:34.413

1@icYou520 Whether the newly trained model is small enough to run easily depends on the algorithm that was used, and to some extent on the parameter settings. Most algorithms do produce small models though. For some applications, like deep packet inspection, regular models can still be too slow however. Most models are "frozen" after training, but some can continue to learn. Transfer learning is a family of techniques for training a pre-trained model on new data. – John Doucette – 2018-08-29T01:47:57.080

1@JohnDoucette Thank you very much. My journey into ML just started on Sunday, No math or programming background, I feel so in over my head. I have a great plan of action though, and I felt some very basic things where confusing me. You and redhqs just helped something click. So Thank you very much, off to Khanacademy for some math lessons. :) – icYou520 – 2018-08-29T01:58:22.367

1@JohnDoucette thank you for the clarification and reminder - and all the best, icYou520 – redhqs – 2018-08-29T09:31:45.537

Is your 3rd paragraph talking about transfer learning ? If so, I don't find it very clear :s – Jérémy Blain – 2018-08-29T09:44:00.740

@JérémyBlain I'm not referring to transfer learning; the question asks what happens to the training data after you train your model so I wanted to mention that you'd typically retain it in case it is needed later. If it's not clear I'll have a think. – redhqs – 2018-08-29T10:26:14.277

@redhqs Ok, but you are talking about 'retraining with new data', that's why I thought that ! – Jérémy Blain – 2018-08-29T10:28:23.963

1@JérémyBlain fair enough, edited for clarity :) – redhqs – 2018-08-29T10:51:24.260