## How to know that your machine learning problem is hopeless?

167

122

Imagine a standard machine-learning scenario:

You are confronted with a large multivariate dataset and you have a pretty blurry understanding of it. What you need to do is to make predictions about some variable based on what you have. As usual, you clean the data, look at descriptive statistics, run some models, cross-validate them etc., but after several attempts, going back and forth and trying multiple models nothing seems to work and your results are miserable. You can spend hours, days, or weeks on such a problem...

The question is: when to stop? How do you know that your data actually is hopeless and all the fancy models wouldn't do you any more good than predicting the average outcome for all cases or some other trivial solution?

Of course, this is a forecastability issue, but as far as I know, it is hard to assess forecastability for multivariate data before trying something on it. Or am I wrong?

Disclaimer: this question was inspired by this one When have I to stop looking for a model? that did not attract much attention. It would be nice to have detailed answer to such question for reference.

1This problem can be answered in practical terms (as @StephanKolassa did) or in absolute terms (some sort of theorem that shows a given model can learn a problem iff certain conditions are satisfied). Which one do you want? – Superbest – 2016-07-05T19:40:39.823

3

This sounds similar to the classic halting problem of computer science? Let's say you have some algorithm A of arbitrary complexity which searches over input data D looking for predictive models, and the algorithm halts when it finds a "good" model for the data. Without adding significant structure on A and D, I don't see how you could tell whether A will ever halt given input D, how you can tell whether A will eventually succeed or continue searching forever?

– Matthew Gunn – 2016-07-05T19:43:10.633

@Superbest it can be both. If you have something to add, feel free to answer. I never heard of theorem that states anything about dealing with real-life multidimensional noisy data, but if you know one that applies, then I'd be interested to read your answer. – Tim – 2016-07-05T19:46:45.373

3Based on @StephenKolassa's answer, another question you could spin off this one is 'At what point should I take my work so far back to the subject matter experts and discuss my results (or lack of results)?' – Robert de Graaf – 2016-07-06T13:20:19.620

– Jan Kukacka – 2018-03-02T19:07:05.637

197

## Forecastability

You are right that this is a question of forecastability. There have been a few articles on forecastability in the IIF's practitioner-oriented journal Foresight. (Full disclosure: I'm an Associate Editor.)

The problem is that forecastability is already hard to assess in "simple" cases.

## A few examples

Suppose you have a time series like this but don't speak German:

How would you model the large peak in April, and how would you include this information in any forecasts?

Unless you knew that this time series is the sales of eggs in a Swiss supermarket chain, which peaks right before western calendar Easter, you would not have a chance. Plus, with Easter moving around the calendar by as much as six weeks, any forecasts that don't include the specific date of Easter (by assuming, say, that this was just some seasonal peak that would recur in a specific week next year) would probably be very off.

Similarly, assume you have the blue line below and want to model whatever happened on 2010-02-28 so differently from "normal" patterns on 2010-02-27:

Again, without knowing what happens when a whole city full of Canadians watches an Olympic ice hockey finals game on TV, you have no chance whatsoever to understand what happened here, and you won't be able to predict when something like this will recur.

Finally, look at this:

This is a time series of daily sales at a cash and carry store. (On the right, you have a simple table: 282 days had zero sales, 42 days saw sales of 1... and one day saw sales of 500.) I don't know what item it is.

To this day, I don't know what happened on that one day with sales of 500. My best guess is that some customer pre-ordered a large amount of whatever product this was and collected it. Now, without knowing this, any forecast for this particular day will be far off. Conversely, assume that this happened right before Easter, and we have a dumb-smart algorithm that believes this could be an Easter effect (maybe these are eggs?) and happily forecasts 500 units for the next Easter. Oh my, could that go wrong.

## Summary

In all cases, we see how forecastability can only be well understood once we have a sufficiently deep understanding of likely factors that influence our data. The problem is that unless we know these factors, we don't know that we may not know them. As per Donald Rumsfeld:

[T]here are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know.

If Easter or Canadians' predilection for Hockey are unknown unknowns to us, we are stuck - and we don't even have a way forward, because we don't know what questions we need to ask.

The only way of getting a handle on these is to gather domain knowledge.

## Conclusions

I draw three conclusions from this:

1. You always need to include domain knowledge in your modeling and prediction.
2. Even with domain knowledge, you are not guaranteed to get enough information for your forecasts and predictions to be acceptable to the user. See that outlier above.
3. If "your results are miserable", you may be hoping for more than you can achieve. If you are forecasting a fair coin toss, then there is no way to get above 50% accuracy. Don't trust external forecast accuracy benchmarks, either.

## The Bottom Line

Here is how I would recommend building models - and noticing when to stop:

1. Talk to someone with domain knowledge if you don't already have it yourself.
2. Identify the main drivers of the data you want to forecast, including likely interactions, based on step 1.
3. Build models iteratively, including drivers in decreasing order of strength as per step 2. Assess models using cross-validation or a holdout sample.
4. If your prediction accuracy does not increase any further, either go back to step 1 (e.g., by identifying blatant mis-predictions you can't explain, and discussing these with the domain expert), or accept that you have reached the end of your models' capabilities. Time-boxing your analysis in advance helps.

Note that I am not advocating trying different classes of models if your original model plateaus. Typically, if you started out with a reasonable model, using something more sophisticated will not yield a strong benefit and may simply be "overfitting on the test set". I have seen this often, and other people agree.

9+1 for marvelous answer that I totally agree with. I'm not accepting it (yet) since still hoping for other answers since the problem is broad. – Tim – 2016-07-05T09:46:08.553

1Sure. I'd love to see someone else's perspective on this, too! – Stephan Kolassa – 2016-07-05T09:46:49.827

6`If you are forecasting a fair coin toss, then there is no way to get above 50% accuracy.`. You said everything there. – Walfrat – 2016-07-05T11:57:43.123

3Using domain knowledge you can add new features to the first two cases (eg, time till Easter and TV viewing numbers, though the latter needs forecasting of its own) to get much better results. In neither case is the situation hopeless. The really interesting part is how to tell missing domain knowledge from a data set of fair coin flips. – Karolis Juodelė – 2016-07-05T12:11:50.437

4@KarolisJuodelė: that is exactly my point. We can't even know when our situation is hopeless, unless we talk to an expert... and then, sometimes the expert can't help us either, and there are "unknown unknowns" to the experts, which conceivably somebody else might know. – Stephan Kolassa – 2016-07-05T12:13:51.447

@Walfrat: the question of course is whether we have already arrived at the level of "residual" variance, or whether there is still any unmodeled but modelable structure left - and how we can know the answer to this question. A "coin toss" may look random but be perfectly forecastable once you know the DGP.

– Stephan Kolassa – 2016-07-13T07:12:28.937

Dice are not all that random, either.

– Stephan Kolassa – 2016-07-13T07:12:35.140

49

The answer from Stephan Kolassa is excellent, but I would like to add that there is also often an economic stop condition:

1. When you are doing ML for a customer and not for fun, you should take a look at the amount of money the customer is willing to spend. If he pays your firm 5000€ and you spent a month on finding a model, you will loose money. Sounds trivial, but I have seen "there must be a solution!!!!"-thinking which led to huge cost overruns. So stop when the money is out and communicate the problem to your customer.
2. If you have done some work, you often have a feeling what is possible with the current dataset. Try to apply that to the amount of money you can earn with the model, if the amount is trivial or a net negative (e.g. due to the time to collect data, develop a solution etc.) you should stop.

As an example: we had a customer who wanted to predict when his machines break, We analyzed existing data and found essentially noise. We dug into the process and found that the most critical data was not recorded and was very difficult to collect. But without that data, our model was so poor that nobody would have used it and it was canned.

While I focused on the economics when working on a commercial product, this rule also applies to academia or for fun projects - while money is less of a concern in such circumstances, time is still a rare commodity. E. g. in academia you should stop working when you produce no tangible results, and you have others, more promising projects you could do. But do not drop that project - please also publish null or "need more / other data" results, they are important, too!

1+1 definitely a great point! I guess all the answers to this question may seem "obvious", but I haven't seen anywhere all those "obvious" things gathered for reference. – Tim – 2016-07-06T08:06:37.667

3Btw, this stopping rule applies also to non-business cases: for example, if you make some kind of research, then our results also have some abstract value and continuing "hopeless" analysis is also reasonable only until value of your analysis exceeds what you could have done instead. So in fact this decision theoretic argument can be made more general. – Tim – 2016-07-06T08:38:02.853

3+1. I have seen this exact problem very often. – Stephan Kolassa – 2016-07-06T10:00:21.283

2I think "nobody would have used ist and was canned" should probably be changed to "nobody would have used it and it was canned" - was this your intended meaning? – Silverfish – 2016-07-06T13:21:42.320

A question that is suggested by the above is 'Having reached a dead end in attempting to build a model, is there a way to salvage some value from the situation, and if so, what possible steps are available?' I say this as someone whose former occupation encompassed reliability engineering (before I became involved with statistics), hence your machine related example is intriguing on two fronts. – Robert de Graaf – 2016-07-06T13:28:52.707

@Robert de Graaf I think in this case you should present some EDA and explain why the problem can't be solved using the given data. At least if you make a lot of graphs, the client will see that you have done something with your time! – Flounderer – 2016-07-07T03:52:50.817

@Silverfish fixed spelling – Christian Sauer – 2016-07-07T05:53:45.933

@Tim and added a paragraph to account for non economic work, does that capture the idea? – Christian Sauer – 2016-07-07T05:54:01.717

@RobertdeGraaf That could be it's own question. But I think that it is always a good idea to think about what you would need to answer the question and present that to the customer. All industry 4.0 questions I've see suffered from the fact that somebody added sensors and somebody else tried to answer question (reliabillity...) with it. Most of the time, there was no real match between the data the sensors provided and the data you would need. All you can do then is to inform the customer, what he would have to do. – Christian Sauer – 2016-07-07T05:57:23.140

2Thanks. I'd say that it's not only about time but about the fact that you could invest the time differently. You could instead work on research project on other life-saving drug you'd save your time but also the public will benefit from the results etc. – Tim – 2016-07-07T06:22:10.663

1@ChristianSauer In my experience as an engineer the problem of a mismatch between sensors (cf gauges) and a useful purpose likely predates the invention of the transistor. – Robert de Graaf – 2016-07-08T12:34:16.197

This is why the first thing I like to do: I "get familiar with the data" and THEN "get familiar with the problem". When I know what the data can say to me, then I hear the problem - I have a better chance of being able to detect that the data can't speak the answer. – EngrStudent – 2016-08-10T14:23:54.603

8

There is another way. Ask yourself -

1. Who or what makes the best possible forecasts of this particular variable?"
2. Does my machine learning algorithm produce better or worse results than the best forecasts?

So, for example, if you had a large number of variables associated with different soccer teams and you were trying to forecast who would win, you might look at bookmaker odds or some form of crowd sourced prediction to compare with the results of your machine learning algorithm. If you are better you might be at the limit, if worse then clearly there is room for improvement.

Your ability to improve depends (broadly) on two things:

1. Are you using the same data as the best expert at this particular task?
2. Are you using the data as effectively as the best expert at this particular task?

It depends on exactly what I'm trying to do, but I tend to use the answers to these questions to drive the direction I go in when building a model, particularly whether to try and extract more data that I can use or to concentrate on trying to refine the model.

I agree with Stephan that usually the best way of doing this is to ask a domain expert.

1Actually your answer contradicts @StephanKolassa answer where he refers to literature suggesting that forecast benchmarks are rather misleading. – Tim – 2016-07-06T12:42:48.353

1@Tim: full disclosure - that link went to an article on benchmarks that I wrote myself. Nevertheless, I stand by my conclusions: all demand forecasting accuracy benchmarks I have seen very probably compare apples to oranges. I'm thus a bit sceptical about looking to external benchmarks. In addition, I think this answer somewhat begs the question. Once your ML algorithm improves on "the best known", how do you know whether you can improve it further, or whether we have achieved The Plateau of Hopelessness? – Stephan Kolassa – 2016-07-06T15:57:46.557

1My most recent use case was rather different. I was trying to predict who was at risk of killing themselves from their postings on the internet. There are various psychometric tests that one can use to gauge severity of depression such as the PHQ9. As its a widely used medical test there is considerable work on its validity and reliability such as "The PHQ-9 Validity of a brief depression severity measure". I found that the reliability and other measures in that paper a good starting point as to the likely results one could achieve from machine learning. – Gavin Potter – 2016-07-07T08:48:18.173

1You are right, of course, about improving on the "best known", there is no way of telling whether to continue searching for a better model. But in my experience, its fairly rare this occurs in a real-world situation. Most of the work I do seems to be about trying to apply expert judgements at scale through the use of machine learning not trying to improve on the best expert in the field. – Gavin Potter – 2016-07-07T08:51:57.270

On a philosophical level...if we believe the data is produced by a deterministic physical process (i.e. excluding true physical randomness like quantum effects) then in principle, there exists more information that could improve the prediction. Any failure of prediction in this setting reflects ignorance of the correct information and/or suboptimal use of it. Of course, that's preposterous in practice. – user20160 – 2016-07-07T12:48:28.423