I like this question because it gets at the politics that exist in every organization. In my view and to a significant degree, expectations about model performance are a function of the org culture and degree to which an organization is "technically literate." One way to make clear what I mean is to consider the differences between the 4 big "data science" entities -- Google, FB, Amazon and Yahoo -- versus the 4 big agency holding entities -- WPP, Omnicon, Interbrand and Publicis. Google, et al, are very technically literate. The agencies, on the other hand, are known to lean towards tech phobia. What's the evidence for this? First off, the technically literate group was founded or are run by engineers, computer scientists, geeks and people with strong tech backgrounds. Who runs the tech illiterate companies? Marketers who have risen to prominence by virtue of their soft communication and people skills. And not only that, having worked in some of these shops in NYC, I can testify that these organizations systematically punish and/or push out the highly technically literate types as not a "fit" with the culture.
Next, consider their aggregate (stock) market caps, The tech literate group adds up to about 800 billion dollars while the tech illiterate group amounts to 80 billion. Tech literate entities are 10x bigger than the others in market cap. This is a clear statement of the market's expectations and it's not high for the illiterates. So, by extrapolation, what kind of hope can you have for challenging the "predictive accuracy" expectations of bozos like these?
So, given that cultural breakout and depending on where you fall, you should have more or less realistic expectations. Of course, different "tech illiterate" entities will have managers who know what they're doing, but for the most part, these entities are dominated by the idiocy of the lowest common denominator in tech skills, i.e., people who are at best technical semi-literates (and dangerous) or, more commonly, totally innumerate but don't know it. Case in point, I worked for a guy who wanted words like "correlation" scrubbed from c-suite decks. This is an extreme case: after all, every secretary knows what a "correlation" is.
This raises the issue of how one deals with the maddeningly naive and innumerate when they ask a really dumb question like, "Why aren't you getting 99% predictive accuracy?" One good response is to reply with a question like, "Why would you assume such an unrealistically high PA is even possible?" Another might be, "Because if I actually got 99% PA, I would have assumed that I was doing something wrong." Which is highly likely to be true, even with 90% PA.
The there's the more fundamental question of the insistence on PA as the sole criterion for model value. The late Leo Breiman left many footprints on the statistical and predictive modeling community of which PA is one. His primary concern with PA was to address the many criticisms being made in the 90s regarding the instability and error inherent in running a single CART tree. His solution was to motivate “random forests” as an approximate and provisional method that would maximize accuracy and reduce instability by eliminating tree structure. He benchmarked the lower MSE from ~1,000 iterative RF “mini-models” against the error from a single logistic regression model. The only problem was that he never bothered to mention the glaring apples to oranges comparison: what might the logistic regression MSE have been if it had also been performed 1,000 times?
The 2008 Netflix Prize offered a sizeable monetary reward to any statistician or team able to improve upon the MSE of their recommender system. At the time, Netflix was spending $150 million a year on this system, convinced that the costs were more than recovered in customer loyalty and purchase of movies that would otherwise never have been chosen. The eventual winners used a complex ensemble of 107 different models.
As Netflix learned however, the real problem was that, from a fully loaded cost perspective, the actual improvement in error over their current model was a mere 0.005% reduction in the 5 point ratings. Not to mention that the IT costs in time, heavy-lifting and maintenance of the winning ensemble of 107 models more than nullified any gains from the error reduction. Given this, Netflix eventually abandoned the pursuit of MSE and no more Netflix Prizes have been awarded
And this is the point: minimizing predictive error can be easily gamed or p-hacked and is prone to analyst fraud (i.e., finding a solution that glorifies the analyst’s modeling skills, positively impacting his potential end-of-year bonus). Moreover, it is a completely statistical solution and goal set in an economic and business vacuum. The metric provides little or no consideration of ancillary, collateral costs -- the very real operational consequences evaluated from A to Z that should be an integral part of any fully-loaded, trade-off based decision-making process.
This has become one of those issues that is embedded in organizations and is very, very difficult to change. In other words, I am fully aware that I am tilting at windmills with this rant about the caveats with the use of PA.