Multi-country model or single model



I am working on a ML model to be deployed in a product operating in many countries.

The issue that I am having is the following: should I

  • train one model and serve it for all countries?
  • train a model per country and serve each model in its country?

I've faced this problem several times, and to me, there's a trade-off in the learning: in the first case, the model has more data to learn, and it'll be more robust (also, the solution is simpler). In the second case, I'll have a more tailored model to each country, and will be able to see effects that are specific to that country.

I'm very interested in knowing if there's an intermediate solution - a general model with some country-specific fine-tuning that can see all the data but also specialize in each specific country. If I were to use Neural Networks, this fine-tuning is natural - you train some epochs with all the data, then the last epochs with each specific country. I am wondering if something similar can be done in Linear regression models and Xgboost, which are the models I generally use.

Is there any literature on this? I think it is kind of a generic topic and there should be some.

David Masip

Posted 2020-07-09T07:20:00.450

Reputation: 5 101

1In the context of neural networks, it's also common to freeze the first several layers and build the specialist models on top of that (rather than letting the specialization tweak the entire network). You could even take the output of one of those layers as features in another type of model. – Ben Reiniger – 2020-07-21T16:16:57.510



In the paper of Hinton - Distilling the knowledge of Neural Networks, the following is mentioned (Section 5) when defining specialist models:

When the number of classes is very large, it makes sense for the cumbersome model to be an ensemble that contains one generalist model trained on all the data and many “specialist” models, each of which is trained on data that is highly enriched in examples from a very confusable subset of the classes (like different types of mushroom).

What they do is they use a general model first and then a specialist model each to focus on a different subset of the classes.

You could consider your problem something similar, instead of a specialist to classes, and specialist in countries. This way you could build a country(cluster of countries) specific ensemble of models.

Carlos Mougan

Posted 2020-07-09T07:20:00.450

Reputation: 4 420

1I like the idea of ensembling the general model with the specific model - maybe even something Bayesian can be useful work, like the general model being the prior and the specific being the posterior – David Masip – 2020-07-20T10:22:30.980


I think the only objective criterion to decide this is to simply compare the performance of the candidate approaches over the validation data.

That being said, if I were to blindly choose the approach upfront without any other information, I would choose a single model, where the model is aware of the country of each piece of data. This would let it model the peculiarities of each country while profiting from the combined training data.

If you have reasons to believe this is harming the global performance because of the intrinsic difference of some countries, you can apply boosting and therefore let the performance of the classifiers speak by itself.


Posted 2020-07-09T07:20:00.450

Reputation: 10 494


I don't have the theoretical ressources to confirm this but I think it's possible to train a first model on the whole dataset with a limited degree of freedom (high regularization) and with the commmon features that will allow you to capture the global trends and then train local models on the residuals.


Posted 2020-07-09T07:20:00.450

Reputation: 326


I think that @mirimo's idea of having a regularized model as an offset is very interesting.

My proposal is a slight variation where you ensure you don't overfit.

The idea is, to obtain the model for group $j$, train a model with all groups except $j$ and use that model as an offset to the model for group $j$. This way, we can have a complex model for the general behavior and still not train on the same target twice, thus having a more stable model.

The downside is that this is way slower, as, if there are $J$ groups, it takes around $J$ times more than the regular training.


On top of @Carlos Mougan proposal, we can:

  • Train a global model
  • Train a specific model for each country
  • Ensemble both models The ensemble can have some shrinkage, like: $$y_{final} = \frac{y_{global} \cdot m + y_{country} \cdot n_{country}}{m + n_{country}} $$ where $y_{country}$ is the prediction of the country-specific model, $y_{global}$ the global prediction, $n_{country}$ the number of samples in a country and $m$ an hyperparameter to tune, the higher the $m$ the more we trust on the global model.

I think this shrinkage is very relevant to the problem.

David Masip

Posted 2020-07-09T07:20:00.450

Reputation: 5 101

Exactly @David. It might not cosider as good production ready solution. Eventhough u take neural network model or Simple ML model. – Gaurav Koradiya – 2020-07-20T10:51:30.283


I think the most important thing you can do to bridge both assumptions is to include the country as a variable in the global model.

Should there be any country-specific effects they will simply be modeled as interactions in the global model. This is how the model deals with any other variable anyway and why should country be any different?

I think the problem is much more complicated if the data is heavily imbalanced e.g. some products are only sold in one country, etc. However this only becomes a problem at a point when training a global model is infeasible anyway.


Posted 2020-07-09T07:20:00.450

Reputation: 1 300

I think this isn't very different from the first approach - of course I would use country as a feature in the model. I think country should be different than other features as we ideally want the performance of all countries to be decent. Let's say, for a fraud model, country A has much more fraud than the others. I dummy model that says every user in country A is fraudulent would have pretty good performance, but it would be useless in country A. – David Masip – 2020-07-20T10:25:36.420

But modeling country as a predictor variable this wouldn't happen even if you used a straight forward model like a linear regression. The fact that we include country as a modeling variable will be enough to mostly take care of it. Put it this way, it is likely that another predictor is even more important than country (e.g. in your example prior fraud) and we are still fine having it as a predictor. By including it in the model the algorithm can identify the best way to model country instead of prescribing the way to deal with the variable (via grouped models). – Fnguyen – 2020-07-20T11:44:59.880

What do you mean by grouped models? I am not sure of what's that – David Masip – 2020-07-20T11:50:20.617

A grouped model means one model per group / country in this case. – Fnguyen – 2020-07-20T11:52:02.263


I don't think there is a unique rule to answer that. It strongly depends on how pertinent is the country information regarding other input data and what you want to predict.

It is possible to face cases where similar input data in different countries lead to different outputs. In that case, it would be mandatory either to add the country as input information or to create a model per country.

In other cases, the country information would not lead to any improvement in the model (so no need to do a specific model per country).

Finally, there are cases for which you will find global information (whatever country) and specific information per country. In that case, there are multiple approaches to deal with it. The first and most common is to include the country as input of your global model. As @Fnguyen mentioned, why dealing with the country differently than other inputs?


If you think that the country has a specific impact on the prediction, here are a non-exhaustive list of how you could create models that deal with your assumption:

  • Using transfer learning, train a global model to capture general trends, then train the same model on the specific countries starting from the global. You may still not be able to capture specific country effects.
  • Using boosting method: train a first classifier on all countries and then train a model per country that does boosting on the output of the global trained classifier. In that way you will keep the global trends and then use specific country information.
  • Using bagging method: train some classifier(s) on all countries, others on a specific country, and then you can combine them parellel to each others in one big model per country.
  • A specific example using NNs: train a global model and one model per country. Then use per country a model that combines both the global and the specific you trained before and only retrain the 'head'. For instance if using DNNs / CNNs, you only retrain the green part of the final model: gloabl-local_NN_combination

The list is non-exhaustive and you must have a good reason to use such approaches which give more importance to the country information. Normally, the machine learning algorithms would do it on their own.


Posted 2020-07-09T07:20:00.450

Reputation: 1 005


First of all in this use case i see we want our model to learn data or understand the data frist. This is a similar problem of natural language processing like one always try to make model to learn it from data. Here we can do little tricky things where we can declare country as target variable and rest of features as input variable. We can at least train model to learn mapping for input features to country so model might have an understanding of the representation of input with respect to the country. we can use this model as ensemble modeling suggested above. I think it would give little accuracy increment and cost-effective solution also.

Gaurav Koradiya

Posted 2020-07-09T07:20:00.450

Reputation: 149