You should try all of:

Using a classifier that can handle missing data. Decision trees can handle missing features both in input and in output. Try `xgboost`

, which does great on kaggle competition winners. See this answer.

Off the shelf imputation routines

Writing your own custom imputation routines ( this option will probably get you the best performance)

Given your pattern of missing values, splitting the problem into four parts and learning classifiers for each.

## Custom routines for imputation

Let's call your sets of columns A,B,C and D.

Looking at this explanation of MICE, it seems to benefit from random patterns in missing values. In your case, the chained equations will go only one way and repeated iterations as in MICE may not help. But the highly regular nature of your missing values may make implementing your own variant of MICE easier.

Use rows in set A to fill B. You can write this as a matrix problem $XW = Z$, $X$ are the rows filled in $A\cap B$, $Z$ are the rows filled in $A - B$. These two sets don't intersect and since $B \subseteq A$, this covers all the rows. Learn $W$, crossvalidating and use it to impute B.

Use A and B to impute C.

- You're double-dipping on A, but I don't think it's a problem overall. Any errors in A will get double the influence on the result.

- Use A,B,C to impute D.

Learn with A,B,C,D with imputed values. Unlike MICE, your error will not be equal for all imputed values, so maybe you want to offset errors due to A by using the four data sets with different weights. "Rows with A are all original data, so this gets a higher weight". "Rows with B get a small penalty, because I have less data."

These four weights will be learned by another "stacked" classifier, sort of similar to the next section.

## Stacked classifiers

A possible disadvantage for imputing is that imputing may be inaccurate, and in the end you have different errors on different data points. So, skip imputing and just predict.

Instead of sorting columns in the order of most filled to least filled, sort the rows, i.e. data points, in the order: most columns to least columns.

Then you have four sets of data. Train a classifier for each one, one that uses all the data but fewer features, then one that uses more features but less data, until the last one which uses the most features but the least data. Which is individually the best? More Data or more Features. That's an empirical answer based on your dataset

After getting the four classifiers, combine them with another linear classifier on top (the "stacked" classifier). It may decide to give more weight to the classifier with the most features, or the classifier with the most data, we'll see (I'm betting on most data). But, you ARE using all the features and ALL the data, and in the optimal ways. I think this should be a reasonable baseline, at least.

You could also try chaining them, start from the last classifier using (least data, most features). The next classifier uses more data but fewer features. For some (the common) data, it has a new feature, 0 if the data point is "new" and $y_0$ if it comes from the old set.

There are three kinds of ensemble methods, Bagging which randomly samples less data to train classifier (helps with very noisy data and gives lower variance), methods like Random Forests which randomly throw away columns, and boosting which chains learning. You predict the values (with anything including Bagging and Random forest), then train another model (of the above type) to predict the residuals and so on.

You can look up the literature of these but honestly, those four classifiers (max data, min features),..., (min data, max features) can be easily generated with any library. Then use emsemble learning techniques to chain or stack classifiers.

In the first figure, are rows data points? So you have some data points with all features and some with some subset of features – Harsh – 2018-11-11T01:14:12.713