## Mean estimation for nested location data

4

1

I want to estimate the average income for a location. I have nested data in the following way: A block is inside a neighborhood, which is inside a zipcode, which is inside a district, which is inside a region, which is inside a state.

I want to estimate the average income at a block level, and the issue is that I don't have much data at that level. I have much more data at a state level, but it is not such a good approximation.

How would you deal with this problem? Are there any ways to incorporate the uncertainty of not having many data points at a block level? Are there any Bayesian frameworks that allow us to incorporate data of all levels? Is it possible that mixed models are able to do so?

If you explain any method, if you can provide a python package where that method is built, it'll be great!

Thanks!

1What did you try so far? What comes to my mind is a dummy-fixed effects model, where you incorporate dummies for some spatial level (e.g. region) for which you have "okay" data and a dummy for each "block" in a linear regression. You could test if the block-level is statistically different from the higher spatial level. – Peter – 2020-05-19T11:04:36.807

1

I'm just trying a damped mean from city to block, a Bayesian-like thing, where the prior is the city mean and the block mean is estimated via the likelihood and the posterior update rule (as in https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf). The issue is that I don't know how to account for all the other levels

– David Masip – 2020-05-19T13:06:12.723

2

this blog post could be related https://simongrund1.github.io/posts/multiple-imputation-for-three-level-and-cross-classified-data/

– oW_ – 2020-05-22T00:02:05.733

2

I don't know if that is the case, but if some kind of continuity assumptions are realistic, you could try to move away from categorical variables (block) to continuous variables (longitude and latitude). Then, if you have information on two neighboring blocks, you could interpolate those values with say a spline.

Of course, this can also be fitted into a machine learning model with predictors such as average income of blocks with distance < x. And if you don't have data of nearby blocks, then your state average might be the next best approximation.

Your state level data can serve as a predictor and also as validation.

Also, plotting your data always helps get some kind of intuition.

2

One option is to move to a more rigorous geographic information system (GIS) data structure.

For example, both plus codes and H3 are designed for nested location data. If your data is reformated to either system, you can easily choose the level of precision for aggregating location data.