I have a spatial raster which I am using as Input for a Random Forest Regression Model. My Goal is a prediction of occurrences of a certain property of individuals for each cell based on cell properties. Each cell includes different numbers of individuals, varying between 1 and 2000. Due to the very different number of individuals in each cell, the data looks like this (cell_id is not used in training):
(cell_id) | cell_data | property occurences 1 | 0.1 0.9 0.1 | 1 2 | 0.1 0.8 0.1 | 670 3 | 0.7 0.1 2.4 | 9
To overcome this problem, I predicted "property occurrences per Individual" for each cell from which I am able to calculate the actual occurrences afterwards.
(cell_id) | cell_data | property occurrences per Individual 1 | 0.1 0.9 0.1 | 0.1 2 | 0.1 0.8 0.1 | 0.1 3 | 0.7 0.1 2.4 | 0.8
However, there are still two problems:
- A row based on only a few Individuals has the same weight as a row based on 1000
- If the number of individuals gets too small, the property occurrence is almost always 0 - regardless of the cell data.
I am thinking about adding a weight based on the number of individuals. Alternatively, I might add a column with the number of individuals and again predict the actual occurrence right away. Are these solutions I should follow or seen as bad ideas? Which is probably the better idea?