Splitting a geospatial data set with existing groupings


Data set: I have a data set consisting of trip data for a large number of cars. The raw data consists of rows with (GPS) coordinates, timestamp, current fuel consumption, temperature, etc. It has been map matched to OpenStreetMap, leading to trip segments, each with a road segment from OSM, a number of associated rows in the raw data, plus aggregated fuel usage, average temperature, etc. Additionally, the trip segments have been grouped into trips, each of which is a continuous driving session, and have been assigned a trip ID.

Target: I'm looking to predict fuel consumption for trips. For this, I'm planning to train a model that can predict fuel consumption for trip segments, then sum up the predictions for a trip. Following, I'd like to split up my data set of trip segments into training and test sets.

Challenges: I want to keep segments for each trip together in one of the two sets, since this will allow me to calculate the error for whole trips during testing.

At the same time, I would like to hit a specific ratio of trip segments between the two sets, but my trips have different numbers of segments, so a simple split of the trips according to this ratio might skew the ratio of segments.

Third, I believe I would like to split the data in such a way as to keep the two sets representative of the full data set. That is, that both sets contain trip segments from across the area that I have data for.

However, I have read a couple of articles with other types of spatial data (for example electoral and agricultural data) that explicitly try to split the data in a way such that there is no overlap in order to prevent implicit contamination of training data with test data. I don't have much experience with spatial data sets, so this makes me unsure of the what my goal should be in this regard.

I don't believe it's possible to both keep the trips grouped and avoid overlapping trip segments, since most trips share segments on heavily travelled roads.

Question: How should I split the data set to best satisfy the goals described above? Or are any of my assumptions incorrect, and should I approach the problem in a different way?

Jacob Bundgaard

Posted 2018-10-09T14:04:44.150

Reputation: 111

No answers