Can GPS coordinates (latitude and longitude) be used as features in a linear model?



I have data sets that contain, among many features, GPS coordinates (latitude and longitude). I'd like to use these data sets to explore problems such as: (1) computing ETA to drive between start and end points; and (2) estimating the amount of crime for a specific point.

I'd like to use a linear regression model. However, can I use these GPS coordinates directly in a linear model?

Latitude and longitude do not have an ordinal property, such as with a person's age. For example, the two points (40.805996, -96.681473) and (41.226682, -95.986587) do not seem to have any meaningful ordering. They are just points in space. I was thinking of replacing them with categorical US zip codes and then doing one-hot encoding, but that would result in a lot of variables.


Posted 2017-10-09T20:19:27.787

Reputation: 241

1Do you have to use them directly? Have you heard about zoning tools, such as the AZP algorithm by S. Openshaw? You could even manually delimit regions in a map to separate regions/zones, if the area is relatively consistent. – Mephy – 2017-10-09T21:24:17.833

@Mephy: That would mean I'd convert lat / long to zones, right? But then I would have hundreds or thousands of categorical zones, just like with zip codes. I'd have to one-hot encode all of them. – stackoverflowuser2010 – 2017-10-09T21:56:36.290

Depends on how you cut the zones, of course. If you choose "south of the Equatorial line/north of the Equatorial line", then it's only two. Many zoning algorithms have some hyper-parameters to define quantities such as the number of zones or minimum zone size. – Mephy – 2017-10-09T22:21:51.917

I have the same issue.I wanna predict a people's position. I have geohashed all the geolocation features in training data. After that,LabelDecoder is used to transform the categorical location feature. Finally,the result is terrible. Is there any good idea to deal with the spatial prediction? – berisfu – 2018-01-30T03:53:14.070



You cannot use them directly, as it is unlikely there is a true linear relationship unless you're looking to predict "how far east or north" someone is. As mentioned in the comments, you need to convert them into zones. If you wanted to keep it really simple, you could use a kNN clustering algorithm with a low number of potential clusters and then assign each instance a new feature with the cluster ID, and then one-hot encode that.

You may also want to read about how people interpolate coordinates to predict values across a whole map. The first example is with temperature stations, but you can also imagine it being "hot zones" for crime.



Posted 2017-10-09T20:19:27.787

Reputation: 1 548


You could do whatever your heart desire, but unless your model predicts the temperature or time-difference, I cannot come up with any other target variable that depends solely on the coordinates.

What you probably want to do, is use an external data source and enrich your data with Country / Zip code / climate / other geographic features that will help your model perform.


Posted 2017-10-09T20:19:27.787

Reputation: 131


GPS coordinates can be directly converted to a geohash. Geohash divides the Earth into "buckets" of different size based on the number of digits (short Geohash codes create big areas and longer codes for smaller areas).

A geohash is a single number that can be used as a feature in a model.

Geohash applies only to the entire world, zipcodes do not.

Brian Spiering

Posted 2017-10-09T20:19:27.787

Reputation: 10 864

The output of a geohasher is a string, not a single number, right? And if the geohash is a string, then I'd have to one-hot encode it, which would result in a lot of variables, just like with a one-hot encoded zipcode. – stackoverflowuser2010 – 2017-10-12T18:11:00.770

A geohash is a single number, encoded in base 32. There is no reason to 1-hot encode. Pick the level of precision and use the relevant number of digits. – Brian Spiering – 2017-10-16T21:11:47.553

I have only seen string representations of geohashes. However, even if geohashes were represented as an long int, is there any linear relationship between them for use in a linear model? That's exactly the point of my original question. – stackoverflowuser2010 – 2017-10-17T20:02:37.167

The relationship between geohashes is slightly complex -

– Brian Spiering – 2017-10-17T21:13:32.403

If the geohashes don't have a linear, ordinal relationship, then they can't be used as numeric values in a linear model, right? If that's the case, then they have to be treated as a categorical feature, which would require one-hot encoding. – stackoverflowuser2010 – 2017-10-18T16:26:30.100

1There are many ways of feature engineering beyond linear and one-hot encoding. For example, the kernel trick or Helmert transformation. – Brian Spiering – 2017-10-20T15:38:27.380