How to predict ETA using Regression?

6

5

I have a data from GPS in the form

1.('latitude', 'longitude','Timestamp').
2.('latitude', 'longitude','Timestamp').
3.('latitude', 'longitude','Timestamp').

I am changing this data into the below form

'latitude_1', 'longitude_1', 'Timestamp_1', 'latitude_2', 'longitude_2', 'Timestamp_2, Timestamp_2-Timestamp_1.

With this format I am training a LinearRegressionWithSGD model of spark where label is Timestamp_2-Timestamp_1 and features are latitude_1, longitude_1, latitude_2, longitude_2.

But when I am giving Origin ( latitude and longitude ) and destination ( latitude and longitude ) the results are very bad.

Kindly guide me whether this approach is the right approach ? and if not then how to build a prediction model from given data to predict Estimated Time of Arrival.

user825828

Posted 2016-02-18T09:59:43.063

Reputation: 99

This is not very clear, and I don't understand how anyone can try and answer it. Where's the origin coordinates? Where's the destination coordinates? What do the timestamps mean? Why are there three lines in your sample data and only _1 and _2 in your changed form? What happened to the third line? – Spacedman – 2016-02-22T18:20:47.187

Instead of trying to "learn" the relation between lat1,lat2,long1 and long2, just calculate the distance between them. Use Haversine instead of Euclidean as our world is not flat... – Omri374 – 2016-02-23T05:13:53.060

I'm with @Spacedman on this one. Congratulations to those who managed to come up with answers. – tagoma – 2017-10-24T22:17:05.070

Answers

4

I suggest to calculate the Haversine distance between two points, and fit a linear regression to find the relation between the Haversine distance and the trip duration. So your regression will be

$duration_t = timestamp_t - timestamp_{t-1} = \alpha + \beta*d(point_t,point_{t-1})$

Where $d$ is the Haversine distance. $point_t$ is a lat/long pair at time $t$.

Note however that there's an assumption that the user drives at the same speed. If half of your data was gathered while walking and half while driving, then your relation between time and ETA is possibly not linear.

Omri374

Posted 2016-02-18T09:59:43.063

Reputation: 215

2

To predict timestamps from two predictor variables longitude and latitude, you want to train a multiple linear regression model of the form

$$Timestamp = \alpha + \beta_0 \cdot Longitude + \beta_1 \cdot Latitude.$$

Given a new latitude-longitude pair of you destination, you can then compute the ETA.

Spark's LinearRegressionWithSGD model should be able to perform multiple linear regression out of the box, using Timestamp as label and latitude and longitude as features. There's no need to transform the data beforehand.

PidgeyUsedGust

Posted 2016-02-18T09:59:43.063

Reputation: 139

Thanks for the answer. However, I still have a doubt. So if I give origin (latitude-longitude) I get TimestampA and give destination (latitude-longitude) I get TimestampB. I just have to take difference between TimestampB and TimestampA which will be predicted ETA right ? And I also want to add third parameter which is time when the current data was recorded e.g 16.00 hrs or 18.00 hrs (for considering traffic conditions). Will this model solve the problem of considering traffic conditions ? – user825828 – 2016-02-19T15:00:48.473

For the origin, you don't have to compute anything. You can indeed compute how long it will take to reach destination by subtracting the timestamps. Given that you have sufficient data for high-traffic times, the model might be able to capture that. – PidgeyUsedGust – 2016-02-19T23:21:00.003

I don't see how that helps. You can be at a specific lat long at different timestamps. How does learning the relation between the coordinates and timestamp help in any way? – Omri374 – 2016-02-22T18:33:15.790

0

The problem is with how you are training the problem. If you use Timestamp1 and Timestamp2 as training parameters, they will carry 100% predictive power, and the algorithm will completely disregard any location parameters. If you want to make predictions based on only an origin and destination, you need to train your model using only those parameters.

j.a.gartner

Posted 2016-02-18T09:59:43.063

Reputation: 1 185

0

This might not be of much help, but I wanted to point out that you might have to control for direction, given that the GPS co-ordinates will reduce and increase depending on the direction of travel.

user1887071

Posted 2016-02-18T09:59:43.063

Reputation: 1