Learning time of arrival (ETA) from historical location data of vehicle



I have location data of taxis moving around the city sourced from: Microsoft Research
Overall it has around 17million data points.

I have converted the data to JSON and filled up mongo. A sample looks like this:

{'lat': 39.84349, 'timestamp': '2008-02-08 17:38:10', 'lon': 116.33986, 'ID': 1131}
{'lat': 39.84441, 'timestamp': '2008-02-08 17:38:15', 'lon': 116.33995, 'ID': 1131}
{'lat': 39.8453, 'timestamp': '2008-02-08 17:38:20', 'lon': 116.34004, 'ID': 1131}
{'lat': 39.84615, 'timestamp': '2008-02-08 17:38:25', 'lon': 116.34012, 'ID': 1131}
{'lat': 39.84705, 'timestamp': '2008-02-08 17:38:30', 'lon': 116.34022, 'ID': 1131}
`{'lat': 39.84891, 'timestamp': '2008-02-08 17:38:40', 'lon': 116.34039, 'ID': 1131}
{'lat': 39.85083, 'timestamp': '2008-02-08 17:38:50', 'lon': 116.3406, 'ID': 1131}

It consists of a taxiID - ID field, timestamp of its latitude and longitude combination.

My question is: I want to use this data to calculate estimated time of arrival(ETA)

So far, I am doing it a crude way by querying mongoDB with aggregation. It is totally inefficient.

I am looking at some sort of learning algorithm where the historical data can be used to train it. In the end, given two points, the algorithm should traverse the possible route by referring historical data and give an estimate of time. Calculating time estimate is not a problem at all if I get the array of JSON documents between the points. But, getting those right arrays is.

Any pointers in this direction will be very helpful.


Posted 2015-05-15T14:14:20.280

Reputation: 167

hi Sprialarchitect I am also trying to solve similar problem I am using Kalman filters. can you share which solution worked for you out of the suggested below. – Prayalankar – 2018-09-24T17:41:32.063



Based on what I figured out from your problem:


  • You can easily convert your data to a graph using Networkx, igraph or any other tool/library/software. Then what you need is a Shortest Path Algorithm (Dijkstra is widely used and implemented in all graph/network analysis softwares). Once you created the graph you can simply calculate the average estimated time.
  • For turning the problem into a Learning Problem, you can use historical time estimations for different paths and assign a weight to an edge proportional to the property of that edge (e.g. traffic jam probability, time conditions) and try to predict the ETA for a new query.


  • You can also turn it into a Network Science Problem and use Graph Theoretc approaches to approach the question. You can start with statistical analysis of nodes and edges attributes e.g. passing time distribution, shortest path length distribution, probabilistic modeling of traffic jam and so on to see if some meaningful insight leads you the next step.
  • Another idea is to use graph clustering algorithms to extract most connected parts of the town and go through the analysis of them i.e. calculate the ETA for different parts instead of whole the data and assign the estimated time to the members of corresponding cluster and reduce the computational complexty if your algorithm.


  • The last but not least is having a look at ArangoDB. It's a new database model which is based on graphs and you can run queries on millions of edges in an amazing speed! all what you need is a bit of javascript knowledge and even if you don't have it you can use AQL language designed for arangoDB. The interesting point is that it uses JSON files as the standard data format so you are already half way through ;)

Hope i could help :) Good Luck!

Kasra Manshaei

Posted 2015-05-15T14:14:20.280

Reputation: 5 323

1Thanks for so many ideas. I have to spend some time on each and will get back to you. :) Highly appreciate you answer. – Spiralarchitect – 2015-05-16T16:43:45.513