I'm processing geospatial data using Spark 2.0 Dataframes with the following schema:
root |-- date: timestamp (nullable = true) |-- lat: double (nullable = true) |-- lon: double (nullable = true) |-- accuracy: double (nullable = true) |-- track_id: long (nullable = true)
I have seen that there are jumps of the location signal to a complete different place. The strange thing is, that the signal then remains for a certain time, say aound 25 seconds or 5 samples at the remote location and then jumps back to where I stand.
I'd like to remove these outliers by calculating the speed between the current and the "last valid record" by calculating the speed between the points. If the speed is above a given threshold, the current record should be dropped and the "last valid record" remains the same. If the speed is below the threshold the current record is added to the result data frame and becomes the new "last valid record".
I'm using Spark 2.0 with Dataframes.
Any suggestions of how to implement this strategy or any better strategy are highly appreciated. Thanks.
PS: I asked the same questions in stackoverflow, with a concrete implementation. But, since I'm not sure if this is the right approach, and do not want bias the answers to a certain Spark method, I ask here for any suggestions. https://stackoverflow.com/questions/41002844/how-to-filter-outlier-rows-from-spark-dataframe-based-on-distance-to-previous-va