## Predict how many days late or early someone will finish their work

1

So I have a set of deadlines and people, with a database of when those people finished their previous work and how much after the deadline it was, as well as when the work was given. The work itself were articles, so I also have the word count for each. How do you, based on the previous data, calculate the amount of days earlier or later somebody will most probably finish their work?

As a concrete example of the problem I'm trying to solve:

John finished his last 5 projects 5,4,3,6,2 days late. What is the most probable amount of days earlier or late he will finish his work?

Basically I'm looking for an appropriate machine learning algortihm to implement to calculate this probable end date.

Very well written question, props! Roughly how many deadlines do you have per person and in total? Do you have access to other data, like a textual description of a task? – jonnor – 2019-04-07T11:38:46.693

1

If we assume that each task delivery is independent of eachother, and the process does not change a lot over time (stationary), we can treat this as a standard regression problem.

Since this is about deadlines, we expect that there might be variations over time, or patterns of delay across the seasons of the year or week. So time-based features might look something like:

|deadline_year|deadline_week_number|deadline_day_of_week|

We also expect that the size of a delay might depend on the size of the task. So if you have the start date, or an estimate on number of days, definitely include that. If people can have multiple tasks at the same time, include that also.

|workdays_between_start_and_deadline|workdays_estimated|concurrent_tasks|

And we expect that delays may depend on the person who performs the task, and who created the task.

|task_owner|task_creator|

Use Exploratory Data Analysis and your knowledge about the processes that created to find more of these possible relationships. Use scatterplots of each feature against the target days_delayed (negative=before time, 0=on time).

One can start with a strong non-linear model like RandomForest. This can give estimates which can be scored (by mean squared error for example), and indicate whether your features are predictive or not. To get probability intervals, you can use a Bayesian model such as Bayesian Ridge Regression. This is a linear model, so may have to spend more time on feature engineering to make the relationships between feature and target (roughly) linear.