I'm trying to perform anomaly detection on the open data from citibike. They are giving bikeshare trips for the past 30+ months, as well as monthly reports. In those reports they say how many bikes have been repaired each month.
The samples I am building are a sample by day and by bike. I actually don't have label for those samples, since I don't know which bike has been repaired which day. But I know that by classifying each sample normal or anormal, I can sum the number of bikes that have been classified anormal during a month and compare that number to the monthly report.
I want to know how one usually deal with it, or how this is called so I can read research paper on the subject.
Exemple of samples :
bikeid, day, feature1, feature2... 1, 2016-01-01, 0.6, -0.2 2, 2016-01-01, 0.5, -0.8 1, 2016-01-02, 0.7, -0.1 2, 2016-01-02, 0.9, 1 ... 1, 2016-01-31, -0.32, -0.45 2, 2016-01-31, -0.5, -0.8
Example of label: 3456 bicycle repairs in January.
But the shape of the data is irrelevant, what is important is that the labels are not about one sample but a group of samples.