how to learn from unlabeled samples but labeled group of samples?


I'm trying to perform anomaly detection on the open data from citibike. They are giving bikeshare trips for the past 30+ months, as well as monthly reports. In those reports they say how many bikes have been repaired each month.

The samples I am building are a sample by day and by bike. I actually don't have label for those samples, since I don't know which bike has been repaired which day. But I know that by classifying each sample normal or anormal, I can sum the number of bikes that have been classified anormal during a month and compare that number to the monthly report.

I want to know how one usually deal with it, or how this is called so I can read research paper on the subject.

Exemple of samples :

bikeid, day,         feature1, feature2...
1,      2016-01-01,  0.6,      -0.2 
2,      2016-01-01,  0.5,      -0.8
1,      2016-01-02,  0.7,      -0.1
2,      2016-01-02,  0.9,       1
1,      2016-01-31, -0.32,     -0.45
2,      2016-01-31, -0.5,      -0.8

Example of label: 3456 bicycle repairs in January.

But the shape of the data is irrelevant, what is important is that the labels are not about one sample but a group of samples.


Posted 2016-07-08T08:05:55.527

Reputation: 141

It's a little unclear to me, can you give some small samples of data? – Jan van der Vegt – 2016-07-08T08:50:20.360

@Jan van der Vegt : I edited to give you samples of data but I don't care about this problem in particular, I just don't know how to describe my problem to Google – Borbag – 2016-07-08T09:00:20.660

maybe aggregation is the word you are looking for.. – Valentas – 2016-07-08T16:52:03.500

Aggregate Output Learning was the right therm, thanks @Valentas. – Borbag – 2016-07-15T09:37:42.990 Here is the paper giving the definition – Borbag – 2016-07-15T09:38:17.867

No answers