Predictive model to maximize sum of dependent variable?

1

I am trying to classify cars for a towing company. Junky cars earn more when sent to the junkyard, and the more valuable cars should earn more at the auction, despite the auction fee. Creating a logistic regression that takes into account Make, Model, Mileage, Year and Run status helps us improve the accuracy of which cars should go where, but a difficulty arises: Sometimes, a car that would be classified as junk can actually be an outlier, and sell for a lot of money. So to optimize our model, we don't really care that much whether we are right or wrong on an individual car, so much as we maximize our bottom line.

All of the models I have seen (Logistic regression, RF, linear regression) make predictions on a line by line basis. What would be a good model to try and maximize the aggregate sum of the predictions?

Below is a reprex of my data, as well as basic code I used. What I actually tried until now is to look at past data, and classify, in hindsight, what should have been done, based on prices that were earned in the auction vs available junk prices. I then ran a glm against that classification to predict the future. As mentioned above, my code improved the accuracy of our decisions, and would have sent more cars to junk correctly, but some that we classified as junk sold for so much in the auction that it wasn't worth sending any to junk.

What is the proper way to approach this?

cars <- structure(list(YearOfCar = c(2009L, 2009L, 2003L, 2004L),  
    Make = c("Hyundai", "Lexus", "Ford", "Toyota"), Model = 
    c("Sonata", "GS 350", "F-250 Super Duty","Camry"), PickUpState = 
    c("MN", "LA", "MA", "NJ"), Auction_Result = c(650,625,425, 1500), 
    Auction_Fee = c(144.25, 373.54, 213.5, 187), Mileage = c(116120L, 
    198900L, 140241L, 312927L), Runs = structure(c(1L, 1L, 1L, 2L), 
    .Label = c("No", "Yes"), class = "factor"),   junkyard_Offer = 
    c(230L, 235L, 140L, 300L), Date = structure(c(17592,  17707, 
    17674, 17583), class = "Date")), row.names = 3:6, class = 
    "data.frame")
cars$hindsight <- ifelse(cars$Auction_Result- 
    cars$Auction_Fee>cars$junkyard_Offer,1,0)
glmodel <- glm(hindsight~Make+Model+Mileage+Runs, data = cars, 
    family="binomial")
prediction <- predict(glmodel, cars, type = 'response')
prediction_classifier <-  if_else(prediction>.501,1,0)
cars$prediction_results <- ifelse(prediction_classifier==1, 
    cars$Auction_Result-cars$Auction_Fee,cars$junkyard_Offer)

Lamden

Posted 2020-11-02T17:44:21.057

Reputation: 11

Answers

0

Interesting problem which potentially involves many aspects of ML, here are a few thoughts:

  • At first sight I thought that this looks more like an optimization problem, not a regular classification problem. In this case I would suggest to maybe try things like genetic learning, because it can find an optimal assignment for individual elements which maximizes a global cost or reward function.
  • However I suspect that the problem is not entirely well defined: "but some that we classified as junk sold for so much in the auction that it wasn't worth sending any to junk.": I'm clueless about the business model but wouldn't it damage the reputation of the company if they were trying to sell all the junk cars at an auction? If I'm right then it matters that only a small proportion of cars are sent to the auction, and then the problem might be related to resource allocation (which is a specific kind of optimization problem).
  • Another way to look at it is that the vast majority of the cars are junk. Thus it might be relevant to see the problem as anomaly detection, in the sense that one tries to pick the rare cases out of a sea of regular cases. The same idea could be implemented as one-class classification, the idea being that the model can identify all the common cases (junk) and anything else is potentially valuable.

Finally it's certainly worth investigating in the data if the valuable cars can actually be found from the features: would a human expert with only the information in the features be able to correctly classify a car? I could imagine that for instance if a particular car is valuable because it was used in a famous movie, it doesn't help to just know its model and mileage. It would also be useful to check the relation between how common a car model is and its junk/valuable status, this could be an important indicator to take into account in the model via a feature. In the most simple case, it might even be possible to detect potentially valuable cars just by looking at this indicator... in which case there's no need for ML at all.

Erwan

Posted 2020-11-02T17:44:21.057

Reputation: 12 600

As far as anomaly detection, I did actually try to use a logistic regression to classify the (otherwise) junk cars that actually did particularly well at the auction... I also tried only including cars where the logistic prediction output was above a more stringent hurdle rate, such as .70 instead of . 5 . For some reason. I wasn't yet able to find any meaningful relationships. Will investigate your ideas-Thanks. – Lamden – 2020-11-03T19:19:37.447