Avoid hardware limitation while competing in Kaggle?


I've learned machine learning via textbooks and examples, which don't delve into the engineering challenges of working with "big-ish" data like Kaggle's.

As a specific example, I'm working on the New York taxi trip challenge. It's a regression task for ~ 2 million rows and 20 columns.

My 4GB-RAM laptop can barely do EDA with pandas and matplotlib in Jupyter Notebook. However, when I try to build a random forest with 1000 trees, it hangs (e.g. Kernel restart error in Jupyter Notebook).

To combat this, I set up a 16GB-RAM desktop. I then ssh in, start a browser-less Jupyter Notebook kernel, and connect my local Notebook to that kernel. However, I still max out that machine.

At this point, I'm guessing that I need to run my model training code as a script.

  • Will this help with my machine hanging?
  • What's the workflow to store the model results and use it later for prediction? Do you use a Makefile to keep this reproducible?
  • Doing this also sacrifices the interactivity of Jupyter Notebook -- is there a workflow that maintains the interactivity?

My current toolkit is RStudio, Jupyter Notebook, Emacs, but am willing to pick up new things.


Posted 2017-10-12T17:03:35.873

Reputation: 139

No answers