I've learned machine learning via textbooks and examples, which don't delve into the engineering challenges of working with "big-ish" data like Kaggle's.
As a specific example, I'm working on the New York taxi trip challenge. It's a regression task for ~ 2 million rows and 20 columns.
My 4GB-RAM laptop can barely do EDA with
matplotlib in Jupyter Notebook. However, when I try to build a random forest with 1000 trees, it hangs (e.g. Kernel restart error in Jupyter Notebook).
To combat this, I set up a 16GB-RAM desktop. I then ssh in, start a browser-less Jupyter Notebook kernel, and connect my local Notebook to that kernel. However, I still max out that machine.
At this point, I'm guessing that I need to run my model training code as a script.
- Will this help with my machine hanging?
- What's the workflow to store the model results and use it later for prediction? Do you use a Makefile to keep this reproducible?
- Doing this also sacrifices the interactivity of Jupyter Notebook -- is there a workflow that maintains the interactivity?
My current toolkit is RStudio, Jupyter Notebook, Emacs, but am willing to pick up new things.