Which cloud platform to maximize my impact as a data scientist?

1

1

I am looking to pick up the knowledge/software skills to move towards becoming an end to end deep learning engineer. By this I mean handling the following on my own:

  1. preprocess big data at low latency
  2. design & train deep learning models on massive data
  3. deploy models to serve predictions at massive scale
  4. stream/preprocess incoming data to update models in real time

Which cloud platform would you choose to do this?

  • GCP: Allows me to do the above with the minimum effort (serverless model hosting, model versioning etc), however, it ties me to tensorflow (I'm an MXNet fan). Looks like I need to pick up apache beam for distributed data preprocessing...
  • AWS: Maximum flexibility but seems far less clean. Appears to be more suited to a team of 5 experts wanting to achieve the above.

What software would you choose?

  • I'm essentially looking to pickup the minimum amount of things to have the maximum impact.
  • Currently spend most of my time using python + MXNet + EC2 and am comfortable with (2)

Ollie

Posted 2018-04-05T16:18:36.850

Reputation: 113

Question was closed 2018-04-06T01:10:59.553

1Knowing this stuff is more the domain of the data engineer than the data scientist. In a startup, you may be given the choice of platform, in which case either should be fine. In an enterprise, they will tell you what to use and the question is moot. However, I should note that tensorflow and beam/spark are industry standards. Welcome to the site! – Emre – 2018-04-05T16:50:17.517

Answers

4

It has been my experience that transitioning from local modeling to large-scale distributed programming is a lot more work than most data scientists realize, and leaves little room for anything BUT data engineering, similar to what @Emre said above.

If you're developing an infrastructure yourself (say, spark) using GCP or AWS VMs, installing and maintaining that is a LOT of work. This is doubly true if you're running a multi-tenant system and/or supporting production jobs. You will constantly be trying to solve the 'why didn't my job run?' or 'why does it take my job 14 days to run?' problems.

If you're using the built-in data science infrastructure to those systems (Redshift, Athena, ElasticSearch, etc), you can save some time but it is still EXTREMELY non-trivial to manage. There is a reason why, year after year, Data Scientists' favorite tools always include Databricks or similar -- because managing these things is a pain, and requires a completely different skillset than actual modeling.

All that being said, I have two suggestions. First, AWS is more mature and has more community answers that help overcome hurdles than GCP does. You will run into issues with its IAM systems (which become a necessary evil of responsible big data engineering), and the quirkiness of its various products (lambda, for instance, will only run < 3m scripts), but overall everything you're trying to do has already been done and documented by someone else. It is the better choice, IMO.

However, I'd urge you to take a closer look at a managed platform like Qubole or Databricks (I work for neither). I set up and maintained a Qubole / AWS environment for over a year. I learned a ton about data engineering, architecture, elastic spark infrastructure and all the nuances / pitfalls / limitations of distributed computing that they don't tell you about in the rote documentation, but was still able to maintain a functioning system. My use case didn't include deep learning, but they could have, by me turning a few knobs.

Then, once you have taken your lessons from these environments that work, you can architect, deploy & support your own big data infrastructure to your heart's content (which will now be your full time job). Hope that helps.

TheProletariat

Posted 2018-04-05T16:18:36.850

Reputation: 151

1Interesting. I appreciate the insight around potential pitfalls. Sounds like regardless of tools, handling points 1-4 will never be 1 person's job. Regarding the choice of cloud platform, I often have the freedom to work with either, since many organizations still use MSQLServer etc. In this scenario, it is up to me to choose who I want to rent GPU's from & deploy models with. This is where it seems GCP allows me to deploy ML models at scale with far less code/infrastructure than AWS using Google global load balancing. Am i wrong? – Ollie – 2018-04-05T18:34:33.697

1Yes, I agree that GCP would be easier and faster to get up and running, but I was under the impression that you were interested in becoming a meta-stack unicorn (i.e. someone who knows all the ins and outs of distributed processing, data flows, storage formats, encryption, compression, architecture, streaming, graph & time series DB's, ML, Statistics, AL, etc). If you just want to run large amounts of data through ML models, I'd definitely look at GCP first (or Nvidia DGX, if money is no object). – TheProletariat – 2018-04-05T18:57:49.147