Why are NLP and Machine Learning communities interested in deep learning?



I hope you can help me, as I have some questions on this topic. I'm new in the field of deep learning, and while I did some tutorials, I can't relate or distinguish concepts from one another.


Posted 2014-10-11T10:24:01.393

Reputation: 430

8Basically because they deliver state-of-the-art results, and eliminate the need for feature engineering. – Emre – 2014-10-20T07:39:09.177

2I would suggest not to use NLP and ML together in such context, as if they were comparable entities. They are IMHO two distinctly separate fields. Despite the fact that NLP uses a lot of ML approaches, methods, algorithms and tools, NLP is an applied field, whereas ML is a (more) generic one. – Aleksandr Blekh – 2014-12-21T10:34:29.063



Why to use deep networks?

Let's first try to solve very simple classification task. Say, you moderate a web forum which is sometimes flooded with spam messages. These messages are easily identifiable - most often they contain specific words like "buy", "porn", etc. and a URL to outer resources. You want to create filter that will alert you about such suspecious messages. It turns to be pretty easy - you get list of features (e.g. list of suspicious words and presence of a URL) and train simple logistic regression (a.k.a. perceptron), i.e. model like:

g(w0 + w1*x1 + w2*x2 + ... + wnxn)

where x1..xn are your features (either presence of specific word or a URL), w0..wn - learned coefficients and g() is a logistic function to make result be between 0 and 1. It's very simple classifier, but for this simple task it may give very good results, creating linear decision boundary. Assuming you used only 2 features, this boundary may look something like this:

linear boundary

Here 2 axes represent features (e.g. number of occurrences of specific word in a message, normalized around zero), red points stay for spam and blue points - for normal messages, while black line shows separation line.

But soon you notice that some good messages contain a lot of occurrences of word "buy", but no URLs, or extended discussion of porn detection, not actually refferring to porn movies. Linear decision boundary simply cannot handle such situations. Instead you need something like this:

non-linear boundary

This new non-linear decision boundary is much more flexible, i.e. it can fit the data much closer. There are many ways to achieve this non-linearity - you can use polynomial features (e.g. x1^2) or their combination (e.g. x1*x2) or project them out to a higher dimension like in kernel methods. But in neural networks it's common to solve it by combining perceptrons or, in other words, by building multilayer perceptron. Non-linearity here comes from logistic function between layers. The more layers, the more sophisticated patterns may be covered by MLP. Single layer (perceptron) can handle simple spam detection, network with 2-3 layers can catch tricky combinations of features, and networks of 5-9 layers, used by large research labs and companies like Google, may model the whole language or detect cats on images.

This is essential reason to have deep architectures - they can model more sophisticated patterns.

Why deep networks are hard to train?

With only one feature and linear decision boundary it's in fact enough to have only 2 training examples - one positive and one negative. With several features and/or non-linear decision boundary you need several orders more examples to cover all possible cases (e.g. you need not only find examples with word1, word2 and word3, but also with all possible their combinations). And in real life you need to deal with hundreds and thousands of features (e.g. words in a language or pixels in an image) and at least several layers to have enough non-linearity. Size of a data set, needed to fully train such networks, easily exceeds 10^30 examples, making it totally impossible to get enough data. In other words, with many features and many layers our decision function becomes too flexible to be able to learn it precisely.

There are, however, ways to learn it approximately. For example, if we were working in probabilistic settings, then instead of learning frequencies of all combinations of all features we could assume that they are independent and learn only individual frequencies, reducing full and unconstrained Bayes classifier to a Naive Bayes and thus requiring much, much less data to learn.

In neural networks there were several attempts to (meaningfully) reduce complexity (flexibility) of decision function. For example, convolutional networks, extensively used in image classification, assume only local connections between nearby pixels and thus try only learn combinations of pixels inside small "windows" (say, 16x16 pixels = 256 input neurons) as opposed to full images (say, 100x100 pixels = 10000 input neurons). Other approaches include feature engineering, i.e. searching for specific, human-discovered descriptors of input data.

Manually discovered features are very promising actually. In natural language processing, for example, it's sometimes helpful to use special dictionaries (like those containing spam-specific words) or catch negation (e.g. "not good"). And in computer vision things like SURF descriptors or Haar-like features are almost irreplaceable.

But the problem with manual feature engineering is that it takes literally years to come up with good descriptors. Moreover, these features are often specific

Unsupervised pretraining

But it turns out that we can obtain good features automatically right from the data using such algorithms as autoencoders and restricted Boltzmann machines. I described them in detail in my other answer, but in short they allow to find repeated patterns in the input data and transform it into higher-level features. For example, given only row pixel values as an input, these algorithms may identify and pass higher whole edges, then from these edges construct figures and so on, until you get really high-level descriptors like variations in faces.

deep learning

After such (unsupervised) pretraining network is usually converted into MLP and used for normal supervised training. Note, that pretraining is done layer-wise. This significantly reduces solution space for learning algorithm (and thus number of training examples needed) as it only needs to learn parameters inside each layer without taking into account other layers.

And beyond...

Unsupervised pretraining have been here for some time now, but recently other algorithms were found to improve learning both - together with pretraining and without it. One notable example of such algorithms is dropout - simple technique, that randomly "drops out" some neurons during training, creatig some distortion and preventing networks of following data too closely. This is still a hot research topic, so I leave this to a reader.


Posted 2014-10-11T10:24:01.393

Reputation: 2 751

3A beautiful explanation this is! – Dawny33 – 2015-11-12T05:26:19.133


First we need to understand why we need Deep learning. To build models ML need Test Data with Labels (supervised or unsupervised). In many domains as the data grows maintaining the data with labels is hard. Deep learning networks doesn't need labeled data. The Deep learning algorithms can figure out the labels. So this obviates the need for domain experts to come out with labels for the data which is very important in the areas of speech recognition, computer vision, and language understanding. Google Cat image recognition is a very interesting experiment. Also it is interesting to know "Geoffrey Hinton" the professor who was hired by Google.

You may get more insight as you explore in this framework.


Posted 2014-10-11T10:24:01.393

Reputation: 71

1Deep learning cannot figure out the labels (how could it be?), but only learn features. You still need some amount of labeled data and supervised learning to solve tasks like classification or regression. – ffriend – 2015-01-16T08:28:22.303


Deep Learning has been around for a long time. CNN,RNN,Boltzmann Machines sure look like new techniques but they were developed long time back. Check the history of deep learning.

The resurgence of Deep Learning is due to the fact that the computational powers increased exponentially from that time. With a laptop equipped with GPUs you can train a complex Deep Learning model is a very small time as compared to its previous days. Deep learning models are also very efficient empirically. Current state of the art in Images,Speech and many fields is Deep Learning models.

I believe due to these factors we can see lot of NLP/ML community has shifted its focus on Deep Learning.


Posted 2014-10-11T10:24:01.393

Reputation: 2 845