What is self-supervised learning in machine learning?



What is self-supervised learning in machine learning? How is it different from supervised learning?


Posted 2019-02-16T20:02:58.273

Reputation: 19 783

Comments are not for extended discussion; this conversation has been moved to chat.

– nbro – 2020-03-06T01:12:08.810




The term self-supervised learning (SSL) has been used (sometimes differently) in different contexts and fields, such as representation learning [1], neural networks, robotics [2], natural language processing, and reinforcement learning. In all cases, the basic idea is to automatically generate some kind of supervisory signal to solve some task (typically, to learn representations of the data or to automatically label a dataset).

I will describe what SSL means more specifically in three contexts: representation learning, neural networks and robotics.

Representation learning

The term self-supervised learning has been widely used to refer to techniques that do not use human-annotated datasets to learn (visual) representations of the data (i.e. representation learning).


In [1], two patches are randomly selected and cropped from an unlabelled image and the goal is to predict the relative position of the two patches. Of course, we have the relative position of the two patches once you have chosen them (i.e. we can keep track of their centers), so, in this case, this is the automatically generated supervisory signal. The idea is that, to solve this task (known as a pretext or auxiliary task in the literature [3, 4, 5, 6]), the neural network needs to learn features in the images. These learned representations can then be used to solve the so-called downstream tasks, i.e. the tasks you are interested in (e.g. object detection or semantic segmentation).

So, you first learn representations of the data (by SSL pre-training), then you can transfer these learned representations to solve a task that you actually want to solve, and you can do this by fine-tuning the neural network that contains the learned representations on a labeled (but smaller dataset), i.e. you can use SSL for transfer learning.

This example is similar to the example given in this other answer.

Neural networks

Some neural networks, for example, autoencoders (AE) [7] are sometimes called self-supervised learning tools. In fact, you can train AEs without images that have been manually labeled by a human. More concretely, consider a de-noising AE, whose goal is to reconstruct the original image when given a noisy version of it. During training, you actually have the original image, given that you have a dataset of uncorrupted images and you just corrupt these images with some noise, so you can calculate some kind of distance between the original image and the noisy one, where the original image is the supervisory signal. In this sense, AEs are self-supervised learning tools, but it's more common to say that AEs are unsupervised learning tools, so SSL has also been used to refer to unsupervised learning techniques.


In [2], the training data is automatically but approximately labeled by finding and exploiting the relations or correlations between inputs coming from different sensor modalities (and this technique is called SSL by the authors). So, as opposed to representation learning or auto-encoders, in this case, an actual labeled dataset is produced automatically.


Consider a robot that is equipped with a proximity sensor (which is a short-range sensor capable of detecting objects in front of the robot at short distances) and a camera (which is long-range sensor, but which does not provide a direct way of detecting objects). You can also assume that this robot is capable of performing odometry. An example of such a robot is Mighty Thymio.

Consider now the task of detecting objects in front of the robot at longer ranges than the range the proximity sensor allows. In general, we could train a CNN to achieve that. However, to train such CNN, in supervised learning, we would first need a labelled dataset, which contains labelled images (or videos), where the labels could e.g. be "object in the image" or "no object in the image". In supervised learning, this dataset would need to be manually labelled by a human, which clearly would require a lot of work.

To overcome this issue, we can use a self-supervised learning approach. In this example, the basic idea is to associate the output of the proximity sensors at a time step $t' > t$ with the output of the camera at time step $t$ (a smaller time step than $t'$).

More specifically, suppose that the robot is initially at coordinates $(x, y)$ (on the plane), at time step $t$. At this point, we still do not have enough info to label the output of the camera (at the same time step $t$). Suppose now that, at time $t'$, the robot is at position $(x', y')$. At time step $t'$, the output of the proximity sensor will e.g. be "object in front of the robot" or "no object in front of the robot". Without loss of generality, suppose that the output of the proximity sensor at $t' > t$ is "no object in front of the robot", then the label associated with the output of the camera (an image frame) at time $t$ will be "no object in front of the robot".


Posted 2019-02-16T20:02:58.273

Reputation: 19 783


Self-supervised learning is when you use some parts of the samples as labels for a task that requires a good degree of comprehension to be solved. I'll emphasize these two key points, before giving an example:

  • Labels are extracted from the sample, so they can be generated automatically, with some very simple algorithm (maybe just random selection).

  • The task requires understanding. This means that, in order to predict the output, the model has to extract some good patterns from the data, generating on the process a good representation.

A very common case for semi-supervised learning takes place in natural language processing, when you need to solve a task but have few labeled data. In such cases, you need to learn a good representation or language model, so you take sentences and give your network self-supervision tasks like these:

  • Ask the network to predict the next word in a sentence (which you know because you took it away).

  • Mask a word and ask the network to predict which word goes there (which you know because you had to mask it).

  • Change the word for a random one (that probably doesn't make sense) and ask the network which word is wrong.

As you can see, these tasks are fairly simple to formulate and the labels are part of the same sample, but they require a certain understanding of the context to be solved.

And it's always like this: alter your data in some way, generating the label in the process, and ask the model something related to that transformation. If the task requires enough understanding of the data, you'll have success.


Posted 2019-02-16T20:02:58.273

Reputation: 379

Isn't what you describe "distant learning"? – Make42 – 2020-06-25T12:04:59.580

I haven't heard of that concept. Could you elaborate on that? – David – 2020-07-03T08:31:26.583

Sure, I described it in https://ai.stackexchange.com/questions/22186/what-is-the-difference-between-distant-supervision-and-self-supervision: Distant supervision is a type of weak supervision (there is uncertainty in the labeling) that uses an auxiliary automatic mechanism to produce weak labels / reference output (in contrast to non-expert human labelers) from data. Possibly (I am not clear on that point yet), the data which is used by this mechanism, cannot be from the input data of the main model.

– Make42 – 2020-07-03T09:25:07.250

I see. I'd say the main difference is on the target task. Distant supervision keeps focusing on the same original task, while self-supervision focuses on a surrogate task as a proxy for learning a good representation. I will elaborate this as an answer in that question. – David – 2020-07-04T11:24:52.793

But, in that other answer you wrote that it might be that "the task casually matches the target task". Then there would be no surrogate task anymore and if that is still self-supervision, then having a surrogate task is not a requirement for a learning to be self-supervised. Aren't you contradicting yourself...? – Make42 – 2020-07-04T12:06:57.470

It might be (very corner case) but it's usually not. It is not a requirement of self-supervision to work on a surrogate task, but to extract all labels from input data. Doing so, it pretty much always results in a surrogate task. – David – 2020-07-04T12:11:21.033


Self-supervised visual recognition is often applied to representation learning. Here we first learn features on unlabeled data (representation learning), and then learn the real model on features extracted from the labeled data. This especially makes sense when we have a lot of unlabeled data and few labeled data.

The features can be learned by solving so called pretext tasks. Examples of pretext tasks are to predict rotation of a jittered image, to recognize jittered instances of a same image, or to predict spatial relationship of image patches.

A nice overview and interesting results can be found in this recent paper.


Posted 2019-02-16T20:02:58.273

Reputation: 429

Do you call the entire setup "self-supervised learning" or only the representation learning part? What is the difference between the terms "self-supervised learning" and "representation learning" then? – Make42 – 2020-06-25T12:08:31.247

Self-supervised learning is a way to achieve representation learning. Some other ways to achieve the same goal are supervised learning and unsupervised learning. – ssegvic – 2020-06-26T13:05:05.633

Ok, I understand you. However, this is in contrast to the answer of @nbro who does not mention the solving of an auxiliary task, implying that an auxiliary task is not required for calling something "self-supervised". See my question https://ai.stackexchange.com/questions/22184/does-self-supervised-learning-require-auxiliary-tasks

– Make42 – 2020-06-26T13:13:24.420

@Make42 I've updated my answer to explicitly mention the concept of pretext or auxiliary task. In the specific context of the robotics paper, which my answer is particularly based on, the concept of pretext or auxilary task is not made explicit, but these terms are used in other contexts, so it's important to emphasize them. – nbro – 2020-06-26T18:32:59.300