Answer is quite yes, please have a look what Google did around this:
Google Cloud Video Intelligence makes videos searchable, and discoverable, by extracting metadata with an easy to use REST API. You can now search every moment of every video file in your catalog. It quickly annotates videos stored in Google Cloud Storage, and helps you identify key entities (nouns) within your video; and when they occur within the video.
So, Google does recognize all kinds of data from the video: it classifies the whole content of it to tags.
What about Humanoid Robot Sophia?
Cameras within Sophia's eyes combined with computer algorithms allow her to see. She can follow faces, sustain eye contact, and recognize individuals. She is able to process speech and have conversations using a natural language subsystem.
These intent to the direction to understand (Google) and produce (Sophia) language from sounds and images. To learn think by themselves, machines are still not ready. If you get into these two cases more you would see that these are quite mechanical and manual (requiring human pre-effort) things still.
It is said that machines are now on a phases of toddler who can ask names for things around her and name them. Take some years more, maybe the abilities are more advanced ;)
You asked about unsupervised learning. There is a video about speech of MIT researcher, who made experiments on text and images and in final notes he denoted that it would be nice to make the same with videos, actually with the same reasoning you had: to learn a language. He promised to keep that in mind with his colleagues, maybe some of them already work on that.
Interesting research paper on the topic was on this link :
We address the problem of automatically learning the
main steps to complete a certain task, such as changing a
car tire, from a set of narrated instruction videos. The con-
tributions of this paper are three-fold. [..] Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.