You can also think of a convolutional neural network (CNN) as an encoder, i.e. a neural network that learns a smaller representation of the input, which then acts as the feature vector (input) to a fully connected network (or another neural network). In fact, there are CNNs that can be thought of as auto-encoders (i.e. an encoder followed by a decoder): for example, the u-net can indeed be thought of as an auto-encoder.
Although it is (almost) never the case that you transform the input to an extremely small feature vector (e.g. a number), even a single float-pointing number can encode a lot of information. For example, if you want to classify the object in the image into one of two classes (assuming there is only one main object in the image), then a floating-point is more than sufficient (in fact, you just need one bit to encode that information).
This smaller representation (the feature vector) that is then fed to a fully connected network is learned based on the information in your given training data. In fact, CNNs are known as data-driven feature extractors.
I am not aware of any theoretical guarantee that ensures that the learned representation is the best suited for your task (probably you need to look into learning theory to know more about this). In practice, the quality of the learned feature vector will mainly depend on your available data and the inductive bias (i.e. the assumptions that you make, which are also affected by the specific neural network architecture that you choose).