FaceNet uses a novel loss metric (triplet loss) to train a model to output embeddings (128-D from the paper) such that any two faces of the same identity will have a small Euclidean distance, and such that any two faces of different identities will have a Euclidean distance larger than a specified margin. It however needs another mechanism (HOG or MTCNN) to detect and extract faces from images in the first place.
Can this idea be extended to object recognition? As in, can an object detection framework, e.g. MaskR-CNN be used to extract bounding boxes of an object, cropping the object feeding this to a network that was trained on triplet loss, and then compare the embeddings of objects to see if they’re the same object?
Is there any research that has been done or any published public datasets for this?