Your problem can be solved with Word2vec as well as Doc2vec. Doc2vec would give better results because it takes sentences into account while training the model.
You can train your doc2vec model following this link. You may want to perform some pre-processing steps like removing all stop words (words like "the", "an", etc. that don't add much meaning to the sentence). Once you trained your model, you can find the similar sentences using following code.
model = gensim.models.Doc2Vec.load('saved_doc2vec_model')
new_sentence = "I opened a new mailbox".split(" ")
The above results are list of tuples for
(label,cosine_similarity_score). You can map outputs to sentences by doing
Please note that the above approach will only give good results if your doc2vec model contains embeddings for words found in the new sentence. If you try to get similarity for some gibberish sentence like
sdsf sdf f sdf sdfsdffg, it will give you few results, but those might not be the actual similar sentences as your trained model may haven't seen these gibberish words while training the model. So try to train your model on as many sentences as possible to incorporate as many words for better results.
If you are using word2vec, you need to calculate the average vector for all words in every sentence and use cosine similarity between vectors.
def avg_sentence_vector(words, model, num_features, index2word_set):
#function to average all words vectors in a given paragraph
featureVec = np.zeros((num_features,), dtype="float32")
nwords = 0
for word in words:
if word in index2word_set:
nwords = nwords+1
featureVec = np.add(featureVec, model[word])
featureVec = np.divide(featureVec, nwords)
from sklearn.metrics.pairwise import cosine_similarity
#get average vector for sentence 1
sentence_1 = "this is sentence number one"
sentence_1_avg_vector = avg_sentence_vector(sentence_1.split(), model=word2vec_model, num_features=100)
#get average vector for sentence 2
sentence_2 = "this is sentence number two"
sentence_2_avg_vector = avg_sentence_vector(sentence_2.split(), model=word2vec_model, num_features=100)
sen1_sen2_similarity = cosine_similarity(sentence_1_avg_vector,sentence_2_avg_vector)