Assuming, you like to create a new speech signal that sounds natural, there is multiple levels of "sound" that you have to consider:
- base frequency (pitch of your voice)
- overtone frequencies (sound of your voice)
- volume of your voice
- tempo of speech
- transitions of phonemes
The latter point itself comprises very different aspects and is the most difficult to deal with. You can't achieve a natural sounding result just splicing single phonemes if they don't fit to each other in their original recording. This is why many speech synthesis algorithms use diphone synthesis.
Adobe's VoCo is like state of the art in manipulating speech signals. It is a neural net / self learning algorithm capable of creating completely new sentences when it has been trained with a large set of sentences of a speaker (about 20 minutes of speech signal is needed). As far as I know, VoCo has not been released to the public for its massive criminal-use potential.