I can speak from a more theoretical point of view, but honestly I haven't had much success with VAEs.
1) How deep should my encoder and decoder network be? Are there any good guidelines?
That depends entirely on your dataset. If you have highly nonlinear data, then a deep network should do well. The successive nonlinearities allow you to capture higher-order correlations in the input, which might be good for your situation.
Do you have any idea of the number of truly independent dimensions in your dataset? Have you done something like PCA (a linear autoencoder)? That might tell you where to aim your dimensionality of your bottleneck layer, and that might help you determine your layer sizes to get there.
2) Should one use fully connected dense networks, or stacked conv1Dnets?
I'd say try both.
3) What activation functions are good choices?
Hugo Larochelle once said that you should always start with ReLUs. See if you get a good enough result with them, as they aren't as prone to the problem of exploding/vanishing gradients, which is something you're going to face with time series data. Remember to initalize them with small positive values, though, to avoid dead neurons.
[EDIT: actually, vanishing and exploding gradients can still be an issue with ReLUs, it's just mitigated a bit. Also, initialize the bias with small positive values. Andrej Karpathy said so in a lecture]
4) Can we say anything about the 'best' dimensionality of the latent dimension?
In some contexts yes, we can. The 'best' dimensionality will be the one that results in the highest lossless compression. That is, the one that compresses your input data the most without losing information when you reconstruct. Good luck finding that optimum, though.
5) Like everyone else I imagine, my loss function is the sum of a reconstruction term, and the KL divergence regularization term. Is there something else one should consider?
You could incorporate various types of regularization that will modify your loss function. You could use dropout, for example, or L1 or L2 regularization.
EDIT: you could also consider an attention-based model, for example see this discussion.
6) Batch normalization? Currently, on my problem, it doesn't make a blind bit of difference. But it should. Should one always use batch-norm?
'Always' is a strong term. I would say not always, but batch normalization ought to be a good choice. Have another look at your implementation before you give up on it.
7) In the decoder layer, is it better to up-sample before returning to the original input dimension?
That's the default presentation - an autoencoder is usually shown as a symmetric construct from the input to the hidden to the reconstruction. I would say experiment with not doing so, but I don't know if one is better than the other.