## What is the proper way to reduce audio sample rate and bit depth?

5

1

When presented with a lossless audio file, what is the proper way to reduce sample rate and bit depth?

When reducing bit depth, I understand that I should apply dither, but I don't know what scale or method to use. (Actually, I read that I should use either a "triangular" or "shaping" dither, but I still don't know which of those to use. Does it come down to personal preference?)

When reducing sample rate, I have no idea if additional processing is required. (Evidently sample rate is far more complicated than bit depth.) At the moment, my conversion process looks like this:

ffmpeg -i <input.file> -compression_level 12 [-ar 44100 -sample_fmt 16] <output.flac>


I'd prefer general answers to application-specific ones, but I'll take what I can get. Thanks in advance!

I'm a consumer, not a producer. No further processing will be applied to these files.

Edit 0: Real world example: I have a 96/24 PCM (WAV) file. I want it to be 44.1/16.

Edit 1: Further research yielded this fantastic comparison of sample rate converters (with graphs)!

I found this video on another question and found it very informative and easy to follow. It's basically about A/D and D/A conversion, but there's some good information about dither and it's effects.

– None – 2013-07-01T19:37:32.313

– None – 2013-07-02T22:30:33.983

SoX (standalone) is what I ultimately used. The latter link was included in my question. – None – 2013-07-03T02:07:29.700

5

Let's take these one at a time, in the correct order:

1. You are starting with a lossless file. This is either PCM or lossless compressed (you didn't specify). If it is compressed, it must be converted to PCM (or uncompressed) to do anything meaningful. This is usually done automatically by any application that claims it can read the format, but you should be aware that it's taking place. Since further work is to be done on this audio, the samples should be converted to floating point format.

2. Next comes the sample rate conversion. As you correctly said, SR conversion is complex. A naive approach to SR conversion will produce "aliasing" distortion. For example, taking every other sample from a 44100 Hz file will give you a 22050 file, but any material that can't be correctly represented at 22050 will, instead of being eliminated, be incorrectly represented. Depending on how much of it there was, the incorrect representation can sound extremely bad. The solution is to filter that stuff out first, so SR conversion consists of filtering, followed by resampling (although there is a trick that allows you to do both in one step, conceptually, it's two steps, in that order). Usually filtering is done automatically by sr conversion software, but you should be aware of it because this makes the biggest difference to quality. Don't skimp on filter quality if you care.

3. Finally, you reduce the bit-depth. As with SR conversion, a naive approach to bit-depth reduction results in a kind of distortion, but this kind of distortion is generally considered less obnoxious. The solution to this distortion is to add "dither", which is a small amount of noise. You must add the dither before the bit-depth reduction. Adding dither after the reduction will have no effect other than adding noise. For most software, dithering is an option (if it's available at all), and is not performed automatically. What dither you use, and even if you dither at all, is, to a large extent, a matter of preference and the source material. There are major mastering studios that don't bother dithering anymore, even though, mathematically speaking, it's the only way to eliminate the distortion.[1] As to triangular vs "shaped" dither[2], the rule of thumb is this: shaped dither is used for the last step, triangular dither is used for other steps. So, in your case, you should probably use a shaped dither. However, you might prefer standard triangular dither as your last step if a shaped dither has already been applied. You have to let your ears decide that one, but hopefully the difference will be so subtle that you won't really notice it anyway.

[1] They don't bother because the distortion dither is designed to protect against is effectively eliminated simply because of the lack of dynamic range in the music they are producing. If the dynamic range of the music (modern pop has a dynamic range of about 6 to 10 dB) is very very small compared to the dynamic range of the medium (CDs have a dynamic range of about 96 dB), you don't really need dither.

[2] the difference between the two is that a "triangular" dither is white noise. A "shaped" dither attempts be less audible by shifting the frequency of the noise out of our hearing range. This works remarkably well, but if you do it over and over again it might result in some high frequency build-up, especially if you are using a really aggressively shaped dither like POW-R.

>

• I generally only deal with WAV (PCM) and FLAC (which ultimately decodes to PCM, yes).
• I'll look into FFmpeg's default algorithm, or maybe recompile with SoX enabled, as it's algorithm seems well regarded. Hydrogenaudio's resampling page also appears to have numerous references.
• I seem to recall reading that rectangular dither should be used during intermittent steps (vs. triangular), but I can't seem to relocate that reference. But that's a moot point for my situation. ;)
• – None – 2013-07-01T21:31:47.707