Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Taming sox

Bart Massey and Claude Code

Source: https://github.com/pdx-cs-sound/taming-sox.

Licensed under CC-BY 4.0.


Introduction

Sox is a command-line audio Swiss Army knife: it converts formats, applies DSP effects, mixes files, generates tones, and slots cleanly into shell pipelines. Its CLI is genuinely strange — effects come after the output filename, format flags are positional, and a typo can silently mean something completely different. This tutorial introduces those quirks in an order that makes them feel inevitable rather than arbitrary.

What this book covers. Sox has many more effects than any one tutorial can reasonably teach. The approach here: show the arguments that teach concepts — the q/t/h/l shapes in fade, the Q/Hz/octaves units in equalizer, the transfer-function syntax in compand. For arguments that just tune existing behavior, the sox man page is the right reference. When an effect is mentioned only in passing, that’s why.

Sox has real limits — they’re collected at the end of the last chapter if you want to check whether it fits your problem before investing in learning it.

Sample audio files are provided; test tones are generated as we go.

sox and sox_ng

Original sox stopped releasing in 2015. In 2024 a community fork, sox_ng, picked up active development. Most modern distros now ship sox_ng under the name sox — when this book says “sox,” it means whichever binary you have. Everything here works on both. A handful of features in the later chapters are sox_ng 14.5+ only; those are called out where they come up.

The title of this book stays Taming sox because the command is still sox. “sox_ng” only comes up when the fork itself is the subject.

Getting sox

Most modern distros ship sox_ng under the name sox. Some older stable releases still ship the 2015 legacy build. Homebrew and the BSDs vary.

Run sox --version to check. SoX_ng means you’re set. SoX 14.4.2 means you have the legacy build — everything in this book still works, but you’ll miss the sox_ng-flagged features later, and you’ll be running an unmaintained binary.

If your platform doesn’t ship sox_ng and you want it, build from https://codeberg.org/sox_ng/sox_ng. Windows binaries are on the releases page.

Getting Started

Check your install

sox --version

You should see a version line. If it starts with SoX_ng, you have the maintained fork; if it says SoX 14.4.2, you have the 2015 legacy release. The book works on both — a few sox_ng-only features are flagged where they come up.

Inspecting files with soxi

soxi reads metadata without touching the audio. The book ships with a short voice recording you can try it on:

⬇ voice.wav (CC0)

soxi voice.wav
Input File     : 'voice.wav'
Channels       : 1
Sample Rate    : 48000
Precision      : 16-bit
Duration       : 00:00:04.09 = 196162 samples ~ 306.503 CDDA sectors
File Size      : 392k
Bit Rate       : 768k
Sample Encoding: 16-bit Signed Integer PCM

Useful individual fields (handy in shell scripts):

soxi -r voice.wav    # sample rate
soxi -b voice.wav    # bit depth
soxi -c voice.wav    # channels
soxi -D voice.wav    # duration in seconds

A test tone

When you want predictable audio to experiment with, generate it:

sox -n test.wav synth 10 sine 440 gain -6
play test.wav

Three new pieces, one sentence each: -n in the input position means “no input file — generate audio instead.” synth 10 sine 440 synthesizes ten seconds of a 440 Hz sine wave. gain -6 knocks it down 6 dB (roughly half amplitude), which leaves headroom so later effects don’t clip. Chapter 10 goes deeper on synth; this is all you need for now.

Playing audio with play

play test.wav

play is sox with your speaker as the implicit output. It is literally a symlink to the same binary. It needs a working audio device — on headless servers, use sox ... -n with stat or soxi to verify results instead.

Scale the playback amplitude with -v (1.0 = unchanged, 0.5 = half amplitude). Despite the name, -v operates on amplitude, which is linear — halving the amplitude reduces perceived loudness by 6 dB, not by half. Use gain (chapter 3) if you want to think in decibels:

play -v 0.5 test.wav

Play only the first half-second:

play test.wav trim 0 0.5

That trim 0 0.5 at the end is an effect. Don’t worry about the syntax yet — chapter 2 will make it click.

Recording with rec

rec is the mirror image: sox with your microphone as the implicit input.

rec capture.wav              # record until Ctrl-C
rec capture.wav trim 0 5     # record for 5 seconds

Conversions and Anatomy

Your first conversion

Sox infers format from file extension. Converting is often just:

sox test.wav test.mp3
sox test.wav test.flac
sox test.wav test.ogg

Use soxi to verify the output matches your expectations.

The null output -n discards the result entirely — useful for checking that sox can read a file without writing anything:

sox test.wav -n

The anatomy of a sox command

Every sox command follows this structure:

sox  [global opts]  [input opts]  infile(s)  [output opts]  outfile  [effects...]
     ────────────   ──────────────────────   ────────────────────    ──────────
     globals        input                    output                  effects
  • globals — options affecting the whole run (-V, --buffer, …)
  • input — format options for the input(s), placed immediately before the filename
  • output — the output file, with its own optional format options before it
  • effects — the effects chain, applied left to right

The -v flag from chapter 1 lives in the input section: it scales the input amplitude as the file is read, before any effects run. The vol effect (chapter 3) does the same arithmetic but in the effects chain. These two commands produce identical output:

sox -v 0.5 test.wav out.wav   # scale at input
sox test.wav out.wav vol 0.5  # scale in effects chain

-v is input-only — sox will reject it as an output flag. To scale the output, use vol or gain in the effects chain.

play and rec are just sox with one section missing. play has no output section (the speaker is implicit). rec has no input section (the microphone is implicit). Any effect chain you’d write after the output filename in sox comes directly after the input in play:

sox  test.wav out.wav highpass 300 norm -3
play test.wav            highpass 300 norm -3

Format flags are positional

A format flag describes the next filename in the command — input or output, depending on where you place it:

sox -r 8000 test.wav out.wav   # override input's declared rate; output inherits 8000 Hz
sox test.wav -r 8000 out.wav   # resample output to 8000 Hz; input rate unchanged
sox -r 16000 test.wav -r 8000 out.wav  # both specified explicitly

All three are valid and mean different things. Placing a flag in the wrong section will silently produce a different result than you intended, which is the most common source of bugs in sox commands. Format options are covered fully in chapter 5.

Effects come last

Effects go after the output filename. This surprises most people once and never again:

sox test.wav out.wav trim 5 10 reverse
#                    ───────────────── effects
play out.wav

Multiple effects are applied left to right: first trim, then reverse on the trimmed result.

Basic Effects

trim — cut out a section

play test.wav trim start [length]

trim takes a start position and a length, not start and end.

play test.wav trim 0 5      # first 5 seconds
play test.wav trim 3 4      # 4 seconds starting at 3s
play test.wav trim 5        # skip the first 5 seconds
play test.wav trim -3       # last 3 seconds
play test.wav trim 00:01:30 # start at 1m30s

reverse — play backwards

play test.wav reverse

Sox loads the whole file into memory to do this; large files are slow.

fade — smooth edges

play test.wav fade [type] fade-in [duration] fade-out

The type can be q (quarter-sine, natural sounding), t (linear), h (half-sine), or l (logarithmic). Omitting type defaults to linear.

play test.wav fade 1         # 1s linear fade-in, play to end
play test.wav fade q 2 0 2   # 2s fade-in, full duration, 2s fade-out

Duration 0 means “play to the natural end of the file.”

vol and gain — adjust volume

vol takes a multiplier; gain takes decibels:

play test.wav vol 0.5    # half amplitude
play test.wav vol 2.0    # double (can clip!)
play test.wav gain -6    # quieter by 6 dB
play test.wav gain 6     # louder by 6 dB (can clip!)

A rough guide: −6 dB ≈ half perceived loudness; +6 dB ≈ double.

sox_ng 14.5+: vol accepts a second argument that enables a soft-clipping limiter so boosts don’t hard-clip when they exceed 0 dBFS. See man sox for the exact argument. On legacy sox, vol 2 clips; on sox_ng with the limiter, it shapes the peak instead.

norm — automatic normalization

norm brings the peak sample to a target level (default 0 dBFS):

play test.wav norm       # peak to 0 dBFS
play test.wav norm -3    # peak to -3 dBFS (safer headroom)

To save the result: sox test.wav out.wav norm -3.

stat — measure levels

Use -n as the output to discard audio and just print statistics:

sox test.wav -n stat     # linear amplitudes, whole file mixed to mono
sox test.wav -n stats    # dB levels, per-channel columns

stats is generally more useful: it reports in dB and breaks out each channel separately. stat reports linear amplitude values, which are harder to interpret. Both print to stderr.

Chaining Effects

Effects in the effects chain are applied strictly left to right. The output of one effect becomes the input to the next.

play test.wav trim 5 10 fade q 1 0 1 norm -3
#             ──────── ──────────── ──────
#             1. trim  2. fade      3. norm

Order matters

# norm then gain: normalize to 0 dBFS, then boost 6 dB — likely clips
play test.wav norm gain 6

# gain then norm: boost first, then normalize back down — norm undoes the gain
play test.wav gain 6 norm

Neither is wrong — they just do different things. Think through the pipeline before you run it.

Writing to a file

When you’re happy with the chain, swap play for sox and add an output filename:

sox test.wav output.wav trim 5 10 fade q 1 0 1 norm -3
play output.wav

Sox converts the format and applies the effects in a single pass, so trimming and converting to MP3 is one command:

sox test.wav output.mp3 trim 0 30 norm -3
play output.mp3

Format Options

Sox detects format from file extensions. When that isn’t possible — raw PCM files, pipes, unusual encodings — you provide it explicitly.

Recall from chapter 2: format flags describe the next filename. Put them in the wrong section and they apply to the wrong file.

The four core flags

These describe the audio itself. Sox will resample, convert, or remix as needed to meet them.

FlagMeaningExample
-rsample rate (Hz)-r 44100
-bbit depth-b 16
-cchannels-c 1 (mono), -c 2 (stereo)
-eencoding-e signed-integer

Common encodings: signed-integer, unsigned-integer, floating-point, a-law, u-law.

The file type flag

-t is different: it names the container format (WAV, AIFF, FLAC, raw, and so on) rather than a property of the audio. Sox normally infers it from the filename extension, so you rarely set it. Reach for -t only when there’s no extension to read (pipes with -, headerless raw files) or the extension lies about the content.

-t raw      # headerless PCM
-t wav      # force WAV regardless of extension

Resampling

sox input.wav -r 8000 telephone.wav    # downsample to 8 kHz
play telephone.wav                     # noticeably lo-fi
sox input.wav -r 48000 hq.wav          # upsample to 48 kHz

The format flag before telephone.wav describes the output. Sox resamples automatically.

Changing bit depth and channels

sox input.wav -b 24 output.wav    # convert to 24-bit
sox stereo.wav -c 1 mono.wav            # stereo → mono (averages channels)
play mono.wav
sox mono.wav -c 2 stereo.wav            # mono → stereo (duplicates channel)

-c uses sox’s default algorithm: averaging when going down, duplication when going up. For anything more specific — dropping a channel, swapping L and R, custom mix weights — use remix (chapter 8).

Fully-specified output

Sometimes you want to know exactly what comes out: a specific sample rate, bit depth, channel count, and encoding. This matters for archival (so the artifact doesn’t drift with the default audio config), for interop (another tool expects 16-bit 44.1 kHz stereo signed-integer and nothing else), and for pipelines that hand audio to downstream processes with narrow assumptions.

The recipe: specify all four flags on the output.

sox input.wav -r 44100 -b 16 -c 2 -e signed-integer output.wav

That produces a WAV with exactly those properties regardless of what the input looked like — sox resamples, converts bit depth, remixes channels, and re-encodes as needed. Verify with soxi output.wav.

The same four-flag pattern works for raw output — just add -t raw:

sox input.wav -t raw -r 44100 -b 16 -c 1 -e signed-integer output.raw

Reading raw PCM

Raw files have no header, so you must describe them completely:

sox -r 44100 -b 16 -c 1 -e signed-integer input.raw output.wav
play output.wav

Writing raw output (see “Fully-specified output” above):

sox input.wav -t raw -r 8000 -b 8 -c 1 -e unsigned-integer output.raw

Piping

Use - for stdin or stdout, with -t to specify the format:

# Two sox processes in a pipeline
sox input.wav -t raw - | sox -t raw -r 44100 -b 16 -c 1 -e signed-integer - output.wav

For piping between two sox processes specifically, the -p flag emits sox’s own internal format, which avoids specifying all those flags manually:

sox test.wav -p trim 0 5 | sox - output.wav norm -3

Filters

Filters shape the frequency content of audio. A quick reference: human hearing spans roughly 20 Hz (low rumble) to 20 kHz (high hiss).

Filtering a single sine wave is uninteresting — it either passes or it doesn’t. Pink noise has energy across the whole spectrum, so filters produce an audible and visible change. Generate some:

sox -n noise.wav synth 5 pinknoise gain -6
play noise.wav

highpass and lowpass

Remove everything below or above a cutoff frequency:

play noise.wav highpass 2000    # remove everything below 2 kHz
play noise.wav lowpass 2000     # remove everything above 2 kHz
play noise.wav highpass 300 lowpass 3400   # telephone band

The telephone band example is a good one to listen to: the characteristic “tinny phone” sound comes entirely from cutting the low and high ends.

bass and treble are shelving variants of equalizer — convenient when you just want to lift or cut one end. See man sox for arguments.

equalizer — parametric EQ

Three arguments: center frequency, width, gain in dB. Width units are controlled by a suffix:

SuffixUnitExample
noneHz200 = 200 Hz wide
qQ factor2q = Q of 2
ooctaves1o = one octave wide

Q and Hz are inversely related: a higher Q means a narrower band. Q = center / bandwidth, so 2q at 1 kHz equals a 500 Hz bandwidth. Q is more useful when you want consistent relative width across different center frequencies.

Stack multiple equalizer effects to build a full EQ:

play noise.wav equalizer 1000 200 -6    # cut 6 dB at 1 kHz, 200 Hz wide
play noise.wav equalizer 1000 2q -6     # same centre, Q=2 (500 Hz wide)
play noise.wav equalizer 3000 1o 3      # boost 3 dB at 3 kHz, one octave wide

A practical voice cleanup chain

⬇ voice.wav (CC0)

sox voice.wav clean.wav \
    highpass 100 \
    equalizer 3000 500 2 \
    norm -3
play clean.wav

Removes low-frequency noise, adds a little presence, normalizes.

sox_ng 14.5+: adds a FIR filter designed from frequency-response knots — you specify points on the desired magnitude response and sox builds the filter. Useful when neither a shelving nor a parametric shape fits what you want. See man sox.

Time and Pitch

Four effects; two axes:

EffectChanges speed?Changes pitch?Notes
ratenonoproper resampler; changes sample rate only
speedyesyesvarispeed tape
tempoyesnotime-stretch, pitch preserved
pitchnoyespitch-shift, duration preserved

rate — resampling

Resamples the audio to a new sample rate. Pitch and duration are both preserved — the output just has fewer (or more) samples per second. Use it to change the technical format of a file, not to alter how it sounds:

sox test.wav out.wav rate 22050    # downsample to 22050 Hz
play out.wav
sox test.wav out.wav rate 48000    # upsample to 48000 Hz
play out.wav

This is equivalent to writing -r 22050 out.wav as an output format flag, but as an explicit effect it fits naturally in a chain and gives access to quality options:

  • -h — high quality: longer anti-aliasing filter, better stopband rejection, audibly cleaner on music
  • -v — very high quality: even longer filter; diminishing returns over -h but useful for archival or repeated resampling where rounding errors accumulate

speed — varispeed

Like a tape running faster or slower. Factor > 1 speeds up and raises pitch; < 1 slows down and lowers pitch.

play test.wav speed 1.5    # faster and higher
play test.wav speed 0.75   # slower and lower

tempo — time-stretch only

Changes duration while preserving pitch using the WSOLA algorithm (chops audio into overlapping segments and re-stitches them). Practical range: 0.5–2.0.

play test.wav tempo 1.2    # 20% faster, same pitch
play test.wav tempo 0.8    # 20% slower, same pitch

Three presets tune the algorithm for different material:

play test.wav tempo -m 1.2   # music (default)
play test.wav tempo -s 0.75  # speech
play test.wav tempo -l 1.1   # linear (least CPU, more artifacts)

pitch — pitch-shift only

Argument is in cents (100 cents = 1 semitone, 1200 = one octave).

play test.wav pitch 200     # up 2 semitones
play test.wav pitch -1200   # down one octave

pitch uses the same WSOLA algorithm as tempo — it is implemented as a tempo stretch followed by a rate resample in the opposite direction, so the duration cancels out and only the pitch shift remains. The -m/-s/-l presets are not exposed on pitch, but you can pass the same segment search overlap tuning parameters if needed.

Combining them

tempo and pitch are independent effects applied in sequence:

play test.wav tempo 1.2 pitch -400   # faster but lower

See also: Rubber Band

Sox has no phase vocoder. When quality matters — especially for time-stretching, pitch-shifting, or formant-preserved vocal shifts — rubberband is the standard tool. The “Beyond sox” chapter covers how to reach for it.

Combining Files

Sample files — download and place in your working directory:

⬇ music.wav — “Erase Data” by Koi-discovery (CC0)

⬇ voice.wav (CC0)

Setup:

sox -n a.wav synth 3 sine 440 gain -6
sox -n b.wav synth 3 sine 660 gain -6
# normalise samples to a common format for mixing
sox samples/music.wav -c 1 -r 44100 music.wav
sox samples/voice.wav -r 44100 voice.wav

Per-input format flags

With multiple inputs, input-section format flags repeat independently for each input file — place them immediately before the file they describe:

sox [input-a] infile_a [input-b] infile_b [output] outfile [effects]

Any input flag works this way: -v, -r, -b, -c, -t, -e. The most common use is -v for per-input volume (shown below), and format flags when combining files of different types or encodings.

-v takes a linear multiplier only — there is no dB form. Common conversions: −6 dB ≈ 0.5, −12 dB ≈ 0.25, −20 dB = 0.1.

sox -v 0.8 a.wav -t raw -r 48000 -b 32 -c 1 -e signed-integer -v 0.5 b.raw out.wav
play out.wav

Concatenation — A then B

List multiple inputs before the output:

sox a.wav b.wav combined.wav
play combined.wav

Files must have identical sample rates and channel counts — sox hard-fails if they differ. Use rate to resample first if needed.

For a smooth crossfade at the join, use the splice effect:

sox a.wav b.wav out.wav splice 3    # crossfade at the 3-second mark
play out.wav

Mixing — A over B

The -m global flag sums inputs together rather than concatenating:

play -m music.wav voice.wav

Mixing raises the overall level — normalize afterward to avoid clipping:

play -m music.wav voice.wav norm -3

Set per-file volume with -v immediately before each input:

play -m -v 0.3 music.wav -v 1.0 voice.wav norm -3

Merging channels — A and B side by side

-M puts channels from each file side by side. Two mono files become one stereo file:

sox -M left.wav right.wav stereo.wav
play stereo.wav

remix — channel routing

Where -c uses sox’s default averaging/duplication, remix gives explicit control. Each argument describes one output channel by naming the input channel(s) that feed it.

play stereo.wav remix 2 1       # swap L and R
play stereo.wav remix -         # average all channels to mono
play stereo.wav remix 1         # keep left channel only, drop right
play stereo.wav remix 1,2 1,2   # both output channels = L+R mix

- averages all input channels into one output channel — equivalent to -c 1 but as an explicit effect. 1,2 sums channels 1 and 2.

Effects and Dynamics

Sample file — download and place in your working directory:

⬇ music.wav — “Erase Data” by Koi-discovery (CC0)

Setup:

# Varying dynamics for compand: loud / quiet / loud
sox -n _loud.wav synth 2 sawtooth 220 gain -6
sox -n _quiet.wav synth 2 sawtooth 220 gain -20
sox _loud.wav _quiet.wav _loud.wav dynamics.wav
play dynamics.wav

reverb

Simulates room acoustics. Arguments: reverberance (0–100), HF damping (0–100), room scale (0–100). Defaults are reasonable.

play music.wav reverb
play music.wav reverb 80 50 100    # large, bright room

--wet-only removes the dry signal, leaving only the wet (reverberated) signal:

play music.wav reverb --wet-only 80

Note: reverb does not extend the output file. The reverb decay is truncated at the input length. To capture the full tail, pad silence onto the end of the input first:

play music.wav pad 0 2 reverb 80

silence — trim silence

These effects need a file that actually has silence. Generate one with pad, which adds silence (in seconds) to the start and end:

sox -n padded.wav synth 5 sawtooth 220 gain -6 pad 1 1
play padded.wav

Remove leading and trailing silence:

play padded.wav silence 1 0.1 1% -1 0.1 1%

Each group is: periods duration threshold. The first group handles the start; the second (preceded by -1) handles the end.

For voice recordings, vad (voice activity detection) is simpler — it finds the onset of audio activity and trims everything before it:

play padded.wav vad

compand — dynamic range compression

compand reduces the gap between loud and quiet passages. dynamics.wav from the setup has 14 dB of range to work with.

play dynamics.wav compand 0.3,1 6:-70,-60,-20 -5 -90 0.2

Breaking that down:

  • 0.3,1 — attack 0.3 s, decay 1 s
  • 6:-70,-60,-20 — transfer function: input/output dB pairs
  • -5 — output gain offset (reduce if sox warns about clipping)
  • -90 — initial signal level
  • 0.2 — delay before processing

A practical podcast leveling chain:

sox dynamics.wav podcast.wav \
    highpass 80 \
    compand 0.3,1 6:-70,-60,-20 -5 -90 0.2 \
    norm -3

Other time-based effects

Sox also provides echo (discrete delays), chorus, and flanger. Their defaults are reasonable starting points; man sox covers the tuning parameters.

Synthesis

synth — generating audio from nothing

-n in the input position means “no input file; generate audio.” synth tells sox what to generate.

play -n synth duration waveform frequency

Waveforms

play -n synth 3 sine     440
play -n synth 3 square   440
play -n synth 3 triangle 440
play -n synth 3 sawtooth 440

Noise

play -n synth 5 whitenoise
play -n synth 5 pinknoise
play -n synth 5 brownnoise

Sweeps

Specify frequency as a range to sweep:

play -n synth 5 sine 100:8000    # 100 Hz → 8 kHz over 5s

Chords

Multiple waveforms on one synth generate simultaneously:

# C major: C4, E4, G4
play -n synth 2 sine 261.63 sine 329.63 sine 392.00 gain -6

Specifying output format

The output format follows your system’s default audio configuration, which may not be what you want. Specify it explicitly with output format flags between -n and the output filename (see “Fully-specified output” in chapter 5):

sox -n -r 44100 -b 16 -c 1 out.wav synth 3 sine 440

Adding effects

play -n synth accepts a full effects chain:

play -n synth 10 sine 440 reverb 80

Batch Processing

Sox works well in shell scripts and pipelines. The examples here assume a POSIX shell (bash, zsh, etc.).

Shell loops

Process a directory

mkdir -p normalized
for f in *.wav; do
    sox "$f" "normalized/$f" norm -3
done

Construct output filenames

for f in *.wav; do
    out="${f%.wav}_clean.wav"
    sox "$f" "$out" highpass 100 norm -3
done

${f%.wav} strips the .wav suffix.

Batch format conversion

mkdir -p mp3
for f in *.wav; do
    sox "$f" "mp3/${f%.wav}.mp3"
done

Use soxi in scripts

duration=$(soxi -D "$f")
if awk "BEGIN { exit !($duration > 5) }"; then
    sox "$f" trimmed.wav trim 0 5
fi

Parallel processing

ls *.wav | xargs -P 4 -I{} sox {} "out/{}" norm -3

Check exit codes

Sox exits non-zero on errors. Always check in scripts:

for f in *.wav; do
    sox "$f" "out/$f" norm -3 || echo "Failed: $f" >&2
done

Piping between sox processes

The -p flag emits sox’s internal format on stdout — no need to specify sample rate, bit depth, or encoding on the receiving end:

sox voice.wav -p trim 0 3 | play - reverb 80

This avoids intermediate files in multi-step pipelines.

Troubleshooting

Most sox problems fall into a small set of patterns. Each one has a quick diagnostic.

Silent output

The file exists, soxi shows sensible numbers, and you hear nothing. Likely causes, in order:

  • Audio device: another app has the output, or play is pointed at the wrong sink. Try play on a known-good file first (play -n synth 1 sine 440). If that is silent, it’s not a sox problem.
  • -v in the wrong section: sox -v 0 input.wav out.wav scales the input to zero, producing a silent file — no error. Check that -v belongs where you put it (see chapter 2).
  • System volume muted at the OS level — check that independently.

Clipping

Clipping sounds like harsh distortion on loud passages — a kind of fuzzy crunch that tracks peaks rather than being continuous. Common causes:

  • gain N after norm: norm lifts the peak to 0 dBFS, then gain pushes above it. Reorder, or norm -N instead.
  • Mixing without headroom: -m sums inputs, so two full-scale signals clip immediately. Either -v 0.5 each input or norm -3 the result.
  • Upsampling a signal that was already at 0 dBFS — the interpolator’s ringing can exceed the original peak.

Detect clipping with stats:

sox output.wav -n stats

Watch the Pk lev dB line and the Flat factor / Num samples report for saturated counts. A non-zero Flat factor on output that shouldn’t have any flat runs is a strong signal.

Format mismatch on concat or mix

Sox hard-fails when inputs to concatenation or -m mixing differ in sample rate or channel count:

sox FAIL sox: Input files must have the same sample-rate

Fix by pre-processing the outlier:

sox other.wav -r 44100 other-44k.wav      # match the rate
sox mono.wav -c 2 mono-stereo.wav         # match channels
sox main.wav other-44k.wav combined.wav   # now concat works

Or do it in a single pipeline with -p:

sox other.wav -p rate 44100 channels 2 | sox main.wav - combined.wav

play fails on headless systems

play needs a working audio device. On servers, CI, and containers, it typically can’t find one and errors out. Diagnose a chain without actually playing:

# Write to /dev/null-style null output and read stats
sox input.wav -n stats

# Or render to a temp file and inspect with soxi
sox input.wav out.wav <effects>
soxi out.wav

-n as the output lets effects run through to stat/stats without needing a device.

Typos that silently mean something else

Effect names in sox aren’t validated against a “did you mean” list; a misspelling is often a valid effect that does something completely different. bass and bas both parse; one of them isn’t the shelving filter. Similarly, format flags put in the wrong section apply to the wrong file without a warning.

The defense: eyeball the command before running, and use -V3 to see what sox actually thinks it’s doing.

Use -V for diagnostics

Sox takes a verbosity level from -V1 (errors only) up through -V4 (everything it knows). -V3 is the usual sweet spot — it prints the effect chain as sox understands it, including which effects actually ran and with what arguments:

sox -V3 input.wav out.wav highpass 100 norm -3

If an effect isn’t doing what you expected, -V3 usually tells you why in the first few lines.

Reading sox error messages

Most sox errors are in the form sox FAIL <subsystem>: <message>. A few common ones:

  • sox FAIL formats: no handler for file extension ... — you asked sox to read or write a format without a -t flag, and the extension didn’t disambiguate. Add -t wav (or whatever).
  • sox FAIL sox: Input files must have the same sample-rate — see the format-mismatch section above.
  • sox FAIL rate: Input sample-rate ... is unchanged — you asked rate to resample to the rate it’s already at. Remove the effect.
  • sox WARN ...: clipped N samples; ... — output clipped; see the clipping section above.

When in doubt, re-run with -V3 — the warning is usually right next to the line that caused it.

Beyond sox

The manual

man sox is the authoritative reference — comprehensive, well-written, and covers every effect and flag in detail. Two companion pages are also worth bookmarking:

man sox          # effects, global options, examples
man soxformat    # format flags, encodings, file type details

LADSPA plugins

Sox can load any LADSPA plugin via the ladspa effect, which opens up hundreds of production-quality processors — noise gates, limiters, multiband compressors, pitch correction, and more:

# List installed plugins
listplugins

# Apply a plugin by label
play voice.wav ladspa <plugin-label> [params...]

On Debian/Ubuntu, apt install swh-plugins installs Steve Harris’s widely-used collection. LADSPA extends sox without changing its pipeline model.

ffmpeg

ffmpeg handles the containers and codecs sox can’t: AAC, Opus, MP4, video tracks, streaming protocols. The two tools pair naturally — use ffmpeg to get audio into or out of awkward formats, sox for signal processing:

# Extract audio from a video, then process with sox
ffmpeg -i video.mp4 -vn audio.wav
sox audio.wav processed.wav highpass 100 norm -3

Rubber Band: high-quality time-stretching and pitch-shifting

Sox has no phase vocoder. WSOLA (tempo, pitch) is fast and reasonable, but on complex music you can hear it working. For higher-quality time-stretching, pitch-shifting, or near-unity resampling, reach for rubberband:

rubberband --tempo 1.2 input.wav output.wav   # 20% faster (same sense as sox tempo)
rubberband --time 0.8 input.wav output.wav    # 0.8x duration (--time is 1/--tempo)
rubberband --pitch 2 input.wav output.wav     # up 2 semitones (not cents)
rubberband --pitch -2 --tempo 1.1 input.wav output.wav  # combine freely

Rubber Band has two engines: R2 (default, fast, WSOLA-based) and R3 (slower, phase vocoder, noticeably better on music):

rubberband --fine --pitch 4 input.wav output.wav   # R3 engine
rubberband-r3 --pitch 4 input.wav output.wav       # equivalent

For vocal pitch-shifting, --formant preserves the formant structure so voices don’t sound cartoonish:

rubberband --fine --formant --pitch 3 voice.wav output.wav

rubberband doesn’t support stdout, so to combine with sox for format conversion, route through a temp file:

rubberband -q --pitch 2 input.wav tmp.wav && sox tmp.wav output.flac

libsox

Sox is also a C library. If you need to embed audio processing in a program, libsox exposes the full effect chain and format I/O via a C API. The header is sox.h; the source ships with examples.

What sox isn’t

Sox is a Swiss Army knife, but some problems aren’t shaped like a knife. Knowing what sox is not good at saves time:

  • No phase vocoder. WSOLA (tempo, pitch) works well but produces artifacts on complex material; use Rubber Band (above) when quality matters.
  • No multitrack routing. Sox processes one stream at a time. For independent tracks with sends and returns, look at ecasound or a DAW.
  • No streaming protocols. Sox reads and writes files and pipes; it has no RTSP, HLS, or WebRTC support.