Writings
Beat the feed, use a reader
Link
Copied!

Share as Image

Aspect Ratio

Background Style

Show Content

Building Frequency Altered Feedback in the Browser

Koray Ulusan /

TL;DR

I extended DAF Online with real-time Frequency Altered Feedback (FAF). FAF pitch-shifts your voice in your headphones while you speak, triggering the brain’s choral effect to reduce stuttering. Getting it right took three algorithm iterations, some careful latency accounting, and a custom jitter benchmark that works around a non-obvious limitation of AudioWorkletGlobalScope. This post covers all of it.

Figure 1: The DAF Online interface featuring a clean, minimalist design with the addition of FAF sliders.

What is Frequency Altered Feedback?

Delayed Auditory Feedback (DAF) slows speech by creating a timing mismatch between articulation and perception. I covered DAF in the previous post.

Frequency Altered Feedback (FAF) attacks the same problem through a different mechanism. It shifts the pitch of your voice in your headphones by a ratio rr, where

r=2s/12r = 2^{s/12}

and ss is the semitone shift. A shift of 3-6 semitones is enough to trigger the choral effect: your brain perceives itself as speaking alongside another voice and drops into a group-speaker processing mode, which disengages the feedback loop that drives stuttering.

The two mechanisms are neurally independent. DAF acts on auditory-motor timing pathways. FAF acts on the choral speech cortical network. Clinical devices like SpeechEasy combine both, which is why DAF Online now supports them together.

Clinical semitone ranges

GoalRangeMechanism
Stuttering therapy33-66 semitonesChoral effect. ~35% fluency improvement at 3 st, 65-70% at 6 st. Below ~2 st too subtle; above ~6 st diminishing returns and unnatural quality.
Pitch-Shift Reflex (PSR) research5050-200200 cents (0.50.5-22 st)Small enough that the brain reads the shift as accidental pitch drift, triggering an involuntary compensatory counter-shift within 50-150 ms. Larger shifts are interpreted as an external error and no reflex fires.

The ±300-cent range mode in the app’s cents UI exists specifically for PSR experiments.

Key Takeaway

FAF and DAF engage different neural pathways and their effects add up. In trials, the combination outperforms either alone, which is why clinical hardware uses both.

The Wrong Algorithm First: WSOLA

My initial implementation used WSOLA (Waveform Similarity Overlap-Add). The algorithm advances its input read head at a rate proportional to rr:

this._inR += H * r;

For r>1r > 1 (pitch up), the read head races ahead of incoming audio. The input buffer drains faster than new samples arrive. For r<1r < 1, it falls behind. In both cases pitch and delay are coupled: setting +8+8 semitones audibly shortened the DAF delay; 8-8 semitones stretched it. This is unsuitable.

Fixed-Anchor OLA: Decoupling Pitch from Delay

The fix is to remove the drifting read head. Every synthesis hop, the analysis position is anchored at a fixed offset behind the write head:

anchor=winL\text{anchor} = w_{in} - L

where winw_{in} is the input write head and L=G/Rmin+1L = \lceil G / R_{min} \rceil + 1 is the lookback depth. Pitch is shifted by resampling within the grain: read G/rG/r input samples and interpolate them into GG output samples.

srcPos(i)=anchor+iG1Gr,i[0,G1]\text{srcPos}(i) = \text{anchor} + \frac{i}{G - 1} \cdot \frac{G}{r}, \quad i \in [0, G-1]
  • r>1:r > 1:\quad G/r<GG/r < G — fewer input samples stretched across GG outputs → higher frequency
  • r=1:r = 1:\quad G/r=GG/r = G — identity
  • r<1:r < 1:\quad G/r>GG/r > G — more input samples compressed → lower frequency

The anchor is fixed relative to the write head. The delay is invariant under changes to rr.

Choosing  Rmin \ R_{min}\and the  +1 \ +1\guard

The lookback depth is L=G/Rmin+1L = \lceil G / R_{min} \rceil + 1. The original implementation used Rmin=0.5R_{min} = 0.5 (one full octave), giving

L=2560.5=512 samples10.7 ms at 48 kHzL = \left\lceil \frac{256}{0.5} \right\rceil = 512 \text{ samples} \approx 10.7 \text{ ms at 48 kHz}

But the UI slider is hard-limited to ±8 semitones, so the minimum reachable ratio is 28/120.6302^{-8/12} \approx 0.630. The theoretical Rmin=0.5R_{min} = 0.5 was buying latency headroom that is never reachable. Setting Rmin=0.6R_{min} = 0.6 (≈9 st down, giving 1 st of headroom beyond the slider) reduces this to

L=2560.6+1=428 samples8.9 ms at 48 kHzL = \left\lceil \frac{256}{0.6} \right\rceil + 1 = 428 \text{ samples} \approx 8.9 \text{ ms at 48 kHz}

The +1+1 guards the linear interpolation from reading one sample past the write head at exactly r=Rminr = R_{min}. Without it:

srcPosmax=(win427)+2560.6=win0.33\text{srcPos}_{max} = (w_{in} - 427) + \frac{256}{0.6} = w_{in} - 0.33 win0.33=win1    ip+1=win\lfloor w_{in} - 0.33 \rfloor = w_{in} - 1 \implies \text{ip} + 1 = w_{in}

That reads inBuf[inW], one sample past the write head, containing stale ring buffer data from a previous wrap. With L=428L = 428:

srcPosmax=win1.33    ip+1=win1\text{srcPos}_{max} = w_{in} - 1.33 \implies \text{ip} + 1 = w_{in} - 1 \checkmark

Amplitude normalisation

With grain size GG and synthesis hop H/rH/r, the COLA overlap density changes with rr. At r=1r = 1 with 50% overlap the Hann sum integrates to ~1. At arbitrary rr, each output point accumulates approximately rr Hann-weighted contributions:

COLA sumr    scale each grain by 1r\text{COLA sum} \approx r \implies \text{scale each grain by } \frac{1}{r}

Without this correction, pitch-up (r>1r > 1) is louder and pitch-down (r<1r < 1) is quieter, with amplitude directly coupled to pitch ratio.

Why Switch to PSOLA?

Fixed-anchor OLA solves the delay-coupling problem and runs cheaply. But it has a fundamental quality limit: grain boundaries don’t align to the signal’s periodicity.

A 256-sample grain cut from a 150 Hz voice at 48 kHz captures 0.8 pitch periods. The next grain starts at a random phase of the next cycle. The Hann window smooths the transition but cannot eliminate the phase discontinuity, which manifests as metallic flutter on sustained vowels, audible at shifts above ±2 semitones.

For stuttering therapy, sustained vowels at 3-6 semitones are exactly the use case.

TD-PSOLA (Time-Domain Pitch Synchronous Overlap-Add) fixes this by aligning grain boundaries to the signal’s own pitch periods. Each grain is exactly 2T02T_0 samples, centered on a pitch mark (glottal closure instant). The synthesis hop is T0/rT_0 / r, which is what changes the perceived fundamental frequency. Because adjacent grains start at the same phase of their respective pitch cycles, overlap-add is phase-coherent and the output is perceptually smooth.

FD-PSOLA (Frequency-Domain PSOLA) is a more complex variant that applies a Fourier transform to each grain, manipulates the spectrum, then inverse transforms back to time domain. It can achieve higher quality at extreme shifts but is more computationally expensive and has higher latency due to the FFT window size. This latency makes it unsuitable for real-time FAF in the browser.

PSOLA Implementation

YIN pitch detection

YIN (de Cheveigné & Kawahara, 2002) estimates T0T_0 via the cumulative mean normalised difference function. The raw difference function is:

d(τ)=j=0W1(x[j]x[jτ])2d(\tau) = \sum_{j=0}^{W-1} \left( x[j] - x[j - \tau] \right)^2

Normalised by the running cumulative mean:

d(τ)=d(τ)τk=1τd(k)d'(\tau) = \frac{d(\tau) \cdot \tau}{\sum_{k=1}^{\tau} d(k)}

d(τ)d'(\tau) tends toward 1.0 for aperiodic signals and dips below a threshold (0.15 here) at the true period. The running-sum formulation keeps normalisation at O(1)O(1) per τ\tau. We re-run every 4 grains (~20 ms) since T0T_0 doesn’t change faster in natural speech, and smooth the estimate with a first-order IIR:

T^00.85T^0+0.15T0raw\hat{T}_0 \leftarrow 0.85 \cdot \hat{T}_0 + 0.15 \cdot T_0^{raw}

This prevents grain-size discontinuities at voiced/voiced boundaries.

Unvoiced fallback

Fricatives, plosive releases, and silence have no T0T_0. YIN returns _voiced = false and we fall through to fixed 128-sample OLA passthrough at 50% overlap with no pitch manipulation. There is no T0T_0 to shift; attempting PSOLA on /s/ produces artifacts.

PSOLA latency and the voice type selector

The processor needs 2Tmax2T_{max} samples of input history before synthesis can begin:

Tmax=fsffloor,LPSOLA=2Tmaxfs×1000 msT_{max} = \left\lfloor \frac{f_s}{f_{floor}} \right\rfloor, \quad L_{PSOLA} = \frac{2 T_{max}}{f_s} \times 1000 \text{ ms}

pitchFloor is passed at node construction via processorOptions, the correct mechanism for init-time configuration that cannot be updated via AudioParam automation:

new AudioWorkletNode(ctx, 'pitch-shifter-psola', {
  processorOptions: { pitchFloor: this._pitchFloor },
});

The voice type selector adjusts this value:

Voice typeffloorf_{floor}TmaxT_{max} @ 48 kHzLPSOLAL_{PSOLA}
Deep80 Hz600 samples≈ 25.0 ms
Average120 Hz400 samples≈ 16.6 ms
High-pitched150 Hz320 samples≈ 13.3 ms

Matching voice type to your actual pitch range is not cosmetic; it directly controls the algorithm’s latency floor.

Loading Both Processors in One Blob

Both OLA and PSOLA are available in the app. Since each processor is a self-contained IIFE string, they are concatenated into a single blob and loaded with one addModule() call, with no extra Vite config and no public/ files:

const blob = new Blob([COMBINED_CODE], { type: 'application/javascript' });
await ctx.audioWorklet.addModule(URL.createObjectURL(blob));

The shared Hann table (1024 floats, 4 KB) is allocated once at module scope, not per node instance.

Switching modes (OLA ↔ PSOLA) requires tearing down existing nodes because processorOptions.pitchFloor is init-time only and cannot be changed post-construction. The “same count fast-path” (which updates only pitchRatio AudioParams without graph surgery) is explicitly bypassed on mode switches.

Multi-FAF: N Parallel Signals

The app supports NN simultaneous pitch-shifted signals. The graph topology:

source{fafNode1fafNodeN}sumGain ⁣(1N)delayNodeout\text{source} \rightarrow \left\{ \begin{array}{c} \text{fafNode}_1 \\ \vdots \\ \text{fafNode}_N \end{array} \right\} \rightarrow \text{sumGain}\!\left(\tfrac{1}{N}\right) \rightarrow \text{delayNode} \rightarrow \text{out}

The normalising gain at 1/N1/N keeps perceived loudness constant as NN increases. Parallel worklet nodes don’t stack latency; they all process within the same 128-sample scheduler pass.

Each PSOLA node already applies its internal 1/r1/r gain correction. The 1/N1/N sumGain composes multiplicatively and correctly on top of it.

The Effective Delay Display

Total user-perceived delay follows the same equation as DAF:

keff=kuser+ksysk_{eff} = k_{user} + k_{sys}

where kuserk_{user} is the slider value and

ksys=kfloor+kFAFk_{sys} = k_{floor} + k_{FAF}

with kfloork_{floor} = baseLatency + outputLatency + inputLatency from the AudioContext, and kFAFk_{FAF} the algorithmic latency of whichever FAF mode is active. The display recomputes whenever any component changes: slider moved, FAF toggled, session started, graph resumed, voice type changed.

Chrome reports outputLatency = 0 immediately after context creation and updates asynchronously. A 200 ms delayed re-read of measureLatencyFloor catches the stabilised value without a full benchmark run.

Jitter Measurement

Accurate jitter measurement from an AudioWorklet required solving a non-obvious problem.

The naive approach of measuring intervals between consecutive process() calls using currentTime produces stddev = 0 by definition. currentTime is the audio clock: it advances in exactly 128/fs128 / f_s seconds per quantum as guaranteed by the spec. It measures ideal scheduling, not actual scheduling.

performance.now() is not available in AudioWorkletGlobalScope. Date.now() is, but its 1 ms resolution causes its own problem: at 48 kHz the quantum is 2.67 ms, so two consecutive quanta can land on the same millisecond tick, producing 0 ms intervals that accumulate into spurious spikes. Raw intervals from Date.now() have correct mean but meaningless stddev.

The fix is to stop measuring intervals and measure drift from expected time instead. Anchor to both currentFrame (exact integer sample count) and Date.now() (coarse wall clock) at the first quantum:

drift[n]=twall[n](t0+f[n]f0fs×1000)\text{drift}[n] = t_{wall}[n] - \left( t_0 + \frac{f[n] - f_0}{f_s} \times 1000 \right)

where f[n]f[n] is currentFrame at quantum nn and f0f_0, t0t_0 are the anchor values. currentFrame advances in exact samples with no resolution loss. Even with 1 ms Date.now() resolution, drift is meaningful: an 8 ms CPU preemption produces ~8 ms drift regardless of rounding. The standard deviation of drift samples is the canonical jitter metric.

Why drift and not intervals?

An interval measures the gap between two coarse timestamps. A drift measures the deviation of one coarse timestamp from a precise prediction. The prediction is exact (derived from currentFrame), so the noise is one-sided: only Date.now() contributes coarsening, not both endpoints. This halves the effective resolution noise compared to raw intervals.

Non-Obvious Bugs

onStateChange null dereference. The original listener:

this._ctx?.addEventListener('statechange', () => cb(this._ctx!.state));

AudioContext.close() sets this._ctx = null, then the context fires a final statechange event with state "closed". The ! dereferences null and throws. Fix: read state from the event target, not the instance field:

this._ctx?.addEventListener('statechange', (e) =>
  cb((e.target as AudioContext).state)
);

this._graph! after await in teardown/rebuild chain.

this._graph.setFAFNodes([])
  .then(() => this._graph!.setFAFNodes(semitones)) // throws if DAF stopped mid-await

The user can stop DAF between teardown and rebuild. The ! assertion fires on null. Fix: this._graph?.setFAFNodes(semitones), which silently no-ops if the session ended.

activateModeBtn called before declaration. Chrome hoists block-scoped function declarations within a block; Firefox and Safari do not (ES2015 strict mode). The initialisation call to activateModeBtn('faf-type', 'single') preceded the function declaration by 10 lines within the same DOMContentLoaded callback. This works on Chrome but throws a TypeError on Firefox. Fixed by reordering the declaration above the call.

OLA vs PSOLA: When Each is Appropriate

OLA (Low Latency)PSOLA (High-Fidelity)
Latency @ 120 Hz floor≈ 8.9 ms≈ 16.6 ms
Quality at ±4 stAudible flutter on vowelsPhase-coherent, natural
Requires voice type configNoYes
Computational costLowLow + YIN every ~20 ms
Unvoiced handlingAlways OLAFalls back to OLA

OLA is appropriate when minimum total latency matters and the semitone shift is small (2\leq 2 st). PSOLA is appropriate for sustained practice sessions at clinical shifts (3-6 st) where audio quality affects the therapeutic experience.

References