Building Frequency Altered Feedback in the Browser

TL;DR

I extended DAF Online with real-time Frequency Altered Feedback (FAF). FAF pitch-shifts your voice in your headphones while you speak, triggering the brain’s choral effect to reduce stuttering. Getting it right took three algorithm iterations, some careful latency accounting, and a custom jitter benchmark that works around a non-obvious limitation of AudioWorkletGlobalScope. This post covers all of it.

Figure 1: The DAF Online interface featuring a clean, minimalist design with the addition of FAF sliders.

What is Frequency Altered Feedback?

Delayed Auditory Feedback (DAF) slows speech by creating a timing mismatch between articulation and perception. I covered DAF in the previous post.

Frequency Altered Feedback (FAF) attacks the same problem through a different mechanism. It shifts the pitch of your voice in your headphones by a ratio $r$ , where

r = 2^{s/12}

and $s$ is the semitone shift. A shift of 3-6 semitones is enough to trigger the choral effect: your brain perceives itself as speaking alongside another voice and drops into a group-speaker processing mode, which disengages the feedback loop that drives stuttering.

The two mechanisms are neurally independent. DAF acts on auditory-motor timing pathways. FAF acts on the choral speech cortical network. Clinical devices like SpeechEasy combine both, which is why DAF Online now supports them together.

Clinical semitone ranges

Goal	Range	Mechanism
Stuttering therapy	$3$ - $6$ semitones	Choral effect. ~35% fluency improvement at 3 st, 65-70% at 6 st. Below ~2 st too subtle; above ~6 st diminishing returns and unnatural quality.
Pitch-Shift Reflex (PSR) research	$50$ - $200$ cents ( $0.5$ - $2$ st)	Small enough that the brain reads the shift as accidental pitch drift, triggering an involuntary compensatory counter-shift within 50-150 ms. Larger shifts are interpreted as an external error and no reflex fires.

The ±300-cent range mode in the app’s cents UI exists specifically for PSR experiments.

Key Takeaway

FAF and DAF engage different neural pathways and their effects add up. In trials, the combination outperforms either alone, which is why clinical hardware uses both.

The Wrong Algorithm First: WSOLA

My initial implementation used WSOLA (Waveform Similarity Overlap-Add). The algorithm advances its input read head at a rate proportional to $r$ :

this._inR += H * r;

For $r > 1$ (pitch up), the read head races ahead of incoming audio. The input buffer drains faster than new samples arrive. For $r < 1$ , it falls behind. In both cases pitch and delay are coupled: setting $+8$ semitones audibly shortened the DAF delay; $-8$ semitones stretched it. This is unsuitable.

Fixed-Anchor OLA: Decoupling Pitch from Delay

The fix is to remove the drifting read head. Every synthesis hop, the analysis position is anchored at a fixed offset behind the write head:

\text{anchor} = w_{in} - L

where $w_{in}$ is the input write head and $L = \lceil G / R_{min} \rceil + 1$ is the lookback depth. Pitch is shifted by resampling within the grain: read $G/r$ input samples and interpolate them into $G$ output samples.

\text{srcPos}(i) = \text{anchor} + \frac{i}{G - 1} \cdot \frac{G}{r}, \quad i \in [0, G-1]

$r > 1:\quad$ $G/r < G$ — fewer input samples stretched across $G$ outputs → higher frequency
$r = 1:\quad$ $G/r = G$ — identity
$r < 1:\quad$ $G/r > G$ — more input samples compressed → lower frequency

The anchor is fixed relative to the write head. The delay is invariant under changes to $r$ .

Choosing $\ R_{min}\$ and the $\ +1\$ guard

The lookback depth is $L = \lceil G / R_{min} \rceil + 1$ . The original implementation used $R_{min} = 0.5$ (one full octave), giving

L = \left\lceil \frac{256}{0.5} \right\rceil = 512 \text{ samples} \approx 10.7 \text{ ms at 48 kHz}

But the UI slider is hard-limited to ±8 semitones, so the minimum reachable ratio is $2^{-8/12} \approx 0.630$ . The theoretical $R_{min} = 0.5$ was buying latency headroom that is never reachable. Setting $R_{min} = 0.6$ (≈9 st down, giving 1 st of headroom beyond the slider) reduces this to

L = \left\lceil \frac{256}{0.6} \right\rceil + 1 = 428 \text{ samples} \approx 8.9 \text{ ms at 48 kHz}

The $+1$ guards the linear interpolation from reading one sample past the write head at exactly $r = R_{min}$ . Without it:

\text{srcPos}_{max} = (w_{in} - 427) + \frac{256}{0.6} = w_{in} - 0.33

\lfloor w_{in} - 0.33 \rfloor = w_{in} - 1 \implies \text{ip} + 1 = w_{in}

That reads inBuf[inW], one sample past the write head, containing stale ring buffer data from a previous wrap. With $L = 428$ :

\text{srcPos}_{max} = w_{in} - 1.33 \implies \text{ip} + 1 = w_{in} - 1 \checkmark

Amplitude normalisation

With grain size $G$ and synthesis hop $H/r$ , the COLA overlap density changes with $r$ . At $r = 1$ with 50% overlap the Hann sum integrates to ~1. At arbitrary $r$ , each output point accumulates approximately $r$ Hann-weighted contributions:

\text{COLA sum} \approx r \implies \text{scale each grain by } \frac{1}{r}

Without this correction, pitch-up ( $r > 1$ ) is louder and pitch-down ( $r < 1$ ) is quieter, with amplitude directly coupled to pitch ratio.

Why Switch to PSOLA?

Fixed-anchor OLA solves the delay-coupling problem and runs cheaply. But it has a fundamental quality limit: grain boundaries don’t align to the signal’s periodicity.

A 256-sample grain cut from a 150 Hz voice at 48 kHz captures 0.8 pitch periods. The next grain starts at a random phase of the next cycle. The Hann window smooths the transition but cannot eliminate the phase discontinuity, which manifests as metallic flutter on sustained vowels, audible at shifts above ±2 semitones.

For stuttering therapy, sustained vowels at 3-6 semitones are exactly the use case.

TD-PSOLA (Time-Domain Pitch Synchronous Overlap-Add) fixes this by aligning grain boundaries to the signal’s own pitch periods. Each grain is exactly $2T_0$ samples, centered on a pitch mark (glottal closure instant). The synthesis hop is $T_0 / r$ , which is what changes the perceived fundamental frequency. Because adjacent grains start at the same phase of their respective pitch cycles, overlap-add is phase-coherent and the output is perceptually smooth.

FD-PSOLA (Frequency-Domain PSOLA) is a more complex variant that applies a Fourier transform to each grain, manipulates the spectrum, then inverse transforms back to time domain. It can achieve higher quality at extreme shifts but is more computationally expensive and has higher latency due to the FFT window size. This latency makes it unsuitable for real-time FAF in the browser.

PSOLA Implementation

YIN pitch detection

YIN (de Cheveigné & Kawahara, 2002) estimates $T_0$ via the cumulative mean normalised difference function. The raw difference function is:

d(\tau) = \sum_{j=0}^{W-1} \left( x[j] - x[j - \tau] \right)^2

Normalised by the running cumulative mean:

d'(\tau) = \frac{d(\tau) \cdot \tau}{\sum_{k=1}^{\tau} d(k)}

$d'(\tau)$ tends toward 1.0 for aperiodic signals and dips below a threshold (0.15 here) at the true period. The running-sum formulation keeps normalisation at $O(1)$ per $\tau$ . We re-run every 4 grains (~20 ms) since $T_0$ doesn’t change faster in natural speech, and smooth the estimate with a first-order IIR:

\hat{T}_0 \leftarrow 0.85 \cdot \hat{T}_0 + 0.15 \cdot T_0^{raw}

This prevents grain-size discontinuities at voiced/voiced boundaries.

Unvoiced fallback

Fricatives, plosive releases, and silence have no $T_0$ . YIN returns _voiced = false and we fall through to fixed 128-sample OLA passthrough at 50% overlap with no pitch manipulation. There is no $T_0$ to shift; attempting PSOLA on /s/ produces artifacts.

PSOLA latency and the voice type selector

The processor needs $2T_{max}$ samples of input history before synthesis can begin:

T_{max} = \left\lfloor \frac{f_s}{f_{floor}} \right\rfloor, \quad L_{PSOLA} = \frac{2 T_{max}}{f_s} \times 1000 \text{ ms}

pitchFloor is passed at node construction via processorOptions, the correct mechanism for init-time configuration that cannot be updated via AudioParam automation:

new AudioWorkletNode(ctx, 'pitch-shifter-psola', {
  processorOptions: { pitchFloor: this._pitchFloor },
});

The voice type selector adjusts this value:

Voice type	$f_{floor}$	$T_{max}$ @ 48 kHz	$L_{PSOLA}$
Deep	80 Hz	600 samples	≈ 25.0 ms
Average	120 Hz	400 samples	≈ 16.6 ms
High-pitched	150 Hz	320 samples	≈ 13.3 ms

Matching voice type to your actual pitch range is not cosmetic; it directly controls the algorithm’s latency floor.

Loading Both Processors in One Blob

Both OLA and PSOLA are available in the app. Since each processor is a self-contained IIFE string, they are concatenated into a single blob and loaded with one addModule() call, with no extra Vite config and no public/ files:

const blob = new Blob([COMBINED_CODE], { type: 'application/javascript' });
await ctx.audioWorklet.addModule(URL.createObjectURL(blob));

The shared Hann table (1024 floats, 4 KB) is allocated once at module scope, not per node instance.

Switching modes (OLA ↔ PSOLA) requires tearing down existing nodes because processorOptions.pitchFloor is init-time only and cannot be changed post-construction. The “same count fast-path” (which updates only pitchRatio AudioParams without graph surgery) is explicitly bypassed on mode switches.

Multi-FAF: N Parallel Signals

The app supports $N$ simultaneous pitch-shifted signals. The graph topology:

\text{source} \rightarrow \left\{ \begin{array}{c} \text{fafNode}_1 \\ \vdots \\ \text{fafNode}_N \end{array} \right\} \rightarrow \text{sumGain}\!\left(\tfrac{1}{N}\right) \rightarrow \text{delayNode} \rightarrow \text{out}

The normalising gain at $1/N$ keeps perceived loudness constant as $N$ increases. Parallel worklet nodes don’t stack latency; they all process within the same 128-sample scheduler pass.

Each PSOLA node already applies its internal $1/r$ gain correction. The $1/N$ sumGain composes multiplicatively and correctly on top of it.

The Effective Delay Display

Total user-perceived delay follows the same equation as DAF:

k_{eff} = k_{user} + k_{sys}

where $k_{user}$ is the slider value and

k_{sys} = k_{floor} + k_{FAF}

with $k_{floor}$ = baseLatency + outputLatency + inputLatency from the AudioContext, and $k_{FAF}$ the algorithmic latency of whichever FAF mode is active. The display recomputes whenever any component changes: slider moved, FAF toggled, session started, graph resumed, voice type changed.

Chrome reports outputLatency = 0 immediately after context creation and updates asynchronously. A 200 ms delayed re-read of measureLatencyFloor catches the stabilised value without a full benchmark run.

Jitter Measurement

Accurate jitter measurement from an AudioWorklet required solving a non-obvious problem.

The naive approach of measuring intervals between consecutive process() calls using currentTime produces stddev = 0 by definition. currentTime is the audio clock: it advances in exactly $128 / f_s$ seconds per quantum as guaranteed by the spec. It measures ideal scheduling, not actual scheduling.

performance.now() is not available in AudioWorkletGlobalScope. Date.now() is, but its 1 ms resolution causes its own problem: at 48 kHz the quantum is 2.67 ms, so two consecutive quanta can land on the same millisecond tick, producing 0 ms intervals that accumulate into spurious spikes. Raw intervals from Date.now() have correct mean but meaningless stddev.

The fix is to stop measuring intervals and measure drift from expected time instead. Anchor to both currentFrame (exact integer sample count) and Date.now() (coarse wall clock) at the first quantum:

\text{drift}[n] = t_{wall}[n] - \left( t_0 + \frac{f[n] - f_0}{f_s} \times 1000 \right)

where $f[n]$ is currentFrame at quantum $n$ and $f_0$ , $t_0$ are the anchor values. currentFrame advances in exact samples with no resolution loss. Even with 1 ms Date.now() resolution, drift is meaningful: an 8 ms CPU preemption produces ~8 ms drift regardless of rounding. The standard deviation of drift samples is the canonical jitter metric.

◈ Why drift and not intervals? ↓

An interval measures the gap between two coarse timestamps. A drift measures the deviation of one coarse timestamp from a precise prediction. The prediction is exact (derived from currentFrame), so the noise is one-sided: only Date.now() contributes coarsening, not both endpoints. This halves the effective resolution noise compared to raw intervals.

Non-Obvious Bugs

onStateChange null dereference. The original listener:

this._ctx?.addEventListener('statechange', () => cb(this._ctx!.state));

AudioContext.close() sets this._ctx = null, then the context fires a final statechange event with state "closed". The ! dereferences null and throws. Fix: read state from the event target, not the instance field:

this._ctx?.addEventListener('statechange', (e) =>
  cb((e.target as AudioContext).state)
);

this._graph! after await in teardown/rebuild chain.

this._graph.setFAFNodes([])
  .then(() => this._graph!.setFAFNodes(semitones)) // throws if DAF stopped mid-await

The user can stop DAF between teardown and rebuild. The ! assertion fires on null. Fix: this._graph?.setFAFNodes(semitones), which silently no-ops if the session ended.

activateModeBtn called before declaration. Chrome hoists block-scoped function declarations within a block; Firefox and Safari do not (ES2015 strict mode). The initialisation call to activateModeBtn('faf-type', 'single') preceded the function declaration by 10 lines within the same DOMContentLoaded callback. This works on Chrome but throws a TypeError on Firefox. Fixed by reordering the declaration above the call.

OLA vs PSOLA: When Each is Appropriate

	OLA (Low Latency)	PSOLA (High-Fidelity)
Latency @ 120 Hz floor	≈ 8.9 ms	≈ 16.6 ms
Quality at ±4 st	Audible flutter on vowels	Phase-coherent, natural
Requires voice type config	No	Yes
Computational cost	Low	Low + YIN every ~20 ms
Unvoiced handling	Always OLA	Falls back to OLA

OLA is appropriate when minimum total latency matters and the semitone shift is small ( $\leq 2$ st). PSOLA is appropriate for sustained practice sessions at clinical shifts (3-6 st) where audio quality affects the therapeutic experience.

You can try it yourself at DAF Online.

References

Natke, U., et al. (2001). Fluency, fundamental frequency, and speech rate under frequency-shifted auditory feedback in stuttering and nonstuttering persons. Journal of the Acoustical Society of America.
Kalinowski, J., et al. (1996). Effect of alterations in auditory feedback and speech rate on stuttering frequency. Journal of Speech and Hearing Research, 39, 396–407.

Share as Image

Building Frequency Altered Feedback in the Browser

TL;DR

What is Frequency Altered Feedback?

Clinical semitone ranges

The Wrong Algorithm First: WSOLA

Fixed-Anchor OLA: Decoupling Pitch from Delay

Choosing $\ R_{min}\$ and the $\ +1\$ guard

Amplitude normalisation

Why Switch to PSOLA?

PSOLA Implementation

YIN pitch detection

Unvoiced fallback

PSOLA latency and the voice type selector

Loading Both Processors in One Blob

Multi-FAF: N Parallel Signals

The Effective Delay Display

Jitter Measurement

Non-Obvious Bugs

OLA vs PSOLA: When Each is Appropriate

References

See also...

TL;DR

What is Frequency Altered Feedback?

Clinical semitone ranges

The Wrong Algorithm First: WSOLA

Fixed-Anchor OLA: Decoupling Pitch from Delay

Choosing Rmin \ R_{min}\ Rmin​ and the +1 \ +1\ +1 guard

Amplitude normalisation

Why Switch to PSOLA?

PSOLA Implementation

YIN pitch detection

Unvoiced fallback

PSOLA latency and the voice type selector

Loading Both Processors in One Blob

Multi-FAF: N Parallel Signals

The Effective Delay Display

Jitter Measurement

Non-Obvious Bugs

OLA vs PSOLA: When Each is Appropriate

References

See also...

Choosing $\ R_{min}\$ and the $\ +1\$ guard