TL;DR
I extended DAF Online with real-time Frequency Altered Feedback (FAF). FAF pitch-shifts your voice in your headphones while you speak, triggering the brain’s choral effect to reduce stuttering. Getting it right took three algorithm iterations, some careful latency accounting, and a custom jitter benchmark that works around a non-obvious limitation of AudioWorkletGlobalScope. This post covers all of it.
What is Frequency Altered Feedback?
Delayed Auditory Feedback (DAF) slows speech by creating a timing mismatch between articulation and perception. I covered DAF in the previous post.
Frequency Altered Feedback (FAF) attacks the same problem through a different mechanism. It shifts the pitch of your voice in your headphones by a ratio , where
and is the semitone shift. A shift of 3-6 semitones is enough to trigger the choral effect: your brain perceives itself as speaking alongside another voice and drops into a group-speaker processing mode, which disengages the feedback loop that drives stuttering.
The two mechanisms are neurally independent. DAF acts on auditory-motor timing pathways. FAF acts on the choral speech cortical network. Clinical devices like SpeechEasy combine both, which is why DAF Online now supports them together.
Clinical semitone ranges
| Goal | Range | Mechanism |
|---|---|---|
| Stuttering therapy | - semitones | Choral effect. ~35% fluency improvement at 3 st, 65-70% at 6 st. Below ~2 st too subtle; above ~6 st diminishing returns and unnatural quality. |
| Pitch-Shift Reflex (PSR) research | - cents (- st) | Small enough that the brain reads the shift as accidental pitch drift, triggering an involuntary compensatory counter-shift within 50-150 ms. Larger shifts are interpreted as an external error and no reflex fires. |
The ±300-cent range mode in the app’s cents UI exists specifically for PSR experiments.
FAF and DAF engage different neural pathways and their effects add up. In trials, the combination outperforms either alone, which is why clinical hardware uses both.
The Wrong Algorithm First: WSOLA
My initial implementation used WSOLA (Waveform Similarity Overlap-Add). The algorithm advances its input read head at a rate proportional to :
this._inR += H * r;For (pitch up), the read head races ahead of incoming audio. The input buffer drains faster than new samples arrive. For , it falls behind. In both cases pitch and delay are coupled: setting semitones audibly shortened the DAF delay; semitones stretched it. This is unsuitable.
Fixed-Anchor OLA: Decoupling Pitch from Delay
The fix is to remove the drifting read head. Every synthesis hop, the analysis position is anchored at a fixed offset behind the write head:
where is the input write head and is the lookback depth. Pitch is shifted by resampling within the grain: read input samples and interpolate them into output samples.
- — fewer input samples stretched across outputs → higher frequency
- — identity
- — more input samples compressed → lower frequency
The anchor is fixed relative to the write head. The delay is invariant under changes to .
Choosing and the guard
The lookback depth is . The original implementation used (one full octave), giving
But the UI slider is hard-limited to ±8 semitones, so the minimum reachable ratio is . The theoretical was buying latency headroom that is never reachable. Setting (≈9 st down, giving 1 st of headroom beyond the slider) reduces this to
The guards the linear interpolation from reading one sample past the write head at exactly . Without it:
That reads inBuf[inW], one sample past the write head, containing stale ring buffer data from a previous wrap. With :
Amplitude normalisation
With grain size and synthesis hop , the COLA overlap density changes with . At with 50% overlap the Hann sum integrates to ~1. At arbitrary , each output point accumulates approximately Hann-weighted contributions:
Without this correction, pitch-up () is louder and pitch-down () is quieter, with amplitude directly coupled to pitch ratio.
Why Switch to PSOLA?
Fixed-anchor OLA solves the delay-coupling problem and runs cheaply. But it has a fundamental quality limit: grain boundaries don’t align to the signal’s periodicity.
A 256-sample grain cut from a 150 Hz voice at 48 kHz captures 0.8 pitch periods. The next grain starts at a random phase of the next cycle. The Hann window smooths the transition but cannot eliminate the phase discontinuity, which manifests as metallic flutter on sustained vowels, audible at shifts above ±2 semitones.
For stuttering therapy, sustained vowels at 3-6 semitones are exactly the use case.
TD-PSOLA (Time-Domain Pitch Synchronous Overlap-Add) fixes this by aligning grain boundaries to the signal’s own pitch periods. Each grain is exactly samples, centered on a pitch mark (glottal closure instant). The synthesis hop is , which is what changes the perceived fundamental frequency. Because adjacent grains start at the same phase of their respective pitch cycles, overlap-add is phase-coherent and the output is perceptually smooth.
FD-PSOLA (Frequency-Domain PSOLA) is a more complex variant that applies a Fourier transform to each grain, manipulates the spectrum, then inverse transforms back to time domain. It can achieve higher quality at extreme shifts but is more computationally expensive and has higher latency due to the FFT window size. This latency makes it unsuitable for real-time FAF in the browser.
PSOLA Implementation
YIN pitch detection
YIN (de Cheveigné & Kawahara, 2002) estimates via the cumulative mean normalised difference function. The raw difference function is:
Normalised by the running cumulative mean:
tends toward 1.0 for aperiodic signals and dips below a threshold (0.15 here) at the true period. The running-sum formulation keeps normalisation at per . We re-run every 4 grains (~20 ms) since doesn’t change faster in natural speech, and smooth the estimate with a first-order IIR:
This prevents grain-size discontinuities at voiced/voiced boundaries.
Unvoiced fallback
Fricatives, plosive releases, and silence have no . YIN returns _voiced = false and we fall through to fixed 128-sample OLA passthrough at 50% overlap with no pitch manipulation. There is no to shift; attempting PSOLA on /s/ produces artifacts.
PSOLA latency and the voice type selector
The processor needs samples of input history before synthesis can begin:
pitchFloor is passed at node construction via processorOptions, the correct mechanism for init-time configuration that cannot be updated via AudioParam automation:
new AudioWorkletNode(ctx, 'pitch-shifter-psola', {
processorOptions: { pitchFloor: this._pitchFloor },
});The voice type selector adjusts this value:
| Voice type | @ 48 kHz | ||
|---|---|---|---|
| Deep | 80 Hz | 600 samples | ≈ 25.0 ms |
| Average | 120 Hz | 400 samples | ≈ 16.6 ms |
| High-pitched | 150 Hz | 320 samples | ≈ 13.3 ms |
Matching voice type to your actual pitch range is not cosmetic; it directly controls the algorithm’s latency floor.
Loading Both Processors in One Blob
Both OLA and PSOLA are available in the app. Since each processor is a self-contained IIFE string, they are concatenated into a single blob and loaded with one addModule() call, with no extra Vite config and no public/ files:
const blob = new Blob([COMBINED_CODE], { type: 'application/javascript' });
await ctx.audioWorklet.addModule(URL.createObjectURL(blob));The shared Hann table (1024 floats, 4 KB) is allocated once at module scope, not per node instance.
Switching modes (OLA ↔ PSOLA) requires tearing down existing nodes because processorOptions.pitchFloor is init-time only and cannot be changed post-construction. The “same count fast-path” (which updates only pitchRatio AudioParams without graph surgery) is explicitly bypassed on mode switches.
Multi-FAF: N Parallel Signals
The app supports simultaneous pitch-shifted signals. The graph topology:
The normalising gain at keeps perceived loudness constant as increases. Parallel worklet nodes don’t stack latency; they all process within the same 128-sample scheduler pass.
Each PSOLA node already applies its internal gain correction. The sumGain composes multiplicatively and correctly on top of it.
The Effective Delay Display
Total user-perceived delay follows the same equation as DAF:
where is the slider value and
with = baseLatency + outputLatency + inputLatency from the AudioContext, and the algorithmic latency of whichever FAF mode is active. The display recomputes whenever any component changes: slider moved, FAF toggled, session started, graph resumed, voice type changed.
Chrome reports outputLatency = 0 immediately after context creation and updates asynchronously. A 200 ms delayed re-read of measureLatencyFloor catches the stabilised value without a full benchmark run.
Jitter Measurement
Accurate jitter measurement from an AudioWorklet required solving a non-obvious problem.
The naive approach of measuring intervals between consecutive process() calls using currentTime produces stddev = 0 by definition. currentTime is the audio clock: it advances in exactly seconds per quantum as guaranteed by the spec. It measures ideal scheduling, not actual scheduling.
performance.now() is not available in AudioWorkletGlobalScope. Date.now() is, but its 1 ms resolution causes its own problem: at 48 kHz the quantum is 2.67 ms, so two consecutive quanta can land on the same millisecond tick, producing 0 ms intervals that accumulate into spurious spikes. Raw intervals from Date.now() have correct mean but meaningless stddev.
The fix is to stop measuring intervals and measure drift from expected time instead. Anchor to both currentFrame (exact integer sample count) and Date.now() (coarse wall clock) at the first quantum:
where is currentFrame at quantum and , are the anchor values. currentFrame advances in exact samples with no resolution loss. Even with 1 ms Date.now() resolution, drift is meaningful: an 8 ms CPU preemption produces ~8 ms drift regardless of rounding. The standard deviation of drift samples is the canonical jitter metric.
◈ Why drift and not intervals? ↓
An interval measures the gap between two coarse timestamps. A drift measures the deviation of one coarse timestamp from a precise prediction. The prediction is exact (derived from currentFrame), so the noise is one-sided: only Date.now() contributes coarsening, not both endpoints. This halves the effective resolution noise compared to raw intervals.
Non-Obvious Bugs
onStateChange null dereference. The original listener:
this._ctx?.addEventListener('statechange', () => cb(this._ctx!.state));AudioContext.close() sets this._ctx = null, then the context fires a final statechange event with state "closed". The ! dereferences null and throws. Fix: read state from the event target, not the instance field:
this._ctx?.addEventListener('statechange', (e) =>
cb((e.target as AudioContext).state)
);this._graph! after await in teardown/rebuild chain.
this._graph.setFAFNodes([])
.then(() => this._graph!.setFAFNodes(semitones)) // throws if DAF stopped mid-awaitThe user can stop DAF between teardown and rebuild. The ! assertion fires on null. Fix: this._graph?.setFAFNodes(semitones), which silently no-ops if the session ended.
activateModeBtn called before declaration. Chrome hoists block-scoped function declarations within a block; Firefox and Safari do not (ES2015 strict mode). The initialisation call to activateModeBtn('faf-type', 'single') preceded the function declaration by 10 lines within the same DOMContentLoaded callback. This works on Chrome but throws a TypeError on Firefox. Fixed by reordering the declaration above the call.
OLA vs PSOLA: When Each is Appropriate
| OLA (Low Latency) | PSOLA (High-Fidelity) | |
|---|---|---|
| Latency @ 120 Hz floor | ≈ 8.9 ms | ≈ 16.6 ms |
| Quality at ±4 st | Audible flutter on vowels | Phase-coherent, natural |
| Requires voice type config | No | Yes |
| Computational cost | Low | Low + YIN every ~20 ms |
| Unvoiced handling | Always OLA | Falls back to OLA |
OLA is appropriate when minimum total latency matters and the semitone shift is small ( st). PSOLA is appropriate for sustained practice sessions at clinical shifts (3-6 st) where audio quality affects the therapeutic experience.
- You can try it yourself at DAF Online.
References
- Natke, U., et al. (2001). Fluency, fundamental frequency, and speech rate under frequency-shifted auditory feedback in stuttering and nonstuttering persons. Journal of the Acoustical Society of America.
- Kalinowski, J., et al. (1996). Effect of alterations in auditory feedback and speech rate on stuttering frequency. Journal of Speech and Hearing Research, 39, 396–407.
