Koray Ulusan | Blog

The Tech Stack Behind This Site

Wed, 22 Apr 2026 05:00:00 GMT

I built this website in April 2026 using Astro 6, Tailwind CSS 4 and MDX to create a high-performance blog that bridges the gap between academic research and modern web standards. My stack centers on a Vite 7-powered pipeline optimized with Terser and astro-compress. By integrating remark-math and rehype-katex, you can see beautifully rendered math. I also have a RSS feed with a beautiful xsl.

The entire lifecycle, from transforming content collections to final minification and deployment to GitHub Pages, is orchestrated via make utilizing exiftool and gh to ensure a lean, production-ready build.

Rather than inheriting the legacy overhead of al-folio or similar, I spent three days engineering a custom solution tailored to my specific needs. Building from scratch ensured total control over the stack, and while it took a few more days to iron out the finer details, I'm happy with the result. I hope you like it, too!

I Built a Free DAF Tool to Replace OS Native Paid Apps

Mon, 20 Apr 2026 05:00:00 GMT

TL;DR

I built DAF Online, a free, browser-based tool for speech therapy that helps people who stutter and Parkinson's patients find fluency. While native apps and $1000 hardware exist, I used the Web Audio API to achieve sub-6ms latency in the browser by aggressively optimizing the audio graph.

What is Delayed Auditory Feedback?

Delayed Auditory Feedback (DAF) is simple: you hear your own voice played back with a short delay. What's less obvious is what that tiny lag does to your brain.

For people who stutter, speaking while hearing a slightly delayed version of your own voice can induce near-instant fluency. It's called the Chorus Effect. Your brain perceives a second speaker and shifts into a different, more fluid processing mode. The same principle is used by speech-language pathologists (SLPs) for Parkinson's patients, where the delay acts as a natural "speed limit," forcing slower, more deliberate speech.

The tool has three core audiences: people who stutter, individuals with Parkinson's Disease, and SLPs running remote telehealth sessions who need a quick, zero-friction way to get a patient practicing from home.

The Landscape Before I Built This (Early 2025)

When I went looking for a free, browser-based DAF tool, I found: nothing that actually worked.

The market looked roughly like this:

Dedicated hardware (e.g., Casa Futuro, SpeechEasy): $1000-$2500+. Clinically validated, but you need to order, wait, and pay.
Native mobile apps (e.g., DAF Pro): A handful exist on iOS and Android. Some are free-tier, most push you toward a subscription. They work reasonably well on modern phones.
Web-based pages: The few I found were marketing funnels pointing back to the native apps, or had limited functionality, or long delays. No one had built an actual working web implementation you could just... open and use.
The "Developer Gap": I found a few GitHub repositories that implemented DAF logic. Some used the Web Audio API, while others were native C++ or Python implementations. Their problem was that they weren't hosted. Just code sitting in a repo.

The implementation isn't complex. The Web Audio API has had a DelayNode for years. The gap wasn't technical; nobody had simply bothered to close it.

The Math: What's Actually Happening

The feedback loop is simple. The output signal is the input signal shifted in time:

$$ y[n] = \alpha \cdot x[n - (k_{user} + k_{sys})] $$

Where:

$y[n]$: The signal the user hears at time $n$
$x[n]$: The user's voice entering the microphone
$k_{user}$: Intentional Lag is the delay you dial in
$k_{sys}$: System Floor is the device lag. the hidden hardware/OS latency floor
$\alpha$: Gain (volume)

Then the effective delay $k_{eff}$ is the sum of the intentional lag and the device lag:

$$ k_{eff} = k_{user} + k_{sys} $$

The variable most people ignore is $k_{sys}$. It's not zero. And if it's high, your "50ms delay" is actually 100ms, which is a qualitatively different therapeutic experience and potentially useless.

Why Latency Is the Whole Game

For DAF to work therapeutically, the internal device latency needs to stay under 15-20ms. That is the time your hardware and software spend processing audio before your intentional delay is added.

Here's why it matters: if $k_{sys}$ is already 50ms and you set a 50ms intentional delay, the user hears a 100ms echo. Worse, high internal latency usually comes with jitter (timing variance), which breaks the chorus effect entirely. Jitter makes the delay feel unstable. The brain doesn't settle into choral mode, it just gets confused.

The Latency Landscape by Device

Setup Typical Internal Latency Verdict Dedicated PC Drivers/Hardware 1 - 9ms Excellent Dedicated DAF Hardware < 10 ms Excellent High-End PC + Chrome$^*$ 6 - 10 ms Excellent iPhone 16 + Safari$^*$ 13 ms Good aptX Low Latency codec (Bluetooth) 40 ms Borderline AAC codec (Bluetooth) 100 - 200 ms Unusable SBC codec (Bluetooth) 150 - 250 ms Unusable

$^*$: My implementation.

This is why native apps have historically had an edge over web tools. iOS and Android give native audio code direct access to the hardware buffer. The browser sits a layer above that but with the right flags, you can close most of the gap.

How to Test Your Own Floor

Set the software delay to 0 ms. Speak a sharp "P" or "K" sound. If it sounds like one sound, your floor is likely under 15ms. If it sounds like a double-hit or a slap-back echo, your internal latency is above 30ms and you should switch to a wired headset or a better audio driver before using the tool therapeutically.

Bluetooth headphones are incompatible with DAF therapy. Their 150-250ms hardware latency dwarfs any intentional delay you'd set, making the total delay unpredictable and therapeutically ineffective. Always use wired headphones.

How It's Built for Speed

To be a legitimate alternative to dedicated hardware, the implementation needed to minimize $k_{sys}$ as aggressively as possible. Three things matter most.

1. Minimal Audio Graph Topology

Every node in the Web Audio API graph adds overhead. The final implementation uses a lean, four-node linear chain. No branches, no unnecessary processing.

// Minimal 2-hop topology for maximum performance
_connectAudioNodes() {
    const nodes = this.audioNodes;

    // source (Mic) -> delay (DAF) -> gain (Vol) -> destination (Output)
    nodes.source.connect(nodes.delayNode);
    nodes.delayNode.connect(nodes.gainNode);
    nodes.gainNode.connect(this.audioContext.destination);
}

2. Requesting Hardware-Level Latency

Browsers default to an "interactive" latency mode (~50ms buffer). Setting latencyHint: 0 tells the browser to request the minimum buffer size the hardware allows. Matching the native device sample rate eliminates resampling lag.

_createAudioContext() {
    const contextOptions = {
        // Request absolute minimum buffer size from hardware
        latencyHint: 0,
    };

    // Match native hardware sample rate to bypass resampling lag
    if (this.deviceSampleRate) {
        contextOptions.sampleRate = this.deviceSampleRate;
    }

    this.audioContext = new (window.AudioContext || window.webkitAudioContext)(contextOptions);
}

3. Honest Latency Measurement

The tool reads baseLatency and outputLatency directly from the AudioContext and adds them to the display so the user always sees their effective delay, not just the slider value.

// Measuring the true hardware "floor"
const outputMs = (this.audioContext.baseLatency + (this.audioContext.outputLatency || 0)) * 1000;
this.measuredFloorMs = outputMs;

// UI shows both the target and the honest effective delay
const effective = Math.round(targetDelay + measuredFloorMs);
this.displayLabel = `${targetDelay} ms (~${effective} ms effective)`;

This matters for trust. A user who sees "50ms (est. 56ms effective)" understands their setup. A user who sees "50ms" and hears 100ms thinks the tool is broken.

Frequency Altered Feedback (FAF)

I havent gotten around implementing FAF yet, but its also used in speech therapy. Instead of delaying the signal, it shifts the pitch up or down. The effect is similar: it disrupts the brain's normal feedback loop and can improve fluency for some users. It's on the roadmap for a future update, but DAF was the priority since it's more widely used and has a clearer latency requirement.

Implementing FAF is a challenge because it requires buffering audio to analyze and shift the frequency content. The buffering means added latency, which can break the therapeutic effect if it exceeds the 15-20ms threshold. More on that in the deep dive below.

FAF shifts your voice up or down in pitch rather than delaying it. The therapeutic mechanism is similar, but the implementation is messier.

The problem is that you can't shift pitch without first buffering a chunk of audio to analyze. A simple delay line just holds samples and replays them. Pitch shifting has to look at a window of the signal before it can do anything, which means latency before your intentional delay is even added.

At 44.1 kHz, the relationship is straightforward:

$$ L_{ms} = \frac{N}{f_s} \cdot 1000 $$

Buffer Size (Samples) Latency Added (at 44.1kHz) Therapeutic Verdict 128 ~2.9 ms Fine 256 ~5.8 ms Fine 512 ~11.6 ms Borderline 1024 ~23.2 ms Already over the threshold

Why you can't skip the buffer

The naive fix is sample-by-sample processing: shift pitch like a sped-up record. That works for about half a second until the playback outruns the input and you get a gap. To keep the feedback in sync with actual speech rate, you need time-domain splicing (SOLA): small grains of audio, cross-faded together. That requires an AudioWorklet and a minimum window size.

Phase vocoders vs. granular synthesis

Phase vocoders do this better perceptually. They use FFTs to shift pitch cleanly with no metallic artifacts. The catch is they need large buffers for frequency resolution, typically 1024 samples or more, which puts you at 23ms of algorithmic latency before anything else. That's already past the cutoff.

Granular synthesis sounds rougher, but it runs on 128 or 256 samples. For this use case, a slightly robotic voice at 10ms beats a natural-sounding one at 40ms.

SEO: Why the Body Text Is Long (On Purpose)

"Delayed Auditory Feedback" is an incredibly niche topic. If you compare it to a broader term like "Stuttering" in Google Trends, you can see how small the specific search market is for the tool itself compared to the condition it treats.

Building the tool was the easy part. Getting it in front of people who need it took just as long.

Most users arrive via high-intent functional queries. They know what a DAF tool is. They just need to find one that works. A minimal landing page with a slider and a button would rank for nothing. In the first months, the tool was invisible to these high-intent users, stalled at 11th-15th in the rankings while hardware retailers and native apps claimed the top spots.

By writing thorough, accurate content about the science and the use cases, the site achieved a 2.4 weighted average position for core keywords, capturing 85% of organic traffic from the top 3 results with a 75% CTR on primary search intent (Feb 2026).

Fun Fact: As it turns out, "DAF" is a congested acronym dominated by Dutch heavy-duty truck manufacturer DAF Trucks N.V. If you search "DAF" without context, Google assumes you're looking for a 7.5-ton hauler, not a speech aid.

Conclusion

With the right initialization flags and a minimal graph topology, a browser-based DAF tool can match native app latency on decent hardware, with no install, no account, and no payment required.

The gap wasn't a hard engineering problem; it was just an ignored one. Speech therapy is a small market, and most developers aren't building for people who stutter or have Parkinson's. Which is why it was worth doing.

If you want to look at the implementation or try the tool yourself:

DAF Online — Try it here!
Source code on GitHub

Enhancing Facial Realism with Synthetic Data Augmentation

Wed, 28 May 2025 05:00:00 GMT

This post explores research originally presented at the CVPR 2025 Workshop on Synthetic Data for Computer Vision (SynData4CV).

TL;DR

Instead of using "classical" image tweaks like flipping or rotating, which actually distort facial identity, this research proves that using InstantID to generate high-quality synthetic portraits as training data significantly improves a DreamBooth model's ability to produce realistic, professional-grade headshots.

The Problem with Few-Shot AI Portraits

You have five casual phone photos and want a polished LinkedIn headshot. Sounds like a job for a text-to-image model, right? In theory, yes. In practice, personalized diffusion models like DreamBooth struggle with a bottleneck of identity retention: they need to learn who you are from a tiny handful of images, then generalize that identity to entirely new scenes and styles.

This is the few-shot personalization problem. It sits at the tension between two competing goals: identity retention (the output should actually look like you) and recontextualization (you should be placeable in any scene the user prompts). Most standard training pipelines lean hard in one direction or the other. My research, published as "Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID", investigates a third path: using one generative model to improve the training of another.

Why Classical Augmentations Backfire

When deep learning practitioners want more training data, the first instinct is to reach for classical augmentations: random flips, crops, rotations, colour jitter. These are reliable staples for large-scale classification tasks. For few-shot face personalization, they are a trap.

Geometric traps

Random Horizontal Flip seems harmless, but faces are subtly asymmetric. A mole on the left cheek, a slightly crooked smile, the direction of a part in the hair: flipping these teaches the model a second identity that contradicts the first. Rather than generalizing, the model averages them into an uncanny composite.

Random Rotation introduces black padding bars around the frame, which the model dutifully learns as part of the subject's visual signature. It also misaligns facial landmarks, undermining the spatial consistency that makes face generation coherent.

Colour confusion

Colour Jittering (tweaking brightness, contrast, saturation, and hue) causes the model to incorrectly associate those shifts with the rare token representing your subject. The result is erratic generations where the subject might appear with an alien skin tone or under lighting that was never in any real photograph.

Segmentation imperfections

Replacing backgrounds using a segmentation model like $U^2$-Net sounds like a clean solution to background leakage. In practice, the segmentation boundary around fine hair creates a blended halo artifact. The model then learns that wispy, semi-transparent fringe is part of the subject's identity, making clean background swaps nearly impossible downstream.

The pattern is the same across all of these: classical augmentations introduce distributional artifacts, and the model, with no other signal to reject them, faithfully memorizes those artifacts as identity-defining features.

Classical augmentations introduce distributional artifacts. With no other signal to reject them, the model faithfully memorizes these artifacts as identity-defining features.

A New Approach: GenAI Improving GenAI

Instead of perturbing real images in ways that corrupt facial structure, the approach explored in this paper asks a different question: what if the augmented images were themselves high-quality generations of the person, produced by a model that already understands faces?

The answer is generative augmentation via InstantID. By conditioning InstantID on a subject's facial landmarks and a set of reference images, we can synthesize diverse, photo-realistic portraits of that person across varied poses, lighting conditions, and contexts, all while preserving the structural integrity of their face. These synthetic images already live in the diffusion model's feature space, so DreamBooth does not have to reconcile the domain gap that plagues classical augmentations.

The result is measurably better facial resemblance in the fine-tuned DreamBooth model, with the full range of recontextualization still intact.

Practical Takeaways

These findings translate directly into actionable recommendations for anyone building portrait personalization pipelines.

1. Balance real and synthetic images

The most important constraint for preventing overfitting is dataset diversity. No single concept (a specific background, a particular outfit, a generated style) should represent more than 25% of your training set. When synthetic images crowd out real ones, the model loses its grip on genuine identity and begins to replicate InstantID's stylistic fingerprint rather than the subject's actual face.

2. The Rule of Four

When generating synthetic training data with InstantID, providing four reference images offers the best trade-off between usability and facial similarity. Fewer references produce inconsistent identity across generations; more references yield diminishing returns and increase annotation overhead.

3. Resolution matters

Images around 1 megapixel align with the native training resolution of SDXL and deliver the best qualitative results. Upscaling smaller images introduces compression artefacts; downscaling large images discards high-frequency facial detail. If your source photos are from a phone camera, a light centre-crop to roughly 1024 × 1024 is ideal.

4. Skip the flips, rotations, and jitter

Given the evidence above: do not use Random Horizontal Flip, Random Rotation, or Colour Jitter in the fine-tuning pipeline. Their well-known benefits for large-scale classification tasks do not transfer to few-shot face personalization.

Measuring Resemblance: The FaceDistance Metric

Qualitative "vibes" are a start, but human intuition is subjective. To systematically rank checkpoints and understand how synthetic data actually moves the needle, we needed a reproducible, automated metric. This led to the development of FaceDistance, a validation tool built on FaceNet embeddings.

Rather than looking at pixels, FaceDistance looks at geometry. It projects facial images into a 128-dimensional hyperspherical space where the distance between points reflects perceptual similarity. Specifically, the metric calculates the average cosine distance between a generated image $G_i$ and the set of original reference images ${R_j}$:

Definition: Given batches of generated images $G = {G_i}{i=1}^m$ and original reference images $R = {R_j}{j=1}^n$, the FaceDistance is defined as:

$$ \bigl[\operatorname{FaceDistance}(G, R)\bigr]i := \frac{1}{n} \sum{j=1}^n \delta^{[0,2]}_{\text{cos}}!\bigl(f(G_i),, f(R_j)\bigr) $$

Breaking down the logic:

The Encoder ($f$): We use MTCNN for precise face detection, followed by FaceNet to extract the identity embedding.
The Distance ($\delta_{\cos}$): We use cosine distance, clipped to a $[0, 2]$ range for numerical stability.

A lower FaceDistance score indicates a stronger mathematical resemblance to the subject.

Pro Tip: FaceDistance acts more like a high-pass filter than a perfect judge. It is excellent for identifying "catastrophic drift" (where the model loses the subject entirely) but it isn't sensitive enough to decide if a "good" image is "great."

In our pipeline, we found that simply discarding the top 15% of highest-distance embeddings from the training set (in cases with $n \geq 8$ references) consistently led to cleaner, more recognizable results.

The Human Test: Does It Fool Real People?

Metrics only go so far. To validate that these portraits actually pass muster in professional contexts, the study recruited 97 white-collar workers to evaluate the generated headshots. Both DreamBooth and InstantID produced portraits that were frequently indistinguishable from genuine professional photographs.

Participants' preferences split along an interesting fault line. Those who valued identity accuracy ("does this actually look like the person?") tended to prefer DreamBooth outputs. Those drawn to overall aesthetics favoured InstantID for its polished, retouched quality. Neither model dominated on all dimensions, which points to a useful practical heuristic: use DreamBooth-with-generative-augmentation when fidelity to a specific individual is paramount, and use InstantID directly when a studio-quality aesthetic matters more than strict identity retention.

Conclusion

Classical augmentations are not universally beneficial. For few-shot face personalization, several common techniques actively degrade output quality. Replacing them with generative augmentation, where InstantID synthesizes diverse but identity-consistent training images, closes the gap between a handful of casual snapshots and a high-fidelity professional portrait.

The broader takeaway extends beyond portraits: synthetic data is not just a fallback for when real data is scarce. It is a tool for shaping precisely what a model learns. As generative models improve, training pipelines that use one model to curate data for another will become increasingly common.

Acknowledgments

This work would not have been possible without the dedicated mentorship of Benjamin Kiefer. Beyond steering the technical direction of this research, Benjamin was a constant guide through the often-turbulent process of publishing my first paper. His attentiveness during our weekly meetings and his rigorous feedback were fundamental to the success of this project. I am deeply grateful for his support in turning these initial ideas into a peer-reviewed publication.

I am also grateful to the CVPR SynData4CV workshop reviewers for their constructive comments.

Citation

If you build on this work or wish to explore the full list of references and literature supporting this research, please refer to the formal paper:

Ulusan, K., & Kiefer, B. (2025). Generating synthetic data via augmentations for improved facial resemblance in DreamBooth and InstantID [Paper presentation]. CVPR Workshop on Synthetic Data for Computer Vision (SynData4CV), Nashville, TN, United States. https://arxiv.org/abs/2505.03557

@inproceedings{ulusan2025generating,
  author    = {Ulusan, Koray and Kiefer, Benjamin},
  title     = {Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID},
  booktitle = {Proceedings of the CVPR Workshop on Synthetic Data for Computer Vision (SynData4CV)},
  year      = {2025},
  url       = {https://arxiv.org/abs/2505.03557},
  note      = {Presented at CVPR 2025 Workshop}
}

The full paper is available at arXiv:2505.03557.

Modifying TD3 with PER, N-Step Returns, and Reward Shaping

Tue, 25 Feb 2025 06:00:00 GMT

TL;DR

I modified TD3 with three reinforcement learning techniques: Prioritized Experience Replay (PER), potential-based reward shaping, and multi-step returns. I used these to train an agent in a simulated air hockey game. Most modifications made things worse. What actually worked was a curriculum: pre-training on shooting and defending modes before facing the real opponent. The final agent wins 98.3% of games against the strong built-in opponent.

The Problem: Sparse Rewards in a Competitive Environment

Air hockey is a hard environment for RL. Goals are rare and delayed, preceded by a long sequence of positioning decisions that receive no direct reward signal. The agent needs to learn to move toward the puck, hit it in the right direction, and coordinate defense and offense, all from a reward that stays at zero until something decisive happens.

The environment I used is HockeyEnv (a.k.a. "Laser Hockey"), a Box2D/Gymnasium simulation of a two-player air hockey game. The observation space is 18-dimensional (positions, velocities, angles of both player and puck), and the action space is a 4-dimensional continuous vector covering movement and shooting. Each episode runs for up to 250 timesteps in normal mode, or a shorter 80-step window in dedicated shooting/defending training modes.

The standard "run a good off-policy algorithm and wait" approach struggles here. The agent's first instinct is to stand still and draw, because drawing is better than the random-action baseline it gets penalized against. Getting past that local optimum requires deliberate intervention.

Base Algorithm: Twin Delayed DDPG (TD3)

TD3 is an actor-critic algorithm that addresses the well-known overestimation bias of DDPG by maintaining two critics and taking the minimum of their Q-value estimates when computing targets:

$$ y = r_t + \gamma \min_{k=1,2} Q_{\theta_k'}(s_{t+1},, \pi_{\phi'}(s_{t+1}) + \epsilon), \quad \epsilon \sim \text{clip}(\mathcal{N}(0,\sigma), -c, c) $$

It also delays actor updates relative to critic updates (hence "Twin Delayed"), which gives the critics time to stabilize before the policy starts chasing them. I built UlusanTD3 on top of Stable Baselines 3, extending the base TD3 class to support the three techniques described below.

The name UlusanTD3 is chosen purely for convenience of the project graders and easy identification in the codebase. It doesn't imply any fundamental change to the TD3 algorithm itself, but rather serves as a container for the specific modifications and experiments conducted in this project.

Technique 1: Prioritized Experience Replay

Standard experience replay samples uniformly from a FIFO buffer. PER [Schaul et al., 2015] argues that transitions where the agent was wrong (those with high temporal-difference TD error) are more informative and should be sampled more often. The sampling probability for a transition is:

$$ P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha} $$

where $\alpha$ controls prioritization strength and $p_i$ is the priority of transition $i$. For TD3 with two critics, I define the priority as the average absolute TD error across both:

$$ p_i = \left|\frac{1}{2}(\delta_1 + \delta_2)_i\right| + \epsilon $$

To keep priorities tractable and prevent divergence, I clip TD errors to the range $[\epsilon, 1]$. Without this upper bound, a single catastrophic prediction early in training can dominate the buffer forever and destabilize the actor.

Sampling more from high-error transitions introduces a bias, which is corrected with importance-sampling (IS) weights:

$$ w_i = \left(N \cdot P(i)\right)^{-\beta}, \quad \text{normalized by } \max_i w_i $$

These weights are folded into the critic loss, replacing the standard MSE:

$$ \mathcal{L}(\theta_k) = \mathbb{E}\left[w \cdot \delta_k^2\right] $$

Because I have two critics with potentially different scales, I give each its own optimizer rather than summing their losses:

# In UlusanTD3.__init__
if isinstance(self.replay_buffer, PrioritizedExperienceReplayBuffer):
    self.critic1_optimizer = th.optim.Adam(
        self.critic.q_networks[0].parameters(), lr=learning_rate
    )
    self.critic2_optimizer = th.optim.Adam(
        self.critic.q_networks[1].parameters(), lr=learning_rate
    )

The PER buffer itself is backed by a SumSegmentTree, which supports O(log N) priority updates and O(log N) stratified sampling. This is essential when the buffer holds a million transitions:

def _sample_indicies_proportional(self, batch_size: int) -> np.ndarray:
    p_total = self._td_errors.sum(end=self.size())
    segment_length = p_total / batch_size
    elem_at_segment_prefixsum = (
        np.arange(batch_size) + np.random.uniform(0, 1, batch_size)
    ) * segment_length
    return [
        self._td_errors.find_prefixsum_idx(p)
        for p in elem_at_segment_prefixsum
    ]

Using a NumPy array in SegmentTree was important because a Python list was too slow for the large buffer size and high update frequency.

After each gradient step, priorities are updated to reflect the latest TD errors:

# Back in the train loop
td_errors = (td_error1 + td_error2) / 2.0
self.replay_buffer.set_priorities(
    batch_inds,
    td_errors.abs().squeeze().detach().cpu().numpy()
)

What actually happened: PER made performance worse in every configuration I tested. More on why below.

Technique 2: Potential-Based Reward Shaping

In environments with sparse rewards, auxiliary signals that encode domain knowledge can accelerate learning without changing the optimal policy. Potential-based reward shaping [Ng et al., 1999] adds a shaping term:

$$ F(s_t, s_{t+1}) = \gamma \phi(s_{t+1}) - \phi(s_t) $$

The key property is that this never changes the optimal policy. It only changes how quickly the agent converges to it. The potential function $\phi: S \to \mathbb{R}$ can encode whatever domain knowledge you have.

HockeyEnv conveniently exposes sub-reward components in its info dict. I used a combination of:

closeness_to_puck — reward for staying near the puck
touch_puck — bonus for making contact
puck_direction — reward for hitting the puck toward the opponent's goal

Two components I tried and removed: centered_puck introduced noise and slowed training, and game_length inadvertently taught the agent to step aside and let in own goals.

def shaped_reward(rewards, infos):
    phis = [
        (info.get("prev_potential_reward", 0),
         info.get("current_potential_reward", 0))
        for info in infos
    ]
    # F(s, s') = gamma * phi(s') - phi(s)
    return [
        r + self.gamma * phi - phi_prev
        for r, (phi_prev, phi) in zip(rewards, phis)
    ]

Because HockeyEnv is a fully observable MDP, I compute $\phi$ directly from the environment state on every step, so no approximation is needed.

Technique 3: Multi-Step Returns

Standard TD3 bootstraps one step into the future. In hockey, the decisive action (the puck shot) is made many timesteps before the goal is actually scored, so the one-step target has no way to credit that shot with the eventual reward.

The truncated n-step return addresses this by accumulating rewards forward:

$$ R_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k \cdot r_{t+k} $$

and substituting it into the TD3 target:

$$ y = R_t^{(n)} + \gamma^n \min_{k=1,2} Q_{\theta_k'}(s_{t+n},, \pi_{\phi'}(s_{t+n}) + \epsilon) $$

With reward shaping combined, the telescoping sum over the potential terms simplifies, and the target becomes:

$$ y = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n \Phi(s_{t+n}) - \Phi(s_t) $$

Implementation-wise, this requires buffering the last $n$ transitions before committing any of them to the replay buffer. The buffer is flushed early when a terminal state is reached:

def add(self, obs, next_obs, action, reward, done, infos):
    self._reward_info_buffer.append((reward, infos))

    if any(done):
        # flush remaining transitions on episode end
        while len(self._reward_info_buffer) > 0:
            self._add_n_step_return(obs, next_obs, action, done, infos)
        return

    if len(self._reward_info_buffer) < self.n_step_return_num:
        return  # keep buffering

    self._add_n_step_return(obs, next_obs, action, done, infos)

The discount exponent in the Bellman target also needs updating to account for the extended horizon:

target_q_values = (
    replay_data.rewards
    + (1 - replay_data.dones)
    * self.gamma ** self.n_step_return_num  # γⁿ instead of γ in TD3
    * next_q_values
).detach()

The Ablation Study: Most Things Didn't Help

I ran combinations of all three techniques (26 configurations in total, sweeping $n \in {2, 3, 4, 5, 10, 20, 30}$ for multi-step returns) and evaluated each against the weak built-in opponent. The results were not great.

Most modifications performed worse than the TD3 baseline. The 3-step return variant was the only technique that consistently outperformed the baseline, and even that improvement was modest.

PER failed systematically. Training with IS weights disabled diverged immediately: without correction, the buffer fills with high-error transitions and the critic chases a badly biased distribution. With IS weights enabled, training was stable but still underperformed the baseline.

The failure of PER wasn't a bug, it was informative. PER's design assumes a stationary data distribution. When the environment or the opponent changes, that assumption breaks.

Curriculum Learning: The Thing That Actually Worked

The real insight was about changing how the agent was trained rather than what algorithm it used.

Without guidance, an agent facing a strong opponent quickly figures out that drawing (never scoring, never conceding) is safer than attempting to score. Once it settles into that strategy, it's hard to unlearn, because the risk of a failed shot (giving the opponent a chance to score) outweighs any expected benefit from trying.

The curriculum I designed breaks this trap in two phases.

Phase 1 (steps 0 to 440k): Alternate every episode between the dedicated shooting mode and defending mode of HockeyEnv. These stripped-down scenarios cut out the full-game complexity and force the agent to develop fundamental skills: aim and shoot; track and block. The episode horizon is only 80 steps, which enables much faster iteration.

Phase 2 (steps 440k+): Empty the replay buffer entirely and switch to training against the strong BasicOpponent in full normal-mode games. The clean buffer prevents old experiences from contaminating the new distribution.

The agent that had been stuck below 50% win rate against the strong opponent reached 98.3% win rate within 110k steps of Phase 2 training. Notably, the win rate against the weak opponent also climbed during Phase 1, even though the agent had never played full games during that phase.

Why PER and Curriculum Don't Mix

When the curriculum switches from Phase 1 to Phase 2, the replay buffer gets emptied. For standard TD3, this is a clean reset. For PER, it causes problems.

The new transitions in Phase 2 initially have high TD errors (the agent has never seen full-game states before). These saturate the buffer with maximum-priority entries. The IS weights assigned to these transitions drop to near zero, because $w_i = (N \cdot P(i))^{-\beta}$ becomes tiny when a large fraction of transitions share the same maximum priority. The critic is updated on high-error samples with effectively zero weight, which means it barely updates at all. The actor loss then diverges.

As shown in the logic below, when new high-error transitions dominate the distribution, the probability $P(i)$ of selecting a new sample becomes very large relative to the small buffer size $N$ during the reset, causing the weight to vanish:

$$w_i = (N \cdot P(i))^{-\beta}$$

When $P(i_{new}) \gg P(i_{old})$, then $w_{i_new} \to 0$.

The critic is updated on high-error samples with effectively zero weight, which means it barely updates at all. Consequently, the actor loss diverges because it is receiving gradients from an unmoving, inaccurate critic.

This is why PER was dropped from the final curriculum configuration entirely.

Self-Play

I also explored self-play, training the agent against a pool of its own past checkpoints. The hope was to develop generalization beyond the scripted BasicOpponent.

It didn't work. The agents converged to a Nash equilibrium of mutual avoidance: both players positioning themselves to not touch the puck rather than risk conceding a goal. Episode lengths climbed toward the 250-step maximum. Once discovered, this drawing strategy was self-reinforcing, because any agent that tried to attack would get punished by an opponent that had learned to exploit aggressive positioning.

Risk aversion dominates in self-play when the stakes are symmetric. The agent's value function correctly estimates that the expected return from "don't touch the puck" is higher than the noisy expected return from "attempt a shot." Injecting BasicOpponent episodes or clearing the buffer when switching opponents did not fix this.

The approach that actually works, from what I heard from peers, is to mix self-play with skill-based training against an easy opponent throughout the whole training run. That way the agent never completely forgets that scoring goals is the point.

Self-play can lead to degenerate equilibria if not carefully structured. In competitive environments, it's crucial to maintain a curriculum that keeps the agent focused on the ultimate goal rather than settling for safe but unproductive strategies.

Tournament Results

The final agent (3-step return, curriculum, no PER) competed in the 2025 RL course tournament at the University of Tübingen, ranking 131/146 (including stale accounts) with a 40% win rate against other students' agents. Some students chose not to join the tournament.

This is a more honest number than the 98.3% against BasicOpponent. The tournament agents had trained on the same environment and knew the same scripted behaviors. Against agents that could also plan, the curriculum advantage faded, and the self-play deficiencies became obvious.

Had I included basic defending and shooting modes throughout the self-play phase, tournament performance would have been noticeably better. The agent was robust but never had a complete training setup.

Practical Takeaways

These findings aren't specific to air hockey.

On PER: It works well in stationary, single-distribution settings. In non-stationary environments (curriculum training, population-based training, anything that changes the data distribution mid-training), the mismatch between stored priorities and the current distribution becomes a liability. Either clear the buffer on every regime change or skip PER in this setting.

On reward shaping: Be selective about what you encode in $\phi$. Subcomponents that make intuitive sense (stay near the puck) can introduce perverse incentives at the MDP level (game_length rewarding own goals). The sufficiency theorem guarantees no harm asymptotically, but finite training is far from asymptotic.

On multi-step returns: Modest $n \in [2, 5]$ is almost always better than large $n \in [10, 30]$ in continuous control. Large $n$ introduces high variance in the return estimate and makes the bootstrap target less reliable.

On self-play: Combine it with skill-based training modes from day one. Self-play alone, starting from scratch, finds the drawing equilibrium before it finds the scoring one.

Conclusion

Adding algorithmic improvements to TD3 mostly made things worse. What actually unlocked real performance was a carefully structured training curriculum: a decision about what the agent practices rather than how it learns.

In sparse-reward environments, the hardest problem is not the algorithm, it's the training setup. PER, multi-step returns, and reward shaping are all principled ideas, but they operate on data. Curriculum learning shapes what data is generated in the first place.

A stable self-play loop that combines pool-based opponent selection with dedicated skill modes is the most promising direction for pushing these agents further.

Here are the resources if you want to dive deeper:

Technical Report
Presentation
GitHub Repo
RL Agents Code

Acknowledgments

This project was completed as part of the RL Course 2024/25 taught by Prof. Georg Martius at the University of Tübingen, in collaboration with Elia Frederick Reppchen (Rainbow DQN) and ChandraLekha Ramireddy (SAC). Compute was provided by the TCML cluster offered by the Cognitive Systems Group (of Prof. Andreas Zell) at the University of Tübingen.