Colin Zhou blog

How I use AI Agents for coding in 2026

2026-04-18T00:00:00+00:00

Agentic workflows are slowly becoming the norm for software development. My current company generously provides all developers with a subscription to Google AI Ultra, which gives you access to the Antigravity IDE with no rate limits, and the maximum priority for requests.

Previously, when I was still using GitHub Copilot in VSCode, my AI workflow mainly revolved around adding files and terminal selections into context, typing in a query, waiting for a response, and then manually applying the changes / asking for clarifications.

I rarely found myself turning on agent mode since I wanted to maintain maximum control over what was accepted and what wasn’t. In a way, I thought that accepting agents into my mainstream coding would be like trying to defuse a landmine every time I tried to move forward.

But Antigravity agents on net balance have been very helpful.

Through the use of Claude and Gemini, I’ve realized that a large number of bugs that I encounter can be fixed relatively easily with a few targeted prompts, sometimes with only one prompt. As a full stack developer, I have been able to quickly implement UI designs (basically not writing Tailwind CSS at all anymore) and also plan out the architecture of new features.

Even better, if I am not confident of it doing a large code change, I can always ask the agent to break it down into smaller steps, and I can review each step individually, or generate a plan that I can comment on before letting it go. You don’t even necessarily need to link every single file of interest in its context; the smarter models can figure that out by themselves.

Here are a few things that agents are good at doing, through my experience:

1. Generating architecture deep-dives

This one is probably the most useful of them all. I can add several files to context and ask Claude or Gemini to give me a first principles explanation of the code from any file, and tell me how it works together and the overall data flow. I’ve been able to quickly use this to refresh my memory on code that I haven’t touched for a long time, or explore new repositories.

The GitHub website has an Agents tab that you can use to ask an agent about a codebase. I have used this feature many times to understand the codebase of open source projects I’m interested in contributing to, and even the codebases of other projects at work.

I truly think there is no more excuse for any developer to not be able to understand any codebase, no matter how large or complex it is, now that you can have AI explain it to you.

2. Adding logging to existing code

This is a task that I used to dread. I would have to go through the codebase and add logging statements to the code, and then I would have to test the code to make sure that the logging was working correctly. But with AI, if you are able to narrow down the area of concern, you can ask the LLM to add targeted logging to identify issues.

It can go overboard with the emojis, but I have found that the emojis actually help me more quickly identify the data flow and errors in large log files with tens of thousands of lines.

3. Doing UI work

I barely write my own CSS anymore. I can just describe what I want in plain English, and Claude / Gemini will generate the Tailwind CSS code for me. In terms of design, I can ask Gemini to come up with possible UI designs that fit specific criteria, and have it implement that automatically.

My background is not in design, and I am going off of gut instinct when it comes to the design portion, but I think even designers working with LLM’s can produce a lot more wireframes that are plausible then otherwise. Gemini does seem to be better at UI then the other LLM’s for some reason.

I find myself being more concerned with how the frontend logic works rather then the designs themselves, which for me is very refreshing.

4. Refactoring monolithic components

For one of my projects, I had a manager component in the backend that was responsible for handling websocket connections from the frontend, sending video, and audio chunks to a pipeline and a separate microservice, and handling all the responses. It was hard to reason about, and I was honestly dreading refactoring it.

I asked Claude to simply refactor this component into multiple components, while keeping the code simple & functional. It proceeded to give me a plan for refactoring into two components, that would each be responsible for only a part of the original component. It ended up doing that completely correctly on the first try.

I stressed over testing the system for a while, but it ended up working out perfectly to specification. Unbelievable.

5. Doing DevOps work

When I wanted to dockerize one of my projects, I had to create dockerfiles and a docker-compose file. With the complexity revolving around using multiple AI models, and also making sure the final setup was easy to use for a developer, I was facing a big uphill climb.

Thankfully, I was able to ask Gemini to give me dockerfiles for all the services, and the compose file. It was able to save all the model weights to a local volume so that in development, we wouldn’t have to download the model weights every time we wanted to run the system. It saved so much time and greatly improved the developer experience, not to mention enhancing the ease of deployment to prod.

6. Doing security audits and identifying performance optimizations

I was able to ask Claude Opus 4.6 to generate a comprehensive security audit of my codebases, and it was able to identify several vulnerabilities that I was not aware of. It also gave me suggestions on how to fix them, and I was able to take that advice effectively. I think that for a general purpose scan, it is useful.

But be aware that sometimes it will highlight changes that it deems extremely urgent, but in reality are not that big of a deal with the scope of the project. That requires human judgement and good aptitude to distinguish.

Now on the negative side:

1. Bad Frontend Habits

On the frontend, it is much easier for agents to keep adding states and ref’s in React, which can easily accumulate and become hard to reason about. It seems to be the default behavior for agents to do this, and needs a lot of human supervision.

Cleanups have been relatively easy for me, but in the early stages, it’s definitely a hit and miss.

2. Changing Models in the Middle of a Conversation

Changing models in the middle of a conversation can lead to a loss of context, and reconciling different arguments can be difficult. If Claude suggested one change, but then Gemini reversed it, it is hard to tell which one is correct, and even more difficult to reverse.

Besides trivial errors, it really is a hope and prayer that Claude and Gemini are on the same page. It is important to prioritise diversity of thought, but sometimes a consensus between models is neede for productivity.

3. Unnecessary File Creation

In addition, agents have a habit of creating files that are not needed on occasions. You have to be very clear and explicit about which files to add. More often then not for testing purposes, it will just create a new script to test something, and then not delete it.

I’ve found that agents cannot actually deal with Jupyter Notebooks effectively for some reason. They will often make syntax errors when editing code cells, and sometimes just create a Python script to run the code instead. I don’t know if I’m missing something here.

4. Terminal Management

While I do appreciate agents spinning up a terminal and running commands to test their changes, it can be very annoying to have to keep track of the terminals that have spawned. I often times have existing terminals in play, and conflicts in this sense can be hard to manage.

5. UI when creating plans

This is definitely nitpicking at this point but when I ask Gemini to generate a plan for example, in Antigravity IDE, it creates a document that is not formatted correctly, and seems very wonky. In comparison, Claude’s plan actually looks like a real Markdown preview that you can see on GitHub repos, and not some notepad-quality document.

While a lot of people have been playing around with AGENT.md files and massive configurations for agentic workspaces, I still have not felt the need to do so. Agents have a context and I don’t want to pollute their memories. I can see where this can be useful for certain cases like formatting code, and maybe it can do it automatically in the style of ruff in Python, or prettier in JavaScript. But is that really worth it when running a script that formats both frontend and backend can be done in a few seconds?

All in all, the improvement in agentic coding over the past few months has been real, and I’m excited to see what the future in agentic coding holds.

I’ve often compared agentic coding to defusing planted bombs in a minefield. Do a bad job, and the mine can explode in your face, creating a big fat mess. But the size of the mines are definitely shrinking, and we should be excited about that.

Adding KV Cache to NanoGPT

2026-04-18T00:00:00+00:00

NanoGPT is Andrej Karpathy’s from-scratch GPT trained on Shakespeare — no abstractions, no optimizations, just the bare-minimum transformer you need to generate text. I wanted to understand how inference servers actually work, so I started at the bottom: adding a KV cache to this toy model by hand.

The core idea is simple. In a standard transformer, every time you generate a new token, you recompute the key and value projections for every token in the sequence — including all the ones you already processed. That’s quadratic in the sequence length. A KV cache stores the key and value vectors from previous positions so you never recompute them. The query for the new token attends over the cached keys and values, and you only compute K and V for the single new token. This brings the per-step cost from O(n) to O(1), and the total generation cost from O(n²) to O(n).

Where the cache lives

The natural place for the cache is inside the Head class — the module that handles one head of self-attention. Each head independently projects the input into query, key, and value vectors, so each head needs its own cache.

I had to think through several things:

Data structure. My first instinct was a hashmap keyed by token ID, but that’s wrong — the cache isn’t about which token, it’s about which position. It’s just a tensor of shape (B, T, hs) that grows by one row each decode step as we concatenate new key/value vectors along the sequence dimension.
Don’t interleave K and V. You might think about storing keys and values in one tensor, but the whole point of caching is fast access. During attention, you need Q @ K^T and then weights @ V — interleaving would force you to extract K and V back out every step, defeating the purpose.
Masking goes away. In the original NanoGPT, the causal mask prevents the model from attending to future tokens during training. But during cached inference, we feed one token at a time. There are no future tokens to mask — the cache only contains past positions. So the mask can be removed entirely for the inference path.
Training vs. inference. PyTorch’s nn.Module has a self.training flag that flips when you call model.eval(). We use this to guard the cache logic: during training, the original full-sequence attention runs unchanged; during inference, we accumulate into the cache and attend over it.

The final code:

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.key_cache = None
        self.value_cache = None
        self.dropout = nn.Dropout(dropout)

    # KV Cache lives here.
    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,1,hs)
        q = self.query(x) # (B,1,hs)
        v = self.value(x)

        if not self.training:
            if self.key_cache is not None:
                self.key_cache = torch.cat([self.key_cache, k], dim=-2) # (B, num_tokens_seen, hs)
                self.value_cache = torch.cat([self.value_cache, v], dim=-2) # (B, num_tokens_seen, hs)
            else:
                self.key_cache = k
                self.value_cache = v

            wei = q @ torch.transpose(self.key_cache, 1, 2) * self.key_cache.shape[-1]**-0.5

            wei = F.softmax(wei, dim=-1) # (B, T, T)
            wei = self.dropout(wei)
            # perform the weighted aggregation of the values
            out = wei @ self.value_cache # (B, 1, T) @ (B, T, hs) -> (B, 1, hs)
            return out
        else:
            # compute attention scores ("affinities")
            wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
            wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
            wei = F.softmax(wei, dim=-1) # (B, T, T)
            wei = self.dropout(wei)
            # perform the weighted aggregation of the 
            
            out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
            return out  

The inference branch is the interesting one. When a new token arrives, we project it to get k and v of shape (B, 1, hs) and concatenate them onto the existing cache. Now the cache has shape (B, T_so_far, hs). The query — also (B, 1, hs) — attends over the full cache: Q @ K^T gives (B, 1, T_so_far) attention weights, and the weighted sum over V gives (B, 1, hs). One row in, one row out. The training branch is unchanged — full-sequence attention with the causal mask, exactly as Karpathy wrote it.

The if self.key_cache is not None check handles the first step: when the cache is empty (the very first forward pass), we initialize it directly instead of trying to concatenate onto None.

Generation

With the cache in place, we need a generation function that actually uses it:

def generate_kv_cache(model, idx, max_num_tokens):
    model.eval()
    clear_kv_cache(model)

    model(idx)

    with torch.no_grad():
        for step in range(max_num_tokens):
            curr_pos = idx.shape[1]

            logits, _ = model(idx[:, -1:], pos=torch.tensor[curr_pos], device=device) # (B, 1, C)
            logits = logits[:, -1, :]
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
    
    return idx

model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(generate_kv_cache(m, context, max_num_tokens=500)[0].tolist()))

In this function, we are setting the model to evaluation mode, and making sure to clear the kv cache for the model.

Now, we run model(idx) once since that is how we prefill the KV cache before the next token is generated. Then, we have a for loop that iterates until the max number of new tokens we want, and grab the logits for the specific index, run softmax over the logits to get the probabilities, and then sample the next index. The index is added to the running sequence of indexes, which will then be decoded into the correct letters at the final step.

Positional encoding

There’s a subtlety here that tripped me up. A transformer has no inherent sense of order — “A cat is big” and “A big is cat” would produce the same embeddings without position information. NanoGPT uses a learned position embedding table: during the forward pass, the position index looks up a vector from the table, and that vector gets added to the token embedding.

During full-sequence training, this is straightforward: if the sequence has 17 tokens, you look up positions 0 through 16. But with the KV cache, we’re feeding one token at a time. If we don’t pass the correct position, the model treats every token as position 0.

The fix is simple: curr_pos = idx.shape[1], which is the current length of the full sequence (prompt + generated so far). Here’s a concrete example:

Prompt: "O Romeo, " → encodes to 9 tokens: [15, 23, 6, 18, 14, 5, 12, 0, 3]

idx = [[15, 23, 6, 18, 14, 5, 12, 0, 3]]   # shape (1, 9)
       ↑   ↑   ↑   ↑   ↑   ↑   ↑  ↑  ↑
      pos0 pos1 ... ... ... ... ... ... pos8

Step 0: width is 9 → model uses pos 9 → generates token 42 → append
idx = [[15, 23, 6, 18, 14, 5, 12, 0, 3, 42]]   # shape (1, 10)

Step 1: width is 10 → model uses pos 10 → generates token 10 → append
idx = [[15, 23, 6, 18, 14, 5, 12, 0, 3, 42, 10]]   # shape (1, 11)

Step 2: width is 11 → model uses pos 11 → generates token 19 → append
idx = [[15, 23, 6, 18, 14, 5, 12, 0, 3, 42, 10, 19]]   # shape (1, 12)

idx.shape[1] always gives us exactly the right position index for the next token.

Verification

The most important check: if the KV cache is mathematically correct, it should produce the exact same tokens as the no-cache version given the same random seed and prompt. The cache is an optimization, not an approximation — it shouldn’t change the output at all.

torch.manual_seed(42)
context = torch.zeros((1, 1), dtype=torch.long, device=device)

# Run without cache
out_no_cache = generate_no_cache(model, context.clone(), max_new_tokens=20)

# Run with cache (same seed, same prompt)
torch.manual_seed(42)
out_with_cache = generate_with_cache(model, context.clone(), max_new_tokens=20)

# Check token-by-token equality
assert torch.equal(out_no_cache, out_with_cache), \
    f"MISMATCH!\nNo cache:   {out_no_cache}\nWith cache: {out_with_cache}"

print("✓ Cache output matches no-cache output exactly!")
print(decode(out_with_cache[0].tolist()))

This seeds the RNG, runs the same prompt through both paths, and asserts that every token matches. If even one differs, the cache logic has a bug.

Shape walkthrough

It helps to trace the dimensions through one full cycle to make sure everything fits:

Prefill (9-token prompt):

model.forward(idx)  # idx: (1, 9)
  tok_emb: (1, 9, 64)    # token embedding lookup
  pos_emb: (9, 64)        # position embedding for positions 0..8
  x:       (1, 9, 64)     # tok_emb + pos_emb (broadcast)

  → Head.forward(x):
    k = self.key(x):     (1, 9, 16)   # Linear(64 → 16)
    q = self.query(x):   (1, 9, 16)
    v = self.value(x):   (1, 9, 16)

    key_cache is None → set directly
    key_cache:    (1, 9, 16)
    value_cache:  (1, 9, 16)

    wei = q @ key_cache.T:  (1,9,16) @ (1,16,9) → (1, 9, 9)
    out = wei @ value_cache: (1,9,9) @ (1,9,16) → (1, 9, 16)

Decode step 0 (one new token):

model.forward(idx[:, -1:], pos=9)  # idx: (1, 1)
  tok_emb: (1, 1, 64)
  pos_emb: (1, 64)        # position embedding for position 9
  x:       (1, 1, 64)

  → Head.forward(x):
    k = self.key(x):   (1, 1, 16)
    q = self.query(x): (1, 1, 16)
    v = self.value(x): (1, 1, 16)

    key_cache: cat[(1,9,16), (1,1,16)] → (1, 10, 16)
    value_cache: cat[(1,9,16), (1,1,16)] → (1, 10, 16)

    wei = q @ key_cache.T:   (1,1,16) @ (1,16,10) → (1, 1, 10)
    out = wei @ value_cache: (1,1,10) @ (1,10,16) → (1, 1, 16)

During prefill, the query has 9 positions and attends over 9 cached positions — (1, 9, 9) attention weights. During decode, the query has 1 position and attends over 10 cached positions — (1, 1, 10). The cache grew by one row, and the computation stayed O(1) per token instead of recomputing everything.

Benchmarks

The real test: does this actually speed things up?

# ── non-cached generate (forces full-context recompute every step) ────────────
def generate_no_cache(model, idx, max_new_tokens):
    """Runs in train mode so the KV cache branch is never entered."""
    model.train()                          # disables KV cache path
    with torch.no_grad():
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, _ = model(idx_cond)
            logits = logits[:, -1, :]
            probs  = torch.nn.functional.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
    return idx

# ── cached generate (your existing path, one token fed at a time) ─────────────
def generate_with_cache(model, idx, max_new_tokens):
    model.eval()
    clear_kv_cache(model)
    with torch.no_grad():
        for _ in range(max_new_tokens):
            # Feed only the LAST token so the cache does the rest of the work
            logits, _ = model(idx[:, -1:])   # (B, 1, vocab_size)
            logits = logits[:, -1, :]
            probs  = torch.nn.functional.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
    return idx

# ── benchmark ─────────────────────────────────────────────────────────────────
N_TOKENS   = 200
N_RUNS     = 3       # average over multiple runs for stability
context    = torch.zeros((1, 1), dtype=torch.long, device=device)

# warm-up (avoids cold-start CUDA overhead skewing results)
_ = generate_no_cache(model, context.clone(), 10)
clear_kv_cache(model)
_ = generate_with_cache(model, context.clone(), 10)

# --- No KV cache ---
times_no_cache = []
for _ in range(N_RUNS):
    t0 = time.perf_counter()
    generate_no_cache(model, context.clone(), N_TOKENS)
    if device == 'cuda':
        torch.cuda.synchronize()
    times_no_cache.append(time.perf_counter() - t0)

# --- With KV cache ---
times_cache = []
for _ in range(N_RUNS):
    t0 = time.perf_counter()
    generate_with_cache(model, context.clone(), N_TOKENS)
    if device == 'cuda':
        torch.cuda.synchronize()
    times_cache.append(time.perf_counter() - t0)

avg_no_cache = sum(times_no_cache) / N_RUNS
avg_cache    = sum(times_cache)    / N_RUNS

print(f"Tokens generated : {N_TOKENS}")
print(f"No KV cache      : {avg_no_cache:.3f}s  ({N_TOKENS/avg_no_cache:.1f} tok/s)")
print(f"With KV cache    : {avg_cache:.3f}s  ({N_TOKENS/avg_cache:.1f} tok/s)")
print(f"Speedup          : {avg_no_cache/avg_cache:.2f}×")

In this block of code, we first define the no-cache and cache versions of the generate function. The no-cache version is the original generate function, which is used to generate text from the model. The cache version is the same as the no-cache version, but it uses the KV cache to generate text from the model.

Then, we define the benchmark function, which is used to benchmark the no-cache and cache versions of the generate function. The benchmark function first generates text from the model using the no-cache version, and then from the model using the cache version. Finally, it prints the speedup of the cache version over the no-cache version.

Running this code, we get:

Tokens generated : 200
No KV cache      : 1.305s  (153.3 tok/s)
With KV cache    : 1.172s  (170.7 tok/s)
Speedup          : 1.11×

Only 1.11×. That’s real but underwhelming. The reason: this model is tiny — 0.2M parameters, 4 layers, 4 heads. At this scale, the Python interpreter overhead (function calls, tensor creation, torch.cat) dominates the actual matrix multiplications. The KV cache saves recomputation that barely costs anything in the first place. On larger models with longer sequences, the quadratic savings become dramatic — this is why production systems treat the KV cache as essential infrastructure, not an optimization.

Output

The model generates Shakespeare-flavored gibberish, which is exactly what we expect from a character-level model trained on a small corpus:

And they brid write, is not the die;
Though we art One my day hangs:
Wart he us hath bury, dills ane away, my feanst,
Anzing heavens, tofultien me milen's
Whines is eye, hain latise, drovets, and Will.

Downerabs!
Alhin the courtius, onceivy:
Supplain's twoy. Hence's norfole,
Against my lows thee again Willo when evicks eye myself?
ETo husing stroops: the resheper my brupt for treign the flows.
Tale oftenceful in thy offery your
Hasting is a aday Was happesty:
if courty.

ANGCIO:
Say, from care,

Things that went wrong

Estimate loss frequency. I was running the loss estimation loop every 100 steps, which was destroying throughput on the free Colab GPU. Changing it to every 500 steps made training workable.

torch.compile and mutable state. I tried torch.compile(model) to speed things up, but it doesn’t play well with the KV cache. Torch compile traces the computation graph and replays it — but the cache is mutable state that changes shape every step. The traced graph expects fixed shapes and corrupts the output. Production systems solve this with pre-allocated caches and padding, but for a toy implementation it’s easier to just skip compile.

Validation loss not decreasing. At one point my validation loss was flat. The cause was surprising: I was calling model.train() and model.eval() in the estimation loop, but since I wasn’t using dropout, there was no behavioral difference between the two modes — except that model.eval() was activating the KV cache path, which was corrupting the loss computation. Removing those calls from the estimation function fixed it.

CUDA device-side assert. After training, generation crashed with CUDA error: device-side assert triggered. The position embedding table had only 32 entries (block_size = 32), so generating 500 tokens tried to look up position 500 in a table that only goes to 31. The fix: cap max_new_tokens so that prompt_length + max_new_tokens ≤ block_size.

This is a real limitation worth calling out. With a fixed-size learned position embedding table, you can never generate more than block_size total tokens. Production models solve this with RoPE (Rotary Positional Embeddings), which computes position information on the fly instead of looking it up in a table, removing the sequence length cap entirely.

You can see the entire code on my GitHub: https://github.com/czhou578/multimodal-inference-visualizer/blob/main/nanogpt.ipynb

Building a Knowledge Base for AI Agents

2026-04-16T00:00:00+00:00

I saw a post on X recently from Andrej Karpathy about building a knowledge base, and taking advantage of modern frontier LLM’s to create specialized local knowledge troves that could be used to understand and synthesize large quantities of information.

In a sense, think of Wikipedia, but instead of having to manually maintain such a wiki, you instead have AI agents do the maintenance, and you as the human are only responsible for curating the initial set of documents and information that you want to be included in the knowledge base.

I thought this was a really interesting idea, and I decided to try it out for myself.

Motivation

Recently, my sister has been advancing a lot in her violin playing, and she’s been asked to take on more responsibilities not just for her school orchestra, but also for other events. Her classes are $120 an hour, and thus there is immense expectations for her to improve her playing due to this cost.

I wanted to do research into how to potentially build an AI agent violin consultant system that could take in audio recording of her playing and give specialized / targeted feedback to her. This would improve practice efficiency for my sister.

The problem was that I had no clue how to get started. Even though I’m a software engineer who took private piano lessons for over 10 years, I didn’t know much about translating music theory to technology like an AI agent. This was the perfect chance to build a specialized knowledge base to help me with this.

Curation

I started by curating a list of documents that I thought would be relevant to this project. These included things like Python libraries of interest, chats with Claude about the topic, and other miscellaneous resources I found online. I planned my knowledge base to only take in documents in markdown format. In order to do so, I installed the Obsidion Chrome extension, which allows you to save web pages as markdown files. Obsidian was the note taking app of choice in the original knowledge base concept by Karpathy, but I personally don’t really use these kind of notetaking apps. But through my experimentation, having the extension made downloading information much easier, which I appreciated.

In my _posts folder, I now had a collection of markdown files spanning resources from many online sources. In order to create a real wiki, I used Claude Sonnet 4.6 in Antigravity IDE to create an implementation plan. It decided to create a single page application (SPA) inside the existing wiki directory. Inside of this directory were the index.html file, a styles.css file, and a main.js file. The main.js file was responsible for reading the markdown files and rendering them on the page, as well as handling the correct page routing.

The resulting page that I ended up seeing was basic but functional. Here is what it looked like:

As you can see, it did have a sidebar of the pages that were available and also the actual page, with the title and the content formatted. It didn’t look completely professional, but it was a good start.

Finally, I asked Claude to create a bibliography page that would include all the links that were referenced in the markdown files. It did this successfully, and I was able to click on the links to visit the original sources.

During this process, it created a json file called wiki-index.json that held the metadata for all the pages, including the links. I think this helped it greatly when it was diving deep into the knowledge base.

Problems

Here were the problems I encountered:

One of my resources in the markdown file had chunks of code from a Github repo page. When that was first rendered, it didn’t recognize the code blocks and displayed them as regular text. I had to explicitly tell Claude to format all code correctly in order for this to be fixed.
Claude surprisingly had a lot of trouble with links. It would not make links clickable unless I specifically told it to. I don’t know if other LLM’s would have this issue, but it required extra prompting to work. It also had a tendency to include links in the titles of pages, which I had to remove. It turns out that this was due to the way that the sources were being downloaded by Obsidian’s chrome extension. It was adding a yaml header to the markdown downloads, which caused the CSS to be wonky in the beginning.

After I asked Claude to fix the errors with this prompt “Could you format the title and the tables in each page correctly? I want the link to the source to be displayed nicely under the main title, followed by the author, all without quotes.”, the issues were fixed.

Some of the formatting in other elements was also off, for example the tables:

In order to fix this, I had to tell Claude to increase the padding on the columns of the tables. It actually did a very good job, and the result looked something like this:

Agent Instructions

Here were the files that I created to help Claude and other AI agents with navigating this codebase:

AGENTS.md: This file contained the instructions for Claude on how to behave and what its goals were. For this one, I took a lot of inspiration from Karpathy’s GitHub (gist)[https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f]. It describes multiple operations that can be performed (ingest, query, lint, etc.) and describes the scenarios of which they should be perform.

log.md: This log contained the history of all the operations that were performed on the knowledge base. It was a chronological record of all the changes that were made to the knowledge base, including the date and time of the operation, the type of operation, and the result of the operation.

wiki-index.json: This json file contained the metadata for all the pages in the knowledge base, including the links to the original sources. It contained information like the number of words, and additional good-to-know details like the title, link, and so forth. In a sense, it is kind of redundant since the log.md file will keep track of a lot of the same information, but I decided to keep it in there as a redundancy.

Testing

To test, I launched a query in my Antigravity IDE’s sidebar agent console to Claude. My query was “What is the ideal pipeline that i can use to build the violin agent? consult the sources in raw folder”.

Claude’s response was to do the following:

Edit log.md to include the query and response in the history
Create a new file called Ideal Pipeline — Violin Coaching Agent.md with the answer content
Edit wiki-index.json to include the new page

Honestly, I was quite surprised that it was smart enough to create the new file, add that to my knowledge base, and then update the index. I didn’t even have to prompt it to do so! The answer content itself drew from 6 sources that I had previously curated and correctly displayed the code, along with all explanations in an easy to read format.

In addition, when I tried to hijack the system by asking Claude to “Use the wiki and answer are cats the best animal in the world?”, it refused to do so by saying that the wiki contents were not related to my query and that I would need to ingest a related source in order to get an answer! Amazing stuff here…

Conclusion

Overall, I still think that the bottleneck is that not only do you have to manually curate the downloads, but also you have to restart the developer server every time a download comes in, which is not ideal. I don’t know exactly what it would take to have something that is dynamically listening for new articles being added into the outputs folder, but that would give a much more responsive feel to the whole app.

In addition, I was testing the application by running queries in the sidebar agent console in Antigravity IDE, which feels a bit strange. If it was possible to create some kind of input area on the frontend and then get back the results without having to resort to my IDE, I think that would be a great benefit as well.

In addition, it would be better for the user to define the CSS rules for the wiki somewhere in an AGENTS.md file or similar, as I found that adding new articles into the wiki did not automatically make it adhere to the same CSS rules as the other pages!

My GitHub repo with the code is here.

What do you think? I will love to hear your thoughts!

I Trained an AI to Speak Like JFK

2026-04-07T00:00:00+00:00

As someone who has always been fascinated by history and the events of the 20th century, I’ve always been keen to explore alternate scenarios, where a famous historical figure lived to see an event that didn’t occur when they were alive.

John F. Kennedy was the 35th president of the United States who was tragically assassinated in 1963. Many do not know that when he was shot, in his clothing was a copy of a speech that he was scheduled to give later that day at the Dallas Trade Mart. The turkey and lunch had already been served, awaiting the arrival of a president who would never make it.

For a long time, I was interested in the idea of what Kennedy would’ve sounded like had he made to lunch safely that day and delivered this speech.

Thankfully, in this age of AI, that is now possible to discover.

I decided to create a system that would train on a corpus of JFK’s speeches from his term in office, and then have that replicated voice read that last Dallas Trade Mart speech, lyrics, and more. This has been a project dream of mine for many years.

Technologies

I first had to pick out the tech stack I was going to use. Python was an easy language choice for ML purposes. In terms of the finetuning model, I could have developed this myself, but I decided to use the F5 TTS open source voice cloning model from GitHub. The reason why was that through Reddit comments and my online research, many people mentioned this as an ideal choice if my priority was speed of cloning. This was also backed up by Claude, who gave me other options like XTTS, but mentioned that F5 was well maintained and optimized for speed. I wasn’t willing to pay for ElevenLabs or a third party proprietary API.

For GPU, I rented an RTX 4000 Ada GPU on Runpod for about $0.27/hr. I was debating whether or not to use the more powerful RTX 4090 or RTX 5090, but as it turned out through my experiments, the budget GPU actually performed very well in terms of inference time. I spent a total of approximately $2 to fully finish this project, which I’m very happy about.

For storing big chunks of data, I initially chose Git LFS since it’s just a natural extension of using Git. But having a HuggingFace repo to store the model checkpoints was a big convenience for me.

Data Downloading

In order to get the voice of JFK to be as good as possible, I found a 4 hour clip of JFK’s speeches from various events on YouTube (yes, that does exist), and I converted it to a large mp3 file. This took a while since a lot of online YouTube to mp3 converters don’t accept clips of that long. I ended up using the yt-dlp library from Python, which directly downloaded the audio from YouTube using the video url into 16 kHz mono WAV format.

Then, the downloaded audio went through a pipeline with the following steps:

Denoising

In the original clip, because the recordings were made in a live environment, there were many instances of clapping, background noise, and other disturbances. To clean these out of the final chunked training audio, the massive clip was fed through the resemble library in Python, which performed the audio cleaning in conjunction with the torchaudio library. The repo link to resemble is here. I only needed to call the denoise function once to get the tensor containing the audio, and then using torchaudio, saved it to a local file.

Transcription and Segmentation

Next was the transcription, since the training process needed labels of the spoken audio for validation purposes and for reference text (for inference). For this step, I utilized the WhisperX model (imported Python package) mainly because it is considered to be one of the highest value models for transcription and word processing, as well as Voice Activity Detection (VAD). I used WhisperXto apply VAD, transcribe with timestamps, and return the list of segments, which are just the start and end timestamps, and the text spoken in between these two. The VAD removed the silences and the non-speech segments automatically.

Slice and Export

Finally, I called a function to slice and export the audio into multiple smaller chunks, and build a csv file containing the audio file names and its corresponding transcription for that clip that was generated from the previous step. By default, I added an argument parser so that users can specify how long the audio clips should be. By default, I used 3 seconds to 15 seconds as the range. If there are segments that are too long, they are simply thrown out. Any segments that are missing text or timestamps are also thrown out. The csv file holding the metadata, and all of the audio files (1801 of them) were then saved to a folder in my project.

Finetuning

For the finetuning step, I asked Claude to write and then modify a script in order to perform the training. I first cloned the F5 GitHub repo to my project and added it as a git submodule. I then created a virtual environment in the project and installed all my dependencies like PyTorch, the HuggingFace accelerate library, and all the libraries needed to run my local scripts and the F5 module. I then prepared the dataset, copying all the chunked .wav files to the correct location as expected by F5, and also made sure that if this was the first time that I pulled from remote repo, that there was actual data in the wav folder and the csv files, and not their respective Git submodule pointers (encountered this issue more then once, lol).

I then converted metadata.csv + wavs into raw.arrow and duration.json, which the F5 library needs for efficient training. If this was not done, then there would be a lot of costly I/O operations because the backend would have to read the audio, parse the csv to find the right entry for this audio clip, and match the text + audio.

Apache Arrow and its format allows for memory efficient storage of this information, for example audio decoded into arrays, text, and the sample rate all in one line entry.

On the other hand, the duration.json file serves to help F5 be more efficient with its batching for training. Ideally, similar length audio clips are grouped together in batches to minimize padding, which is wasted computation.

I then modified the accelerate config so that the accelerate library specifically would be ready for the training. This was done by modifying the default_config.yaml file, which contains tunable settings for the training process, like mixed precision, distributed training, number of processes, etc.

Then, I officially launched the finetuning, which would perform the specified epochs, using the hyperparameters defined at the top of the file, and then save checkpoints every so often until the training is done.

Inference

I then ran the inference.py which is a script that would create a .wav file in the outputs folder based on the command line or text file arguments with the speech that is going to be read. You can select the wav file that would serve as the reference audio clip, and then update the corresponding reference text with the text from that clip.

The inference script contains multiple options as command line options, such as the reference audio clip, the reference text, the output file name etc, and the checkpoint directory. It will use the specified checkpoint passed in as the model to use for inference.

Problems Encountered

There were several issues that I ran into during this project.

I initially ran into storage issues in RunPod due to the number of wav files and the checkpoints from the model training. If you want to replicate my code on RunPod, I would suggest having at least 40GB of storage.
It is important to know how to save your checkpoints. I used HuggingFace’s Git LFS to store my checkpoints, but it is very easy to get mixed up with using Git LFS in conjunction with Git submodules. There were many times when I forgot to run git lfs pull before running the training script, or forgot to sync using git submodules the files that I needed, like the csv file, or the wav files.
During model finetuning, I had issues where old checkpoints of a previous run were being reused. For example, if I had a trial run that did 30 epochs, then a 50 epoch run would run strangely fast since it was not actually training from scratch. To fix this, I had to modify my training script such that each experiment would get its own checkpoint directory.
When doing inference, I realized that there were the word “government” interspersed in the final audio recording. In addition, when I first asked it to read Taylor Swift’s Blank Space lyrics, the AI voice speed ran through the lyrics, without respecting the line breaks or pauses. To fix the first issue, it turns out that F5 by default requires that the reference text be exactly the same as the reference audio.

In order to fix the second issue, I had to modify the inference script to add pauses between lines. I added a pause parameter that can be adjusted (I used 0.7 seconds) to insert pauses between lines. In addition, I also added a mode parameter that would take two values: “lines” or “sentences”, with a default of “lines”. In the original text file containing the lyrics, there were no punctuation between lines, so the inference script was just reading the text as one long string. Now, if the mode was set to “lines”, then it would present every line break as the end of a line. Otherwise, if the mode was set to “sentences”, then it would present every punctuation mark as the end of a line.

One unique thing about JFK’s speeches is that the microphone quality from the 1960s were obviously much worse then they are today. In the beginning, the reference audio clip that I used was a very clear sounding clip, which to me made it seem very strange. So I actually switched it out for another clip that had more background noise and static, which to me sounded more authentic to the time period (LOL).

Takeaways

My biggest takeaways from this project is that it is possible to recreate someone’s voice from audio clips, all for free. It is quite amazing the ecosystem that has developed for such projects to be doable. The GPU resources needed were not very expensive at all which was surprising.

There are a few things that would be worth exploring in the future.

How long of a testing clip in the beginning do you really need in order to get a good transcription? I used a clip that was 4 hrs, but would 2 hrs be enough?
What is the perfect set of hyperparameters to use for best results? This will always be a work in progress.
Are there other TTS libraries out there that would be better for this project?
Right now, I have to manually go through the list of wav files that were from the dataset to use as a reference. I wonder if there is a programmatic way of doing this, based on some formula or some baseline metric.

Conclusion

This project was a dream come true for me that I had planned for over a year. I’m very happy with the final result and learned a lot about setting up and using an open source TTS library. It does bring up some interesting questions about the ethics of using AI in this way. If such technology is already this accessible and will only get better, can we trust it to not be misused?

A voice clone of the president ordering an invasion of another country could be easily misinterpreted as real, and could have dire consequences, like starting a war that kills millions.

At the end, I was able to generate clips of JFK’s voice reading Taylor Swift lyrics, the 2025 inauguration speech, and more. I highly recommend trying it out if you have the time and resources!

Sources and Results

In my repository, I have the entire list of .wav files that I used, as well as the scripts for inference, data loading, and training. I also have a markdown file that lists all of the experiments I ran for finetuning, and what hyperparams that I used.

GitHub Repository

For your listening enjoyment, here are the audio clips that I generated using the finetuned model.

Taylor Swift’s Blank Space

JFK’s Dallas Trade Mart Speech (undelivered 11/22/63)

2025 Presidential Inauguration Speech

Thanks for reading!

Building a Local Voice Agent on CPU

2026-04-05T00:00:00+00:00

I have seen online the growing prevalence of people running local AI models on their personal devices and interacting with them through voice. For a long time, I ignored the hype simply because I didn’t believe that my 2022 Lenovo Thinkpad X1 Carbon PC could be capable of running any AI model on a usefulness basis.

But after seeing the new Alibaba Qwen 3.5 family of models drop and seeing people running it on their phones, I became intrigued at how my PC could run it, especially since I do have 16 GB of RAM and 1 TB of SSD storage. Here was my experience attempting to create a voice agent that would leverage Qwen to perform basic browser operations like opening up and replaying YouTube videos, launching search queries, and controlling my browser like an agent. I used pure Python for development.

Setup

I started by installing Ollama, which is a tool that allows you to run large language models locally. I installed it on my Windows PC using their official installer in PowerShell.

Next, I installed the Qwen 3.5 model using the following command:

ollama run qwen3.5:4b

I chose Ollama because according to Grok, it is one of the most popular tools for running local AI models and it is tailored to developers.

The actual install for Qwen 3.5 didn’t take very long, and I was able to open its console up in my Windows terminal and send commands to it pretty easily. I averaged around 14 tokens per second, which on my Intel CPU, wasn’t as bad as I expected. My goal was just to have the tokens/sec be sufficient enough that there wouldn’t be too much latency. My priority wasn’t to have the model write essays, but to execute short and succinct commands.

My goal was to create a fully end to end voice agent, that would take in my spoken input, translate late it into text, get the response from Qwen, and then recite it back to me.

Agent Evolution

I started off using pyttsx3 for text to speech, since it was recommended by Grok as a popular choice. I also began with SpeechRecognition, which is a Python library for speech recognition and a wrapper around many major speech recognition APIs from Google and more. Google’s Web Speech API happened to be a free service to use so I went with that.

The problem was that the Google Speech API does have a rate limit which could be exceeded, prompting errors when calling it programmatically, and also requires internet use, which I didn’t like because I preferred something that I could run locally.

So I switched over to Vosk, which was suggested by Claude after some prompting.

Vosk is a local toolkit for speech recognition that supports over 20 languages, runs on CPU, and are small in terms of model size (50mb according to their documentation). That ended up working just as well with not much latency.

Browser Automation

I had some previous knowledge of using frameworks like Playwright to automate the browser, so I ended up integrating that into my project. Structurally, I added the browser logic as a “skill” into my project, where there was a global class called BrowserManager which contained all the methods for the browser automation, like the initialization lifecycle. It also contains the methods for Playwright to perform operations in the browser like navigate_to, which opens a new tab and navigates to a URL.

I also had to add a system prompt for Qwen in order to help it understand the browser task. It basically told Qwen to output either [SEARCH] or [Navigate] at the beginning of each answer that is brower related. Based on which one it returns, a different Playwright method would be called.

Here is what the prompt looked like:

SYSTEM_PROMPT = """You are a helpful voice assistant named Qwen. 
You must strictly follow these exact command formats for actions:

- To search Google: [SEARCH] query here
- To search YouTube: [YOUTUBE] query here
- To restart a YouTube video: [YOUTUBE_REPLAY]
- To play the first YouTube video result: [YOUTUBE_CLICK_FIRST]
- To open a website: [NAVIGATE] example.com

Example 1:
User: "Search for cat videos on YouTube"
Assistant: [YOUTUBE] cat videos

Example 2:
User: "Go to reddit"
Assistant: [NAVIGATE] reddit.com

Example 3:
User: "How are you today?"
Assistant: I am doing well, thank you!

For general questions, answer verbally in 1-2 short, natural sentences without abbreviations. 
IMPORTANT RULE: When you output a command, you MUST NOT output any other text. Output literally ONLY the command format. Do NOT invent new brackets."""

Improving TTS And Running the Agent

After continuously playing around with the model, I realized that from the TTS side of things, the voice lacked emotion and sounded robotic. In addition, there was also a noticeable latency (over 5 seconds) every turn when waiting for the model response. I looked into alternatives and found out about Piper TTS, which is a fast, local, and high quality TTS engine that is based on the VITS model. I also did look into Kokoro TTS, which I have actual work-related experience in but the setup was going to take much longer.

To integrate Piper into my project, I had to implement a queue based architecture rather then the single engine approach I had before. The reason for this was so that in the background, the TTS engine could generate the audio without blocking the main loop.

Here is a snippet of this code:

tts_queue = queue.Queue()

def tts_worker():
    """Background thread for continuous Text-to-Speech processing."""
    PIPER_MODEL_PATH = "en_US-lessac-medium.onnx"
    use_piper = False
    
    try:
        if os.path.exists(PIPER_MODEL_PATH):
            piper_engine = PiperVoice.load(PIPER_MODEL_PATH)
            piper_sample_rate = piper_engine.config.sample_rate
            use_piper = True
            print("[System] Loaded Piper TTS Model.")
        else:
            raise FileNotFoundError("Piper model not found locally.")
    except Exception as e:
        print(f"[System] Piper error fallback to pyttsx3: {e}")
        # We must initialize pyttsx3 inside the thread loop for safety in some OS environments
        tts_engine = pyttsx3.init()
        tts_engine.setProperty('rate', 160)
        
    while True:
        text = tts_queue.get()
        if text is None:
            break
            
        print(f"\n[Qwen] {text}")
        
        if use_piper:
            try:
                stream = sd.OutputStream(samplerate=piper_sample_rate, channels=1, dtype='int16')
                stream.start()
                for chunk in piper_engine.synthesize(text):
                    stream.write(chunk.audio_int16_array)
                stream.stop()
                stream.close()
            except Exception as e:
                print(f"[System] Piper playback failed: {e}")
        else:
            tts_engine.say(text)
            tts_engine.runAndWait()
            
        tts_queue.task_done()

# Start TTS background thread
tts_thread = threading.Thread(target=tts_worker, daemon=True)
tts_thread.start()

*I also experienced this at work when developing such systems using GPU’s, which is that the response to the very first turn takes a long time, since the model needs to be loaded into memory. To fix this issue for this project, I had a silent request be made to Qwen at the very beginning so that it would be ready for the first real user request.

I had to open up Brave browser in developer mode in order to have the browser operations to work. I don’t know if this is required for other browsers like Safari or Chrome, but it was just adding a flag to the cmd line to lauch the browser.

At this point, I could speak into the microphone and tell the voice agent to open YouTube and play a video! The latency was still very noticeable, but the browser automation worked as expected.

Memory

Towards the end, I saw a need for the agent to remember everything in a session. Previously, if I told the agent to open up a YouTube video and then play it, the first action would be done, but the second would be skipped. I would have no way of prompting the agent to accurately go off of the last command’s result.

Due to the inherent limitations of my hardware and how I prioritized simplicity, I decided to create a simple in-memory solution, where a simple array in the main thread is created that would store the agent and user messages one after the other.

Here is a simplified version of what it looked like:

# Initialize memory
memory = []

# Add user message
memory.append({"role": "user", "content": user_input})

# Get response from Qwen
response = qwen.generate(messages=memory)

# Add assistant message
memory.append({"role": "assistant", "content": response})

The one caveat was that I could only save the last 10 messages, otherwise the history would get too long and the pollution would start to affect the model’s behavior. In my case, it didn’t matter that much since I wasn’t performing long chains of thought or operations that would run a long time without my input.

Whisper Model

As a final test, I decided to ask Gemini if there were any other possible options for STT in python that is free and highly reliable with low latency. It eventually gave me several options but recommended Faster-Whisper, which is a Python implementation of OpenAI’s Whisper model that is optimized for speed and efficiency.

I ended up replacing Vosk with Faster-Whisper using int8 quantization on CPU, which performed satisfactorily in terms of transcribing my voice into text for the agent, only taking 2-3 seconds now.

For the system that I wanted to build, this kind of quantization was the best tradeoff between speed and accuracy.

Here is a snippet of my code:

class STTManager:
    def __init__(self):
        print("\n[System] Loading Faster-Whisper Model (base.en)...")
        # Run on CPU with int8 quantization for speed on typical desktop CPUs
        self.model = WhisperModel("base.en", device="cpu", compute_type="int8")
        
        print("[System] Initializing Microphone...")
        self.recognizer = sr.Recognizer()
        self.recognizer.energy_threshold = 300  # Adjust if it's too sensitive or not sensitive enough
        self.recognizer.dynamic_energy_threshold = False
        
        # Increase pause threshold so it doesn't aggressively cut off the end of your sentence
        # if you pause slightly before saying "YouTube"
        self.recognizer.pause_threshold = 1.5
        self.recognizer.non_speaking_duration = 0.5
        
        self.microphone = sr.Microphone(sample_rate=16000)
        
        # Adjust for ambient noise once on startup
        with self.microphone as source:
            self.recognizer.adjust_for_ambient_noise(source, duration=1)
            
        self.is_listening = False
        self._stop_listening_func = None

    def listen_for_speech(self):
        """Blocks and yields text once recognized."""
        if not self.is_listening:
            self.start_listening()
        with self.microphone as source:
            while self.is_listening:
                try:
                    # Listen for a single phrase (blocks until silence is detected)
                    audio_data = self.recognizer.listen(source, timeout=1, phrase_time_limit=15)
                    
                    if not self.is_listening:
                        break # In case we got paused while waiting for speech
                        
                    # Convert the raw audio bytes directly into a normalized float32 numpy array
                    # Whisper expects 16kHz audio, which our sr.Microphone is already set to
                    audio_np = np.frombuffer(audio_data.get_raw_data(), dtype=np.int16).astype(np.float32) / 32768.0
                    
                    # Transcribe
                    segments, _ = self.model.transcribe(audio_np, beam_size=5, condition_on_previous_text=False)
                    
                    text = "".join([segment.text for segment in segments]).strip()
                    
                    if text:
                        return text
                        
                except sr.WaitTimeoutError:
                    # Just loops around and keeps listening if nobody spoke
                    pass
                except Exception as e:
                    print(f"\n[STT Error] {e}")

In my final STT code, I have a class called STTManager that handles the STT pipeline. It uses the Whisper model to transcribe speech to text after using the microphone to capture audio and converting that into a numpy array that directly feeds into the Whisper model.

Takeaways

It is actually very difficult to get a bare bones voice agent working. You have to contend with the limitations of the hardware you are using, the model latency, the orchestration of the entire voice to response pipeline, the fallback behavior if the tool call doesn’t work, and so forth.

Here are things that I didn’t try adn would be worth experimeting with in the future:

Upgrading hardware somehow (this is the obvious one and would result in faster performance).
Figuring out a way to not use flags like [SEARCH] or [NAVIGATE] in the prompt. If the agent was to get more complicated tasks, tracking these flags would be much more difficult.
Browser frameworks like Playwright are not the only option. Selenium and others also do the same work, but with subtle differences. It would be interesting to benchmark the effectiveness of different browser automation frameworks and see how it adds up. I don’t know if Qwen would be better at using one or the other.
I have heard of browsers out there that were built for agentic AI, but I didn’t look into them for this project. Would they be better with this similar setup?
Increasing the parameter of the model would be a good test of actual reasoning, but that is again tied to the first point of hardware limitations.
Would figuring out a way to increase the tokens per second make the model stronger at these tasks?

Even though I was able to get the basic idea down, putting it all in code proved to be much more difficult, and time consuming, since one problem could’ve had multiple sources. Prompting the agent to do something is more of an art then something that can be empirically measured. When I was starting out, I tended to believe that most of the problems were due to bad prompts, when in realiy, the tool call could’ve been failing.

Other Interesting Musings

I used AI to write most of the code for this project (Gemini 3.1 Pro and Claude Sonnet 4.6), but I did encounter some interesting aspects in this process. First, when downloading the model, the AI that I used would oftentimes write a separate Python script to download the model, instead of just doing it in the client. This is what happened with the Whisper model downloading.
In the beginning, I was having trouble having Playwright immediately open the tab in my browser when I told the agent to do so. During the debugging process with AI, a Python script called test_cdp.py was created that would send a message to Chromium and would return a success when a link navigation works. This was very useful in quickly resolving the issue.

There is promise though, and in the future, it would be amazing if I could build an entirely local and customized version of such a system for myself without having to worry too much about reliability. But that is truly for the future.

Thanks for reading!

making money should be harder

2026-04-04T00:00:00+00:00

I moved to San Mateo earlier this year for a new job. One of the most stressful parts of the move was finding an apartment. As you know, housing in the Bay Area is extremely expensive and competitive. As it turned out, my desire to live closer to work removed a lot of choices online, and in the end, I only ended up touring about 4-5 places.

One of the places we toured was a tall apartment complex right near the big street. The leasing agent told us that the building was originally constructed in the late 1940’s for families moving to the Bay after World War 2. It had been renovated multiple times and certain units (like the one I toured) had shiny new appliances.

Apart from the age, the building was very solid and I seriously considered moving in, if it was not for the fact that the parking situation was a bit chaotic, and another complex happened to not have this problem.

It really made me reflect on how different the attitude towards work was back in the 1940s and today. Back then, people were more materially poor, but they did honest work, were optimistic, and willing to sacrifice to build things that had value. The fact that an apartment building built back then could still stand so tall today is a testament to this. In addition, places like the Empire State Building in New York or the Golden Gate Bridge were built in only a few years but still stand the pressures of modern living.

At my parent’s house, which was built less then 10 years ago, the paint on the walls have already started peeling, the drain has clogged up more then once, which needed repairs, the gutter has clogged up multiple times despite it being “fixed”, and a little chunk of the plywood floor chipped off in the basement. Why is everything such poor quality nowadays? I have a theory…

Nowadays, there is just so much money in the system that no one really has an incentive to work hard anymore. Why work a job when you can gamble on DraftKings and Bitcoin 24/7? Why should we be fiscally responsible when the Federal Reserve can just print more money infinitely (Thanks gulf states!)? Why do honest work when doing shoddy work just gives more opportunities later to make more money fixing your previous errors?

My hot take: money should be harder to make, or at the very least it should only follow the value of the work being created.

Your work is the part that has value; money is just a way to motivate you to do work. Look at the people who have had to quit their jobs to take care of their sick parents or grandparents. The market doesn’t value their work in any respect, but if no one was acting as caretakers, there would be no families and consequently no societies anymore. Is that what we really want?

Focus on producing something of value, and the rewards will follow. This is how I want to live.

my life is already simple without ai, what now?

2026-03-21T00:00:00+00:00

It won’t be long before AI agents will permeate every aspect of our lives. It would be able to do menial tasks, and all the things that people simply either don’t have time or don’t want to do.

But what about the people who prefer living a simple life? Who prefer to keep the rooms uncluttered by not owning much at all? Who would give up opportunities to make more money simply because it’s not worth the extra mental energy or headspace? Who would rather keep their email inboxes clean by not signing up to a bunch of garbage subscriptions in the first place?

A new class of society may be created consisting of these individuals who go the opposite direction of complexity, engaging in silent competition against the masses whose daily checklists increase in length thanks to the automating power of AI.

At the end of the day, as AI becomes more powerful, people will start to feel pressure to live more complex lives and add on more responsibilities, as the old and mundane will be automated away.

Because how fun would it be to sit on a couch and stare at the ceiling as the world goes by? That’s the complete antithesis to life.

I’ve been playing with agents a lot lately, spending money for the first time to access more powerful compute to do my own projects. The power of this tech is real, but I wonder how meaningful it would be to my personal life, apart from work.

I don’t have a cluttered inbox, a ton of rooms to maintain, a desire to eat restaurant level food every day at home, a lot of text messages incoming on the daily, or a lack of energy to get up to turn off the lights for bed every night.

What will become of me?

How can I embrace AI while my lifestyle is keeping it out?

Questions to ponder…

i made my personal website ai-friendly!

2026-03-21T00:00:00+00:00

I’ve been playing with AI agents a lot lately, and thought a lot about how to make my personal website more AI-friendly.

It’s clear to that AI agents will completely change how talent is discovered for any profession. In the future, recruiters most likely will be using agents to find candidates for jobs. This is much more desirable then manually browsing LinkedIn profiles and potentially thousands of generic looking resumes. It would also accelerate the hiring process and remove the need for certain steps, which will be a net positive for both employers and job seekers.

I made several adjustments to my website to make it more agent friendly.

Adding Files for AI Parsing

Create an llm.txt file

This file contains information about me that I want AI agents to know, serving as a sort of basic entrypoint for any parsing agent. It includes my name, contact information, skills, experience, and interests. It also includes information about my personality and work style. It would explicitly tell the parsing agent to compare my experience to the job description, fetch json files corresponding to my projects (more on this below) and contact me if I’m a good fit.

Create a projects.json file with entries for each project

Each project has its own json entry that contains information about the project. This includes the name of the project, a description of the project, the technologies used, and a link to the project’s GitHub repository.

Here is a real world sample of one of these entries:

{
    "name": "LLM God",
    "description": "Desktop application to query multiple LLMs (Claude, ChatGPT, Gemini, etc.) at once for the same prompt.",
    "technologies": ["HTML", "CSS", "JavaScript", "Node.js", "Electron.js"],
    "github": "https://github.com/czhou578/LLM-God",
    "live": null,
    "date": "2025"
},

The goal for this is to have the file be easily accessed by agents using a curl command as an example: curl https://czhou578.github.io/v3/resume.json | jq

Create a resume.md file

This file contains my resume in markdown format. It includes my work experience, education, skills, and interests. This is just another way for agents to quickly discover my qualifications and experience.

Create a faq.md file

This file is meant to answer the majority of questions that would normally be expected from a first round recruiter call. It lists answers to questions divided into different categories, like work style / culture fit, past experiences with certain technologies, and expertise in different domain disciplines.

Here is a small snippet of some of the questions I included in mine:

What is Colin Zhou's expertise in AI and ML integration?
Has Colin worked with LLMs in production?
Does Colin have experience with vector embeddings or semantic search?
Does Colin have full-stack experience suitable for a startup?

If a hiring agent could scrape this info well, in my mind it would be able to do a good job of determining if I’m a good fit for a role, and I could skip the first round of interviews entirely.

Making HTML optimizations

I also made a bunch of optimizations to the HTML of my website to make it more AI-friendly. I added semantic HTML tags, ARIA labels, and other accessibility features. I also added a sitemap and a robots.txt file to help search engines and AI agents discover my content. In addition, I made sure to wrap the sections of my website with semantic elements like the

tag in order for agents to better understand the structure of my website.

Adding CLI

I added on a locally run CLI for agents to parse. I used pure JavaScript to define several functions that would scrape and extract certain sections of my portfolio based upon specific queries. It uses regex to match boundaries.

For example, here is a code snippet of how it extracts information about my skills:

  const skillLines = skillsText
    .split(/\r?\n/)
    .filter((l) => l.trim().length > 0);
  const resumeSkills = new Set();

  skillLines.forEach((line) => {
    const cleanLine = line.replace(/^- \*\*.*?\*\*/, "").trim();
    if (cleanLine) {
      // Replace parentheses with commas so things like "Cloud (AWS, GCP)" become "Cloud , AWS, GCP,"
      const formattedLine = cleanLine.replace(/[()]/g, ",");
      const items = formattedLine.split(",").map((s) => s.trim().toLowerCase());
      items.forEach((item) => {
        const subItems = item.split("/");
        subItems.forEach((si) => {
          const finalWord = si.trim();
          if (finalWord && finalWord !== "es6" && finalWord !== "cloud") {
            resumeSkills.add(finalWord);
          }
        });
      });
    }
  });

Adding MCP

As someone who is relatively new to MCP, I had to do some research to understand how it works. In the end, I decided to include an MCP server that was hosted on Cloudflare since from a usability standpoint, it was the most straightforward to implement and would give agents access.

I ended up using a combination of the Wrangler npm package and Cloudflare workers to deploy my server. I installed wrangler using npm and write a TypeScript file called index.ts that would serve as my MCP server.

How this works is that it exposes an endpoint that agents can use to query my website for information. An AI agent connects to this endpoint and asks “what tools do you have?” (via tools/list). The server responds with get_experiences, get_projects, and match_job (including strict JSON schemas for the inputs). The agent can then trigger tools/call to execute the logic and get the data.

Here is a code snippet of how this works:

{
  name: "get_experiences",
  description: "Gets the professional experiences from Colin's resume.",
  inputSchema: { type: "object", properties: {} },
},

async function getExperiences() {
  const res = await fetch(RESUME_URL);
  if (!res.ok) throw new Error("Failed to fetch resume");
  const text = await res.text();
  const match = text.match(/## Experience\r?\n([\s\S]*?)(?=\r?\n## |$)/);
  if (match && match[1]) {
    return "=== Experiences ===\n" + match[1].trim();
  }
  return "Could not find the Experience section in resume.";
}

if (request.params.name === "get_experiences") {
    const exp = await getExperiences();
}

The first step is to use the tools/list endpoint to see what tools are available. The get_experiences tool returns the experiences section of my resume. If the request made to the backend wants to get experiences, it will invoke this function.

That’s it! I then deployed the MCP server to Cloudflare and added the endpoint to my website. Funny enough, I ended up doing a side experiment where I tried to ask Antigravity IDE’s agent to find a way to use the browser agent to setup Cloudflare for me. The problem was that it got stuck at the login part due to repeatedly failing the Cloudflare captcha, hahaha.

Takeaways:

It does seem like a lot of unnecessary work at the moment and a lot of extra files to create, but with the agentic world that we are encountering, you have to build for agents. Presenting the most crucial data in various structured formats is the simplest way that can help agents more easily parse your website.

If any of you have more ideas on how to build websites and optimize sites for an agentic future, feel free to reach out to me! I would love to hear about how things can be improved or optimized.

The repository link for my personal website can be found here: repo My personal website link is here. Notice the AI Agent section at the very bottom of the site!

Thanks!

agents doing research? it’s too early…

2026-03-21T00:00:00+00:00

When I saw a while back that Andrej Karpathy tried to let an agent finetune (successfully) a neural network overnight without any assistance, I thought it would be cool for me to try replicating such an agent.

The entire GitHub repo can be found here: repo. Note that each trial run that I did was done on a different branch, and all of the findings from each trial are listed there.

Single Agent Experiment:

Setup

At the very beginning, I selected a model from my ai-notebooks to finetune. I implemented the models last year for self learning purposes and these basic implementations served me well in this project. One model that I used for the single agent experiment was the ResNext model + CIFAR-100 dataset.

Here were the main files of concern:

train.py: The script which contains the code of the neural network that is to be finetuned.

program.md: The instructions for the agent, which includes how to run the experiment loop, how to setup the environment, how to log results, and how to report results. I allowed the agent to modify details of the architecture, optimizer, layer normalizations, learning rate, and other hyperparameters.

README.md: Instructions for the developer on how to run the experiment with the agent. I included the prompt that should be entered when the agent was to kickoff the experiment. This prompt evolved through the trials that I did.

experiment_results.ipynb: The Jupyter Notebook that plots the results of the experiment, showing the validation loss compared to the number of epochs that the agents ran.

requirements.txt: Containing the dependencies that the agent must install in order to run the experiment.

results.tsv: The file that the agents would write the results of their individual trials to.

Thanks to my current company, I had access to the Google AI Ultra Plan via Google Antigravity IDE, which easily allowed me to spin up multiple agents for the course of these trials. But in reality, any kind of agentic workflow will suffice.

I paid for a single RTX 4090 GPU on Runpod for around $0.59/hr in order to run all experiments. I chose this hardware for its value proposition, as cheaper GPU’s wouldn’t have the power I needed while the models like the H100 is obviously overkill.

Running Trials:

I entered the prompt in README.md into Antigravity’s AI chat (no CLI here!), and was able to kick off a training run. Gemini 3.1. Pro (High) was able to read the markdown files, install the dependencies, and do about 20 trials until it stopped and manually prompted me if more trials were needed. I asked the agent to log all the validation losses in the run.log file and report all trials to the tsv file, making sure to list not only the validation loss but also the description of changes.

The agent immediately started and to run the workers in parallel, it generated a custom bash script and python script that would use the subprocess module to run multiple agents. Here is how this was initialized:

    p1 = subprocess.Popen(
        ["python", "-u", "train.py"],
        cwd="/workspace/worker1",
        stdout=f1, stderr=subprocess.STDOUT,
        stdin=subprocess.DEVNULL,  # prevent hanging on input reads
        start_new_session=True     # put in a new process group for clean tree killing
    )
    p2 = subprocess.Popen(
        ["python", "-u", "train.py"],
        cwd="/workspace/worker2",
        stdout=f2, stderr=subprocess.STDOUT,
        stdin=subprocess.DEVNULL,
        start_new_session=True
    )

    deadline = time.monotonic() + TIMEOUT
    
    # Track state for liveness probing
    processes = [
        {"name": "Worker 1", "p": p1, "log_path": "/workspace/autoresearch/worker1.log", "last_size": 0, "last_active": time.monotonic()},
        {"name": "Worker 2", "p": p2, "log_path": "/workspace/autoresearch/worker2.log", "last_size": 0, "last_active": time.monotonic()}
    ]

A while loop went on infinitely until the processes were stopped. Inside of the while loop, there were checks for the time limit exceeded and regular health checks on the processes. I found this generation amusing, as the agent inferred this step from my instructions without explicit prompting.

Findings:

For my single agent experiment with the ResNeXt model, I did experiments with my Gemini 3.1 Pro agent and got the following chart:

Here is a snapshot of the tsv file:

commit loss memory_gb status description 8789f42 1.836339 6.0 keep baseline e4aebb6 1.269070 3.8 keep cardinality 4 width 32, label smoothing 0.0, max_lr 1.5e-2 f0e1136 1.300190 7.6 discard batch_size 1024, max_lr 2e-2 423217b 1.277093 6.1 discard cardinality 2 width 64 41fdb79 1.251534 3.8 keep weight decay 1e-3 55423d5 1.435687 3.8 discard remove ColorJitter and RandomErasing a8d1478 1.250736 3.8 keep num_epochs 33 8433970 1.212178 5.4 keep replace ReLU with GELU aa75e9f 1.236824 5.4 discard Add Dropout p=0.1 a38743f 1.255651 5.4 discard Tune OneCycleLR pct_start 0.1 div 100 06233bd 1.255651 5.4 discard Switch GELU to SiLU

The run.log file:

loss: 1.589365 training_seconds: 193.3 total_seconds: 196.8 peak_vram_mb: 5482.5 num_steps: 2904 num_params_M: 0.5 Using device: cuda GPU Memory: 25.3 GB Files already downloaded and verified Starting Epoch 1 Batch 0/88, Loss: 4.7114 Batch 50/88, Loss: 4.2465 Epoch 1 - Training Loss: 4.3348, Validation Loss: 4.0079 Starting Epoch 2 Batch 0/88, Loss: 3.9555 Batch 50/88, Loss: 3.7705 Epoch 2 - Training Loss: 3.8064, Validation Loss: 3.6797 Starting Epoch 3 Batch 0/88, Loss: 3.5122 Batch 50/88, Loss: 3.4208 Epoch 3 - Training Loss: 3.4044, Validation Loss: 3.4134 Starting Epoch 4 Batch 0/88, Loss: 3.1425 Batch 50/88, Loss: 3.1418 Epoch 4 - Training Loss: 3.0934, Validation Loss: 2.9951 …

Multiple Agents Experiment:

I also tried running multiple agents in parallel to see if I could speed up the process. This time, I used the EfficientNet architecture which is a convolutional neural network developed by Google, and I used the CIFAR-100 dataset like before.

Setup:

The setup that I used was almost the exact same as before. The one main difference was that I introduced a new file called swarm_brain.json. This file was used to keep track of the status of each agent, and it was updated by each agent when they started and finished their trials.

The main idea here is that the json file would act as a centralized point for all worker agents, since Antigravity IDE only allows agents to run in parallel and inter-agent communcation is right now not possible.

I also made it clear in the README.md file that the agents should run in parallel as worker agents. Each worker agent would have its own unique identifier, and when they are done with a trial, they would update the swarm_brain.json file with their results and status.

Otherwise, I kept the hardware setup the same as the single agent experiment.

Running Trials:

I entered the prompt in README.md into Antigravity’s AI chat and waited. Quite immediately, my plan to spin up 3 worker agents backfired as the CUDA on my RTX 4090 GPU was maxed out. I immediately changed it back to 2 worker agents, which solved the problem. After a while, the trial runs finished without any issues or interventions.

Findings:

Here were my results from the experiment:

And the swarm_brain.json file:

{"agent_id": "orchestrator", "validation_loss": null, "description": "Initialization"}
{"agent_id": "worker_1", "validation_loss": 4.474519, "description": "baseline with locking and timeout"}
{"agent_id": "worker_2", "validation_loss": 4.433933, "description": "dropout to 0.1 for worker 2"}
{"agent_id": "worker_1", "validation_loss": 4.403664, "description": "label_smoothing 0.0"}
{"agent_id": "worker_2", "validation_loss": 4.449075, "description": "base_lr 8e-3"}
{"agent_id": "worker_1", "validation_loss": 4.403990, "description": "weight_decay=1e-4"}
{"agent_id": "worker_2", "validation_loss": 3.588812, "description": "batch_size=512"}
{"agent_id": "worker_1", "validation_loss": 3.629776, "description": "base_lr=2e-3"}
{"agent_id": "worker_2", "validation_loss": 3.975648, "description": "label_smoothing=0.2"}
{"agent_id": "worker_1", "validation_loss": 3.574252, "description": "base_lr=1e-2"}
{"agent_id": "worker_2", "validation_loss": 3.564052, "description": "weight_decay=0.0"}
{"agent_id": "worker_1", "validation_loss": null, "description": "Running Loop 6 base_lr=1e-2 wd=0.0"}
{"agent_id": "worker_2", "validation_loss": null, "description": "Running Loop 6 base_lr=1e-3 wd=0.0"}
{"agent_id": "worker_1", "validation_loss": 3.574787, "description": "base_lr=1e-2 wd=0.0"}
{"agent_id": "worker_2", "validation_loss": 3.600373, "description": "base_lr=1e-3 wd=0.0"}

My biggest takeaway from all of this is that it is possible to coordinate multiple agents and have them write to a single file, and then take that file as context to improve itself.

But from the validation loss progression, the drop was noticeable at first but then slowly plateaued out. Even though I told the worker agents to read the json file before every loop, it’s unclear if they actually retained the information from the previous loops. I am also unsure of how I would’ve been able to tell just from their displayed chain of thoughts.

It does lead to the idea that AI is good at jumping but not climbing. It really is a kind of brute force trial and error, where it tries different combinations and is able to jump really high at times. But when you try to ask it to build off of intermediate steps that were previously generated, it has a harder time.

Issues:

I did run into several major issues when trying to run multiple agents in parallel. The first was that Git became problematic since each agent would create their own Git branches and then try to commit their results / delete their branches when they were done. Often times for reasons I can’t figure out, Git would sometimes freeze and the agent would get stuck there.

Second, the agent would create unnecessary files even when unprompted even if I told it not to do so. I think this was due to the fact that with multiple agent setups, there will inevitably be some rogue command where the agent needs to do scratch work and just decides that a simple log file is not enough.

Third, there were times when Gemini just crashed during the middle of a trial with no reason. This is more of an Antigravity IDE problem, and with my daily work, I find that certain peak hours would make it more likely for this to happen, even with the ultra enterprise plan.

Fourth: Spawning multiple worker agents on a single GPU can be tricky due to both processes having to share resources. I did encounter instances where one process unexpectedly ran out of memory and failed while the other kept going. When I tried to restart the run with two agents, Gemini would instead try to run the agents sequentially, which totally ruins the point of having multiple agents.

Fifth: For some reason, multiple agents often have a very hard time sticking to the 5 minute time limit that I set for each trial. I have no clue why. This happens even if there is no crash.

Sixth: Every worker agent needed explicit permission to access the run.log and train.py files in their respective worktrees when starting a new experiment. I don’t know why this is the case, and there is no way to bypass it.

Future Work

I did find other ways to coordinate multiple agents through open source projects, such as https://github.com/wjgoarxiv/antigravity-swarm, which adds a specific coordination layer on top of Antigravity IDE. Having something more sophisticated like this could be more useful.

I also did not use Claude Sonnet 4.6 or any other models apart from Gemini 3.1 Pro High model. It would be interesting to see if other models would perform better or worse.

One thing that I avoided was letting an agent run overnight, since I encountered quite a few issues with Gemini crashing during the middle of a trial or the agent asking for explicit permission after 20 trials or so. The validation loss of these models would definitely be lower if I had it run for many hours like this.

If I were to also pay for a more powerful GPU, that would also help with potentially the GPU memory issues that I encountered as well as allowing for more data trials to be run. I am not currently aware of what would happen if you just tell an agent to strictly use x or y amount of VRAM or system resources per say. My intuition is that it would ignore it unless you have a separate background agent or worker constantly in a loop monitoring the resources and telling the other agents to stop if utilization goes too high. But that to me seems a bit strange as well because sometimes there would be brief spikes of GPU usage that would not necessitate a crash, but would theoretically be detected as so from a numerical standpoint.

Conclusion

It is definitely possible to run research through agents, and I was shocked that this could be done quite reliably with the right guardrails. Multi agent orchestration is still a very big challenge with the problems I described above, but I’m sure that innovation in the upcoming months will make this better. I think its too early to say whether AI researchers will be replaced, but having an assistant that never gives up, doesn’t eat food, doesn’t sleep, and can just brute force combinations definitely sounds better then just a nice-to-have.

Thanks Andrej for the inspiration, and let me know of your opinions or any other feedback!

the mental cost

2026-03-21T00:00:00+00:00

A lot of people don’t get it, but our brains don’t have infinite space. It is constantly trying to forget things, recent and old. This is simply how our biology works.

It is also a curious fact that our primitive brains are not wired to keep up with this barrage of information that are suddenly available at our fingertips. It’s no wonder to me that geniuses often live tortured lives.

So many times in my life, I see people happily discussing owning multiple houses, cars, and having to deal with the complicated tax system as if its some source of pride. It’s flabbergasting to me.

For my entire childhood, I wanted to live in a single family house and as far as possible from our cramped 1100 square feet apartment. But when it finally happened when I was in college, the feeling wasn’t what I expected.

It turns out that maintaining a house is a crazy ton of work. Especially these days when half the contractors in the area just seem to cause problems to profit off later, it’s hard to truly know when a good job has been done. You cannot sleep well because if the previous contractor messed up badly, the problem still exists but now you gotta find someone else and they may not be the true antidote either. All the while, you gotta pay property taxes, worry about pests, and obsess over the security of your house, the house insurance, like ugh.

One of my other biggest pet peeves is driving. I hate driving a lot. I only do it because I have to in America, where politics has consistently trampled on the desire of many for better public transport. I hate having to play out scenes in my head of potentially getting in a bad traffic accident, having to constantly swivel my head at high speed to check for clueless pedestrians / drivers, constantly doing complex geometric equations in my mind to see if I could fit into a parking space, you get the point.

Saying you are tired typically implies being physically exhausted, because that is something visible. But how do you explain mental exhaustion to someone who cannot see it with their eyes?

Reducing the mental load is my key to happiness. I just want to focus on what matters, and the world cannot force me away from that.