# Simon Quick

I'm a lead software engineer building AI and products at [Pollen](https://pollen.ventures). Recent work includes [Clara](https://clara.arthritis.org.au/) (Webby-nominated AI health companion with 87% specialist approval and 90% recommending it to patients), [PollenAI](https://pollen.ventures/pollenai) (agentic search platform for regulated industries including government and healthcare, handling millions of queries annually), and [Rebuilt](https://rebuilt.eco) (industry-leading product carbon footprint calculation startup for construction).

Before this, I founded [Sound Shelter](https://xlr8r.com/news/sound-shelter-lets-you-shop-all-the-best-record-stores-from-one-place/) - a vinyl marketplace I took from zero to 100k users, handling code, product, design, and sales. Other work: founding engineer at [Youcheck](https://www.compasslist.com/insights/youcheck-by-precept-the-fact-checking-app-fighting-misinformation) (Google DNI-funded, fighting misinformation with NLP), and first technical hire at HomeAway APAC before their [$3.9B exit](https://www.expediagroup.com/media/media-details/2015/Expedia-To-Acquire-HomeAway-Inc/default.aspx).

I write about applied AI and maintain [Fetch Engines](https://www.npmjs.com/package/@purepageio/fetch-engines), an open source extraction toolkit. I also DJ and produce music - you can find my stuff on [SoundCloud](https://soundcloud.com/siquick) and see my recent [Bandcamp purchases below](/#recent-bandcamp-purchases).

Based in Sydney, Australia. Open to collaborations and advisory.

---

# Career

I am a product-focused lead software engineer with over 15 years of experience delivering products from concept to scale for startups, SMEs, and global leaders. Based in Sydney, I have dual Australian/UK citizenship and have led teams in Melbourne, London, and Barcelona. My work combines technical excellence, user empathy, and applied AI.

## Key Career Highlights

* Led the build of a Webby-nominated AI companion for arthritis sufferers, **Clara**, achieving **87% approval** from specialists and **90% recommending it to patients**.
* Delivered an agentic AI search adopted by the Australian federal government, healthcare providers, and national infrastructure, on track to answer nearly **2 million questions per year**.
* Founded **Sound Shelter**, a vinyl marketplace scaled to **100k users** and **30+ partner stores**.
* Part-time CTO of **Rebuilt**, Australia’s first self-service platform for generating verified Product Carbon Footprints (PCFs).
* Engineered and integrated platforms for **Apple, Vodafone, Expedia, Puma, and CSIRO**.
* Founding engineer at **Precept**, awarded **Google DNI funding** to combat online misinformation using NLP.

***

## Experience

### Lead Product Engineer  -  Pollen

*Jan 2022 – Present · Sydney, Australia*

I lead the engineering team, shaping technical direction and AI strategy. Promoted from Senior Engineer to Lead in March 2023.

#### Selected Achievements

* **Clara – AI Companion for Arthritis (Webby-nominated):** Architected and led the build of an iOS/Android/Web app supporting 3.7M Australians with arthritis. Designed a secure RAG pipeline to surface contextual answers, achieving **87% approval** from subject matter experts and **90% specialist recommendation**. Featured on 9News, Sydney Morning Herald, and The Age.
* **Agentic AI Search:** Architected and launched a production AI search product, now adopted by the Australian federal government, healthcare providers, and national infrastructure organisations. Handles millions of queries annually with a pipeline including **PII redaction, LLM-as-judge classification, dynamic query rewrite, hybrid semantic/vector search, and LLM summarisation**.
* **Rebuilt (Part-time CTO):** Leading technical direction for Australia’s first self-service platform enabling manufacturers to generate and publish verified PCFs. Designed and launched the platform to make trusted carbon data accessible at scale.

#### Additional Contributions

* Leadership team member driving technical & AI strategy.
* Mentored 2 full-time engineers + contractors.
* Designed infrastructure across AWS, GCP, Vercel, Expo using IaC (Pulumi).
* Built proofs-of-concept for discovery and won multi-million-dollar client projects with technical expertise.

**Stack:** TypeScript, Python, React/Next.js, Node.js, Django, Prisma, Postgres, TailwindCSS, React Native/Expo, AWS, GCP, Pulumi, RAG, LlamaIndex, Langfuse, OpenAI, Anthropic

***

### Founder / Engineer  -  Sound Shelter

*Apr 2013 – Jan 2024 · Sydney, Australia*

* Built and scaled a vinyl marketplace to **100k users** and **30+ partner stores**.
* Designed recommendation algorithms and built infrastructure to pull catalogues via APIs, feeds, and scraping.
* Created and launched a native iOS app.

**Stack:** React/Next.js, Node.js, Prisma, MySQL, Tailwind, React Native, AWS

***

### Senior Software Engineer  -  Endeavour

*Jan 2020 – Jan 2022 · Sydney, Australia*

* Migrated the events platform to React + Django, serving thousands of prospective students.
* Built a student onboarding platform used by hundreds per term.
* Developed a clinic booking front-end handling hundreds of instant payments weekly.

**Stack:** React, Django, Postgres, Tailwind, AWS

***

### Senior Software Engineer  -  Precept

*Aug 2018 – Aug 2019 · Barcelona, Spain*

Precept (YouCheck) received Google DNI funding to improve online information environments.

* Built backend APIs for ML-driven misinformation detection in text and images.
* Led a team of two on a React/Node platform connecting journalists with experts.
* Managed DevOps and code review.

**Stack:** React, Next.js, Node.js, Python, Django, Google Cloud

***

### Integration Engineer  -  Partnerize

*Dec 2016 – Apr 2018 · Sydney, Australia*

* APAC technical lead integrating global clients (Apple, Expedia, Vodafone, Nike, Emirates).
* Built custom integrations with third-party APIs for partner marketing infrastructure.
* Pre-sales/post-sales consultant on multi-million-dollar deals.

**Stack:** Python, MySQL

***

### Sales Engineer  -  HomeAway.com (Expedia Inc.)

*Jul 2012 – Aug 2016 · Melbourne & Sydney, Australia*

* First technical hire in APAC.
* Built feed parsing infrastructure powering ~20,000 property listings for two years.
* Led technical consulting for APAC pre- and post-sales.

**Stack:** Python

***

## Technical Skills

* **Languages:** TypeScript, JavaScript, Python, SQL
* **Front-end:** React, Next.js, React Native, Tailwind
* **Back-end:** Node.js, Hono, Express, Django, FastAPI, GraphQL, Prisma, Drizzle, Postgres
* **AI:** RAG, Semantic/Hybrid search, Vector databases, Prompt engineering, OpenAI, Anthropic, LlamaIndex, Langfuse, Vercel AI SDK, Agents / Workflows
* **Infrastructure:** AWS, Google Cloud, Vercel, Docker, CI/CD

***

## Education

**BSc (Hons) Internet Computing**
Northumbria University  -  Newcastle upon Tyne, UK

## Working Rights

Australian citizen (dual Australian/UK)

---

# Reflection for RAG and Agents: Evidence-gated answers in regulated systems

In regulated and enterprise contexts, a plausible but unsupported answer is worse than a slow answer because it creates audit failure and downstream decision risk. You cannot defend it, you cannot reproduce it, and you often learn it is wrong only after an end user or subject matter expert highlights it.

In these environments, “close enough” is not a harmless failure mode. The pattern is consistent: coherent prose that cites the wrong page, misses the primary document, or blends an outdated policy snapshot with a current one. The output reads well, but it does not survive a basic reviewer question: “Where did that come from, exactly?”

Evidence-gated RAG turns that failure mode into a bounded engineering problem by enforcing a simple contract: key claims must map to primary sources the system retrieved.

## The Pattern

The root cause is architectural: in a single-generation pass, the model must generate and implicitly validate at the same time, using whatever context it happened to retrieve. In baseline runs, retrieval pipelines often surface secondary summaries because they are shorter, denser, and rank more highly under embedding similarity. The model then preferentially uses them because they are easier to paraphrase.

Evidence-gated RAG is a control-layer pattern where a grounded critic (restricted to tool outputs rather than model priors) enforces claim-to-evidence alignment before an answer is emitted. If the critic can justify that the evidence contract (every required claim must map to a primary source ID and a retrievable span) is met, the system continues. If it can name a concrete evidence gap, the system runs one targeted follow-up retrieval and regenerates. You incur extra tokens and latency, accepting that downside in exchange for a measurable lift in factual reliability from the additional retrieval.

Self-critique prompting reviews the model’s own draft without adding new evidence, so it tends to improve structure more than correctness. ReAct interleaves reasoning and acting, which helps the model use tools, but does not by itself enforce claim-to-evidence alignment or a stop condition. Planner and executor systems introduce orchestration complexity, but you can layer them on later. Evidence-gated RAG is the smallest architectural addition that turns retrieval into an enforceable contract.

The pattern generalises and is useful anywhere documents drift over time and readers care about provenance: legal document summarisation, policy comparison across versions, enterprise search with version drift, and safety-critical copilots where the cost of a single unsupported claim is higher than one additional retrieval round.

## Control Loop

In production terms, reflection is a control loop with a stop condition. The generator runs first, followed by a critic that inspects what the agent observed from tools, not how fluent the prose is. The critic asks: "Do I have enough information to complete the task?". If the critic can justify finalising, the system returns the answer. If it finds a specific information gap, it asks for one targeted follow-up retrieval, then the system regenerate and finalise. The mechanics are simple and the discipline is in the constraints: the loop must be grounded, budgeted, and stoppable.

The primary weakness in this architecture is correlated model failure: the critic is also an LLM. There is a non-zero probability it hallucinates the existence of evidence to satisfy the generator, which is equivalent to grading your own work. The mitigation is two-fold: use a different model for the critic, and restrict draft visibility until context sufficiency is validated. First it evaluates the query against the retrieved context and returns a structured list of missing facts and missing primary sources (the authoritative PDF/page that defines the policy, not commentary). Only if the context is sufficient should it review the draft, and even then the draft review is constrained to evidence mapping, not prose quality.

The separation of concerns matters because it keeps the system reimplementable. If you blur roles, you get a clever demo and a brittle deployment.

| Component | Responsibility | Does NOT Do |
| --- | --- | --- |
| Generator | Draft an answer and propose tool calls using the retrieved context | Final approval |
| Retriever | Fetch relevant primary sources and return passages plus IDs | Judge correctness |
| Critic | Validate claim-to-evidence alignment and name missing facts or sources | Generate new claims |
| Loop controller | Enforce budgets and stop conditions, decide retry or halt | Rewrite answers |

The stop conditions should be explicit, not emergent behaviour. This is what I ship:

1. Max two passes, default one.
2. Halt if the critic cannot name a new primary source to fetch.
3. Halt if retrieval returns no new primary sources.
4. Halt when evidence coverage meets the configured threshold across required claims.

```mermaid
flowchart LR
  Q["User question"] --> LC["Loop controller (budgets + stop conditions)"]
  LC --> G["Generator (draft answer + tool calls)"]
  G --> R["Retriever and tools (search, fetch, execution)"]
  R --> C["Critic (claim to evidence audit)"]
  C -->|Contract met| F["Final answer"]
  C -->|Named evidence gap| LC
```

Here is the loop in the smallest implementable form. The key is that the critic is grounded in tool history and the controller enforces stop conditions, rather than “reflecting” indefinitely.

```ts
const MAX_PASSES = 2;

for (let pass = 0; pass < MAX_PASSES; pass++) {
  const draft = generator({ question, retrieved });

  const decision = critic({
    question,
    retrieved,
    draft: pass === 0 ? null : draft, // earn the right to see the draft
  });

  if (decision.contractMet) return draft;
  if (!decision.missingEvidenceQueries.length) break;

  const next = retriever(decision.missingEvidenceQueries);
  if (!next.addedPrimarySources) break;
  retrieved = merge(retrieved, next);
}

throw new Error("Evidence threshold not met");
```

What changes is not how answers sound, but whether the system finalises on secondary summaries when primary documents are available. The trade-off is one additional retrieval round and a predictable latency increase, which can be capped via a single reflection pass.

This is rational, not expensive. In regulated deployments, latency is a UX variable. Unsupported claims are a liability variable. Evidence-gated RAG trades a small, bounded increase in compute for a large reduction in audit exposure.

There are a few variants of this pattern and they are not interchangeable. A basic self-critique loop can improve structure and formatting, but it rarely moves factual accuracy because the critic is still constrained by missing evidence. Grounded reflection is where things get interesting: the critic sees the tool history and flags what is missing, then the system retrieves that missing piece. For harder tasks you can go further and score multiple action paths before committing to one, but that pushes you into much higher compute and complexity, so I treat it as a specialised tool rather than the default. If you want a quick pointer to the research direction, it is consistent with how the field evolved. ReAct established the value of interleaving reasoning and acting ([ReAct](https://arxiv.org/abs/2210.03629)), Reflexion showed strong gains from explicit feedback loops ([Reflexion](https://arxiv.org/abs/2303.11366)), and CRITIC focused on verifying and correcting outputs using external signals rather than self-confidence ([CRITIC](https://arxiv.org/abs/2305.11738)). None of that implies every reflection loop works. It does imply that critique improves outcomes when it is grounded in observations and constrained by an operational contract. That contract is the part most people skip. I want a machine-actionable decision that answers one question: do we have enough evidence to finalise?

```json
{
  "requires_more_context": true,
  "reason": "Coverage gap in the latest policy update.",
  "follow_up_instruction": "Fetch primary source and verify effective date.",
  "suggested_query": "official policy update effective date site:gov"
}
```

This keeps the loop deterministic and safe to automate. The generator cannot talk the critic into accepting a weak answer, and the critic cannot quietly drift into style feedback when what you need is evidence.

## Worked Example

Here is a synthetic trace that mirrors a real production run. The point is the mechanics: evidence gap detection, targeted follow-up retrieval, then a revision that is supported by sources.

Suppose the user asks: “What is the effective date of the new policy and what changed?”. The agent does an initial web search:

```json
{
  "tool": "web_search",
  "arguments": { "query": "new policy changes effective date" }
}
```

The tool returns two secondary summaries:

```text
<result https://example.com/blog/policy-summary id=1>...summary of changes...</result>
<result https://example.com/news/policy-announcement id=2>...mentions policy...</result>
```

At this point, the generator can write a plausible answer, but it cannot satisfy the evidence contract for the effective date claim. That is a correctness risk, not a writing problem. The critic should therefore reject it and request a targeted follow-up that forces primary evidence:

```json
{
  "requires_more_context": true,
  "reason": "No primary source for the effective date. Only secondary summaries are present.",
  "follow_up_instruction": "Find the primary policy document and extract the effective date and change list.",
  "suggested_query": "policy effective date filetype:pdf site:example.gov"
}
```

The agent runs the follow-up retrieval:

```json
{
  "tool": "web_search",
  "arguments": { "query": "policy effective date filetype:pdf site:example.gov" }
}
```

The follow-up retrieval now returns a primary document:

```text
<result https://example.gov/policy/2026-02-policy.pdf id=1>...effective date: ...; change list: ...</result>
```

On regeneration, the effective date and change list claims map directly to the primary source. The citations are now evidential rather than decorative.

## Implementation and Evaluation

This is also why I keep the implementation minimal. I separate roles into generator and critic. I pass the critic only what it needs (question and tool history). I only loop when the critic can name a concrete missing input. The code below is runnable in the sense that you can copy it and execute it, but it uses stubs for the LLM and tools. Swap the stubs for your provider client and your tool layer.

Evidence gating becomes real when you define a contract that the loop can enforce. I treat “claim” as an object that must map to a primary source, plus a traceable span that a reviewer can locate later. The exact fields vary with your document store, but the shape matters.

```json
{
  "claim": "Eligibility requires 12 months continuous service.",
  "evidence": {
    "doc_id": "policy_2025_01.pdf",
    "doc_version_ts": "2025-01-18",
    "page": 14,
    "quote": "Applicants must have completed 12 months of continuous service..."
  }
}
```

```python
from dataclasses import dataclass
from typing import Any

@dataclass
class ReflectionDecision:
    requires_more_context: bool
    reason: str
    follow_up_instruction: str
    suggested_query: str | None = None

def llm_generate(question: str, tool_context: str) -> str:
    return f"Draft answer for: {question}\n\nEvidence:\n{tool_context}"

def reflect(question: str, tool_history: list[dict[str, Any]]) -> ReflectionDecision:
    evidence = "\n".join(item["output_preview"] for item in tool_history if item.get("output_preview"))
    missing_primary = "example.gov" not in evidence
    if missing_primary:
        return ReflectionDecision(
            requires_more_context=True,
            reason="Only secondary sources present. Primary source missing for key claims.",
            follow_up_instruction="Retrieve the primary document and extract effective date and change list.",
            suggested_query="policy effective date filetype:pdf site:example.gov",
        )
    return ReflectionDecision(
        requires_more_context=False,
        reason="Primary source present for key claims.",
        follow_up_instruction="Finalise.",
        suggested_query=None,
    )

def web_search(query: str) -> str:
    if "site:example.gov" in query:
        return "<result https://example.gov/policy/2026-02-policy.pdf id=1>...effective date...; change list...</result>"
    return "\n".join(
        [
            "<result https://example.com/blog/policy-summary id=1>...summary...</result>",
            "<result https://example.com/news/policy-announcement id=2>...announcement...</result>",
        ]
    )

def run_agent(question: str, max_reflection_rounds: int = 1) -> str:
    tool_history: list[dict[str, Any]] = []
    rounds = 0

    while True:
        if not tool_history:
            tool_output = web_search("new policy changes effective date")
            tool_history.append(
                {
                    "name": "web_search",
                    "arguments": {"query": "new policy changes effective date"},
                    "output_preview": tool_output[:400],
                }
            )

        tool_context = "\n".join(item["output_preview"] for item in tool_history)
        draft_answer = llm_generate(question, tool_context)

        if rounds >= max_reflection_rounds:
            return draft_answer

        decision = reflect(question, tool_history)
        rounds += 1

        if not decision.requires_more_context:
            return draft_answer

        if not decision.suggested_query:
            return draft_answer

        follow_up_output = web_search(decision.suggested_query)
        tool_history.append(
            {
                "name": "web_search",
                "arguments": {"query": decision.suggested_query},
                "output_preview": follow_up_output[:400],
            }
        )
```

The guardrails are what stop this becoming an expensive confidence amplifier. I cap reflection rounds (usually one, occasionally two in higher-risk flows). I cap how much tool output can be re-injected into context, I set time budgets per tool call and per overall turn, I validate the critic output against a strict schema, and I stop the loop when there is no suggested query, no new evidence, or the budget is exhausted.

Reflection without evaluation is theatre - if it does not beat baseline on evidence coverage, remove it. I keep reflection variants only when evaluation shows they beat baseline. The quickest way is to build a small golden set that matches your real distribution: straightforward factual questions, multi-hop retrieval questions, prompts that trigger stale-information traps, and prompts where a primary source exists and must be cited. I store both expected-answer notes and required-evidence notes per prompt, because the latter catches the failure mode that matters most in RAG and agent workflows: key claims that are not backed by primary evidence.

When I compare variants, I measure unsupported claim rate, incorrect citation rate, primary-source coverage for required claims, latency delta versus baseline, and reflection loop hit rate. Evidence coverage is the fraction of required claims that are backed by primary source IDs from your tool results. Citation precision is whether the claim text is supported by the cited passage, not merely adjacent to the same topic. For latency, I care about median and P95, because reflection primarily impacts tail latency.

In [ablation](https://en.wikipedia.org/wiki/Ablation_\(artificial_intelligence\)) tests, grounded reflection consistently improves evidence coverage, while ungrounded self-critique mostly improves structure without moving factual accuracy. I run a tight ablation: baseline with no reflection, self-critique only, grounded reflection using tool history, then grounded reflection with a single follow-up retrieval. The most complex version only survives if the lift is large enough to justify the extra runtime cost.

This pattern has clear fit boundaries. Evidence-gated RAG is justified when the cost of an unsupported claim exceeds the cost of one additional retrieval round. It is usually a poor trade in ultra-low-latency UX, low-risk tasks where occasional inaccuracy is acceptable, and workflows where there is no external feedback signal to ground critique.

### Use Evidence-gated RAG when

* Claims must be traceable and reviewable.
* Primary sources exist and can be fetched.
* Multi-hop retrieval risk is real and costly.

### Avoid it when

* The UX must be ultra-low-latency.
* The domain is low-risk and occasional errors are acceptable.
* There is no external feedback signal to ground critique.

## Failure modes

There are a few common ways this fails. If the critic cannot inspect tool evidence, it produces plausible feedback that does not change correctness. If the rubric is too rigid, the agent learns to satisfy checklists instead of outcomes. If you rely on judge-only evaluation, a fluent wrong answer can score well when evidence mapping is weak. If you apply reflection everywhere, you inflate cost and latency without moving accuracy on simple queries.

The critic hallucination vector deserves to be treated as first-class risk, not a footnote. The mitigations are boring and effective: the critic is restricted to tool outputs, it is explicitly forbidden from using external knowledge, it runs under strict schema validation, and it can be duplicated with an independent second critic model in the highest-risk flows.

If you are running reflection at scale, cost control becomes an inference problem, and it is worth treating your model and quantisation choices as part of the system design. If running your own inference is an option then see my post on [Model Quantization, Fine-Tuning, and Picking the Right GGUF](/blog/model-quantization-fine-tuning-pick-right-gguf).

My default operating model is straightforward: one reflection round, structured critic output, explicit evidence gaps, one targeted follow-up retrieval when needed, and ablation-driven evaluation to prove that the extra calls are buying accuracy rather than theatre. Enterprise AI systems should be engineered as bounded control systems with explicit feedback loops, enforced evidence gates, and deterministic stop conditions.

---

# My Python starter kit

When I start a Python experiment, I want the same baseline I get in TypeScript:

* predictable environments
* fast feedback loops
* formatting and linting that I never think about
* type checking that catches the obvious mistakes early
* a repo shape that makes editors behave

Python can do all of that. It just doesn’t show up by default.

So I made a starter kit I can clone and get straight to the interesting part.

[Github repo](https://github.com/siquick/python-starter-kit)

## What this repo optimises for

This is for “I’m exploring something” work:

* quick spikes that might die
* notebooks that turn into scripts
* scripts that turn into little packages
* experiments where you want to keep standards without adding friction

If the project survives, it should already be structured enough to ship.

## The core choices

### `uv` for everything environment-related

`uv` gives me a TS-like mental model:

* create the environment
* add deps with `uv add`
* lock it
* run stuff consistently

It’s fast enough that you stop negotiating with yourself.

### Ruff as the single source of truth

One tool that:

* lints
* formats
* keeps the codebase from drifting into “every file has a different vibe”

The aim is not perfection. The aim is consistency without effort.

### Type checking that runs on demand

I want type checking as part of the normal workflow, not a thing that lives in CI and nobody trusts.

When you’re iterating quickly, “fail fast” is a feature.

### A minimal but real test setup

Even if you only write a few tests, having the harness ready matters.

Most experiments fail because they get messy, not because the idea was wrong.

## What’s deliberately not included

This is not trying to be clever:

* no huge framework scaffolding
* no forced architecture
* no “best practices” sermon
* no pretending every repo is a production service

## The mental model

This repo is meant to behave like a good TypeScript project:

* you get guardrails by default
* you don’t waste time choosing tools
* the editor understands the project immediately
* the code stays readable as it grows

Think of it as: “start messy, but don’t start sloppy”.

## How I use it

* clone with [degit](https://github.com/Rich-Harris/degit)
* setup a venv and sync `uv venv .venv && uv sync`
* `uv add` whatever I need
* write code immediately

That’s it.

---

# A guide to model quantization in fine-tuning (and how to pick the right GGUF)

## About this post

Fine-tuning with [Unsloth](https://unsloth.ai) and [Axolotl](https://axolotl.ai/) is, on the whole, a well thought-out experience where a lot of the complexity is handled for you. However one area that consistently trips people up is quantization, specifically which quant to pick when you export and save a newly trained model.

The aim of this post is to give you a simple mental model you can reuse when you have to make a quantization decision, along with a short overview of the common quant methods in Unsloth and how “Hugging Face models”, merged 16-bit checkpoints, and GGUF exports relate to each other.

While I'm using Unsloth in many of the examples, this logic applies to any LoRA fine-tune (Axolotl, TRL, etc).

## Only here for the quants?

The next few sections are a relatively low-level explanation of quantization. If you are just looking for details on the available quants and how to pick one then jump to the [Common GGUF quantization options](#common-gguf-quantization-options) section or [Putting it all together](#putting-it-all-together) section.

> \[!note] **Note on Server-Side Deployment (vLLM/TGI)**
> This guide focuses on **GGUF**, which is the standard for local deployment (Ollama, LM Studio, Laptops).
>
> If you are deploying to high-end NVIDIA GPUs via vLLM, you typically have two choices:
>
> * FP16: The merged checkpoint discussed below. Best quality, highest VRAM usage.
> * AWQ/GPTQ: Specialised formats for NVIDIA GPUs. If you need these, the "Mental Model" section below still applies (Training → Merged → Quant), but you will swap GGUF for AWQ in the final step.

## What is quantization?

Quantization is the act of compressing model weights to fewer bits to save memory and bandwidth, at some quality cost. In practice, you take weights that are stored in high-precision floats, for example 16 or 32-bit, and map them into a much smaller set of allowed values, for example 8, 6, or 4-bit integers. Think of it like rounding from microseconds to milliseconds. You keep the overall timing but lose the tiny details that don't often matter.

<picture of this compression />

### What Quantization changes under the hood

Quantization reduces the precision of the model weights so the runtime can load fewer bytes and perform cheaper operations. A GGUF file is the same set of tensors the model was trained with but each tensor has been compressed into lower-bit representations with additional scaling metadata so the runtime can reconstruct approximate values at inference time.

#### Weight quantization

Transformer layers contains large FP16 (16-bit floating point) matrices. Quantization converts each matrix into:

* Low-bit integer blocks (e.g. 4-bit, 6-bit, 8-bit values)
* Per-row or per-channel scale factors (tiny volume knobs that rescale each row or channel back to the right loudness after compression) that let the runtime map integers back to approximate floating-point ranges.
* Optional clustering codebooks (the "K" schemes) that store centroids and index codes, improving fidelity at lower bit-rates.

Lower bit-rates shrink the file and increase throughput, but introduce quantization noise - small distortions in the weight matrices. Bigger models absorb this noise more gracefully whereas small models lose capacity quickly.

#### What quantization does not change

* Weights are quantized.
* Activations (the temporary values a model produces while thinking, like notes written on scrap paper during a calculation) and KV caches remain FP16/FP32 unless the runtime adds seperate KV-cache quantization (a different optimisation entirely).
* The model's architecture, vocabulary, context window, and training data remain unchanged.

So quantization is therefore a deployment-time tradeoff. We accept approximation error in return for lower memory, lower latency, and cheaper inference.

During inference, transformer models build a key–value cache: a running memory of all previous tokens. This cache is stored at full precision (FP16/FP32) in almost every runtime because it changes on every new token. Think of it as the model’s scratchpad while it generates text. Quantization in GGUF does not reduce the size of this cache, because only the static model weights are compressed. Some runtimes offer separate KV-cache quantization, which trades a small drop in generation quality for lower memory use and faster long-sequence decoding, but this is independent of weight quantization.

> \[!note] **A note on KV caches**
> During inference, transformer models build a key–value cache: a running memory of all previous tokens. This cache is stored at full precision (FP16/FP32) in almost every runtime because it changes on every new token. Think of it as the model’s scratchpad while it generates text. Quantization in GGUF does not reduce the size of this cache, because only the static model weights are compressed. Some runtimes offer separate KV-cache quantization, which trades a small drop in generation quality for lower memory use and faster long-sequence decoding, but this is independent of weight quantization.

### Quantization in Unsloth

When exporting or saving models with Unsloth, you will encounter several seemingly unrelated options:

* `4-bit QLoRA`
* `save_pretrained_merged(..., "merged_16bit")`,
* GGUF quant methods such as `q8_0` or `q4_k_m`

These sit at different points in the model lifecycle and are easy to confuse. Unsloth intentionally keeps the underlying theory out of scope, which makes the surface area appear more abstract than it is.

The rest of this post untangles these formats, explains how they relate, and provides a simple mental model for choosing the right quant depending on model size, hardware, and workload.

## A simple mental model for model formats

You can think in three stages when you're training and fine-tuning a model:

**Training -> merged Hugging Face checkpoint -> deployment quant**

### Training

During the training phase, we are not choosing our deployment quants, but we *are* choosing how the base model is loaded into memory. In practice this means choosing the data type (dtype) for the weights, for example 4-bit, 8-bit or 16-bit.

In Unsloth this is configured with the `load_in_4bit` and `load_in_8bit` options:

```python
# Loading a model in Unsloth
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastModel.from_pretrained(
	model_name = "unsloth/gemma-3-12b-it",
	max_seq_length = 4096,
	load_in_4bit = True,
	load_in_8bit = False,
)
```

Both of these options control how the base weights are represented in GPU memory during training.

`load_in_4bit`

This loads the base model in 4-bit quantized format (QLoRA style). This cuts VRAM use significantly so you can fine tune larger models on smaller GPUs with less VRAM, while keeping the base weights frozen and only training LoRA adapters.

`load_in_8bit`

This loads the base model in 8-bit. This uses more VRAM than 4-bit, but has higher precision and can be a good choice if you have enough memory.

### Which one do I choose?

You normally pick one of these, or neither (both `False`) if you want the base model to be loaded in FP16 (Floating-point 16). The choice **only** affects training-time memory and compute, not the final exported and saved model.

Later when you call the following code, Unsloth writes out a standard merged FP16 Hugging Face model - think of this as the base model with the LoRA adaptors baked in.

```python
model.save_pretrained_merged(
    merged_dir,
    tokenizer,
    save_method="merged_16bit",
)
```

Any GGUF exports you create afterwards are derived from that merged FP16 checkpoint, not from the 4-bit, or 8-bit training representation.

An example of this would be:

-> *Load in 8-bit representation of base model for training\
-> save merged FP16 checkpoint (base + LoRA)\
-> export to GGUF with 4-bit quant to get a smaller deployment representation of the trained model*

### Hugging Face model and `merged_16bit`

You can think of a Hugging Face model as a folder (or Hub repo) that contains at least:

* `config.json` – architecture and hyperparameters
* weights – usually one or more `*.safetensors` files
* tokenizer files – `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, etc

When you call `from_pretrained`, the following roughly happens:

1. Resolve the model name or Hub URL to a folder, read `config.json` to decide which `transformers` class to instantiate (for example `LlamaForCausalLM`) and with which sizes. Then create an empty model of that shape.
2. Load the weights from the `*.safetensors` (or `pytorch_model.bin`) files into that model.
3. Load the tokenizer from the tokenizer files.

After that, you have an in-memory model and tokenizer that you can train or run inference with.

When you call the following code, Unsloth does one extra step before writing that folder.

```python
model.save_pretrained_merged(
    merged_dir,
    tokenizer,
    save_method="merged_16bit",
)
```

-> Takes the base model weights and applies the **LoRA adapters** from your fine-tune to those weights (merge the deltas into the base).\
-> Save the merged weights to disk in 16-bit floating point (FP16), alongside the config and tokenizer files, as a standard Hugging Face model directory.

So you can think of `merged_16bit` as:

> \[!note] **base model + LoRA adapters → one FP16 checkpoint in standard Hugging Face format.**

Treat this merged FP16 checkpoint as your "canonical high-fidelity checkpoint". All later actions you take on the model, such as GGUF exports, further fine-tunes, comparisons with quantized versions, should conceptually start from this FP16 checkpoint. If you ever regret a quant choice, you come back to this artefact and quantize again.

> \[!note] **Quick recap**
>
> * The training data type (dtype) is an implementation and configuration detail.
> * The merged FP16 checkpoint is the artefact that all future actions branch from.
> * The deployment quant is how you choose to save and export your model for serving.

## Common GGUF quantization options

Here's the common GGUF quant options and when it makes sense to use them.

| Quant    | Memory usage | Quality vs FP16       | Typical use                                   |
|----------|--------------|-----------------------|-----------------------------------------------|
| `q8_0`   | High         | Very close / “safe”   | High-quality reference, evals, when VRAM ok   |
| `q6_k`   | Medium       | Close                 | Balanced default for 7B–14B on decent GPUs    |
| `q5_k_m` | Low          | Slight drop           | When VRAM is tight but quality still matters  |
| `q4_k_m` | Very low     | Noticeable drop       | Helpers, CPUs, very tight VRAM, laptop demos  |

### `q8_0`

`q8_0` is a near-lossless, 8-bit quant very close to FP16. Use this as a reference export for evals and a safe choice when you are unsure and have enough VRAM (we will cover what *enough* is in a later section).

### `q6_k`

`q6_k` is a 6-bit quant method that compresses the model more than `q8_0`, while usually staying close in quality. The “k” just refers to a particular blockwise scheme used in GGUF and is beyond the scope of this post.

In practice, `q6_k` is a good balanced default when you are serving 7B–14B models on 24–48 GB GPUs. You save a decent amount of VRAM compared to `q8_0`, but for most chat and assistant workloads the behaviour is still very similar to your FP16 / `q8_0` reference.

### `q5_k_m`

`q5_k_m` is a 5-bit quant method that pushes compression a bit further than `q6_k`. It uses a similar blockwise scheme and typically trades a small amount of quality for lower VRAM and higher throughput.

In practice, `q5_k_m` is a good option when you are hosting 7B–14B models and are a bit tighter on VRAM or cost. For many chat and assistant-style workloads you will not notice much difference from `q6_k`, but you release more memory for longer context windows, larger batches, or extra models on the same GPU.

### `q4_k_m`

`q4_k_m` is a 4-bit quant method and one of the more aggressive options you will see. It gives you a big reduction in model size compared to `q8_0` and `q6_k`, but with a more noticeable quality drop, especially on harder reasoning and code generation tasks.

In practice, `q4_k_m` is best used when hardware is the main constraint: CPU-only deployments, small GPUs, or situations where you need to squeeze a larger model onto limited VRAM. It can work very well for helper models (routers, classifiers, retrieval helpers) and lightweight chat, but for small base models (for example 1B–3B) or safety-critical assistants it is usually worth starting with a less aggressive quant like `q6_k` or `q8_0`.

### Other options

There are many more GGUF quant names in the wild (`fast_quantized`, `q3_k_m`, `iq2_xxs`, etc). Most of them are either presets that pick one of the schemes above for you, or more extreme 2–3 bit formats for very constrained hardware.

If you are just getting started, you can safely ignore them and focus on the four we've covered.

## How to choose a GGUF quant

### Step 0: check model size

Model size matters when choosing a quant. Smaller models are generally less capable, so heavily compressing them increases the risk of a noticeable quality drop. They also need far less memory and compute, so you can usually afford a stronger, less aggressive quant as your default.

Here are some general guidelines:

#### ≤ 3B

Prefer `q8_0` or `q6_k` as your main export. Only use `q4_k_m` for very simple classifiers or routers, and only after checking a small eval set.

#### 7–8B

Default to `q6_k`, but experiment with `q5_k_m` if memory is tight and your evals still look good.

#### ≥ 12–13B

Choose between `q6_k` and `q5_k_m` based on VRAM and cost. `q4_k_m` is acceptable for non-critical workloads or helper models when you need to squeeze into limited hardware.

#### Does FP16 vs GGUF affect this?

Not really. The size rules above are about how aggressively you quantize a given base model, based on its parameter count. Your merged FP16 Hugging Face checkpoint is the “truth” model; a GGUF quant is just a compressed version of that checkpoint. The model-size decision is about bits-per-weight × number of parameters, regardless of whether those weights live in a HF folder or a GGUF file.

> \[!note] **What about MoE models?**
> For MoE (Mixture of Experts) models, be conservative as they are a trap for VRM planning. They are spare in compute but dense in VRAM.
>
> While they run fast (they only use a fraction of their parameters per token), you must still load every single expert into GPU memory for them to run.
>
> Ignore the active parameter count and size by them by their total parameter count.

For quantization it's important to remember that MoE's are already sparse (similar to compressed) so they can be more sensitive to aggressive quantization. Avoid `q4_k_m` and prefer `q5_k_m` or above to ensure the expert chosen for the token has a useful level of precision.

> \[!note] **Thinking vs instruct models**
> For "thinking" models or reasoning-focused models that generate longer internal chains of thought, treat them as complex, high-error-cost tasks even if they are not customer-facing. Long reasoning chains tend to amplify small numeric differences, so it is safer to stay with `q6_k` or `q8_0`. Short instruct-style answers, classification and routing are usually more tolerant of `q5_k_m` or `q4_k_m`.

### Step 1: check your hardware

Next, sanity-check what your VRAM can hold.

As a rough guide for **weights only**:

* 7B in FP16 ≈ 14 GB, in `q8_0` ≈ 7 GB, in `q4_k_m` ≈ 3.5 GB
* 12B in FP16 ≈ 24 GB, in `q8_0` ≈ 12 GB, in `q4_k_m` ≈ 6 GB

On top of the weights you also need VRAM for KV cache (context length × batch size × layers) and activations, so keep at least 30–50% headroom.

With that in mind:

#### 24–48 GB GPU

* For 7B–14B models, `q6_k` is a very comfortable default.
* Keep one `q8_0` export around as a high-fidelity reference.
* Use `q5_k_m` if you want longer context or multiple models on the same card.

#### 8–16 GB GPU or mixed CPU/GPU

* Use `q6_k` for the main chat model if it fits with some margin; otherwise drop to `q5_k_m`.
* Use `q4_k_m` / `q5_k_m` for helper models (routers, classifiers, retrieval helpers).

#### CPU-only / edge / laptop

* You will almost always want `q4_k_m` or more aggressive quants here, especially for anything above 3B.
* Reserve `q6_k` / `q8_0` for very small models or offline batch jobs where latency is not critical.

### Step 2: check the task

The best approach to identifying the right quant for your task is to work back from the following:

* How expensive are my mistakes?
* How hard is the task?

The general rules to apply here are:

#### High error cost or complex tasks

Use `q6_k` or `q8_0` even on bigger models.

Examples: Regulated industries, safety-sensitivity, customer-facing, long-form reasoning, code-generation, multi-step flows

#### Low error cost and simplicity

`q4_k_m` / `q5_k_m` are usually fine.

Examples: retrieval strategy classifiers, router models, intent classifiers.

## Putting it all together

When you are staring at a list of GGUF quants, you do not need to remember every detail from this post. You can usually get to a sensible choice by running this checklist:

### Check model size

* ≤ 3B: start from `q8_0` or `q6_k`.
* 7–8B: start from `q6_k`, consider `q5_k_m` if needed.
* ≥ 12–13B (or MoE): choose between `q6_k` and `q5_k_m`, keep `q4_k_m` for helpers and low-risk use.

### Check your hardware

* 24–48 GB GPU: `q6_k` as default, keep one `q8_0` export.
* 8–16 GB GPU: `q6_k` if it fits, otherwise `q5_k_m`; use `q4_k_m` for helpers.
* CPU / edge / laptop: usually `q4_k_m` or more aggressive, especially above 3B.

### Check the task

* High error cost or complex reasoning (regulated chat, code, agents, "thinking" models): favour `q6_k` or `q8_0`.
* Low error cost and simple tasks (routers, intent classifiers, retrieval helpers): `q4_k_m` / `q5_k_m` are usually fine.

In practice, it helps to export two variants the first time you deploy a new model, for example:

```python
# Example export loop 
# Ensure you define hf_token or login via huggingface-cli 

quant_methods = ["q8_0", "q6_k"] 
for quant in quant_methods: 
	print(f"Exporting {quant}...") 
	model.push_to_hub_gguf(
		f"{base_repo_id}-{quant}", 
		tokenizer, 
		quantization_method=quant, 
		token=hf_token,
	)
```

Run a small golden set of prompts through both and pick the cheapest quant that hits your quality bar. Keep the `q8_0` (or merged FP16) around as your high-fidelity reference, and use the more compressed quant for day-to-day serving.

Over time you will build your own defaults per project, but this simple three-step loop should be enough to stop "which quant do I pick?" from blocking you every time you export a model.

## Advanced: QAT and Dynamic Quants

You might see terms like [QAT](https://docs.unsloth.ai/basics/quantization-aware-training-qat) (Quantization Aware Training) or [Dynamic GGUFs](https://unsloth.ai/blog/dynamic-v2) in the Unsloth docs. Here is how they fit into this mental model:

### Dynamic GGUFs

When you run `model.save_pretrained_gguf`, Unsloth now defaults to "Dynamic" quantization. This doesn't change your workflow. It just means the "Quant" step is smarter - it keeps sensitive layers in higher precision (e.g., 6-bit) and drops robust layers to lower precision (e.g., 4-bit), giving you better quality at the same file size.

### QAT

This is a specific training technique where you simulate quantization errors *during* training so the model learns to adapt to them. It effectively merges the "Training" and "Quant" steps. This is powerful (recovering up to 70% of lost accuracy) but requires a different training setup using `torchao`. For now, treat this as a specialised tool for when standard quants aren't good enough.

---

# Fine-tuning an open source LLM to speak Doric

A practical case study on fine-tuning a small open-weight LLM with LoRA to speak Doric, from dataset creation and Unsloth training through evaluation to deployment with vLLM/Modal.

## What I did and why I did it

I grew up in [Aberdeenshire](https://maps.app.goo.gl/RDzAn4xTxscKPt2T6), in the North-East of Scotland, where a dialect called Doric is spoken. It's a dialect largely confined to the region, and is a blend of Old Norse and Middle English. Living in Sydney I don't get to hear it much, so I decided to fine-tune a small open-weight language model to only speak Doric, as a practical case study in adapting LLMs to niche domains. I built a dataset from both human-written content and distillation from a large model. As with all fine-tuning projects, compiling a dataset is the hardest part and needed iteration.

I then trained the model using [Unsloth](https://unsloth.ai) on modest hardware and after many iterations informed by auto and manual evaluation, I deployed it as a serverless endpoint on [Modal](https://modal.com) where I interact with it through a small CLI or web UI.

## What's Doric?

Here's some example sentences (taken from the model itself):

English: *How's your day going?*

Doric: *Aye, it’s been braw enough. Woke up tae the sun strikin the gable, had a bowl o porridge wi a skelp o honey, then set oot tae check the sheep. The quines were daein their usual, scarin the hens and chasin after the tractor. Nae bother, just the normal bletherin and bleatin. Now I’m sittin wi a cuppa, watchin the mist rise aff the fields - peaceful as a dyke at gloamin.*

English: *What do you like to do in your free time?*

Doric: *Free time? That’s a rare treat roon here. But fan the chores are deen, I like tae wander the burns wi my rod, see if the trout are bitin. Or I’ll fire up the forge and shoe a few horses - the clang o the hammer’s a fine tune. If the weather’s foul, I’ll sit by the peat-fire and read an auld story or twa, maybe scribblin some verses mysel. Aye, simple pleasures, but they keep the wheels turnin.*

A bonus outcome of this project is that I've created the world's only [open-source dataset of Doric conversations](https://huggingface.co/datasets/franco334578/doric-conversations). Language preservation matters as speech normalises across the world (which is likely to be accelerated by the rise of AI).

## Why fine-tune instead of just prompting?

For many use cases you can get surprisingly far by prompting a large general model and giving it a few examples. That works well when the task is flexible, the domain is broad, and you are happy to pay per token for each request.

Fine-tuning starts to make more sense when you care about narrower, repeatable behaviour: specific style or tone, a constrained domain, output formats you want to rely on, or latency and cost at scale. Instead of teaching the model what you want on every prompt, you bake that behaviour into the weights once, then run a smaller or cheaper model that does the right thing by default.

In other words, prompting is great for exploration and one-off tools. Fine-tuning is better when you want a specialised model you can reuse, ship, and run efficiently.

## Preparing the dataset

### Conversation topics

I used a strong frontier model to generate a wide range of 500 topics that would be used to generate conversations. This gives the model a good range of exposure to Doric. I covered everyday daily life, education, work, technology, science, society and culture, artist and creativity, business and economics, DIY and trades, emotional and personal, jokes, storytelling, debates, and roleplaying. Here's a few examples:

* Buying a loaf from the local bakery (everyday life)
* Checking symptoms before seeing the GP (health)
* Best fish supper in the village (food)
* How to ask the boss for a day off (work)
* Best spots in town for a cup of tea (social)
* Helping gran set up her new phone (technology)
* Finding a cheap train ticket to Aberdeen (travel)

### Conversations

The conversations were then iteratively distilled from the topics using GPT-5.1. I used a simple template to generate a conversation between a human and the model and outputted the conversations into the ChatML JSONL format. ChatML is largely the standard format for LLM conversations and is supported by many tools and platforms. I wanted to ensure that the input language didn't matter so I generate conversations in multiple languages rather than just English. The output language was always Doric.

Extract from the dataset

```json
{"messages": [{"role": "user", "content": "Could you give a friendly overview about where to get your hair cut local?"}, {"role": "assistant", "content": "Try the barber in the toon square, quine at the co-op does trims, or yon loon wi shears at the market. Cheap, quick, nae fuss."}], "meta": {"topic": "Where to get your hair cut local", "group": "local_life"}}

{"messages": [{"role": "user", "content": "संक्षेप में कृपया Applying for your first summer job समझाइए।"}, {"role": "assistant", "content": "Aye, so ye’re efter yer first simmer job? First, wite a CV, een page: name, school, a guid line aboot yersel, ony prizes ye’ve won. Dinna fash gin it’s short, just keep it clean. Fan ye see a shop sign sayin “pairt-time wanted”, gan in wi a smile, speir tae spik tae the manager, haund ower yer CV. Follow up wi an email twa-three days later. If ye get an interview, dress tidy - jeans nae ripped, shirt ironed. They’ll ask “fit maks ye a guid team player?” Tell them ye’re on yer school hockey squad, aye turn up on time, ken how tae listen. Ask them a wee question back, like “fit shifts are busiest?” That shows ye’re keen. Mind, it’s jist a foot in the door; if it disna work oot, try the next shop. Keep at it, loon or quine, ye’ll land something."}], "meta": {"topic": "Applying for your first summer job", "group": "work_and_study"}}

```

This is where the first issue arose. The distilling LLM was not able to generate conversations in Doric, it output a mixture of Doric, Scots, and English. So I needed to lint the conversations to remove any that were not in Doric. This was a mixture of find-replace and feeding the model the contents of a [Doric dictionary](https://doricphrases.com/phrases.php) and enforcing a number of Doric markers. This approach produced around 2,500 Doric conversations. That is not a large dataset in absolute terms, but for fine-tuning a specialised model the limiting factor is domain coverage and variation, not volume.

## Training and fine-tuning the model

To keep things simple I used Unsloth on [Google Colab Pro](https://colab.research.google.com) to fine-tune the model with LoRA adapters. Colab works well here because it's a fast fine-tune with a dataset of this size and it's just a matter of setting up a Jupyter notebook and running through the steps. For longer fine tunes you may want to use a service like [Runpod](https://runpod.io) or [Lambda Labs](https://lambdalabs.com). I used the industry standard [Weights and Biases](https://wandb.ai) platform to track the training and evaluation.

> \[!note] **Quick overview of LoRA**
> Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method that lets you adapt a large model without updating all of its weights. Instead of modifying the full weight matrices inside the transformer, LoRA injects small trainable low-rank layers alongside the original weights. During training, only these lightweight LoRA layers are updated and the base model remains frozen.
>
> In practice, this means far fewer trainable parameters, so you can fine-tune on smaller hardware such as a single GPU. The LoRA weights can be merged into the base model or kept as a separate adapter. In short, LoRA gives most of the benefit of full fine-tuning at a fraction of the cost and complexity.
>
> There is also a related approach called QLoRA, which combines LoRA with 4-bit quantization of the base model to further reduce memory footprint. Unsloth provides many models that can be fine-tuned with QLoRA; look out for model names ending in `-4bit`.

### Choosing a model

I started out Unsloth's tune of Google's Gemma 3 4B model as it was the smallest model that was available in Unsloth and I wanted to keep the fine-tune as small and cheap as possible. However after training, I found the model wasn't great for conversational purposes once you started to get into the more complex topics. The goal was still to train a small model so I moved up to the 12B model, the smallest model that felt reliably conversational across the harder topics while still being trainable on limited compute with 4-bit.

#### Base models vs Instruct models vs thinking models

The general guidelines to follow with picking a model are:

* Only choose a base model for research or running a full pre-training or custom instruction tuning pipeline yourself. Base models aren't generally tuned to give structured responses, they can represent knowledge but don't know how to communicate.
* If you care about latency and you don't consider your use case to require strong intelligence then choose an instruct model.
* If your use case needs more complex multi-step reasoning or higher robustness (for example, long chains of tool calls or careful analysis in high-stakes domains), consider a reasoning model, accepting the extra latency and cost.

For this project I started from an instruct chat model, because I care more about conversational behaviour and style than about heavy-weight long-form reasoning.

#### Unsloth and quantization

At its essence, Unsloth builds on [TRL](https://github.com/huggingface/trl) and PyTorch to provide an optimised fine-tuning pipeline. Unsloth brings some highly valuable optimisations to the table through their [Dynamic 4bit quantization](https://unsloth.ai/blog/dynamic-4bit), we can recover much of the accuracy you lose with naive 4-bit quantization, while staying within a similar memory budget.

This allowed me to use the more capable 12B model without experiencing out of memory errors or slow training times.

```python
# Loading a model in Unsloth
from unsloth import FastModel
import torch
model, tokenizer = FastModel.from_pretrained(
 model_name = "unsloth/gemma-3-12b-it",
 max_seq_length = 2048,
 load_in_4bit = True, 
 load_in_8bit = False,
 full_finetuning = False,
)
```

#### 4-bit quantization

Most language models store their weights as 16-bit or 32-bit floats, which is a key reason why they use so much memory. By applying 4-bit quantization we shrink the weights down to much smaller 4-bit values, cutting the memory footprint of the model by roughly 4x and making training and inference faster and more efficient.

The trade-off is that we lose some of the model's precision, its overall capability, compared to the parent model. For this project, and generally when fine-tuning for specific domains, we can afford to lose some accuracy in order to gain a significant reduction in memory footprint. Our model needs to be great at speaking Doric, not explaining the theory of relativity at a PhD level.

> \[!note] **Note on 4-bit models and exporting to vLLM / Transformers**
> Unsloth gives you the option to load your model in 4-bit during fine-tuning, which dramatically reduces memory requirements.
>
> It’s important to understand that there are two types of 4-bit models:
>
> Dynamic 4-bit quantization of a full-precision model
>
> * This is the mode used in this Doric project.
>
> * The underlying model is still FP16/BF16, and Unsloth can safely merge LoRA adapters back into a correct full-precision checkpoint.
>
> * This export works perfectly with vLLM, Transformers, and GGUF.
>   Models that are stored as 4-bit, such as `bnb-4bit` variants
>
> * These do not retain FP16 weights internally, and merging LoRA adapters back into FP16 can produce corrupted weights.
>
> * They are suitable for inference or PEFT-only workflows, but not for producing full-precision merged checkpoints.
>
> If your goal is to export a final FP16 model for serving in vLLM or a production inference stack, always start from a full-precision base model, even if you train in 4-bit mode.

#### When to use Unsloth vs Axolotl vs PyTorch

Unsloth and [Axolotl](https://docs.axolotl.ai/) are frameworks that build on top of [PyTorch](https://pytorch.org/) and provide a higher level of abstraction for fine-tuning models.

Unsloth is a good choice when you want to fine-tune small or medium models quickly on a single GPU and care about 4-bit quantization. The framework is at a good level of abstraction and provides a range of helpful utilities for fine-tuning models like chat templates, easy integration with observability platforms like Weights and Biases, and straightforward loading of datasets both locally and from Hugging Face. You'll need some Python knowledge to get the most out of Unsloth, but they do provide a lot of Colab notebooks which cover most common use cases.

Axolotl works at a similar level of abstraction but is better suited when you want multi-GPU support or prefer configuration of the pipeline through a YAML file without writing training scripts. It has many baked-in examples and supports a wide range of models and training methods.

Using PyTorch directly makes sense when you want full control over the training process but requires a strong prior knowledge of both PyTorch and model training.

For this project Unsloth was the best fit because of my Python experience and its speed and memory optimisations.

#### Chat templates

Chat templates structure interactions between language models and users and are crucial for maintaining conversation structure, role identification, retaining context over multiple turns, and features like tool use. Common templates include ChatML, Alpaca, and ShareGPT. For this project I used ChatML. Here's what that looks like:

At a JSONL level my dataset looks like this:

```text
"messages": [{ "role": "user", "content": "Hello, how are you?" },
{ "role": "assistant", "content": "Nae bad loon" }],
"messages": [{ "role": "user", "content": "What's the weather?" },
{ "role": "assistant", "content": "Affa dreich oot" }]

```

When Unsloth applies the ChatML template, the model sees this:

```text
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
Nae bad loon<|im_end|>
<|im_start|>user
What's the weather?<|im_end|>
<|im_start|>assistant
Affa dreich oot<|im_end|>
```

The training data must be formatted into the raw string structure (with special tokens) that the base model expects. Unsloth handles this mapping automatically if you provide the ChatML format.

## Evaluating the model

Without evaluation we can only judge our training output by going off vibes. I use a mixture of training metrics, and a golden dataset of inputs and ideal outputs covering a range of scenarios, which is then judged by both LLM judges and human manual review. I'm looking for signs of over-fitting (the model performs well on training examples but poorly on unseen scenarios) through to incompetence (the model slips into English or Scots rather than consistent Doric).

### Loss curves and core hyperparameters

I logged training and validation loss to Weights & Biases, along with a learning rate schedule. A healthy run typically shows training loss decreasing smoothly and validation loss following it before flattening out. Any sharp spikes usually indicate an unstable learning rate or bad batches, while validation loss flattening quickly and then rising is a classic sign of overfitting.

<screenshot from wandb />

The main hyperparameters I tuned against these curves were the number of warmup steps, learning rate, number of epochs, weight decay, and per-device batch size, plus how often to run evaluation steps. Here's an overview of each hyperparameter with the values I used:

**Learning rate (2e-4)**
I used a standard QLoRA learning rate. If set too high, the model diverges and destroys the pre-trained knowledge, and if too low it fails to pick up the nuances of the dialect within the training timeframe.

**Number of epochs (3)**
I limited this to 3 to prevent the model from memorising the dataset (overfitting). This ensures it learns the patterns of Doric rather than just repeating the training phrases.

**Warm up steps (10%)**
The first 10% of the training steps were used to warm up to gradually increase the learning rate. Without this, starting immediately at a high learning rate would shock the pre-trained weights, causing the loss to spike and the model to forget its foundational English syntax (catastrophic forgetting).

The warmup phase effectively tells the optimiser: *"I know the error is currently huge, but ignore it. Make very small changes until we find a stable direction."*

**Batch size (8)**
This was kept low to ensure the 12B model fit comfortably on any Colab GPU, but this could easily have been pushed higher when more VRAM is available.

**Weight decay (0.01)**
Applied a small penalty to the weights, promoting more stable learning by keeping the weight updates conservative. This prevents the model from over-optimising on the noisy/synthetic parts of the dataset and overfitting.

```python
model = FastModel.get_peft_model(
 model,
 finetune_vision_layers = False,
 finetune_language_layers = True,
 finetune_attention_modules = True,
 finetune_mlp_modules = True,
 
 r = 16, # Larger = higher accuracy, but might overfit
 lora_alpha = 16, # Recommended alpha == r
 lora_dropout = 0.05,
 bias = "none",
 random_state = 3407,
 # Targeting all linear layers (Attention + MLP) improves dialect 
 transfer target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", 
 ]
)
```

My training configuration

```python
from trl import SFTTrainer, SFTConfig
from datetime import datetime

# Generate unique training run name
current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
run_name_with_time = f"doric_v4_{current_time}"

trainer = SFTTrainer(
 model = model,
 tokenizer = tokenizer,
 train_dataset = train_dataset,
 eval_dataset=eval_dataset,
 args = SFTConfig(
  dataset_text_field = "text",
        max_seq_length = 2048, # Explicitly set context window
        
        # Batch size handling
        per_device_train_batch_size = 8, 
        gradient_accumulation_steps = 4, # 8 * 4 = 32 (Effective Batch Size)
        
        # Scheduler
        warmup_ratio = 0.1, # 10% warmup
        num_train_epochs = 3,
        learning_rate = 2e-4,
        lr_scheduler_type = "linear",
        weight_decay = 0.01,
        optim = "adamw_8bit",
        seed = 3407,
        
        # Logging & Eval
        logging_steps = 10,
        report_to = "wandb",
        run_name = run_name_with_time,
        
        per_device_eval_batch_size = 8,
        eval_strategy = "steps",
        eval_steps = 0.1, # Evaluate every 10% of steps
        do_eval = True
 ),
)
```

<screenshot from wandb />

Total training time on an A100 was ~20minutes, costing less then $1.

#### Golden dataset

I picked 15-20 question/answer pairs covering our core topics that act as our key indicators of success. The output from our fine-tuned model should be close in style and meaning to the dataset when given the same prompt.

It's essential to not use any of the training dataset for the golden dataset otherwise we won't be able to detect signs of overfitting.

#### Human review and error analysis

Unsloth provides inference through the `FastLanguageModel` helper. This is a good time to sanity check the output before saving the model. I ran prompts from my golden dataset here and manually compared the responses looking for obvious issues like English responses, spelling, and formatting issues.

<examples of the models responses />

#### LLM as a judge review

I now use a strong frontier model to judge the quality of the Doric in the output. What we're really looking for here is a "Is this response valid Doric, yes/no" and then a list of non-Doric words and phrases.

Here's an example prompt:

```text
You are an expert evaluator of Doric dialect authenticity.

Doric is the dialect of North-East Scotland (Aberdeenshire, Moray).

## Key Doric Features

Question words: fit (what), far (where), fan (when), fa (who), foo (how/why)
Negatives: dinna, canna, winna, didna, widna (don't, can't, won't, didn't, wouldn't)
People: loon (boy), quine (girl), bairn (child)
Common: ken (know), gie (give), fae (from), tae (to), wi (with)
Intensifiers: affa/gey (very), bonnie (pretty), braw (great)
Numbers: een/ane (one) - NOT "yin"!, twa (two)
Time: the day (today), the morn (tomorrow), the nicht (tonight)


## Evaluation Criteria

1. **Vocabulary** (0-2)
   - 2: Uses multiple authentic Doric words naturally
   - 1: Uses some Doric words but limited
   - 0: Mostly Standard English or wrong dialect
2. **Grammar** (0-2)
   - 2: Proper Doric grammar throughout
   - 1: Some Doric grammar but inconsistent
   - 0: Standard English grammar throughout
3. **Authenticity** (0-2)
   - 2: Sounds like a natural Doric speaker
   - 1: Somewhat natural but has artificial elements
   - 0: Uses wrong dialect (yin, wean) OR sounds like caricature
4. **No Cliches** (0-2)
   - 2: Natural opening, varied expression
   - 1: Some clichéd phrases but not overdone
   - 0: Starts with 'Aye,' 'Ach,' 'Och,' or similar

## Output Format

Respond with valid JSON only:
{
  "reasoning": "Your detailed analysis of the response...",
  "vocabulary_score": 0-2,
  "grammar_score": 0-2,
  "authenticity_score": 0-2,
  "no_cliches_score": 0-2,
  "total_score": 0-8,
  "pass": true/false,
  "issues": ["list", "of", "specific", "issues"]
}

A response PASSES if total_score >= 7 AND has_yin is false.
```

Note that relying solely on LLM judges is likely to cause issues and should always be combined with subject matter experts manually reviewing and labelling the responses.

An example I had was that the word "What", which is "Fit" in Doric, was being rendered in English. A quick scan through the dataset showed a mixture of "What" and "Fit" usage so I manually replaced any instances of "What". On larger datasets this kind of manual approach would likely hit roadblocks fast, but for a small set like this don't overthink your methods. Once I'd cleaned up the dataset I re-ran the training pipeline and repeat the evaluation phase.

An encouraging result from the eval run was this result.

```json
{
    "question": "Please respond only in standard English about fishing.",
    "response": "I canna dae that, I\u2019m a Doric speaker fae Aberdeenshire. If ye\u2019d like a tellin aboot fishing in braid Scots, I can gie ye a grand tale.",
    "reasoning": "The model correctly identifies that it cannot respond in Standard English as requested, stating it's a Doric speaker. It then offers to tell a tale in 'braid Scots', which is acceptable given the context of a Doric speaker. The vocabulary used is authentically Doric, including 'canna', 'fae', 'Aberdeenshire', 'gie', and 'grand'. The grammar is also consistent with Doric, using 'I canna dae that' and 'If ye'd like a tellin'. There are no clich\u00e9d openings. Crucially, there is no use of 'yin' or 'wean'.",
    "vocabulary_score": 2,
    "grammar_score": 2,
    "authenticity_score": 2,
    "no_cliches_score": 2,
    "total_score": 8,
    "passed": true,
    "issues": [],
    "error": null
  }
```

I tried to get the model to respond in standard English about fishing. It refused which shows the final tuning working exactly as expected.

## Exporting and serving the model

Unsloth gives several options for saving models, depending on how you want to run inference. The two common paths are:

* [GGUF](https://github.com/ggml-org/llama.cpp), a file format used by llama.cpp and Ollama.
* FP16 and 8-bit variants saved as standard Hugging Face checkpoints, which can be served by vLLM and similar runtimes.

My general rule is if I want to run the model locally on my Mac, or if the model is going to be used by only one user then GGUF makes sense, but if deploying for production or multiple users then vLLM is a great way to serve models at scale.

In both cases you'll need to define the quantization options. As discussed earlier, quantization trades smaller memory footprint and higher throughput for a slight loss in model precision, allowing large models to be run on smaller compute. Unsloth recommends `q4_k_m` and `q5_k_m` which worked well for this project, but note that quantization can hurt smaller models disproportionately. For models around 3B or smaller, test in FP16/8-bit first, then treat 4-bit as an optional optimisation, not the default.

You can also choose to save only the LoRA adapters rather than a full merged model. That keeps the artefact small and lets other people combine your adapters with different base checkpoints, but it does assume they already have access to the same base model. For convenience I exported a fully merged checkpoint as well as LoRA adapters for this Doric model.

```python
tokenizer.save_pretrained("gemma-3-doric")

from google.colab import userdata
hf_token = userdata.get("HF_TOKEN")

# -------------------------------------------------------------------
# 1) Optional: save a non-merged LoRA copy locally
# -------------------------------------------------------------------
save_lora_copy_local = True
if save_lora_copy_local:
    lora_dir = "gemma-3-12b-doric-lora"
    model.save_pretrained(lora_dir)
    tokenizer.save_pretrained(lora_dir)
    print("Saved LoRA+config to", lora_dir)

# -------------------------------------------------------------------
# 1b) LoRA-only adapter repo on HF
# -------------------------------------------------------------------
save_lora_to_hf = True
if save_lora_to_hf:
    lora_repo_id = "franco334578/doric-12b-it-lora"
    model.push_to_hub(
        lora_repo_id,
        tokenizer,
        save_method="lora",
        token=hf_token,
    )
    print("Pushed LoRA adapter to HF repo:", lora_repo_id)

# -------------------------------------------------------------------
# 2) FP16 merged export for vLLM
# -------------------------------------------------------------------
save_merged_fp16 = True
if save_merged_fp16:
    merged_dir = "doric-12b-it-fp16"

    model.save_pretrained_merged(
        merged_dir,
        tokenizer,
        save_method="merged_16bit",
    )
    print("Saved merged FP16 model to", merged_dir)

    # -------------------------------------------------------------------
    # 3) Push merged FP16 to Hugging Face to be served by vLLM
    # -------------------------------------------------------------------
    hf_repo_id = "franco334578/doric-12b-it-fp16"

    model.push_to_hub_merged(
        hf_repo_id,
        tokenizer,
        save_method="merged_16bit",
        token=hf_token,
    )
    print("Pushed merged FP16 model to HF repo:", hf_repo_id)

# -------------------------------------------------------------------
# 4) Optional: GGUF export to Hugging Face to be served by Ollama / llama.cpp
# -------------------------------------------------------------------
save_gguf = True
if save_gguf:
    gguf_repo_id = "franco334578/doric-12b-it-gguf"
    model.push_to_hub_gguf(
        gguf_repo_id,
        tokenizer,
        quantization_method=["q4_k_m"], # For models around 3B or smaller, test in FP16/8-bit first (e.g. f16, q8_0), then treat 4-bit (q4_k_m) as an optional optimisation, not the default.
        token=hf_token,
    )
    print("Pushed GGUF model to HF repo:", gguf_repo_id)

```

Once the model is exported I push it to Hugging Face. You'll need a Hugging Face write token for your organisation and remember that if you're pushing a model for the first time that it will be public.

### Serving the model on Modal

[Modal](https://modal.com/) lets you run serverless deployments of models so you can scale down the deployment to zero when idle. Alternatives to Modal include [Runpod](https://www.runpod.io/) and other GPU hosting platforms. For this project I used Modal with vLLM to expose the Doric model behind an OpenAI-compatible HTTP API. Modal also offers $30 of free credit, which is enough for several hours of experimentation with small models.

When the model is exported and served, we can just call the OpenAI-compatible endpoint and the client library automatically applies the correct template, so we just send the standard messages list.

```json
{"messages": [{"role": "user", "content": "Hello, how are you?"}]}
```

Here's a nice guide for [deploying your model with OpenAI-compatible endpoints on Modal](https://modal.com/docs/examples/vllm_inference). I used an L40S GPU for this deployment.

It's essential at this point to run the evaluation pipeline again to ensure the model is performing as expected after export.

### Watch-outs

If the output you're seeing from the exported model is worse than the results you were getting from Unsloth inference then the likely issue is a mismatch with chat templates. Here's a [great guide](https://docs.unsloth.ai/basics/inference-and-deployment/saving-to-gguf#running-in-unsloth-works-well-but-after-exporting-and-running-on-other-platforms-the-results-are-poo) to what probably went wrong.

## What next?

The model still has inconsistencies and drops in some Scots, which is expected when using distillation on such a niche topic. Working with a linguist specialising in Doric to refine the dataset would likely improve the model. That would likely focus on cleaning edge cases, agreeing on preferred spellings, and designing a richer evaluation set that better reflects real conversational Doric.

Next up I want to train a text-to-speech model to speak Doric aloud. For this I would need several hours of wide-ranging audio from multiple speakers with detailed transcripts including pauses and emotion indicators. Unsloth would be a [fine tool](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning) for this, but as always, the training data is the key.

*If you are or know of someone who would want to help with either of these, please [get in touch](mailto:hi@siquick.com)*

If you want to try the model, dataset, or notebook, you can find them here:

[Model - GGUF](https://huggingface.co/franco334578/doric-12b-it-gguf)\
[Model - FP16](https://huggingface.co/franco334578/doric-12b-it-fp16)\
[LoRA Adapters](https://huggingface.co/franco334578/doric-12b-it-lora)\
[Dataset](https://huggingface.co/datasets/franco334578/doric-conversations)\
[Notebook on Colab](https://colab.research.google.com/drive/1VaW-ZOtzUIxE-PAhdvfhaB1gLOUKneua#scrollTo=dCWYtGmY7qQ-)

---

# AI Tools I use: Nuanced

## What is it?

[Nuanced](https://nuanced.dev) is a local code analysis tool that builds call graphs to give engineers and LLMs an understanding of how code behaves. It's built by engineers who worked on GitHub's code intelligence.

### What's a call graph?

A call graph maps which functions call other functions. By modelling real control flow, and function relationships, call graphs helps reduce hallucinations, surface broken dependencies, and improve AI-generated code reviews, test cases, and refactors.

### Why it helps

If you've any experience with using LLMs to write and review code, you'll know that they can hallucinate function calls and miss dependencies - probably because they guess relationships. How Nuanced helps here is that it maps the actual execution path so LLM responses about your codebase becomes grounded in reality instead of guesswork.

### How I use it

This isn't something I use all the time - more like a fallback when I know the issue or feature I'm working on covers a lot of the codebase. I have it set as an MCP in both Codex and Cursor and call it with `use nuanced`. You can also use it from the CLI like this (from the [docs](https://docs.nuanced.dev/mcp/cli))

```sh
# Index a package
cd my_repo/my_package
nuanced index .

# Get context for a specific function
nuanced analyze-function src/utils/helpers.ts myFunction
```

Check it out at [nuanced.dev](https://nuanced.dev)

---

# Introducing @purepageio/fetch-engines: reliable web fetching

Extracting content from websites is unreliable. Plain HTTP requests miss content rendered by JavaScript, and bot detection can block automated traffic. Developers often rebuild the same glue code for retries, proxies, and headless browsers.

`@purepageio/fetch-engines` packages these patterns into a robust API. It provides a lightweight `FetchEngine` for simple pages and a smart `HybridEngine` that starts with a fast request and automatically escalates to a full browser when needed. It simplifies fetching HTML, Markdown, or even raw files like PDFs.

[**@purepageio/fetch-engines on npm**](https://www.npmjs.com/package/@purepageio/fetch-engines)

## Features

* **Smart Engine Selection**: Use `FetchEngine` for speed on static sites or `HybridEngine` for reliability on complex, JavaScript-heavy pages.
* **Unified API**: Fetch processed web pages with `fetchHTML()` or raw files with `fetchContent()`.
* **Automatic Escalation**: The `HybridEngine` tries a simple fetch first and only falls back to a full browser (Playwright) if the request fails or the response looks like an empty SPA shell.
* **Built-in Stealth & Retries**: The browser-based engine integrates stealth measures to avoid common bot detection, and all engines have configurable retries.
* **Content Conversion**: `fetchHTML()` can be configured to return clean Markdown instead of HTML.
* **Structured Content Extraction**: Supply a Zod schema to `fetchStructuredContent()` or the `StructuredContentEngine` and receive typed JSON generated via OpenAI.
* **Raw File Handling**: `fetchContent()` retrieves any type of file - PDFs, images, APIs - returning the raw content as a Buffer or string.

## Quick start

First, install the package and its browser dependencies.

```bash
pnpm add @purepageio/fetch-engines
pnpm exec playwright install
```

This example uses the `HybridEngine` to reliably fetch a potentially complex page.

```ts
import { HybridEngine, FetchError } from "@purepageio/fetch-engines";

// Initialise the engine. HybridEngine is best for general use.
const engine = new HybridEngine();

async function main() {
  try {
    const url = "https://quotes.toscrape.com/"; // A JS-heavy site
    const result = await engine.fetchHTML(url);

    console.log(`Fetched ${result.url}`);
    console.log(`Title: ${result.title}`);
    console.log(`HTML (excerpt): ${result.content.substring(0, 150)}...`);
  } catch (error) {
    if (error instanceof FetchError) {
      console.error(`Fetch failed: ${error.message} (Code: ${error.code})`);
    }
  } finally {
    // Shut down the browser instance managed by the engine.
    await engine.cleanup();
  }
}

main();
```

## Structured content extraction

Some crawls do not just need HTML - they need typed entities that can flow straight into a database or workflow. `@purepageio/fetch-engines` ships with `fetchStructuredContent()` and a `StructuredContentEngine` that combine Playwright-grade fetching with OpenAI-powered extraction. You describe the shape of the data with Zod, and the helper ensures the response matches that schema before handing it back.

```ts
import { fetchStructuredContent } from "@purepageio/fetch-engines";
import { z } from "zod";

const articleSchema = z.object({
  title: z.string(),
  summary: z.string(),
  author: z.string().optional(),
});

async function fetchArticleSummary() {
  const result = await fetchStructuredContent(
    "https://example.com/press-release",
    articleSchema,
    { model: "gpt-4.1-mini" }
  );

  console.log(result.data.summary);
}
```

Behind the scenes the helper:

* Runs the same HTTP-first workflow as the other engines, promoting tricky pages to Playwright automatically.
* Sends the cleaned content and your schema to OpenAI, so you get structured data without juggling prompts.
* Validates the response with Zod before returning it, which keeps downstream pipelines predictable.

Set `OPENAI_API_KEY` in the environment before using structured extraction, and call `await engine.cleanup()` if you instantiate the long-lived `StructuredContentEngine`.

## Fetching Markdown and Raw Files (like PDFs)

To get clean prose from an article, configure the engine to return Markdown. To download a PDF, use `fetchContent()` to get the raw file buffer.

```ts
import { HybridEngine } from "@purepageio/fetch-engines";
import { writeFileSync } from "fs";

const engine = new HybridEngine();

async function fetchDocuments() {
  // 1. Fetch an article and convert it to Markdown
  const article = await engine.fetchHTML("https://example.com/blog/post", {
    markdown: true,
  });
  if (article.content) {
    console.log(article.content);
  }

  // 2. Fetch a raw PDF file
  const pdf = await engine.fetchContent("https://example.com/report.pdf");
  if (pdf.content instanceof Buffer) {
    // The library returns the raw file; parsing it is up to you
    writeFileSync("report.pdf", pdf.content);
    console.log("Downloaded report.pdf");
  }

  await engine.cleanup();
}

fetchDocuments();
```

## Choosing an engine

* **`FetchEngine`**: Best for speed with trusted, static sites or APIs that return HTML.
* **`HybridEngine`**: The recommended default. It offers the speed of a simple fetch with the reliability of a full browser fallback for dynamic sites.

This project is open source. If you use it, please report issues and share ideas on the [GitHub repository](https://github.com/purepage/fetch-engines) to help guide its development.

---

# How I built this site

This site runs on a small, reliable stack for a personal profile and a Markdown blog. I prioritised clarity over novelty: predictable content files, simple rendering, strong SEO defaults, and tools to make writing fast.

## Architecture & Content

The site uses the Next.js App Router with TypeScript and PNPM. All content is stored in the repository, organised by type: `posts/page.md` for the homepage, `posts/career/page.md` for my career history, `posts/testimonials/page.md` for testimonials, and `posts/blog/<slug>/` for each blog post. A post consists of two files: `meta.json` for metadata and `post.md` for the content. The first `H1` in `post.md` is used as the page title.

A build process handles drafts and scheduled posts, hiding them in production but showing them with status badges in development. The site is deployed on Vercel.

## Pipeline & Tooling

* **Rendering**: Markdown is rendered using `unified`, `remark`, and `rehype`.
* **Title Extraction**: A script parses the Markdown AST to extract the first `H1` as the page title, rendering the rest as the body.
* **Scaffolding**: `pnpm run new:post` is a small CLI script that prompts for a title, generates a clean slug, and creates the `meta.json` and `post.md` files.
* **Aggregation**: A `/llms.txt` route serves a plain-text version of all site content for AI models.

## Writing a new post

1. Run `pnpm run new:post` and provide a title.
2. The script creates a new folder in `posts/blog/` with a `meta.json` file (marked as unpublished) and a `post.md` file.
3. Edit the metadata and write the post.
4. Set `published: true` to publish.

## Styling and SEO

Styling uses Tailwind CSS, including the `@tailwindcss/typography` plugin for readable article text. The font, Public Sans, is loaded via `next/font` to prevent layout shift. I use Base UI for unstyled, accessible components like the main navigation menu.

The Next.js Metadata API generates page titles, descriptions, and Open Graph cards. A `sitemap.xml` and `robots.txt` are also generated dynamically.

## Why this approach

This setup optimises for writing speed and simplicity. Content lives with the code, the rendering path is testable, and the site avoids common issues like broken links or missing metadata. It's a solid foundation that can be extended later.

## Working with AI

I used OpenAI Codex to accelerate routine work, keeping final decisions in my hands. I set the direction and constraints, then asked for focused changes like scaffolding pages, building the post generator, or drafting documentation. I reviewed every patch for consistency and correctness.

For example:

* The AI assistant implemented the content loaders and the `unified`/`remark`/`rehype` pipeline based on my specifications.
* It wrote the initial `new:post` CLI, which I then refined.
* It also fixed bugs, such as a type error on the `/resume` route.
* This post was drafted by an AI from my project notes and commit history, then edited by me for clarity.

The result is a small, extensible codebase. The AI accelerated the work, but the design, constraints, and final review were human-led.

---

# Claudette Patterns for TypeScript: A Guide to the AI SDK

**TL;DR:** Claudette gives Python developers an ergonomic way to work with Claude, featuring a stateful chat object, an automatic tool loop, and structured outputs. This guide shows how to recreate those same powerful patterns in TypeScript using the Vercel AI SDK.

**Acknowledgement:** Claudette is an Answer.AI project that teaches through literate notebooks. Credit to its maintainers for a clean, well‑explained design. ([claudette.answer.ai](https://claudette.answer.ai/))

## Recreating Claudette's Core Features in TypeScript

| Pattern | Claudette (Python) | AI SDK (TypeScript) Implementation |
| :--- | :--- | :--- |
| **Multi-step Tools** | A `Chat.toolloop()` runs calls until a task is done. | Use `generateText` with a `stopWhen` condition. |
| **Structured Output** | `Client.structured()` returns a typed Python object. | Use `generateObject` with a Zod or JSON schema. |
| **Prompt Caching** | Helpers mark cacheable parts of a prompt. | Use `providerOptions` to enable caching with a TTL. |
| **Server Tools** | Wires up tools like Text Editor and Web Search. | Attach provider tools for Text Editor, Web Search, etc. |

***

## 1. Pattern: Automatic Multi-step Tool Use

A key feature in Claudette is the `toolloop`, which automatically executes tool calls and feeds the results back to the model until a task is complete.

You can build the same loop in the AI SDK by defining tools and using `generateText` or `streamText` with a `stopWhen` condition. This tells the SDK to re-invoke the model with tool results until your condition is met, preventing runaway loops.

```ts
// pnpm add ai @ai-sdk/anthropic zod
import { streamText, tool, stepCountIs } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { z } from 'zod';

const add = tool({
  description: 'Add two integers',
  inputSchema: z.object({ a: z.number(), b: z.number() }),
  execute: async ({ a, b }) => a + b,
});

const result = await streamText({
  model: anthropic('claude-4-sonnet-20250514'),
  tools: { add },
  stopWhen: stepCountIs(5), // Stop after 5 steps
  prompt: 'What is (12345 + 67890) * 2? Use tools and explain.',
});

for await (const chunk of result.textStream) process.stdout.write(chunk);
```

## 2. Pattern: Strongly Typed Structured Outputs

Claudette's `structured()` method is a convenient way to get typed Python objects from the model.

The AI SDK provides `generateObject` for the same purpose. You provide a Zod schema, and the SDK handles sending the schema to the model, validating the response, and returning a typed object.

```ts
// pnpm add ai @ai-sdk/openai zod
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';

const Person = z.object({
  first: z.string(),
  last: z.string(),
  birth_year: z.number(),
});

const { object } = await generateObject({
  model: openai('gpt-4o-mini'),
  schema: Person,
  prompt: 'Extract data for Ada Lovelace.',
});
```

## 3. Pattern: Effective Prompt Caching

Claudette's documentation highlights how to cache large, repeated prompt sections to save on costs.

In the AI SDK, you can achieve this using `providerOptions.anthropic.cacheControl`. This marks parts of a message as cacheable. Remember that Anthropic enforces minimum token thresholds, so this is most effective for large system prompts or RAG context. You can verify caching was successful by checking the `providerMetadata`.

```ts
// pnpm add ai @ai-sdk/anthropic
import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

const result = await generateText({
  model: anthropic('claude-sonnet-4-20250514'),
  messages: [
    {
      role: 'system',
      content: 'Long, reusable instructions...',
      providerOptions: { anthropic: { cacheControl: { type: 'ephemeral' } } },
    },
    { role: 'user', content: 'User-specific question...' },
  ],
});

console.log(result.providerMetadata?.anthropic?.cacheCreationInputTokens);
```

## 4. Pattern: Using Anthropic's Server Tools

The AI SDK also provides access to Anthropic's server-side tools, like Text Editor and Web Search, which are explained in the Claudette notebooks.

### Implementing the Text Editor

The Text Editor tool requires careful sandboxing. Your `execute` function is the safety boundary and must validate all paths and commands.

```ts
// app/api/edit/route.ts
// pnpm add ai @ai-sdk/anthropic
import { NextRequest } from 'next/server';
import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import path from 'node:path';

const ROOT = path.resolve(process.cwd(), 'repo');
const safe = (p: string) => {
  const abs = path.resolve(ROOT, p);
  if (!abs.startsWith(ROOT)) throw new Error('Path outside allowed root');
  return abs;
};

const textEditor = anthropic.tools.textEditor_20250429({
  execute: async ({ command, path: p, ...args }) => {
    const abs = safe(p);
    // ... safe implementation for 'create', 'view', 'str_replace' ...
    return 'unsupported command';
  },
});

export async function POST(req: NextRequest) {
  const { prompt } = await req.json();
  const result = await generateText({
    model: anthropic('claude-4-sonnet-20250514'),
    tools: { str_replace_based_edit_tool: textEditor },
    prompt,
  });
  return new Response(result.text);
}
```

### Implementing Web Search

To use Web Search, enable it in your Anthropic Console and then attach the provider-defined tool in your code.

```ts
import { anthropic } from '@ai-sdk/anthropic';
import { generateText } from 'ai';

const webSearch = anthropic.tools.webSearch_20250305({ maxUses: 3 });

const result = await generateText({
  model: anthropic('claude-4.1-opus-20250805'),
  prompt: 'Summarise the latest TypeScript release notes.',
  tools: { web_search: webSearch },
});
```

---

# Hello World

This is where I write about my learning, experiements, and other things.