Deploying GLM-OCR as a Serverless GPU API on Modal

20 Mar 2026

I put together a small repo that deploys GLM-OCR to Modal as a PDF-to-Markdown API.

The goal was simple. I wanted something open, fairly compact, and easy to drop into an LLM ingestion pipeline without pulling in a pile of extra infrastructure. GLM-OCR turned out to be a good fit, and Modal makes it very easy to expose it as a public GPU-backed endpoint.

Here for the code?

The repo is here.

What's happening

There are plenty of OCR tools around, but I wanted something small and hackable. GLM-OCR is a compact document OCR model that works through transformers and can produce structured text from page images without much fuss.

The deployment side is equally minimal. It is just a FastAPI app and a Modal GPU class. The endpoint accepts a PDF, the worker renders each page to an image, runs OCR, and returns the result as Markdown.

The shape of the code

The request flow is straightforward:

POST a PDF to /parse
render the PDF pages with pypdfium2
send each page image to GLM-OCR
return the result as Markdown

You can hit it with curl like this:

curl -X POST "https://<your-modal-url>/parse" \
  -F "pdf=@/absolute/path/to/file.pdf"

The nice part is how little code you need around the model itself. The worker loads GLM-OCR once, and the API calls it remotely:

@app.cls(
    image=inference_image,
    gpu="L40S",
    volumes={MODEL_CACHE_PATH: hf_cache_volume},
)
class GlmOcrService:
    @modal.enter(snap=True)
    def load(self) -> None:
        from transformers import AutoModelForImageTextToText, AutoProcessor

        self.processor = AutoProcessor.from_pretrained(MODEL_NAME)
        self.model = AutoModelForImageTextToText.from_pretrained(
            pretrained_model_name_or_path=MODEL_NAME,
            torch_dtype="auto",
            device_map="auto",
        )

You still write normal Python functions and classes, but end up with a public serverless GPU service.

Modal suits bursty workloads like PDF processing that do not need to sit online all day.

In this setup I am using:

a FastAPI endpoint for uploads
an L40S GPU worker
a persisted Hugging Face cache so the model weights do not need to be pulled every time
scale-to-zero after about a minute of inactivity

That gives you a proper hosted OCR API without managing a VM, Kubernetes, or a permanently warm GPU.

Observed performance

The useful bit is end-to-end behaviour, not just model-card numbers.

In my quick tests with a 4-page PDF:

cold start: 26.08s
warm start: 17.98s
warm start: 18.17s

So in practice that looks like:

roughly 8 seconds of cold-start overhead
then about 4.5 seconds per page once the worker is warm

That is slower than raw model benchmark figures, but those are measuring something different. Here you are timing the whole request path: request handling, PDF rasterisation, image preparation, OCR, and Markdown assembly, plus cold start where relevant.

You can do this for almost free

As of 20 March 2026, Modal’s Starter plan includes $30/month in compute credits.

This repo uses an L40S, currently priced at $0.000542/sec, or about $1.95/hour. That works out to roughly 15.4 hours of GPU runtime per month before you pay anything beyond the included credit.

So you can deploy a real OCR API backed by a proper GPU, keep the codebase small, and experiment without the cost becoming its own project.

Cover photo by Brett Jordan on Unsplash