Deploying GLM-OCR as a Serverless GPU API on Modal
I put together a small repo that deploys GLM-OCR to Modal as a PDF-to-Markdown API.
The goal was simple. I wanted something open, fairly compact, and easy to drop into an LLM ingestion pipeline without pulling in a pile of extra infrastructure. GLM-OCR turned out to be a good fit, and Modal makes it very easy to expose it as a public GPU-backed endpoint.
Here for the code?
The repo is here.
What's happening
There are plenty of OCR tools around, but I wanted something small and hackable. GLM-OCR is a compact document OCR model that works through transformers and can produce structured text from page images without much fuss.
The deployment side is equally minimal. It is just a FastAPI app and a Modal GPU class. The endpoint accepts a PDF, the worker renders each page to an image, runs OCR, and returns the result as Markdown.
The shape of the code
The request flow is straightforward:
POSTa PDF to/parse- render the PDF pages with
pypdfium2 - send each page image to
GLM-OCR - return the result as Markdown
You can hit it with curl like this:
curl -X POST "https://<your-modal-url>/parse" \
-F "pdf=@/absolute/path/to/file.pdf"
The nice part is how little code you need around the model itself. The worker loads GLM-OCR once, and the API calls it remotely:
@app.cls(
image=inference_image,
gpu="L40S",
volumes={MODEL_CACHE_PATH: hf_cache_volume},
)
class GlmOcrService:
@modal.enter(snap=True)
def load(self) -> None:
from transformers import AutoModelForImageTextToText, AutoProcessor
self.processor = AutoProcessor.from_pretrained(MODEL_NAME)
self.model = AutoModelForImageTextToText.from_pretrained(
pretrained_model_name_or_path=MODEL_NAME,
torch_dtype="auto",
device_map="auto",
)
You still write normal Python functions and classes, but end up with a public serverless GPU service.
Why Modal works well for this
Modal suits bursty workloads like PDF processing that do not need to sit online all day.
In this setup I am using:
- a FastAPI endpoint for uploads
- an
L40SGPU worker - a persisted Hugging Face cache so the model weights do not need to be pulled every time
- scale-to-zero after about a minute of inactivity
That gives you a proper hosted OCR API without managing a VM, Kubernetes, or a permanently warm GPU.
Observed performance
The useful bit is end-to-end behaviour, not just model-card numbers.
In my quick tests with a 4-page PDF:
- cold start:
26.08s - warm start:
17.98s - warm start:
18.17s
So in practice that looks like:
- roughly 8 seconds of cold-start overhead
- then about 4.5 seconds per page once the worker is warm
That is slower than raw model benchmark figures, but those are measuring something different. Here you are timing the whole request path: request handling, PDF rasterisation, image preparation, OCR, and Markdown assembly, plus cold start where relevant.
You can do this for almost free
As of 20 March 2026, Modal’s Starter plan includes $30/month in compute credits.
This repo uses an L40S, currently priced at $0.000542/sec, or about $1.95/hour. That works out to roughly 15.4 hours of GPU runtime per month before you pay anything beyond the included credit.
So you can deploy a real OCR API backed by a proper GPU, keep the codebase small, and experiment without the cost becoming its own project.
Cover photo by Brett Jordan on Unsplash