Compile once.
Deploy anywhere.

ZeptonML extracts computational graphs from trained models and compiles them into minimal, deterministic binaries. No runtime. No framework. Just inference.

Get started Documentation

Performance

Container size15 GB

910280 MB

125× smaller

Cold start45 s

68 ms

5,625× faster

Memory~2 GB

2468 MB

42× lighter

Three lines to deploy

Dockerfile

# Compile a torch.export model to a static binary
FROM scratch
ARG MODEL=model.pt2
COPY ${MODEL} /model.pt2
RUN zeptonml compile /model.pt2 -o /infer
ENTRYPOINT ["/infer"]

Why ZeptonML

Zero runtime

No Python, no PyTorch, no CUDA runtime. A single static binary that runs on bare metal.

Deterministic memory

Every allocation planned at compile time. No GC pauses, no fragmentation, no surprises.

Cross-compilation

One build step produces binaries for x86, ARM, any target. No toolchain gymnastics.

Edge ready

Static binary for embedded deployment. If it has a processor, it runs your model.

From the notebook

Engineering Notes — 2026.05.30

Why MTEB Scores Don't Predict Performance on Your Data

Benchmark leaderboards measure generalization across tasks you'll never run. Evaluate embedding models on the only distribution that matters — yours.

It is better to be vaguely right than exactly wrong.

— Carveth Read, 1898