Caltech Research Produces a 1-Bit LLM That Fits on Any Smartphone While Matching Full-Precision Competitors

David Borish
19 hours ago
5 min read

caltech research llm breakthrough — Caltech Research Produces a 1-Bit LLM That Fits on Any Smartphone While Matching Full-Precision Competitors

A standard 16-bit 8B language model requires roughly 16 gigabytes of memory to run. An iPhone 17 Pro cannot host one. PrismML's 1-bit Bonsai 8B requires 1.15 gigabytes, runs at around 40 tokens per second on that same iPhone, and scores competitively on benchmarks against models fourteen times its size.

That combination — capable benchmark scores at a fraction of the memory footprint — is the central claim of Bonsai, and it puts PrismML's release squarely in the middle of a long-running debate about whether 1-bit quantization can produce models that are actually deployable rather than merely small.

The company emerged from stealth on March 31, 2026, founded by Caltech mathematicians including Babak Hassibi, the university's electrical engineering department chair, and backed by $16.25 million from Khosla Ventures, Cerberus Capital, and Caltech itself. Amir Salek, who led the TPU program at Google and is now at Cerberus Ventures, is among the investors. The model weights are available immediately under the Apache 2.0 license.

What 1-Bit Caltech LLM Means Here

Most language models store weights in 16-bit or 32-bit floating point format. Each parameter requires that many bits of memory. 1-bit quantization reduces each weight to a single binary value: zero or one. The tradeoff has historically been severe performance degradation, particularly in tasks requiring multi-step reasoning, instruction following, and reliable tool use.

PrismML's architecture goes further than most prior 1-bit attempts. Embeddings, attention projections, MLP layers, and the language model head are all 1-bit. There are no higher-precision components in the network. The effective bits per weight come out to 1.125, accounting for a shared 16-bit scale factor applied per group of 128 weights.

The company describes this as the first commercially viable 1-bit LLM, a claim that rests on both the benchmark scores and the breadth of deployment targets the model actually reaches. Independent verification of that commercial viability claim will take time, but the released weights, inference code, and benchmarking methodology are publicly available for scrutiny.

Benchmark Results

Evaluated using EvalScope v1.4.2 and vLLM 0.15.1 on an NVIDIA H100, Bonsai 8B achieves an average score of 70.5 across six benchmark categories, placing it in range of full-precision 8B instruct models including Qwen3 8B, Llama3 8B, and Mistral 7B. The benchmarks cover reasoning, math, coding, and general instruction following.

PrismML introduces a metric they call intelligence density, defined as the negative log of the model's average error rate divided by model size in gigabytes. By this measure, Bonsai 8B scores 1.06 per GB. Qwen3 8B scores 0.10 per GB. The metric penalizes improvements at lower accuracy levels relative to improvements near ceiling, which the company argues provides a more realistic picture of capability gains.

The metric is internally defined, which is worth noting. It is not a standard used elsewhere in the field, and readers should treat it as PrismML's framing of their own advantage rather than an externally validated measurement. Raw average benchmark scores, which are directly comparable to reported numbers from other labs, show Bonsai 8B competitive with but not clearly superior to leading 8B models on accuracy alone.

What distinguishes the result is that this score comes at 1.15 GB. A Pareto frontier analysis PrismML conducted across twenty models from 1.2 GB to 16.4 GB shows the Bonsai family (8B, 4B, and 1.7B) sitting well beyond the previous efficiency boundary defined by Qwen3 models and Ministral 3B.

Speed and Energy Numbers

On an M4 Pro Mac, Bonsai 8B runs at 131 tokens per second. On an RTX 4090, it reaches 368 tokens per second. On an iPhone 17 Pro Max, it runs at approximately 44 tokens per second. A standard 16-bit 1B model runs at 23 tokens per second on the same iPhone prompt — less than half the speed, at a fraction of the capability.

Energy efficiency gains are substantial. On the M4 Pro, Bonsai 8B requires 0.074 milliwatt-hours per token. On the iPhone 17 Pro Max, 0.068 mWh per token. The company reports 4 to 5 times better energy efficiency relative to full-precision 8B models. This figure matters for edge deployment economics, where battery life and thermal limits impose real constraints on inference frequency.

PrismML notes that these gains derive primarily from reduced memory bandwidth requirements — moving fewer bits to and from memory during inference — rather than from hardware designed specifically for 1-bit arithmetic. In linear layers, 1-bit weights theoretically allow multiplication to be replaced by addition, but current GPU and mobile hardware does not exploit this. The company says purpose-built 1-bit hardware could push efficiency gains by another order of magnitude.

Deployment Breadth

The model runs across a wider range of platforms than most releases of this capability tier. It operates natively on Apple Silicon via the company's MLX fork, on NVIDIA GPUs via a llama.cpp fork with GGUF Q1_0_g128 format, on iPhone and iPad via the Locally AI app, and on CPU. Android support is noted in the Hugging Face model card.

PrismML has released the inference kernels as open-source forks of llama.cpp and MLX. The kernels are not yet merged into upstream versions of either framework, which means users need to build from the PrismML forks. A setup script and Colab notebook are available to lower that barrier.

Three model sizes ship simultaneously: 8B at 1.15 GB, 4B at 0.57 GB (reaching 132 tokens per second on an M4 Pro), and 1.7B at 0.24 GB (reaching 130 tokens per second on an iPhone 17 Pro Max). The family covers the range from server-side throughput optimization to phones with limited memory budgets.

The Open-Prem Inflection Point Connection

The Bonsai release is a data point in a longer trend. Over the past year since first writing the Open-Prem Inflection Point Paper, capable open-weight models have progressively closed the gap with frontier proprietary systems on standard benchmarks. What Bonsai adds is a different axis: capable models at memory footprints that allow local deployment on hardware that was previously excluded from serious AI workloads entirely.

On-premises deployment economics shift when an 8B-class model fits on a consumer phone at 44 tokens per second. The cloud dependency for inference narrows. For enterprises with data residency requirements, healthcare environments with connectivity constraints, and developers building offline applications, a model this small that performs this well changes the build calculus.

The Open-Prem Inflection Point V3 Released April 1st, 2026

The argument PrismML is making — that intelligence density matters as much as raw parameter count — aligns with a pattern visible across the last several model generations: the gap between large proprietary models and efficiently designed smaller ones has been narrowing faster than most analysts expected in 2023.

What Comes Next

PrismML describes the current Bonsai generation as the beginning of a category. The company has not disclosed specifics about future model releases or training runs, but notes that its 1-bit architecture is not tied to a specific base model and can be applied to newer foundations as the field advances.

The hardware opportunity they're pointing toward — chips designed for 1-bit inference that could replace multiplication with addition at scale — does not yet exist commercially. If it materializes, efficiency gains beyond the current 4-5x over full-precision models become plausible. The company has investors with direct semiconductor experience, which may inform how seriously they're developing that path.

For now, the result is a model that runs on an iPhone at 8B-class performance, consumes 1.15 GB of memory, and is freely available. Whether the commercial viability claim holds up as developers build real applications on top of it is a question the open-source release is designed to answer.

DAVID BORISH