When Interactivity Scales With Intelligence: Mira Miratti's Thinking Machines Lab Launches New Model
- David Borish
- 2 days ago
- 5 min read

Most AI research focuses on making models smarter. A new paper from Thinking Machines Lab focuses on something different: making models actually usable in the way humans naturally work together.
The paper, published today, introduces what the team calls interaction models, a class of AI systems designed to handle real-time, multimodal collaboration natively rather than through the add-on components that current systems depend on. The research arrives with benchmark results that put their model, TML-Interaction-Small, ahead of GPT Realtime and Gemini Live on several interactivity measures while remaining competitive on standard intelligence tests.
The Bottleneck They're Trying to Fix
The problem Thinking Machines identifies is structural. Current large language models experience the world one turn at a time. A user types or speaks, the model waits, the model responds, the user waits. Until the user finishes, the model perceives nothing. Until the model finishes generating, it receives no new information. The researchers describe this as a narrow channel that limits how much of a person's knowledge, intent, and judgment can actually reach the model.
They point to a telling admission buried in a recent frontier model card, which noted that when models are used in interactive, hands-on patterns, users often perceive them as too slow and don't realize as much value. The card suggested autonomous, long-running harnesses better elicit model capabilities. Thinking Machines reads this as a design concession, not a feature, one that effectively pushes humans out of their own work.
This matters practically. Most real work can't be fully specified upfront. A designer, an analyst, or an engineer working through a problem needs to clarify, redirect, and give feedback as the work develops. When the AI interface requires the human to batch all of that guidance into discrete prompts, it's not a collaboration; it's a delegation that happens to allow some revision.
How the Model Works
The core technical idea is what the team calls time-aligned micro-turns. Rather than waiting for a complete user input and generating a complete response, the model processes 200-millisecond chunks of audio, video, and text continuously, interleaving input and output in a single token stream. At any given moment the model is both listening and generating, which allows it to interrupt when something is wrong, backchannel while the user is talking, respond to visual cues without being explicitly prompted, and speak simultaneously with the user when tasks require it, like live translation.
Most existing real-time systems emulate this behavior through external components, primarily voice-activity-detection systems that predict when a speaker has finished so a turn-based model can begin. Thinking Machines argues that these components, being far less intelligent than the models they're managing, place a ceiling on what real-time AI can actually do. Proactive behaviors like "interrupt me when I make a factual error" or "count how many pushups I complete" require the system to decide when to respond based on context rather than an audio boundary. A harness that listens for silence can't do that.
Their architecture uses minimal preprocessing. Audio is encoded using dMel spectrograms with a lightweight embedding layer, images are split into patches processed by a compact MLP, and a flow-based decoder generates audio output. All components are trained together from scratch with the main transformer rather than bolted on afterward.
For tasks that require deeper reasoning than can be produced in real time, the interaction model delegates to a background model running asynchronously. The interaction model stays active throughout, handling follow-ups and new input, then integrates the background results into the conversation when the moment is right. The researchers frame this as getting reasoning-model depth at non-thinking-model latency.
What the Benchmarks Show
On FD-bench, which measures interactivity across scenarios including user interruptions, backchanneling, and speech detection, TML-Interaction-Small scored 77.8 on version 1.5, compared to 54.3 for Gemini Flash Live and 48.3 for GPT Realtime 1.5. On turn-taking latency, the model responded in 0.40 seconds, against 1.18 seconds for GPT Realtime 2.0 at minimal settings and 0.57 seconds for Gemini Flash Live.
On Audio MultiChallenge, a benchmark for intelligence and instruction following, TML-Interaction-Small scored 43.4, above GPT Realtime 1.5 (34.7) and Gemini Flash Live minimal (26.8), and below the thinking-mode versions of those models (48.5 and 36.1 respectively). The researchers position their model as occupying the space between those poles: more interactive than any current model, and more intelligent than any current non-thinking model.
The more interesting results come from benchmarks the team built themselves to test capabilities that existing benchmarks don't measure. On TimeSpeak, which tests whether a model can initiate speech at user-specified times with the correct content, TML-Interaction-Small scored 64.7 percent accuracy. GPT Realtime 2.0 scored 4.3 percent.
On CueSpeak, which tests whether the model speaks at the correct moment in response to verbal cues from the user, the scores were 81.7 percent versus 2.9 percent. On RepCount-A, which streams video of repeated physical actions and asks the model to count reps aloud in real time, TML-Interaction-Small hit 35.4 percent within one rep of the correct count; GPT Realtime 2.0 scored 1.3 percent.
These aren't marginal differences. Existing real-time models essentially fail at tasks that require initiating speech based on time or visual context. The underlying reason is that those models are built for audio-turn-detection and have no mechanism for deciding to speak in response to something they see or a time condition they're tracking.
What This Signals for How AI Gets Used
The framing in the paper is worth taking seriously. The researchers argue that interactivity should scale alongside intelligence, meaning that as models become more capable, they should also become better collaborators, not systems that are best used in long autonomous runs while the human waits for results.
The model is 276 billion parameters with 12 billion active, a mixture-of-experts design. The team acknowledges it's too slow to serve larger pretrained models in this real-time setting and plans to release larger versions later this year. Extended sessions also present context management challenges as continuous audio and video accumulate quickly.
The safety work reflects the specific pressures of real-time interaction. The team used text-to-speech systems to generate training data for natural-sounding refusals in speech, and built automated red-teaming harnesses to test robustness across extended conversations, aiming for parity between the model's behavior in voice and text.
The research preview is limited for now. Thinking Machines is accepting feedback and plans broader access later this year. They're also launching a research grant to encourage more work on interaction model evaluation, noting that existing benchmarks don't adequately capture what these systems can and can't do.
The paper's core claim is straightforward: when interactivity is built into the model rather than assembled around it, it improves as the model improves. That's a different bet than the one most labs are making, and today's benchmarks suggest it's not an unreasonable one.
