The 300 Millisecond Window

I set out to find the cheapest voice AI stack in Europe. I found the answer. Then I realized I'd been asking the wrong question.

Two weeks ago I set out to find the cheapest voice AI stack in Europe.

I tested every combination. Groq, Gemini, Resemble, ElevenLabs, Twilio, Kimi. I wrote benchmarks. I built spreadsheets. I calculated cost per minute to the fourth decimal place.

I found the answer. And then I realized I'd been asking the wrong question.

The Setup

Here's what prompted this. I'm building voice agents for a call center automation project in Prague. The pitch is simple: replace the human on the phone with something that sounds human, responds like a human, and costs less than a human.

Every vendor deck I've seen makes the same argument: look how cheap we are compared to your agents.

So I did what any engineer would do. I built the comparison myself.

I tested from Prague. Real telephony. Real speech-to-text. Real language models. Real voice synthesis. End-to-end, the way a caller would actually experience it.

The first thing I learned had nothing to do with cost.

The Silence

I called my own Gemini-powered agent on a Tuesday afternoon. It picked up instantly. I said "Hi, I'd like to check the status of my order."

Then nothing.

One second. Two seconds. I checked if the call had dropped. Three seconds. I was about to hang up when the voice came back, pleasant, articulate, completely correct.

But by second two, I was already gone. Not physically—I was still on the line. But psychologically, I'd left. The trust was broken. Whatever came next was recovery, not conversation.

I ran the same test with Groq. The response came in under a second. I didn't notice the gap. I just... continued talking. Like I would with a person.

That's when I stopped optimizing for pennies and started optimizing for milliseconds.

The Number

Three hundred milliseconds.

That's the natural gap between speakers in human conversation. It's not a design choice or a UX preference. It's neurological. Hardwired. Every human on earth expects the next speaker to begin within 300ms of when the previous speaker stops.

Below 300ms and you're interrupting. Above 500ms and something feels off—a wrongness you can't articulate but your body registers immediately. Past 800ms, the conversation feels robotic. Past one second, you're reaching for the end-call button.

The research on this is brutal and unambiguous. Contact centers report 40% more hang-ups when voice agents exceed one second of response latency. Conversion rates drop approximately 7% for every 100ms of additional delay. One-third of callers abandon if they feel they're not getting a prompt response.

One-third.

I spent two weeks comparing stacks that differ by $0.08 per minute. The actual differentiator is a window of time shorter than a blink.

What I Actually Measured

I ran end-to-end turn latency from Prague. Not network round-trip—actual time from the moment I stop speaking to the first audio byte reaching my ear.

Groq running Llama 3.3 70B on their LPU hardware in Helsinki: under two seconds, consistently. The consistency is the point. Not the average—the consistency. Every call. Every turn. No spikes.

Gemini 2.5 Flash, GA since June 2025, localized in Finland and Belgium: one to three seconds on good runs. But "good runs" is doing heavy lifting. The p95 is where the bodies are buried. On paper, Gemini's average latency looks competitive. In production, one out of every six or seven turns hits a multi-second spike. That's not a statistics problem. That's a hang-up.

I know a team that reported "great" 400ms average latency on their voice agent. Looked wonderful in the dashboard. Then they dug into the distribution and found 15% of turns were hitting two seconds or more. Their users weren't complaining about average performance. Their users were hanging up on tail latency.

The average hid everything.

Grok from xAI: solid at one to four seconds, but function calling broke repeatedly. Fine for chat. Unreliable for agents that need to actually do things.

The Dirty Secret

Here's something nobody talks about with Gemini's pricing.

Google appears to preload tokens—generating responses speculatively to simulate speed, then discarding the unused output. You're billed for tokens that never reach the caller.

Try calculating your actual cost per minute in production. You can't. The meter is running on ghost tokens.

On paper, a Gemini-native voice stack prices out around $0.06 per minute. The cheapest option by far. In practice, nobody I've spoken to can reproduce that number across a week of production traffic.

The cheapest API is the one that doesn't bill you for hallucinated work.

The Anatomy of a Cent

I said I found the answer. Here it is, component by component. Every number verified, every source checked.

Speech-to-text: Whisper Large v3 Turbo on Groq. Forty cents per hour. That's $0.0003 per minute of conversation. Three hundredths of a cent. Essentially a rounding error in the universe's ledger.

The brain: Llama 3.3 70B on Groq's LPU. $0.59 per million input tokens, $0.79 per million output. A typical conversation minute—four turns, system prompt, context window, user utterances, agent responses—costs $0.0017. Less than two-tenths of a cent.

The voice: Resemble AI at $0.06 per minute. Forty-plus voices. Voice cloning available. On-premise deployment for data residency.

The pipes: Twilio Media Streams at $0.004 per minute.

Total: $0.066 per minute.

Read those numbers again. The brain and the ears cost two-tenths of a cent combined. The intelligence is free. The listening is free. Ninety-one percent of the entire bill is the voice—the synthesis layer, the part that turns tokens into sound waves.

The thing you're paying for isn't thinking. It's speaking.

The Premium Version

For completeness: swap Resemble for ElevenLabs at roughly $0.08 per minute and Twilio Media Streams for Twilio ConversationRelay at $0.07 per minute. Add Gemini 3 Flash Preview for the text layer. Total comes to about $0.15 per minute.

That's 2.3 times the Groq/Resemble stack. The extra $0.086 buys you ten thousand voices and enterprise observability dashboards that look great in procurement presentations.

Enterprise buyers love paying for insurance. Builders love shipping.

The Variant Worth Watching

Kimi K2 from Moonshot AI, released July 2025, running on Groq infrastructure. Same price bracket. But with reasoning capabilities that approach frontier territory.

The September 2025 Instruct variant sharpened it. The January 2026 K2.5 release sharpened it again. For use cases where the agent needs to actually think—not just retrieve and recite, but reason through a caller's problem—the intelligence gap matters more than the cost gap.

Kimi K2.5 on Groq might be the first sub-$0.15 stack with genuine frontier-level reasoning. That's not an incremental improvement. That's a category shift. The voice agent that can think while costing less than a cup of coffee per hour of conversation.

The Question Everyone Asks

GDPR?

Solved. Every major provider in this stack has compliance mechanisms in place as of 2026. Groq has a Data Processing Agreement, EU/UK representatives, and Standard Contractual Clauses. Twilio is ISO and PCI certified. Google Cloud carries DPA, SCC, and ISO 27001. Resemble AI offers on-premise deployment for organizations that need data to never leave their walls.

This was the blocker for two years. It isn't anymore.

Stop using compliance as an excuse not to ship. The compliance story is settled. The latency story is not.

The Wrong Question

This is where I tell you I was asking the wrong question from the start.

I set out to find the cheapest stack. I built the spreadsheet. I found it. Here's the spreadsheet. You're welcome.

But then I did the math on what it actually costs to have a human answer the phone.

The cheapest option on earth—a shared offshore call center in the Philippines or India, per-minute billing—runs $0.27 to $0.45 per minute. A dedicated offshore agent at $7 to $16 per hour works out to $0.18 to $0.40 per minute when you account for the roughly 40 productive talk minutes in every paid hour. The rest is idle time, breaks, after-call work, hold gaps between calls.

A US-based center: $0.75 to $1.35 per minute.

The Groq/Resemble stack at $0.066 is four times cheaper than the cheapest shared offshore line. Eleven to twenty times cheaper than a US call center. Even the premium ElevenLabs stack at $0.15 per minute is 1.8 times cheaper than bargain-basement offshore.

Every AI stack wins on cost. Every single one. The cheapest. The most expensive. The one I haven't tested yet that launched yesterday. They all win.

The cost war is over. AI won it before the first call connected.

The Right Question

So why did I almost hang up on my own Gemini agent?

Not because it was expensive. Because it was slow.

A human agent picks up after hold time—one to thirty seconds of waiting. An AI agent picks up instantly. That's the easy win, and everyone celebrates it.

But then the human agent takes 200 milliseconds to respond to your question. Because that's how human brains work. That's the cadence we evolved over a hundred thousand years of spoken language. 200 to 300 milliseconds. Not because we're fast. Because we start formulating our response while the other person is still talking.

An AI agent that can match that 300 millisecond gap wins.

An AI agent that pauses for 1.5 seconds while the language model thinks loses the call. Every time. Regardless of whether it costs $0.066 or $0.15 or $0.005.

The cost difference between the cheapest and most expensive AI stack is $0.086 per minute. That's $5.16 per hour. That's the price of a mediocre coffee in Prague.

The cost of one dropped call from a two-second latency spike is one lost customer.

The math isn't close.

What I Should Have Known

I should have known this from Moltbook.

When 1.5 million AI agents tried to build a society in seven days, the system didn't collapse because of misalignment or cost or capability. It collapsed because the infrastructure couldn't keep up with the speed of interaction. Latency in identity verification. Latency in security checks. Latency in the feedback loops that should have caught the prompt injection attacks before they propagated.

The bots that thrived on Moltbook weren't the smartest or the cheapest to run. They were the fastest to respond. The ones that could hold a conversation thread. The ones whose replies arrived before the other agent's context window moved on.

Speed is the substrate. Everything else is a feature request.

The Prediction

Voice AI in 2026 looks like search in 2004.

Everyone knows it matters. Nobody agrees on the architecture. The default choice—Google—works, but with tail latency that will shred your p95 and ghost-token billing that will shred your budget projections.

The Groq/Resemble stack is the equivalent of building on AWS in 2008 instead of waiting for Google Cloud. Less obvious. More reliable. And the builders who choose it now will have eighteen months of production data while everyone else is still comparing pricing pages.

Here's what I'd build today. Groq for the brain. Resemble for the voice. Twilio Media Streams for the pipes. Optimize every component for sub-800 millisecond end-to-end latency. Ship it in a week. Iterate on speed, not cost.

The voice stack war isn't about who has the best demo. It's about who answers before the caller gives up.

The Punchline

I started this project trying to save pennies.

I ended it understanding that the most expensive thing in voice AI isn't the language model, or the voice synthesis, or the telephony, or the compliance overhead.

It's silence.

Two seconds of silence on a phone call costs more than every API in the stack combined. Because silence is where the caller decides this isn't a person. Silence is where trust breaks. Silence is where the finger moves to the red button.

At 300 milliseconds, the caller doesn't know they're talking to a machine.

At 1,500 milliseconds, they don't care. They've already hung up.

Build for the 300 millisecond window. That's where the money is.

The author benchmarks voice AI stacks and writes about what happens when you optimize for the wrong metric. The technical configuration for Twilio Media Stream integration with the Groq/Resemble stack is available on request.

Shipping AI outcomes @verduona • Intelligence from the frontier